import pandas as pd
data = {'City': ['New York', 'London', 'Paris'],
'Temperature': [70, 60, 75]}
df = pd.DataFrame(data)
print(df)
City Temperature
0 New York 70
1 London 60
2 Paris 75
We will give a brief overview of Pandas in this chapter. We are going to discuss many of the features of Pandas in the latter chapters as part of our analysis of the climate data.
Also check out the excellent book on Python for Data Analysis by Wes Mckinney, https://wesmckinney.com/book/, that gives more information on using Python for data analysis. The book primarily uses Numpy and Pandas. We have learned a lot by reading the book and we highly recommend it. It is available Wes Mckinney’s website for free! What can be better?
We will first learn about the basic data structure in pandas, a DataFrame. Then, we will check some of the operations that we can perform on the data frame to manipulate the data.
In pandas, a DataFrame is a two-dimensional tabular data structure that is similar to a table in a relational database or a spreadsheet. It consists of rows and columns, where each column can have a different data type (e.g., numeric, string, boolean) and each row represents a separate observation or record.
Below is an example of tabular data with two columns or variables: City and Temperature.
City | Temperature (°F) |
---|---|
New York | 70 |
London | 60 |
Paris | 75 |
Tokyo | 68 |
Sydney | 72 |
DataFrames are a powerful tool for data manipulation, analysis, and visualization. They provide a convenient way to organize, manipulate, and analyze structured data. Some key features of DataFrames in pandas include:
Flexible indexing: DataFrames have row and column labels, which allow for easy indexing and selection of specific data points or subsets of the data. In this case, the columns are “City” and “Temperature”. We will see how to use indexing later in the book.
Data alignment: DataFrames automatically align data based on their row and column labels, making it easy to perform operations on multiple columns or rows simultaneously. This is very important for our visualization exercises later in the book.
Data cleaning and transformation: DataFrames provide a wide range of functions and methods for cleaning and transforming data, such as removing missing values, handling duplicates, and applying mathematical or statistical operations to the data.
Data visualization: DataFrames can be easily visualized using various plotting libraries, such as Matplotlib or Plotly. They provide built-in functions for creating different types of plots, including line plots, bar plots, scatter plots, and more. We will use many of them later in the book.
To create a DataFrame in pandas, we can pass various types of data structures, such as lists, dictionaries, or NumPy arrays, to the pd.DataFrame() constructor. You can also read data from external sources, such as CSV files or databases, using functions like pd.read_csv().
Once a DataFrame is created, you can perform various operations on it, such as selecting specific columns or rows, filtering data based on conditions, aggregating data, merging or joining multiple DataFrames, and more.
Here’s an example of creating a simple DataFrame in pandas:
import pandas as pd
data = {'City': ['New York', 'London', 'Paris'],
'Temperature': [70, 60, 75]}
df = pd.DataFrame(data)
print(df)
City Temperature
0 New York 70
1 London 60
2 Paris 75
What did we do in this code?
We import pandas
Here, we define a dictionary called data. The dictionary has two keys: ‘City’ and ‘Temperature’. The corresponding values are lists of strings and integers, respectively. This data represents the cities and their (average) temperatures. But when is the temperature recorded? We need more variables or columns.
This line creates a DataFrame object using the pd.DataFrame() function from the pandas library. The DataFrame is a two-dimensional table-like data structure that can hold data in rows and columns. We pass the data dictionary as an argument to create the DataFrame. The keys of the dictionary become the column names, and the values become the data in the columns.
We can see from the output that
Most of the climate related data that we use will be in a tabular format, many a time in CSV format. The CSV (comma-separated values) format is common non-proprietary format for storing tabular data stored as plain text.
Let us read a real data set, that is in the CSV format, and show some operations that we can do on the data frame.
We will revisit all these operations in latter chapters in more depth.
Since we are in Atlanta, we will start with the climate data at the Hartsfield-Jackson Atlanta International Airport. It is the busiest airport in the world and you may have connected through Atlanta before!
We will not go through all the details of the dataset here as the main goal is to just show how we can access a CSV file over the internet.
by the way, if you see an error below, it is because the National Center for Environmental Information (NCEI) that hosts the data was hit by Hurricane Helene and the data center is down. It is ironical but highlights the reality of climate change and extreme weather.
The first line imports the Pandas library and assigns it the alias pd. Remember, as we mentioned in the installation chapter, we have all the tools in the pandas library available to us now, including reading a CSV file.
The second line reads a CSV file from a given URL and loads it into a Pandas DataFrame.
The pd.read_csv function is a versatile method for reading CSV files, whether they are stored locally or accessible via a URL. In this case, the URL points to a dataset hosted by the National Centers for Environmental Information (NCEI), which contains global summary data of the month for the ATL airport.
The result of this operation is stored in the variable df, which is a Pandas DataFrame. A DataFrame is a two-dimensional tabular data structure similar to a table in a database or an Excel spreadsheet.
Let us look at what is in this CSV file or dataset!
By the way, we chose this particular CSV file as it is relatively clean and easy to handle. We will see how the other datasets that we handle in this book can be more messy 🥺
It is always a good idea to get a sense of the dataset before doing any operations on it.
We can display the first few rows of the dataset with the following code
STATION | DATE | LATITUDE | LONGITUDE | ELEVATION | NAME | ADPT | ADPT_ATTRIBUTES | ASLP | ASLP_ATTRIBUTES | ... | WSF1 | WSF1_ATTRIBUTES | WSF2 | WSF2_ATTRIBUTES | WSF5 | WSF5_ATTRIBUTES | WSFG | WSFG_ATTRIBUTES | WSFM | WSFM_ATTRIBUTES | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | USW00013874 | 1930-01 | 33.62972 | -84.44224 | 308.2 | ATLANTA HARTSFIELD JACKSON INTERNATIONAL AIRPO... | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 | USW00013874 | 1930-02 | 33.62972 | -84.44224 | 308.2 | ATLANTA HARTSFIELD JACKSON INTERNATIONAL AIRPO... | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2 | USW00013874 | 1930-03 | 33.62972 | -84.44224 | 308.2 | ATLANTA HARTSFIELD JACKSON INTERNATIONAL AIRPO... | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3 | USW00013874 | 1930-04 | 33.62972 | -84.44224 | 308.2 | ATLANTA HARTSFIELD JACKSON INTERNATIONAL AIRPO... | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
4 | USW00013874 | 1930-05 | 33.62972 | -84.44224 | 308.2 | ATLANTA HARTSFIELD JACKSON INTERNATIONAL AIRPO... | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
5 rows × 110 columns
The code displays the first 5 rows of the DataFrame by default. You can specify the number of rows to display by passing an integer as an argument.
STATION | DATE | LATITUDE | LONGITUDE | ELEVATION | NAME | ADPT | ADPT_ATTRIBUTES | ASLP | ASLP_ATTRIBUTES | ... | WSF1 | WSF1_ATTRIBUTES | WSF2 | WSF2_ATTRIBUTES | WSF5 | WSF5_ATTRIBUTES | WSFG | WSFG_ATTRIBUTES | WSFM | WSFM_ATTRIBUTES | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | USW00013874 | 1930-01 | 33.62972 | -84.44224 | 308.2 | ATLANTA HARTSFIELD JACKSON INTERNATIONAL AIRPO... | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 | USW00013874 | 1930-02 | 33.62972 | -84.44224 | 308.2 | ATLANTA HARTSFIELD JACKSON INTERNATIONAL AIRPO... | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2 | USW00013874 | 1930-03 | 33.62972 | -84.44224 | 308.2 | ATLANTA HARTSFIELD JACKSON INTERNATIONAL AIRPO... | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3 | USW00013874 | 1930-04 | 33.62972 | -84.44224 | 308.2 | ATLANTA HARTSFIELD JACKSON INTERNATIONAL AIRPO... | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
4 | USW00013874 | 1930-05 | 33.62972 | -84.44224 | 308.2 | ATLANTA HARTSFIELD JACKSON INTERNATIONAL AIRPO... | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
5 | USW00013874 | 1930-06 | 33.62972 | -84.44224 | 308.2 | ATLANTA HARTSFIELD JACKSON INTERNATIONAL AIRPO... | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
6 | USW00013874 | 1930-07 | 33.62972 | -84.44224 | 308.2 | ATLANTA HARTSFIELD JACKSON INTERNATIONAL AIRPO... | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
7 | USW00013874 | 1930-08 | 33.62972 | -84.44224 | 308.2 | ATLANTA HARTSFIELD JACKSON INTERNATIONAL AIRPO... | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
8 | USW00013874 | 1930-09 | 33.62972 | -84.44224 | 308.2 | ATLANTA HARTSFIELD JACKSON INTERNATIONAL AIRPO... | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
9 | USW00013874 | 1930-10 | 33.62972 | -84.44224 | 308.2 | ATLANTA HARTSFIELD JACKSON INTERNATIONAL AIRPO... | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
10 rows × 110 columns
Similarly, we display the last 5 rows of the DataFrame by default. Or we can specify the number of rows to display by passing an integer as an argument (e.g., df.tail(10)).
STATION | DATE | LATITUDE | LONGITUDE | ELEVATION | NAME | ADPT | ADPT_ATTRIBUTES | ASLP | ASLP_ATTRIBUTES | ... | WSF1 | WSF1_ATTRIBUTES | WSF2 | WSF2_ATTRIBUTES | WSF5 | WSF5_ATTRIBUTES | WSFG | WSFG_ATTRIBUTES | WSFM | WSFM_ATTRIBUTES | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1132 | USW00013874 | 2024-05 | 33.62972 | -84.44224 | 308.2 | ATLANTA HARTSFIELD JACKSON INTERNATIONAL AIRPO... | 16.03 | ,,,W | 1013.5 | ,,,W | ... | NaN | NaN | 13.9 | ,W | 19.7 | ,W | NaN | NaN | NaN | NaN |
1133 | USW00013874 | 2024-06 | 33.62972 | -84.44224 | 308.2 | ATLANTA HARTSFIELD JACKSON INTERNATIONAL AIRPO... | 17.49 | ,,,W | 1015.0 | ,,,W | ... | NaN | NaN | 11.6 | ,W | 20.6 | ,W | NaN | NaN | NaN | NaN |
1134 | USW00013874 | 2024-07 | 33.62972 | -84.44224 | 308.2 | ATLANTA HARTSFIELD JACKSON INTERNATIONAL AIRPO... | 21.32 | ,,,W | 1017.1 | ,,,W | ... | NaN | NaN | 15.7 | ,W | 20.1 | ,W | NaN | NaN | NaN | NaN |
1135 | USW00013874 | 2024-08 | 33.62972 | -84.44224 | 308.2 | ATLANTA HARTSFIELD JACKSON INTERNATIONAL AIRPO... | NaN | NaN | NaN | NaN | ... | NaN | NaN | 14.8 | ,W | 17.9 | ,W | NaN | NaN | NaN | NaN |
1136 | USW00013874 | 2024-09 | 33.62972 | -84.44224 | 308.2 | ATLANTA HARTSFIELD JACKSON INTERNATIONAL AIRPO... | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
5 rows × 110 columns
We can get a summary of the dataset, including the number of non-null entries, data types of columns, and memory usage.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1137 entries, 0 to 1136
Columns: 110 entries, STATION to WSFM_ATTRIBUTES
dtypes: float64(47), int64(11), object(52)
memory usage: 977.2+ KB
We can see that the DataFrame has 1137 rows and 110 columns. We can also get a sense of the type of data. We have 58 columns with a float64 datatype and another 52 with an object datatype (which is typically used for string or mixed data types).
The approximate amount of memory used to store this DataFrame is relatively small at 977.2 KB.
We can generate descriptive statistics for numerical columns, including count, mean, standard deviation, min, max, and quartiles.
LATITUDE | LONGITUDE | ELEVATION | ADPT | ASLP | ASTP | AWBT | AWND | CDSD | CLDD | ... | WDF1 | WDF2 | WDF5 | WDFG | WDFM | WSF1 | WSF2 | WSF5 | WSFG | WSFM | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 1.137000e+03 | 1.137000e+03 | 1137.0 | 220.000000 | 220.000000 | 220.000000 | 220.000000 | 488.000000 | 1137.000000 | 1137.000000 | ... | 230.000000 | 349.000000 | 349.000000 | 308.000000 | 137.000000 | 230.000000 | 349.000000 | 349.000000 | 308.000000 | 137.000000 |
mean | 3.362972e+01 | -8.444224e+01 | 308.2 | 10.336455 | 1017.975000 | 981.115909 | 13.730045 | 3.679303 | 502.067986 | 86.153738 | ... | 233.956522 | 247.048711 | 244.670487 | 252.178571 | 237.153285 | 13.050435 | 14.214327 | 18.668481 | 18.864286 | 14.191241 |
std | 7.108554e-15 | 1.421711e-14 | 0.0 | 7.391857 | 2.489038 | 2.051477 | 6.546238 | 0.612600 | 449.442361 | 99.706672 | ... | 96.960343 | 101.115334 | 97.401396 | 86.333026 | 92.045376 | 2.976102 | 2.313260 | 3.485740 | 4.151514 | 3.397847 |
min | 3.362972e+01 | -8.444224e+01 | 308.2 | -6.650000 | 1011.600000 | 975.500000 | -0.410000 | 2.400000 | 0.000000 | 0.000000 | ... | 20.000000 | 10.000000 | 10.000000 | 23.000000 | 45.000000 | 7.600000 | 8.900000 | 10.300000 | 11.300000 | 8.500000 |
25% | 3.362972e+01 | -8.444224e+01 | 308.2 | 3.957500 | 1016.200000 | 979.700000 | 7.695000 | 3.200000 | 16.300000 | 0.500000 | ... | 150.000000 | 190.000000 | 200.000000 | 225.000000 | 180.000000 | 11.200000 | 13.000000 | 16.100000 | 15.900000 | 12.100000 |
50% | 3.362972e+01 | -8.444224e+01 | 308.2 | 9.845000 | 1017.450000 | 980.900000 | 13.710000 | 3.700000 | 444.000000 | 28.300000 | ... | 280.000000 | 290.000000 | 290.000000 | 270.000000 | 270.000000 | 12.750000 | 13.900000 | 18.300000 | 18.500000 | 13.400000 |
75% | 3.362972e+01 | -8.444224e+01 | 308.2 | 17.385000 | 1019.600000 | 982.300000 | 20.105000 | 4.100000 | 940.000000 | 177.100000 | ... | 310.000000 | 320.000000 | 320.000000 | 315.000000 | 315.000000 | 14.300000 | 15.700000 | 20.600000 | 20.600000 | 15.200000 |
max | 3.362972e+01 | -8.444224e+01 | 308.2 | 21.650000 | 1025.500000 | 987.300000 | 23.440000 | 5.900000 | 1442.200000 | 354.800000 | ... | 360.000000 | 360.000000 | 360.000000 | 360.000000 | 360.000000 | 26.800000 | 21.900000 | 34.000000 | 34.400000 | 30.800000 |
8 rows × 58 columns
Note that the number of columns is the same as the number of float64 datatype in the DataFrame. 58. In case, some of the data is mixed type or for some reason stored as an object datatype, it won’t be included in these descriptive stats.
Let us access one column from this dataset.
0 308.2
1 308.2
2 308.2
3 308.2
4 308.2
...
1132 308.2
1133 308.2
1134 308.2
1135 308.2
1136 308.2
Name: ELEVATION, Length: 1137, dtype: float64
By the way, what is 308.2? what are the units? From the Wikipedia article on ATL airport, it appears to be height about mean sea level in meters.
count 1137.0
mean 308.2
std 0.0
min 308.2
25% 308.2
50% 308.2
75% 308.2
max 308.2
Name: ELEVATION, dtype: float64
Not surprisingly, the elevation of ATL airport is always the same, at 308.2. Hence, the standard deviation is 0.
What if we want to access multiple columns?
Index(['STATION', 'DATE', 'LATITUDE', 'LONGITUDE', 'ELEVATION', 'NAME', 'ADPT',
'ADPT_ATTRIBUTES', 'ASLP', 'ASLP_ATTRIBUTES',
...
'WSF1', 'WSF1_ATTRIBUTES', 'WSF2', 'WSF2_ATTRIBUTES', 'WSF5',
'WSF5_ATTRIBUTES', 'WSFG', 'WSFG_ATTRIBUTES', 'WSFM',
'WSFM_ATTRIBUTES'],
dtype='object', length=110)
We only see a few columns in the print statement as there are 110 columns. Let us try:
['STATION', 'DATE', 'LATITUDE', 'LONGITUDE', 'ELEVATION', 'NAME', 'ADPT', 'ADPT_ATTRIBUTES', 'ASLP', 'ASLP_ATTRIBUTES', 'ASTP', 'ASTP_ATTRIBUTES', 'AWBT', 'AWBT_ATTRIBUTES', 'AWND', 'AWND_ATTRIBUTES', 'CDSD', 'CDSD_ATTRIBUTES', 'CLDD', 'CLDD_ATTRIBUTES', 'DP01', 'DP01_ATTRIBUTES', 'DP10', 'DP10_ATTRIBUTES', 'DP1X', 'DP1X_ATTRIBUTES', 'DSND', 'DSND_ATTRIBUTES', 'DSNW', 'DSNW_ATTRIBUTES', 'DT00', 'DT00_ATTRIBUTES', 'DT32', 'DT32_ATTRIBUTES', 'DX32', 'DX32_ATTRIBUTES', 'DX70', 'DX70_ATTRIBUTES', 'DX90', 'DX90_ATTRIBUTES', 'DYFG', 'DYFG_ATTRIBUTES', 'DYHF', 'DYHF_ATTRIBUTES', 'DYNT', 'DYNT_ATTRIBUTES', 'DYSD', 'DYSD_ATTRIBUTES', 'DYSN', 'DYSN_ATTRIBUTES', 'DYTS', 'DYTS_ATTRIBUTES', 'DYXP', 'DYXP_ATTRIBUTES', 'DYXT', 'DYXT_ATTRIBUTES', 'EMNT', 'EMNT_ATTRIBUTES', 'EMSD', 'EMSD_ATTRIBUTES', 'EMSN', 'EMSN_ATTRIBUTES', 'EMXP', 'EMXP_ATTRIBUTES', 'EMXT', 'EMXT_ATTRIBUTES', 'HDSD', 'HDSD_ATTRIBUTES', 'HTDD', 'HTDD_ATTRIBUTES', 'PRCP', 'PRCP_ATTRIBUTES', 'PSUN', 'PSUN_ATTRIBUTES', 'RHAV', 'RHAV_ATTRIBUTES', 'RHMN', 'RHMN_ATTRIBUTES', 'RHMX', 'RHMX_ATTRIBUTES', 'SNOW', 'SNOW_ATTRIBUTES', 'TAVG', 'TAVG_ATTRIBUTES', 'TMAX', 'TMAX_ATTRIBUTES', 'TMIN', 'TMIN_ATTRIBUTES', 'TSUN', 'TSUN_ATTRIBUTES', 'WDF1', 'WDF1_ATTRIBUTES', 'WDF2', 'WDF2_ATTRIBUTES', 'WDF5', 'WDF5_ATTRIBUTES', 'WDFG', 'WDFG_ATTRIBUTES', 'WDFM', 'WDFM_ATTRIBUTES', 'WSF1', 'WSF1_ATTRIBUTES', 'WSF2', 'WSF2_ATTRIBUTES', 'WSF5', 'WSF5_ATTRIBUTES', 'WSFG', 'WSFG_ATTRIBUTES', 'WSFM', 'WSFM_ATTRIBUTES']
Now we see all the columns. Let us select, say STATION, DATE and TAVG (average temperature)
STATION | DATE | TAVG | |
---|---|---|---|
0 | USW00013874 | 1930-01 | 6.00 |
1 | USW00013874 | 1930-02 | 11.03 |
2 | USW00013874 | 1930-03 | 9.60 |
3 | USW00013874 | 1930-04 | 17.48 |
4 | USW00013874 | 1930-05 | 21.36 |
... | ... | ... | ... |
1132 | USW00013874 | 2024-05 | 23.52 |
1133 | USW00013874 | 2024-06 | 27.48 |
1134 | USW00013874 | 2024-07 | 27.97 |
1135 | USW00013874 | 2024-08 | 27.86 |
1136 | USW00013874 | 2024-09 | 24.59 |
1137 rows × 3 columns
Let us select a particular index value. Say 100th entry.
STATION USW00013874
DATE 1938-05
LATITUDE 33.62972
LONGITUDE -84.44224
ELEVATION 308.2
...
WSF5_ATTRIBUTES NaN
WSFG NaN
WSFG_ATTRIBUTES NaN
WSFM NaN
WSFM_ATTRIBUTES NaN
Name: 100, Length: 110, dtype: object
The 100th entry seems to be May 1938. We also notice a number of NaN values.
NaN stands for “Not a Number” and is a standard floating-point representation of missing data in Pandas. It is used to denote missing or null values in a DataFrame. Understanding and handling NaN values is crucial for data cleaning and preprocessing.
Note NaN doesn’t necessarily mean that you can set it to 0!
We can use .isna() method to check whether a specific row has a NaN value.
STATION False
DATE False
LATITUDE False
LONGITUDE False
ELEVATION False
...
WSF5_ATTRIBUTES True
WSFG True
WSFG_ATTRIBUTES True
WSFM True
WSFM_ATTRIBUTES True
Name: 100, Length: 110, dtype: bool
We can get a summary of NaN values to understand the columns.
STATION 0
DATE 0
LATITUDE 0
LONGITUDE 0
ELEVATION 0
...
WSF5_ATTRIBUTES 788
WSFG 829
WSFG_ATTRIBUTES 829
WSFM 1000
WSFM_ATTRIBUTES 1000
Length: 110, dtype: int64
How can we drop NaN values?
STATION | DATE | LATITUDE | LONGITUDE | ELEVATION | NAME | ADPT | ADPT_ATTRIBUTES | ASLP | ASLP_ATTRIBUTES | ... | WSF1 | WSF1_ATTRIBUTES | WSF2 | WSF2_ATTRIBUTES | WSF5 | WSF5_ATTRIBUTES | WSFG | WSFG_ATTRIBUTES | WSFM | WSFM_ATTRIBUTES |
---|
0 rows × 110 columns
What happened? Why did the dataset has 0 rows? Using dropna() drops all rows where any of the columns has a missing value. It seems that all of the rows have at least one column with a missing value!
How about we drop only if a specific column has a missing value? First, we need to subset the data using a subset parameter. It allows us to specify which column(s) to check for missing values, in this case TAVG
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1137 entries, 0 to 1136
Columns: 110 entries, STATION to WSFM_ATTRIBUTES
dtypes: float64(47), int64(11), object(52)
memory usage: 977.2+ KB
Now we can see that only one observation is dropped and we have 1136 rows.
We can fill the NaN or substitute NaN with a specific value. We need to be very careful with this operation as missing or NaN is not the same as a zero value. For example, TAVG is missing means that the average is not computed because the station may have some technical difficulties or may be in maintenance and the temperature is not measured. It doesn’t mean it is 0!.
But in case, we need to do it, we can use .fillna() method
by the way, we created new datasets df_cleaned and df_filled so that the original dataset df is not changed. It is important to keep track of the datasets and columns.
How about if we want to get more than one column? Note, we are not saving the results.
STATION USW00013874
DATE 1938-05
TAVG 21.8
Name: 100, dtype: object
We can also filter rows by a condition.
<class 'pandas.core.frame.DataFrame'>
Index: 459 entries, 4 to 1136
Columns: 110 entries, STATION to WSFM_ATTRIBUTES
dtypes: float64(47), int64(11), object(52)
memory usage: 398.0+ KB
We can see that the number of rows have come down from 1137 to 458.
That’s because we only kept observations or rows where the average temperature, TAVG exceeded 20F
If we increase the threshold to 35 degree FH, we will see that there are fewer rows that satisfy the condition, only 226 out of the 1137 that we started.
<class 'pandas.core.frame.DataFrame'>
Index: 226 entries, 6 to 1135
Columns: 110 entries, STATION to WSFM_ATTRIBUTES
dtypes: float64(47), int64(11), object(52)
memory usage: 196.0+ KB
How about selecting a few columns from the dataset and create a smaller dataset? Let’s say, DATE and TVG?
DATE TAVG
0 1930-01 6.00
1 1930-02 11.03
2 1930-03 9.60
3 1930-04 17.48
4 1930-05 21.36
... ... ...
1132 2024-05 23.52
1133 2024-06 27.48
1134 2024-07 27.97
1135 2024-08 27.86
1136 2024-09 24.59
[1137 rows x 2 columns]
Let’s break it down step by step:
Let us look at a few more useful functions
DATE | TAVG | |
---|---|---|
0 | 1930-01 | 6.00 |
1 | 1930-02 | 11.03 |
2 | 1930-03 | 9.60 |
3 | 1930-04 | 17.48 |
4 | 1930-05 | 21.36 |
The dataset is already sorted by DATE so we don’t see any change.
What if we want to sort in descending order, that is, have the latest date first?
DATE | TAVG | |
---|---|---|
1136 | 2024-09 | 24.59 |
1135 | 2024-08 | 27.86 |
1134 | 2024-07 | 27.97 |
1133 | 2024-06 | 27.48 |
1132 | 2024-05 | 23.52 |
We can see that monthly average for September of 2024 is not computed, but we have August 2024 data.
Let us look at the top temperatures for the ATL airport during 1930-2024. We can sort by TAVG and use the head function as the data is already sorted.
DATE | TAVG | |
---|---|---|
931 | 2007-08 | 29.78 |
762 | 1993-07 | 29.63 |
606 | 1980-07 | 29.49 |
786 | 1995-07 | 29.06 |
678 | 1986-07 | 28.93 |
We can see that July seems to be the month where 4 of the 5 average hottest temperatures are recorded at the ATL airport. The highest average monthly temperature seems to be 29.78°C or approximately 85.60°F. It does feel much hotter in August!
We can also get the largest values and smallest values without sorting
931 29.78
762 29.63
606 29.49
786 29.06
678 28.93
Name: TAVG, dtype: float64
The nlargest method is used to retrieve the top n largest values from the Series. In this case, n is set to 5, meaning the method will return the five largest values from the ‘TAVG’ column. The result is another Series containing these top values, sorted in descending order.
How about the lowest temperatures? Also, it would be good to have the DATE also as we don’t when the highest or lowest temperatures occurred.
DATE | TAVG | |
---|---|---|
564 | 1977-01 | -1.51 |
120 | 1940-01 | -1.31 |
576 | 1978-01 | 0.90 |
407 | 1963-12 | 1.92 |
480 | 1970-01 | 2.18 |
df[[‘DATE’, ‘TAVG’]]: This part of the code selects two columns from the DataFrame df, namely ‘DATE’ and ‘TAVG’. The double square brackets [[‘DATE’, ‘TAVG’]] are used to create a new DataFrame that contains only these two columns. This is useful when you want to focus on specific columns for further operations.
.nsmallest(5, ‘TAVG’): The nsmallest method is then called on the newly created DataFrame. This method is used to return the first n rows ordered by columns in ascending order. In this case, n is set to 5, meaning the method will return the 5 rows with the smallest values in the ‘TAVG’ column. The second argument, ‘TAVG’, specifies the column to sort by.
It looks like at the ATL airport, rarely does the average monthly temperature dip below zero Celsius!
We have used [[]] keep some columns before. Let us add one more column to the list.
DATE | TAVG | ADPT | |
---|---|---|---|
0 | 1930-01 | 6.00 | NaN |
1 | 1930-02 | 11.03 | NaN |
2 | 1930-03 | 9.60 | NaN |
3 | 1930-04 | 17.48 | NaN |
4 | 1930-05 | 21.36 | NaN |
... | ... | ... | ... |
1132 | 2024-05 | 23.52 | 16.03 |
1133 | 2024-06 | 27.48 | 17.49 |
1134 | 2024-07 | 27.97 | 21.32 |
1135 | 2024-08 | 27.86 | NaN |
1136 | 2024-09 | 24.59 | NaN |
1137 rows × 3 columns
Let’s look at code to drop some columns.
Index(['DATE', 'TAVG'], dtype='object')
We see that instead of 3 columns, we have 2 columns and ADPT is no longer in the column list. By the way we used .columns to get the list of columns in the dataset
DATE | TAVG | |
---|---|---|
0 | 1930-01 | 6.00 |
1 | 1930-02 | 11.03 |
2 | 1930-03 | 9.60 |
3 | 1930-04 | 17.48 |
4 | 1930-05 | 21.36 |
... | ... | ... |
1132 | 2024-05 | 23.52 |
1133 | 2024-06 | 27.48 |
1134 | 2024-07 | 27.97 |
1135 | 2024-08 | 27.86 |
1136 | 2024-09 | 24.59 |
1137 rows × 2 columns
Sometimes, we may need to rename the column names.
The code typically looks like this:
For our dataset, let us rename ‘NAME’ to ‘STATION_NAME’
df_columns = df[['DATE', 'TAVG', 'NAME']]
df_columns = df_columns.rename(columns={'NAME': 'STATION_NAME'})
df_columns.columns
Index(['DATE', 'TAVG', 'STATION_NAME'], dtype='object')
We can see the column name is changed. We can also see it by printing the dataset:
DATE | TAVG | STATION_NAME | |
---|---|---|---|
0 | 1930-01 | 6.00 | ATLANTA HARTSFIELD JACKSON INTERNATIONAL AIRPO... |
1 | 1930-02 | 11.03 | ATLANTA HARTSFIELD JACKSON INTERNATIONAL AIRPO... |
2 | 1930-03 | 9.60 | ATLANTA HARTSFIELD JACKSON INTERNATIONAL AIRPO... |
3 | 1930-04 | 17.48 | ATLANTA HARTSFIELD JACKSON INTERNATIONAL AIRPO... |
4 | 1930-05 | 21.36 | ATLANTA HARTSFIELD JACKSON INTERNATIONAL AIRPO... |
... | ... | ... | ... |
1132 | 2024-05 | 23.52 | ATLANTA HARTSFIELD JACKSON INTERNATIONAL AIRPO... |
1133 | 2024-06 | 27.48 | ATLANTA HARTSFIELD JACKSON INTERNATIONAL AIRPO... |
1134 | 2024-07 | 27.97 | ATLANTA HARTSFIELD JACKSON INTERNATIONAL AIRPO... |
1135 | 2024-08 | 27.86 | ATLANTA HARTSFIELD JACKSON INTERNATIONAL AIRPO... |
1136 | 2024-09 | 24.59 | ATLANTA HARTSFIELD JACKSON INTERNATIONAL AIRPO... |
1137 rows × 3 columns
Let us look at a general purpose .apply function.
The code typically looks like this
Let us say, we want to create a new variable where we take the average temperature in Celsius and convert to °F.
We need to do two things
import pandas as pd
def Celsius_to_Fahrenheit(celsius):
return (celsius * 9/5) + 32
df_columns = df[['DATE', 'TAVG']]
df_columns = df_columns.rename(columns={'TAVG': 'TAVG_C'})
# Apply the celsius_to_fahrenheit function to each element in the 'TAVG' column
df_columns['TAVG_F'] = df_columns['TAVG_C'].apply(Celsius_to_Fahrenheit)
df_columns
DATE | TAVG_C | TAVG_F | |
---|---|---|---|
0 | 1930-01 | 6.00 | 42.800 |
1 | 1930-02 | 11.03 | 51.854 |
2 | 1930-03 | 9.60 | 49.280 |
3 | 1930-04 | 17.48 | 63.464 |
4 | 1930-05 | 21.36 | 70.448 |
... | ... | ... | ... |
1132 | 2024-05 | 23.52 | 74.336 |
1133 | 2024-06 | 27.48 | 81.464 |
1134 | 2024-07 | 27.97 | 82.346 |
1135 | 2024-08 | 27.86 | 82.148 |
1136 | 2024-09 | 24.59 | 76.262 |
1137 rows × 3 columns
Let us break down the code step by step
It is easier to show the code where we actually need it. So, we will close this chapter with writing the data back to a CSV file or an Excel file so that we have a copy of the file.
Similarly, we can write the data to an Excel file
We started by reading a CSV or Excel file, and we ended this chapter by writing the data back as a CSV or Excel file.