3 Pandas

We will give a brief overview of Pandas in this chapter. We are going to discuss many of the features of Pandas in the latter chapters as part of our analysis of the climate data.

Also check out the excellent book on Python for Data Analysis by Wes Mckinney, https://wesmckinney.com/book/, that gives more information on using Python for data analysis. The book primarily uses Numpy and Pandas. We have learned a lot by reading the book and we highly recommend it. It is available Wes Mckinney’s website for free! What can be better?

3.1 Pandas DataFrame

We will first learn about the basic data structure in pandas, a DataFrame. Then, we will check some of the operations that we can perform on the data frame to manipulate the data.

In pandas, a DataFrame is a two-dimensional tabular data structure that is similar to a table in a relational database or a spreadsheet. It consists of rows and columns, where each column can have a different data type (e.g., numeric, string, boolean) and each row represents a separate observation or record.

Below is an example of tabular data with two columns or variables: City and Temperature.

City	Temperature (°F)
New York	70
London	60
Paris	75
Tokyo	68
Sydney	72

DataFrames are a powerful tool for data manipulation, analysis, and visualization. They provide a convenient way to organize, manipulate, and analyze structured data. Some key features of DataFrames in pandas include:

Flexible indexing: DataFrames have row and column labels, which allow for easy indexing and selection of specific data points or subsets of the data. In this case, the columns are “City” and “Temperature”. We will see how to use indexing later in the book.
Data alignment: DataFrames automatically align data based on their row and column labels, making it easy to perform operations on multiple columns or rows simultaneously. This is very important for our visualization exercises later in the book.
Data cleaning and transformation: DataFrames provide a wide range of functions and methods for cleaning and transforming data, such as removing missing values, handling duplicates, and applying mathematical or statistical operations to the data.
Data visualization: DataFrames can be easily visualized using various plotting libraries, such as Matplotlib or Plotly. They provide built-in functions for creating different types of plots, including line plots, bar plots, scatter plots, and more. We will use many of them later in the book.

To create a DataFrame in pandas, we can pass various types of data structures, such as lists, dictionaries, or NumPy arrays, to the pd.DataFrame() constructor. You can also read data from external sources, such as CSV files or databases, using functions like pd.read_csv().

Once a DataFrame is created, you can perform various operations on it, such as selecting specific columns or rows, filtering data based on conditions, aggregating data, merging or joining multiple DataFrames, and more.

Here’s an example of creating a simple DataFrame in pandas:

import pandas as pd
data = {'City': ['New York', 'London', 'Paris'],
   'Temperature': [70, 60, 75]}

df = pd.DataFrame(data)
print(df)

       City  Temperature
0  New York           70
1    London           60
2     Paris           75

What did we do in this code?

We import pandas
Here, we define a dictionary called data. The dictionary has two keys: ‘City’ and ‘Temperature’. The corresponding values are lists of strings and integers, respectively. This data represents the cities and their (average) temperatures. But when is the temperature recorded? We need more variables or columns.
This line creates a DataFrame object using the pd.DataFrame() function from the pandas library. The DataFrame is a two-dimensional table-like data structure that can hold data in rows and columns. We pass the data dictionary as an argument to create the DataFrame. The keys of the dictionary become the column names, and the values become the data in the columns.

We can see from the output that

the DataFrame has two columns: City and Temperature.
There are three records or observation.
Also note that the index starts at 0.

3.1.1 CSV Files

Most of the climate related data that we use will be in a tabular format, many a time in CSV format. The CSV (comma-separated values) format is common non-proprietary format for storing tabular data stored as plain text.

Let us read a real data set, that is in the CSV format, and show some operations that we can do on the data frame.

We will revisit all these operations in latter chapters in more depth.

Since we are in Atlanta, we will start with the climate data at the Hartsfield-Jackson Atlanta International Airport. It is the busiest airport in the world and you may have connected through Atlanta before!

We will not go through all the details of the dataset here as the main goal is to just show how we can access a CSV file over the internet.

Warning

by the way, if you see an error below, it is because the National Center for Environmental Information (NCEI) that hosts the data was hit by Hurricane Helene and the data center is down. It is ironical but highlights the reality of climate change and extreme weather. NCEI Asheville Outage

import pandas as pd
df = pd.read_csv("https://www.ncei.noaa.gov/data/global-summary-of-the-month/access/USW00013874.csv")

The first line imports the Pandas library and assigns it the alias pd. Remember, as we mentioned in the installation chapter, we have all the tools in the pandas library available to us now, including reading a CSV file.

The second line reads a CSV file from a given URL and loads it into a Pandas DataFrame.

The pd.read_csv function is a versatile method for reading CSV files, whether they are stored locally or accessible via a URL. In this case, the URL points to a dataset hosted by the National Centers for Environmental Information (NCEI), which contains global summary data of the month for the ATL airport.

The result of this operation is stored in the variable df, which is a Pandas DataFrame. A DataFrame is a two-dimensional tabular data structure similar to a table in a database or an Excel spreadsheet.

Let us look at what is in this CSV file or dataset!

By the way, we chose this particular CSV file as it is relatively clean and easy to handle. We will see how the other datasets that we handle in this book can be more messy 🥺

It is always a good idea to get a sense of the dataset before doing any operations on it.

We can display the first few rows of the dataset with the following code

df.head()

	STATION	DATE	LATITUDE	LONGITUDE	ELEVATION	NAME	ADPT	ADPT_ATTRIBUTES	ASLP	ASLP_ATTRIBUTES	...	WSF1	WSF1_ATTRIBUTES	WSF2	WSF2_ATTRIBUTES	WSF5	WSF5_ATTRIBUTES	WSFG	WSFG_ATTRIBUTES	WSFM	WSFM_ATTRIBUTES
0	USW00013874	1930-01	33.62972	-84.44224	308.2	ATLANTA HARTSFIELD JACKSON INTERNATIONAL AIRPO...	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	USW00013874	1930-02	33.62972	-84.44224	308.2	ATLANTA HARTSFIELD JACKSON INTERNATIONAL AIRPO...	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	USW00013874	1930-03	33.62972	-84.44224	308.2	ATLANTA HARTSFIELD JACKSON INTERNATIONAL AIRPO...	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	USW00013874	1930-04	33.62972	-84.44224	308.2	ATLANTA HARTSFIELD JACKSON INTERNATIONAL AIRPO...	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	USW00013874	1930-05	33.62972	-84.44224	308.2	ATLANTA HARTSFIELD JACKSON INTERNATIONAL AIRPO...	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

5 rows × 110 columns

The code displays the first 5 rows of the DataFrame by default. You can specify the number of rows to display by passing an integer as an argument.

df.head(10)

	STATION	DATE	LATITUDE	LONGITUDE	ELEVATION	NAME	ADPT	ADPT_ATTRIBUTES	ASLP	ASLP_ATTRIBUTES	...	WSF1	WSF1_ATTRIBUTES	WSF2	WSF2_ATTRIBUTES	WSF5	WSF5_ATTRIBUTES	WSFG	WSFG_ATTRIBUTES	WSFM	WSFM_ATTRIBUTES
0	USW00013874	1930-01	33.62972	-84.44224	308.2	ATLANTA HARTSFIELD JACKSON INTERNATIONAL AIRPO...	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	USW00013874	1930-02	33.62972	-84.44224	308.2	ATLANTA HARTSFIELD JACKSON INTERNATIONAL AIRPO...	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	USW00013874	1930-03	33.62972	-84.44224	308.2	ATLANTA HARTSFIELD JACKSON INTERNATIONAL AIRPO...	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	USW00013874	1930-04	33.62972	-84.44224	308.2	ATLANTA HARTSFIELD JACKSON INTERNATIONAL AIRPO...	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	USW00013874	1930-05	33.62972	-84.44224	308.2	ATLANTA HARTSFIELD JACKSON INTERNATIONAL AIRPO...	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
5	USW00013874	1930-06	33.62972	-84.44224	308.2	ATLANTA HARTSFIELD JACKSON INTERNATIONAL AIRPO...	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
6	USW00013874	1930-07	33.62972	-84.44224	308.2	ATLANTA HARTSFIELD JACKSON INTERNATIONAL AIRPO...	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
7	USW00013874	1930-08	33.62972	-84.44224	308.2	ATLANTA HARTSFIELD JACKSON INTERNATIONAL AIRPO...	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
8	USW00013874	1930-09	33.62972	-84.44224	308.2	ATLANTA HARTSFIELD JACKSON INTERNATIONAL AIRPO...	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
9	USW00013874	1930-10	33.62972	-84.44224	308.2	ATLANTA HARTSFIELD JACKSON INTERNATIONAL AIRPO...	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

10 rows × 110 columns

Similarly, we display the last 5 rows of the DataFrame by default. Or we can specify the number of rows to display by passing an integer as an argument (e.g., df.tail(10)).

df.tail()

	STATION	DATE	LATITUDE	LONGITUDE	ELEVATION	NAME	ADPT	ADPT_ATTRIBUTES	ASLP	ASLP_ATTRIBUTES	...	WSF1	WSF1_ATTRIBUTES	WSF2	WSF2_ATTRIBUTES	WSF5	WSF5_ATTRIBUTES	WSFG	WSFG_ATTRIBUTES	WSFM	WSFM_ATTRIBUTES
1132	USW00013874	2024-05	33.62972	-84.44224	308.2	ATLANTA HARTSFIELD JACKSON INTERNATIONAL AIRPO...	16.03	,,,W	1013.5	,,,W	...	NaN	NaN	13.9	,W	19.7	,W	NaN	NaN	NaN	NaN
1133	USW00013874	2024-06	33.62972	-84.44224	308.2	ATLANTA HARTSFIELD JACKSON INTERNATIONAL AIRPO...	17.49	,,,W	1015.0	,,,W	...	NaN	NaN	11.6	,W	20.6	,W	NaN	NaN	NaN	NaN
1134	USW00013874	2024-07	33.62972	-84.44224	308.2	ATLANTA HARTSFIELD JACKSON INTERNATIONAL AIRPO...	21.32	,,,W	1017.1	,,,W	...	NaN	NaN	15.7	,W	20.1	,W	NaN	NaN	NaN	NaN
1135	USW00013874	2024-08	33.62972	-84.44224	308.2	ATLANTA HARTSFIELD JACKSON INTERNATIONAL AIRPO...	NaN	NaN	NaN	NaN	...	NaN	NaN	14.8	,W	17.9	,W	NaN	NaN	NaN	NaN
1136	USW00013874	2024-09	33.62972	-84.44224	308.2	ATLANTA HARTSFIELD JACKSON INTERNATIONAL AIRPO...	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

5 rows × 110 columns

We can get a summary of the dataset, including the number of non-null entries, data types of columns, and memory usage.

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1137 entries, 0 to 1136
Columns: 110 entries, STATION to WSFM_ATTRIBUTES
dtypes: float64(47), int64(11), object(52)
memory usage: 977.2+ KB

We can see that the DataFrame has 1137 rows and 110 columns. We can also get a sense of the type of data. We have 58 columns with a float64 datatype and another 52 with an object datatype (which is typically used for string or mixed data types).

The approximate amount of memory used to store this DataFrame is relatively small at 977.2 KB.

We can generate descriptive statistics for numerical columns, including count, mean, standard deviation, min, max, and quartiles.

df.describe()

	LATITUDE	LONGITUDE	ELEVATION	ADPT	ASLP	ASTP	AWBT	AWND	CDSD	CLDD	...	WDF1	WDF2	WDF5	WDFG	WDFM	WSF1	WSF2	WSF5	WSFG	WSFM
count	1.137000e+03	1.137000e+03	1137.0	220.000000	220.000000	220.000000	220.000000	488.000000	1137.000000	1137.000000	...	230.000000	349.000000	349.000000	308.000000	137.000000	230.000000	349.000000	349.000000	308.000000	137.000000
mean	3.362972e+01	-8.444224e+01	308.2	10.336455	1017.975000	981.115909	13.730045	3.679303	502.067986	86.153738	...	233.956522	247.048711	244.670487	252.178571	237.153285	13.050435	14.214327	18.668481	18.864286	14.191241
std	7.108554e-15	1.421711e-14	0.0	7.391857	2.489038	2.051477	6.546238	0.612600	449.442361	99.706672	...	96.960343	101.115334	97.401396	86.333026	92.045376	2.976102	2.313260	3.485740	4.151514	3.397847
min	3.362972e+01	-8.444224e+01	308.2	-6.650000	1011.600000	975.500000	-0.410000	2.400000	0.000000	0.000000	...	20.000000	10.000000	10.000000	23.000000	45.000000	7.600000	8.900000	10.300000	11.300000	8.500000
25%	3.362972e+01	-8.444224e+01	308.2	3.957500	1016.200000	979.700000	7.695000	3.200000	16.300000	0.500000	...	150.000000	190.000000	200.000000	225.000000	180.000000	11.200000	13.000000	16.100000	15.900000	12.100000
50%	3.362972e+01	-8.444224e+01	308.2	9.845000	1017.450000	980.900000	13.710000	3.700000	444.000000	28.300000	...	280.000000	290.000000	290.000000	270.000000	270.000000	12.750000	13.900000	18.300000	18.500000	13.400000
75%	3.362972e+01	-8.444224e+01	308.2	17.385000	1019.600000	982.300000	20.105000	4.100000	940.000000	177.100000	...	310.000000	320.000000	320.000000	315.000000	315.000000	14.300000	15.700000	20.600000	20.600000	15.200000
max	3.362972e+01	-8.444224e+01	308.2	21.650000	1025.500000	987.300000	23.440000	5.900000	1442.200000	354.800000	...	360.000000	360.000000	360.000000	360.000000	360.000000	26.800000	21.900000	34.000000	34.400000	30.800000

8 rows × 58 columns

Note that the number of columns is the same as the number of float64 datatype in the DataFrame. 58. In case, some of the data is mixed type or for some reason stored as an object datatype, it won’t be included in these descriptive stats.

Let us access one column from this dataset.

df['ELEVATION']

0       308.2
1       308.2
2       308.2
3       308.2
4       308.2
        ...  
1132    308.2
1133    308.2
1134    308.2
1135    308.2
1136    308.2
Name: ELEVATION, Length: 1137, dtype: float64

By the way, what is 308.2? what are the units? From the Wikipedia article on ATL airport, it appears to be height about mean sea level in meters.

df['ELEVATION'].describe()

count    1137.0
mean      308.2
std         0.0
min       308.2
25%       308.2
50%       308.2
75%       308.2
max       308.2
Name: ELEVATION, dtype: float64

Not surprisingly, the elevation of ATL airport is always the same, at 308.2. Hence, the standard deviation is 0.

What if we want to access multiple columns?

columns = df.columns
print(columns)

Index(['STATION', 'DATE', 'LATITUDE', 'LONGITUDE', 'ELEVATION', 'NAME', 'ADPT',
       'ADPT_ATTRIBUTES', 'ASLP', 'ASLP_ATTRIBUTES',
       ...
       'WSF1', 'WSF1_ATTRIBUTES', 'WSF2', 'WSF2_ATTRIBUTES', 'WSF5',
       'WSF5_ATTRIBUTES', 'WSFG', 'WSFG_ATTRIBUTES', 'WSFM',
       'WSFM_ATTRIBUTES'],
      dtype='object', length=110)

We only see a few columns in the print statement as there are 110 columns. Let us try:

columns_list = list(columns)
print(columns_list)

['STATION', 'DATE', 'LATITUDE', 'LONGITUDE', 'ELEVATION', 'NAME', 'ADPT', 'ADPT_ATTRIBUTES', 'ASLP', 'ASLP_ATTRIBUTES', 'ASTP', 'ASTP_ATTRIBUTES', 'AWBT', 'AWBT_ATTRIBUTES', 'AWND', 'AWND_ATTRIBUTES', 'CDSD', 'CDSD_ATTRIBUTES', 'CLDD', 'CLDD_ATTRIBUTES', 'DP01', 'DP01_ATTRIBUTES', 'DP10', 'DP10_ATTRIBUTES', 'DP1X', 'DP1X_ATTRIBUTES', 'DSND', 'DSND_ATTRIBUTES', 'DSNW', 'DSNW_ATTRIBUTES', 'DT00', 'DT00_ATTRIBUTES', 'DT32', 'DT32_ATTRIBUTES', 'DX32', 'DX32_ATTRIBUTES', 'DX70', 'DX70_ATTRIBUTES', 'DX90', 'DX90_ATTRIBUTES', 'DYFG', 'DYFG_ATTRIBUTES', 'DYHF', 'DYHF_ATTRIBUTES', 'DYNT', 'DYNT_ATTRIBUTES', 'DYSD', 'DYSD_ATTRIBUTES', 'DYSN', 'DYSN_ATTRIBUTES', 'DYTS', 'DYTS_ATTRIBUTES', 'DYXP', 'DYXP_ATTRIBUTES', 'DYXT', 'DYXT_ATTRIBUTES', 'EMNT', 'EMNT_ATTRIBUTES', 'EMSD', 'EMSD_ATTRIBUTES', 'EMSN', 'EMSN_ATTRIBUTES', 'EMXP', 'EMXP_ATTRIBUTES', 'EMXT', 'EMXT_ATTRIBUTES', 'HDSD', 'HDSD_ATTRIBUTES', 'HTDD', 'HTDD_ATTRIBUTES', 'PRCP', 'PRCP_ATTRIBUTES', 'PSUN', 'PSUN_ATTRIBUTES', 'RHAV', 'RHAV_ATTRIBUTES', 'RHMN', 'RHMN_ATTRIBUTES', 'RHMX', 'RHMX_ATTRIBUTES', 'SNOW', 'SNOW_ATTRIBUTES', 'TAVG', 'TAVG_ATTRIBUTES', 'TMAX', 'TMAX_ATTRIBUTES', 'TMIN', 'TMIN_ATTRIBUTES', 'TSUN', 'TSUN_ATTRIBUTES', 'WDF1', 'WDF1_ATTRIBUTES', 'WDF2', 'WDF2_ATTRIBUTES', 'WDF5', 'WDF5_ATTRIBUTES', 'WDFG', 'WDFG_ATTRIBUTES', 'WDFM', 'WDFM_ATTRIBUTES', 'WSF1', 'WSF1_ATTRIBUTES', 'WSF2', 'WSF2_ATTRIBUTES', 'WSF5', 'WSF5_ATTRIBUTES', 'WSFG', 'WSFG_ATTRIBUTES', 'WSFM', 'WSFM_ATTRIBUTES']

Now we see all the columns. Let us select, say STATION, DATE and TAVG (average temperature)

df[['STATION', 'DATE', 'TAVG']]

	STATION	DATE	TAVG
0	USW00013874	1930-01	6.00
1	USW00013874	1930-02	11.03
2	USW00013874	1930-03	9.60
3	USW00013874	1930-04	17.48
4	USW00013874	1930-05	21.36
...	...	...	...
1132	USW00013874	2024-05	23.52
1133	USW00013874	2024-06	27.48
1134	USW00013874	2024-07	27.97
1135	USW00013874	2024-08	27.86
1136	USW00013874	2024-09	24.59

1137 rows × 3 columns

Let us select a particular index value. Say 100th entry.

df.iloc[100]

STATION            USW00013874
DATE                   1938-05
LATITUDE              33.62972
LONGITUDE            -84.44224
ELEVATION                308.2
                      ...     
WSF5_ATTRIBUTES            NaN
WSFG                       NaN
WSFG_ATTRIBUTES            NaN
WSFM                       NaN
WSFM_ATTRIBUTES            NaN
Name: 100, Length: 110, dtype: object

The 100th entry seems to be May 1938. We also notice a number of NaN values.

Note

NaN stands for “Not a Number” and is a standard floating-point representation of missing data in Pandas. It is used to denote missing or null values in a DataFrame. Understanding and handling NaN values is crucial for data cleaning and preprocessing.

Note NaN doesn’t necessarily mean that you can set it to 0!

We can use .isna() method to check whether a specific row has a NaN value.

nan_check_row = df.iloc[100].isna()
print(nan_check_row)

STATION            False
DATE               False
LATITUDE           False
LONGITUDE          False
ELEVATION          False
                   ...  
WSF5_ATTRIBUTES     True
WSFG                True
WSFG_ATTRIBUTES     True
WSFM                True
WSFM_ATTRIBUTES     True
Name: 100, Length: 110, dtype: bool

We can get a summary of NaN values to understand the columns.

nan_summary = df.isna().sum()
print(nan_summary)

STATION               0
DATE                  0
LATITUDE              0
LONGITUDE             0
ELEVATION             0
                   ... 
WSF5_ATTRIBUTES     788
WSFG                829
WSFG_ATTRIBUTES     829
WSFM               1000
WSFM_ATTRIBUTES    1000
Length: 110, dtype: int64

How can we drop NaN values?

df_cleaned = df.dropna()
df_cleaned

	STATION	DATE	LATITUDE	LONGITUDE	ELEVATION	NAME	ADPT	ADPT_ATTRIBUTES	ASLP	ASLP_ATTRIBUTES	...	WSF1	WSF1_ATTRIBUTES	WSF2	WSF2_ATTRIBUTES	WSF5	WSF5_ATTRIBUTES	WSFG	WSFG_ATTRIBUTES	WSFM	WSFM_ATTRIBUTES

0 rows × 110 columns

What happened? Why did the dataset has 0 rows? Using dropna() drops all rows where any of the columns has a missing value. It seems that all of the rows have at least one column with a missing value!

How about we drop only if a specific column has a missing value? First, we need to subset the data using a subset parameter. It allows us to specify which column(s) to check for missing values, in this case TAVG

df_cleaned = df.dropna(subset=['TAVG'])
df_cleaned.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1137 entries, 0 to 1136
Columns: 110 entries, STATION to WSFM_ATTRIBUTES
dtypes: float64(47), int64(11), object(52)
memory usage: 977.2+ KB

Now we can see that only one observation is dropped and we have 1136 rows.

We can fill the NaN or substitute NaN with a specific value. We need to be very careful with this operation as missing or NaN is not the same as a zero value. For example, TAVG is missing means that the average is not computed because the station may have some technical difficulties or may be in maintenance and the temperature is not measured. It doesn’t mean it is 0!.

But in case, we need to do it, we can use .fillna() method

df_filled = df.fillna(0)

by the way, we created new datasets df_cleaned and df_filled so that the original dataset df is not changed. It is important to keep track of the datasets and columns.

How about if we want to get more than one column? Note, we are not saving the results.

df[['STATION', 'DATE', 'TAVG']].iloc[100]

STATION    USW00013874
DATE           1938-05
TAVG              21.8
Name: 100, dtype: object

We can also filter rows by a condition.

filtered_df = df[df['TAVG'] > 20]
filtered_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 459 entries, 4 to 1136
Columns: 110 entries, STATION to WSFM_ATTRIBUTES
dtypes: float64(47), int64(11), object(52)
memory usage: 398.0+ KB

We can see that the number of rows have come down from 1137 to 458.

That’s because we only kept observations or rows where the average temperature, TAVG exceeded 20F

If we increase the threshold to 35 degree FH, we will see that there are fewer rows that satisfy the condition, only 226 out of the 1137 that we started.

filtered_df = df[df['TAVG'] > 25]
filtered_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 226 entries, 6 to 1135
Columns: 110 entries, STATION to WSFM_ATTRIBUTES
dtypes: float64(47), int64(11), object(52)
memory usage: 196.0+ KB

How about selecting a few columns from the dataset and create a smaller dataset? Let’s say, DATE and TVG?

df_columns = df[['DATE', 'TAVG']]
print(df_columns)

         DATE   TAVG
0     1930-01   6.00
1     1930-02  11.03
2     1930-03   9.60
3     1930-04  17.48
4     1930-05  21.36
...       ...    ...
1132  2024-05  23.52
1133  2024-06  27.48
1134  2024-07  27.97
1135  2024-08  27.86
1136  2024-09  24.59

[1137 rows x 2 columns]

Let’s break it down step by step:

The code starts by defining a DataFrame called df_columns.
The data assigned to df_columns is obtained from a pandas DataFrame called df.
The double square brackets [[]] are used to select specific columns from the DataFrame df.
Inside the double square brackets, we can specify the names of the columns we want to select. In this case, the columns with the names ‘DATE’ and ‘TAVG’ are selected. We separate the columns by a comma.
The selected columns are then assigned to the dataset df_columns.

Let us look at a few more useful functions

df_sorted = df_columns.sort_values(by='DATE')
df_sorted.head()

	DATE	TAVG
0	1930-01	6.00
1	1930-02	11.03
2	1930-03	9.60
3	1930-04	17.48
4	1930-05	21.36

The dataset is already sorted by DATE so we don’t see any change.

What if we want to sort in descending order, that is, have the latest date first?

df_sorted = df_columns.sort_values(by='DATE', ascending=False)
df_sorted.head()

	DATE	TAVG
1136	2024-09	24.59
1135	2024-08	27.86
1134	2024-07	27.97
1133	2024-06	27.48
1132	2024-05	23.52

We can see that monthly average for September of 2024 is not computed, but we have August 2024 data.

Let us look at the top temperatures for the ATL airport during 1930-2024. We can sort by TAVG and use the head function as the data is already sorted.

df_sorted = df_columns.sort_values(by='TAVG', ascending=False)
df_sorted.head()

	DATE	TAVG
931	2007-08	29.78
762	1993-07	29.63
606	1980-07	29.49
786	1995-07	29.06
678	1986-07	28.93

We can see that July seems to be the month where 4 of the 5 average hottest temperatures are recorded at the ATL airport. The highest average monthly temperature seems to be 29.78°C or approximately 85.60°F. It does feel much hotter in August!

We can also get the largest values and smallest values without sorting

df_columns['TAVG'].nlargest(5)

931    29.78
762    29.63
606    29.49
786    29.06
678    28.93
Name: TAVG, dtype: float64

The nlargest method is used to retrieve the top n largest values from the Series. In this case, n is set to 5, meaning the method will return the five largest values from the ‘TAVG’ column. The result is another Series containing these top values, sorted in descending order.

How about the lowest temperatures? Also, it would be good to have the DATE also as we don’t when the highest or lowest temperatures occurred.

df[['DATE', 'TAVG']].nsmallest(5, 'TAVG')

	DATE	TAVG
564	1977-01	-1.51
120	1940-01	-1.31
576	1978-01	0.90
407	1963-12	1.92
480	1970-01	2.18

df[[‘DATE’, ‘TAVG’]]: This part of the code selects two columns from the DataFrame df, namely ‘DATE’ and ‘TAVG’. The double square brackets [[‘DATE’, ‘TAVG’]] are used to create a new DataFrame that contains only these two columns. This is useful when you want to focus on specific columns for further operations.
.nsmallest(5, ‘TAVG’): The nsmallest method is then called on the newly created DataFrame. This method is used to return the first n rows ordered by columns in ascending order. In this case, n is set to 5, meaning the method will return the 5 rows with the smallest values in the ‘TAVG’ column. The second argument, ‘TAVG’, specifies the column to sort by.

It looks like at the ATL airport, rarely does the average monthly temperature dip below zero Celsius!

We have used [[]] keep some columns before. Let us add one more column to the list.

df_columns = df[['DATE', 'TAVG', 'ADPT']]
df_columns

	DATE	TAVG	ADPT
0	1930-01	6.00	NaN
1	1930-02	11.03	NaN
2	1930-03	9.60	NaN
3	1930-04	17.48	NaN
4	1930-05	21.36	NaN
...	...	...	...
1132	2024-05	23.52	16.03
1133	2024-06	27.48	17.49
1134	2024-07	27.97	21.32
1135	2024-08	27.86	NaN
1136	2024-09	24.59	NaN

1137 rows × 3 columns

Let’s look at code to drop some columns.

df_columns = df_columns.drop(columns = ['ADPT'])
df_columns.columns

Index(['DATE', 'TAVG'], dtype='object')

We see that instead of 3 columns, we have 2 columns and ADPT is no longer in the column list. By the way we used .columns to get the list of columns in the dataset

df_columns

	DATE	TAVG
0	1930-01	6.00
1	1930-02	11.03
2	1930-03	9.60
3	1930-04	17.48
4	1930-05	21.36
...	...	...
1132	2024-05	23.52
1133	2024-06	27.48
1134	2024-07	27.97
1135	2024-08	27.86
1136	2024-09	24.59

1137 rows × 2 columns

Sometimes, we may need to rename the column names.

The code typically looks like this:

df = df.rename(columns={'old_column_name': 'new_column_name'})

For our dataset, let us rename ‘NAME’ to ‘STATION_NAME’

df_columns = df[['DATE', 'TAVG', 'NAME']]
df_columns = df_columns.rename(columns={'NAME': 'STATION_NAME'})
df_columns.columns

Index(['DATE', 'TAVG', 'STATION_NAME'], dtype='object')

We can see the column name is changed. We can also see it by printing the dataset:

df_columns

	DATE	TAVG	STATION_NAME
0	1930-01	6.00	ATLANTA HARTSFIELD JACKSON INTERNATIONAL AIRPO...
1	1930-02	11.03	ATLANTA HARTSFIELD JACKSON INTERNATIONAL AIRPO...
2	1930-03	9.60	ATLANTA HARTSFIELD JACKSON INTERNATIONAL AIRPO...
3	1930-04	17.48	ATLANTA HARTSFIELD JACKSON INTERNATIONAL AIRPO...
4	1930-05	21.36	ATLANTA HARTSFIELD JACKSON INTERNATIONAL AIRPO...
...	...	...	...
1132	2024-05	23.52	ATLANTA HARTSFIELD JACKSON INTERNATIONAL AIRPO...
1133	2024-06	27.48	ATLANTA HARTSFIELD JACKSON INTERNATIONAL AIRPO...
1134	2024-07	27.97	ATLANTA HARTSFIELD JACKSON INTERNATIONAL AIRPO...
1135	2024-08	27.86	ATLANTA HARTSFIELD JACKSON INTERNATIONAL AIRPO...
1136	2024-09	24.59	ATLANTA HARTSFIELD JACKSON INTERNATIONAL AIRPO...

1137 rows × 3 columns

Let us look at a general purpose .apply function.

3.1.1.1 Apply a function to a column

The code typically looks like this

df['new_column'] = df['column_name'].apply(function)

Let us say, we want to create a new variable where we take the average temperature in Celsius and convert to °F.

We need to do two things

write a function to take input of temperature in Celsius and convert to °F.
apply to every row of the data so that we have a new column.

import pandas as pd

def Celsius_to_Fahrenheit(celsius):
    return (celsius * 9/5) + 32

df_columns = df[['DATE', 'TAVG']]
df_columns = df_columns.rename(columns={'TAVG': 'TAVG_C'})

# Apply the celsius_to_fahrenheit function to each element in the 'TAVG' column
df_columns['TAVG_F'] = df_columns['TAVG_C'].apply(Celsius_to_Fahrenheit)

df_columns

	DATE	TAVG_C	TAVG_F
0	1930-01	6.00	42.800
1	1930-02	11.03	51.854
2	1930-03	9.60	49.280
3	1930-04	17.48	63.464
4	1930-05	21.36	70.448
...	...	...	...
1132	2024-05	23.52	74.336
1133	2024-06	27.48	81.464
1134	2024-07	27.97	82.346
1135	2024-08	27.86	82.148
1136	2024-09	24.59	76.262

1137 rows × 3 columns

Let us break down the code step by step

We imported the pandas with alias pd
The function Celsius_to_Fahrenheit is defined to convert a temperature from Celsius to Fahrenheit. It takes a single argument, Celsius, and returns the equivalent temperature in Fahrenheit using the formula (celsius * 9/5) + 32.
Next, the code selects two columns, ‘DATE’ and ‘TAVG’, from an existing DataFrame df and assigns this subset to a new DataFrame df_columns. The ‘TAVG’ column, which contains average temperature data in Celsius, is then renamed to ‘TAVG_C’ to make its contents more explicit.
The Celsius_to_Fahrenheit function is then applied to each element in the ‘TAVG_C’ column using the apply method. This method is highly efficient for element-wise operations in Pandas. The results of this conversion are stored in a new column, ‘TAVG_F’, within the df_columns DataFrame.
Finally, the modified df_columns DataFrame, now containing the original date and temperature in both Celsius and Fahrenheit, is returned or displayed, depending on the context in which this code is executed. This allows for easy comparison and analysis of temperature data in both units.

It is easier to show the code where we actually need it. So, we will close this chapter with writing the data back to a CSV file or an Excel file so that we have a copy of the file.

df.to_csv('ATL_Temp.csv', index = False)

The df.to_csv method is a function provided by the Pandas library that allows you to write the contents of a DataFrame to a CSV file.
The first argument, ‘ATL_Temp.csv’, specifies the name of the CSV file to which the DataFrame will be written. In this case, the file will be named ATL_Temp.csv. If the file does not already exist, it will be created. If it does exist, it will be overwritten. Note that it will be written in the current working directory.
The index=False parameter indicates that the row indices of the DataFrame should not be written to the CSV file. By default, the to_csv method includes the row indices in the output file. Setting index=False ensures that only the data columns are written, without the additional index column.

Similarly, we can write the data to an Excel file

df.to_excel('ATL_Temp.xlsx', index=False)

We started by reading a CSV or Excel file, and we ended this chapter by writing the data back as a CSV or Excel file.

3.1 Pandas DataFrame

3.1.1 CSV Files

3.1.1.1 Apply a function to a column

3.2 Useful References and Websites