Understanding Climate Change Using Data Science
  • Home
  • Research
  • AI Literacy
  • Twitter
  • Facebook
  1. Programming and Visualization Primer
  2. 3  Pandas

  • FRONT MATTER
    • Preface
    • About the Book
    • About the Author

  • Programming and Visualization Primer
    • 1  Setup and Installation
    • 2  Python Primer
    • 3  Pandas

  • 2024 Climate Dashboard
    • 4  Introduction
    • 5  U.S. and Global Temperatures
    • 6  Seasonal Temperature
    • 7  High and Low Temperatures
    • 8  Temperature and Heat
    • 9  Arctic and Antarctic Ice
    • 10  Oceans
    • 11  Sea Level Rise (SLR)
    • 12  Part 1: Conclusion

  • Are We Responsible? Anthropocene Effect?
    • 13  About Part-2
    • 14  Greenhouse Gas Emissions

  • Anthropocene? Why Should We Care About Climate Change?
    • 15  About Part-3
    • 16  NOAA Climate Indicators: Droughts
    • 17  Disaster Declarations by FEMA

  • What Can We Do? Personal Action, Mitigation and Resilience

Table of contents

  • 3.1 Pandas DataFrame
    • 3.1.1 CSV Files
  • 3.2 Useful References and Websites
  • Edit this page
  • Report an issue
  • View source
  1. Programming and Visualization Primer
  2. 3  Pandas

3  Pandas

We will give a brief overview of Pandas in this chapter. We are going to discuss many of the features of Pandas in the latter chapters as part of our analysis of the climate data.

Also check out the excellent book on Python for Data Analysis by Wes Mckinney, https://wesmckinney.com/book/, that gives more information on using Python for data analysis. The book primarily uses Numpy and Pandas. We have learned a lot by reading the book and we highly recommend it. It is available Wes Mckinney’s website for free! What can be better?

3.1 Pandas DataFrame

We will first learn about the basic data structure in pandas, a DataFrame. Then, we will check some of the operations that we can perform on the data frame to manipulate the data.

In pandas, a DataFrame is a two-dimensional tabular data structure that is similar to a table in a relational database or a spreadsheet. It consists of rows and columns, where each column can have a different data type (e.g., numeric, string, boolean) and each row represents a separate observation or record.

Below is an example of tabular data with two columns or variables: City and Temperature.

City Temperature (°F)
New York 70
London 60
Paris 75
Tokyo 68
Sydney 72

DataFrames are a powerful tool for data manipulation, analysis, and visualization. They provide a convenient way to organize, manipulate, and analyze structured data. Some key features of DataFrames in pandas include:

  • Flexible indexing: DataFrames have row and column labels, which allow for easy indexing and selection of specific data points or subsets of the data. In this case, the columns are “City” and “Temperature”. We will see how to use indexing later in the book.

  • Data alignment: DataFrames automatically align data based on their row and column labels, making it easy to perform operations on multiple columns or rows simultaneously. This is very important for our visualization exercises later in the book.

  • Data cleaning and transformation: DataFrames provide a wide range of functions and methods for cleaning and transforming data, such as removing missing values, handling duplicates, and applying mathematical or statistical operations to the data.

  • Data visualization: DataFrames can be easily visualized using various plotting libraries, such as Matplotlib or Plotly. They provide built-in functions for creating different types of plots, including line plots, bar plots, scatter plots, and more. We will use many of them later in the book.

To create a DataFrame in pandas, we can pass various types of data structures, such as lists, dictionaries, or NumPy arrays, to the pd.DataFrame() constructor. You can also read data from external sources, such as CSV files or databases, using functions like pd.read_csv().

Once a DataFrame is created, you can perform various operations on it, such as selecting specific columns or rows, filtering data based on conditions, aggregating data, merging or joining multiple DataFrames, and more.

Here’s an example of creating a simple DataFrame in pandas:

import pandas as pd
data = {'City': ['New York', 'London', 'Paris'],
   'Temperature': [70, 60, 75]}

df = pd.DataFrame(data)
print(df)
       City  Temperature
0  New York           70
1    London           60
2     Paris           75

What did we do in this code?

  1. We import pandas

  2. Here, we define a dictionary called data. The dictionary has two keys: ‘City’ and ‘Temperature’. The corresponding values are lists of strings and integers, respectively. This data represents the cities and their (average) temperatures. But when is the temperature recorded? We need more variables or columns.

  3. This line creates a DataFrame object using the pd.DataFrame() function from the pandas library. The DataFrame is a two-dimensional table-like data structure that can hold data in rows and columns. We pass the data dictionary as an argument to create the DataFrame. The keys of the dictionary become the column names, and the values become the data in the columns.

We can see from the output that

  1. the DataFrame has two columns: City and Temperature.
  2. There are three records or observation.
  3. Also note that the index starts at 0.

3.1.1 CSV Files

Most of the climate related data that we use will be in a tabular format, many a time in CSV format. The CSV (comma-separated values) format is common non-proprietary format for storing tabular data stored as plain text.

Let us read a real data set, that is in the CSV format, and show some operations that we can do on the data frame.

We will revisit all these operations in latter chapters in more depth.

Since we are in Atlanta, we will start with the climate data at the Hartsfield-Jackson Atlanta International Airport. It is the busiest airport in the world and you may have connected through Atlanta before!

We will not go through all the details of the dataset here as the main goal is to just show how we can access a CSV file over the internet.

Warning

by the way, if you see an error below, it is because the National Center for Environmental Information (NCEI) that hosts the data was hit by Hurricane Helene and the data center is down. It is ironical but highlights the reality of climate change and extreme weather. NCEI Asheville Outage


import pandas as pd
df = pd.read_csv("https://www.ncei.noaa.gov/data/global-summary-of-the-month/access/USW00013874.csv")

The first line imports the Pandas library and assigns it the alias pd. Remember, as we mentioned in the installation chapter, we have all the tools in the pandas library available to us now, including reading a CSV file.

The second line reads a CSV file from a given URL and loads it into a Pandas DataFrame.

The pd.read_csv function is a versatile method for reading CSV files, whether they are stored locally or accessible via a URL. In this case, the URL points to a dataset hosted by the National Centers for Environmental Information (NCEI), which contains global summary data of the month for the ATL airport.

The result of this operation is stored in the variable df, which is a Pandas DataFrame. A DataFrame is a two-dimensional tabular data structure similar to a table in a database or an Excel spreadsheet.

Let us look at what is in this CSV file or dataset!

By the way, we chose this particular CSV file as it is relatively clean and easy to handle. We will see how the other datasets that we handle in this book can be more messy 🥺

It is always a good idea to get a sense of the dataset before doing any operations on it.

We can display the first few rows of the dataset with the following code

df.head()
STATION DATE LATITUDE LONGITUDE ELEVATION NAME ADPT ADPT_ATTRIBUTES ASLP ASLP_ATTRIBUTES ... WSF1 WSF1_ATTRIBUTES WSF2 WSF2_ATTRIBUTES WSF5 WSF5_ATTRIBUTES WSFG WSFG_ATTRIBUTES WSFM WSFM_ATTRIBUTES
0 USW00013874 1930-01 33.62972 -84.44224 308.2 ATLANTA HARTSFIELD JACKSON INTERNATIONAL AIRPO... NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 USW00013874 1930-02 33.62972 -84.44224 308.2 ATLANTA HARTSFIELD JACKSON INTERNATIONAL AIRPO... NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 USW00013874 1930-03 33.62972 -84.44224 308.2 ATLANTA HARTSFIELD JACKSON INTERNATIONAL AIRPO... NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 USW00013874 1930-04 33.62972 -84.44224 308.2 ATLANTA HARTSFIELD JACKSON INTERNATIONAL AIRPO... NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 USW00013874 1930-05 33.62972 -84.44224 308.2 ATLANTA HARTSFIELD JACKSON INTERNATIONAL AIRPO... NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5 rows × 110 columns

The code displays the first 5 rows of the DataFrame by default. You can specify the number of rows to display by passing an integer as an argument.

df.head(10)
STATION DATE LATITUDE LONGITUDE ELEVATION NAME ADPT ADPT_ATTRIBUTES ASLP ASLP_ATTRIBUTES ... WSF1 WSF1_ATTRIBUTES WSF2 WSF2_ATTRIBUTES WSF5 WSF5_ATTRIBUTES WSFG WSFG_ATTRIBUTES WSFM WSFM_ATTRIBUTES
0 USW00013874 1930-01 33.62972 -84.44224 308.2 ATLANTA HARTSFIELD JACKSON INTERNATIONAL AIRPO... NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 USW00013874 1930-02 33.62972 -84.44224 308.2 ATLANTA HARTSFIELD JACKSON INTERNATIONAL AIRPO... NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 USW00013874 1930-03 33.62972 -84.44224 308.2 ATLANTA HARTSFIELD JACKSON INTERNATIONAL AIRPO... NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 USW00013874 1930-04 33.62972 -84.44224 308.2 ATLANTA HARTSFIELD JACKSON INTERNATIONAL AIRPO... NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 USW00013874 1930-05 33.62972 -84.44224 308.2 ATLANTA HARTSFIELD JACKSON INTERNATIONAL AIRPO... NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
5 USW00013874 1930-06 33.62972 -84.44224 308.2 ATLANTA HARTSFIELD JACKSON INTERNATIONAL AIRPO... NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
6 USW00013874 1930-07 33.62972 -84.44224 308.2 ATLANTA HARTSFIELD JACKSON INTERNATIONAL AIRPO... NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
7 USW00013874 1930-08 33.62972 -84.44224 308.2 ATLANTA HARTSFIELD JACKSON INTERNATIONAL AIRPO... NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
8 USW00013874 1930-09 33.62972 -84.44224 308.2 ATLANTA HARTSFIELD JACKSON INTERNATIONAL AIRPO... NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
9 USW00013874 1930-10 33.62972 -84.44224 308.2 ATLANTA HARTSFIELD JACKSON INTERNATIONAL AIRPO... NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

10 rows × 110 columns

Similarly, we display the last 5 rows of the DataFrame by default. Or we can specify the number of rows to display by passing an integer as an argument (e.g., df.tail(10)).

df.tail()
STATION DATE LATITUDE LONGITUDE ELEVATION NAME ADPT ADPT_ATTRIBUTES ASLP ASLP_ATTRIBUTES ... WSF1 WSF1_ATTRIBUTES WSF2 WSF2_ATTRIBUTES WSF5 WSF5_ATTRIBUTES WSFG WSFG_ATTRIBUTES WSFM WSFM_ATTRIBUTES
1132 USW00013874 2024-05 33.62972 -84.44224 308.2 ATLANTA HARTSFIELD JACKSON INTERNATIONAL AIRPO... 16.03 ,,,W 1013.5 ,,,W ... NaN NaN 13.9 ,W 19.7 ,W NaN NaN NaN NaN
1133 USW00013874 2024-06 33.62972 -84.44224 308.2 ATLANTA HARTSFIELD JACKSON INTERNATIONAL AIRPO... 17.49 ,,,W 1015.0 ,,,W ... NaN NaN 11.6 ,W 20.6 ,W NaN NaN NaN NaN
1134 USW00013874 2024-07 33.62972 -84.44224 308.2 ATLANTA HARTSFIELD JACKSON INTERNATIONAL AIRPO... 21.32 ,,,W 1017.1 ,,,W ... NaN NaN 15.7 ,W 20.1 ,W NaN NaN NaN NaN
1135 USW00013874 2024-08 33.62972 -84.44224 308.2 ATLANTA HARTSFIELD JACKSON INTERNATIONAL AIRPO... NaN NaN NaN NaN ... NaN NaN 14.8 ,W 17.9 ,W NaN NaN NaN NaN
1136 USW00013874 2024-09 33.62972 -84.44224 308.2 ATLANTA HARTSFIELD JACKSON INTERNATIONAL AIRPO... NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5 rows × 110 columns

We can get a summary of the dataset, including the number of non-null entries, data types of columns, and memory usage.

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1137 entries, 0 to 1136
Columns: 110 entries, STATION to WSFM_ATTRIBUTES
dtypes: float64(47), int64(11), object(52)
memory usage: 977.2+ KB

We can see that the DataFrame has 1137 rows and 110 columns. We can also get a sense of the type of data. We have 58 columns with a float64 datatype and another 52 with an object datatype (which is typically used for string or mixed data types).

The approximate amount of memory used to store this DataFrame is relatively small at 977.2 KB.

We can generate descriptive statistics for numerical columns, including count, mean, standard deviation, min, max, and quartiles.

df.describe()
LATITUDE LONGITUDE ELEVATION ADPT ASLP ASTP AWBT AWND CDSD CLDD ... WDF1 WDF2 WDF5 WDFG WDFM WSF1 WSF2 WSF5 WSFG WSFM
count 1.137000e+03 1.137000e+03 1137.0 220.000000 220.000000 220.000000 220.000000 488.000000 1137.000000 1137.000000 ... 230.000000 349.000000 349.000000 308.000000 137.000000 230.000000 349.000000 349.000000 308.000000 137.000000
mean 3.362972e+01 -8.444224e+01 308.2 10.336455 1017.975000 981.115909 13.730045 3.679303 502.067986 86.153738 ... 233.956522 247.048711 244.670487 252.178571 237.153285 13.050435 14.214327 18.668481 18.864286 14.191241
std 7.108554e-15 1.421711e-14 0.0 7.391857 2.489038 2.051477 6.546238 0.612600 449.442361 99.706672 ... 96.960343 101.115334 97.401396 86.333026 92.045376 2.976102 2.313260 3.485740 4.151514 3.397847
min 3.362972e+01 -8.444224e+01 308.2 -6.650000 1011.600000 975.500000 -0.410000 2.400000 0.000000 0.000000 ... 20.000000 10.000000 10.000000 23.000000 45.000000 7.600000 8.900000 10.300000 11.300000 8.500000
25% 3.362972e+01 -8.444224e+01 308.2 3.957500 1016.200000 979.700000 7.695000 3.200000 16.300000 0.500000 ... 150.000000 190.000000 200.000000 225.000000 180.000000 11.200000 13.000000 16.100000 15.900000 12.100000
50% 3.362972e+01 -8.444224e+01 308.2 9.845000 1017.450000 980.900000 13.710000 3.700000 444.000000 28.300000 ... 280.000000 290.000000 290.000000 270.000000 270.000000 12.750000 13.900000 18.300000 18.500000 13.400000
75% 3.362972e+01 -8.444224e+01 308.2 17.385000 1019.600000 982.300000 20.105000 4.100000 940.000000 177.100000 ... 310.000000 320.000000 320.000000 315.000000 315.000000 14.300000 15.700000 20.600000 20.600000 15.200000
max 3.362972e+01 -8.444224e+01 308.2 21.650000 1025.500000 987.300000 23.440000 5.900000 1442.200000 354.800000 ... 360.000000 360.000000 360.000000 360.000000 360.000000 26.800000 21.900000 34.000000 34.400000 30.800000

8 rows × 58 columns

Note that the number of columns is the same as the number of float64 datatype in the DataFrame. 58. In case, some of the data is mixed type or for some reason stored as an object datatype, it won’t be included in these descriptive stats.

Let us access one column from this dataset.

df['ELEVATION']
0       308.2
1       308.2
2       308.2
3       308.2
4       308.2
        ...  
1132    308.2
1133    308.2
1134    308.2
1135    308.2
1136    308.2
Name: ELEVATION, Length: 1137, dtype: float64

By the way, what is 308.2? what are the units? From the Wikipedia article on ATL airport, it appears to be height about mean sea level in meters.

df['ELEVATION'].describe()
count    1137.0
mean      308.2
std         0.0
min       308.2
25%       308.2
50%       308.2
75%       308.2
max       308.2
Name: ELEVATION, dtype: float64

Not surprisingly, the elevation of ATL airport is always the same, at 308.2. Hence, the standard deviation is 0.

What if we want to access multiple columns?

columns = df.columns
print(columns)
Index(['STATION', 'DATE', 'LATITUDE', 'LONGITUDE', 'ELEVATION', 'NAME', 'ADPT',
       'ADPT_ATTRIBUTES', 'ASLP', 'ASLP_ATTRIBUTES',
       ...
       'WSF1', 'WSF1_ATTRIBUTES', 'WSF2', 'WSF2_ATTRIBUTES', 'WSF5',
       'WSF5_ATTRIBUTES', 'WSFG', 'WSFG_ATTRIBUTES', 'WSFM',
       'WSFM_ATTRIBUTES'],
      dtype='object', length=110)

We only see a few columns in the print statement as there are 110 columns. Let us try:

columns_list = list(columns)
print(columns_list)
['STATION', 'DATE', 'LATITUDE', 'LONGITUDE', 'ELEVATION', 'NAME', 'ADPT', 'ADPT_ATTRIBUTES', 'ASLP', 'ASLP_ATTRIBUTES', 'ASTP', 'ASTP_ATTRIBUTES', 'AWBT', 'AWBT_ATTRIBUTES', 'AWND', 'AWND_ATTRIBUTES', 'CDSD', 'CDSD_ATTRIBUTES', 'CLDD', 'CLDD_ATTRIBUTES', 'DP01', 'DP01_ATTRIBUTES', 'DP10', 'DP10_ATTRIBUTES', 'DP1X', 'DP1X_ATTRIBUTES', 'DSND', 'DSND_ATTRIBUTES', 'DSNW', 'DSNW_ATTRIBUTES', 'DT00', 'DT00_ATTRIBUTES', 'DT32', 'DT32_ATTRIBUTES', 'DX32', 'DX32_ATTRIBUTES', 'DX70', 'DX70_ATTRIBUTES', 'DX90', 'DX90_ATTRIBUTES', 'DYFG', 'DYFG_ATTRIBUTES', 'DYHF', 'DYHF_ATTRIBUTES', 'DYNT', 'DYNT_ATTRIBUTES', 'DYSD', 'DYSD_ATTRIBUTES', 'DYSN', 'DYSN_ATTRIBUTES', 'DYTS', 'DYTS_ATTRIBUTES', 'DYXP', 'DYXP_ATTRIBUTES', 'DYXT', 'DYXT_ATTRIBUTES', 'EMNT', 'EMNT_ATTRIBUTES', 'EMSD', 'EMSD_ATTRIBUTES', 'EMSN', 'EMSN_ATTRIBUTES', 'EMXP', 'EMXP_ATTRIBUTES', 'EMXT', 'EMXT_ATTRIBUTES', 'HDSD', 'HDSD_ATTRIBUTES', 'HTDD', 'HTDD_ATTRIBUTES', 'PRCP', 'PRCP_ATTRIBUTES', 'PSUN', 'PSUN_ATTRIBUTES', 'RHAV', 'RHAV_ATTRIBUTES', 'RHMN', 'RHMN_ATTRIBUTES', 'RHMX', 'RHMX_ATTRIBUTES', 'SNOW', 'SNOW_ATTRIBUTES', 'TAVG', 'TAVG_ATTRIBUTES', 'TMAX', 'TMAX_ATTRIBUTES', 'TMIN', 'TMIN_ATTRIBUTES', 'TSUN', 'TSUN_ATTRIBUTES', 'WDF1', 'WDF1_ATTRIBUTES', 'WDF2', 'WDF2_ATTRIBUTES', 'WDF5', 'WDF5_ATTRIBUTES', 'WDFG', 'WDFG_ATTRIBUTES', 'WDFM', 'WDFM_ATTRIBUTES', 'WSF1', 'WSF1_ATTRIBUTES', 'WSF2', 'WSF2_ATTRIBUTES', 'WSF5', 'WSF5_ATTRIBUTES', 'WSFG', 'WSFG_ATTRIBUTES', 'WSFM', 'WSFM_ATTRIBUTES']

Now we see all the columns. Let us select, say STATION, DATE and TAVG (average temperature)

df[['STATION', 'DATE', 'TAVG']]
STATION DATE TAVG
0 USW00013874 1930-01 6.00
1 USW00013874 1930-02 11.03
2 USW00013874 1930-03 9.60
3 USW00013874 1930-04 17.48
4 USW00013874 1930-05 21.36
... ... ... ...
1132 USW00013874 2024-05 23.52
1133 USW00013874 2024-06 27.48
1134 USW00013874 2024-07 27.97
1135 USW00013874 2024-08 27.86
1136 USW00013874 2024-09 24.59

1137 rows × 3 columns

Let us select a particular index value. Say 100th entry.

df.iloc[100]
STATION            USW00013874
DATE                   1938-05
LATITUDE              33.62972
LONGITUDE            -84.44224
ELEVATION                308.2
                      ...     
WSF5_ATTRIBUTES            NaN
WSFG                       NaN
WSFG_ATTRIBUTES            NaN
WSFM                       NaN
WSFM_ATTRIBUTES            NaN
Name: 100, Length: 110, dtype: object

The 100th entry seems to be May 1938. We also notice a number of NaN values.

Note

NaN stands for “Not a Number” and is a standard floating-point representation of missing data in Pandas. It is used to denote missing or null values in a DataFrame. Understanding and handling NaN values is crucial for data cleaning and preprocessing.

Note NaN doesn’t necessarily mean that you can set it to 0!

We can use .isna() method to check whether a specific row has a NaN value.

nan_check_row = df.iloc[100].isna()
print(nan_check_row)
STATION            False
DATE               False
LATITUDE           False
LONGITUDE          False
ELEVATION          False
                   ...  
WSF5_ATTRIBUTES     True
WSFG                True
WSFG_ATTRIBUTES     True
WSFM                True
WSFM_ATTRIBUTES     True
Name: 100, Length: 110, dtype: bool

We can get a summary of NaN values to understand the columns.

nan_summary = df.isna().sum()
print(nan_summary)
STATION               0
DATE                  0
LATITUDE              0
LONGITUDE             0
ELEVATION             0
                   ... 
WSF5_ATTRIBUTES     788
WSFG                829
WSFG_ATTRIBUTES     829
WSFM               1000
WSFM_ATTRIBUTES    1000
Length: 110, dtype: int64

How can we drop NaN values?

df_cleaned = df.dropna()
df_cleaned
STATION DATE LATITUDE LONGITUDE ELEVATION NAME ADPT ADPT_ATTRIBUTES ASLP ASLP_ATTRIBUTES ... WSF1 WSF1_ATTRIBUTES WSF2 WSF2_ATTRIBUTES WSF5 WSF5_ATTRIBUTES WSFG WSFG_ATTRIBUTES WSFM WSFM_ATTRIBUTES

0 rows × 110 columns

What happened? Why did the dataset has 0 rows? Using dropna() drops all rows where any of the columns has a missing value. It seems that all of the rows have at least one column with a missing value!

How about we drop only if a specific column has a missing value? First, we need to subset the data using a subset parameter. It allows us to specify which column(s) to check for missing values, in this case TAVG

df_cleaned = df.dropna(subset=['TAVG'])
df_cleaned.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1137 entries, 0 to 1136
Columns: 110 entries, STATION to WSFM_ATTRIBUTES
dtypes: float64(47), int64(11), object(52)
memory usage: 977.2+ KB

Now we can see that only one observation is dropped and we have 1136 rows.

We can fill the NaN or substitute NaN with a specific value. We need to be very careful with this operation as missing or NaN is not the same as a zero value. For example, TAVG is missing means that the average is not computed because the station may have some technical difficulties or may be in maintenance and the temperature is not measured. It doesn’t mean it is 0!.

But in case, we need to do it, we can use .fillna() method

df_filled = df.fillna(0)

by the way, we created new datasets df_cleaned and df_filled so that the original dataset df is not changed. It is important to keep track of the datasets and columns.

How about if we want to get more than one column? Note, we are not saving the results.

df[['STATION', 'DATE', 'TAVG']].iloc[100]
STATION    USW00013874
DATE           1938-05
TAVG              21.8
Name: 100, dtype: object

We can also filter rows by a condition.

filtered_df = df[df['TAVG'] > 20]
filtered_df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 459 entries, 4 to 1136
Columns: 110 entries, STATION to WSFM_ATTRIBUTES
dtypes: float64(47), int64(11), object(52)
memory usage: 398.0+ KB

We can see that the number of rows have come down from 1137 to 458.

That’s because we only kept observations or rows where the average temperature, TAVG exceeded 20F

If we increase the threshold to 35 degree FH, we will see that there are fewer rows that satisfy the condition, only 226 out of the 1137 that we started.

filtered_df = df[df['TAVG'] > 25]
filtered_df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 226 entries, 6 to 1135
Columns: 110 entries, STATION to WSFM_ATTRIBUTES
dtypes: float64(47), int64(11), object(52)
memory usage: 196.0+ KB

How about selecting a few columns from the dataset and create a smaller dataset? Let’s say, DATE and TVG?

df_columns = df[['DATE', 'TAVG']]
print(df_columns)
         DATE   TAVG
0     1930-01   6.00
1     1930-02  11.03
2     1930-03   9.60
3     1930-04  17.48
4     1930-05  21.36
...       ...    ...
1132  2024-05  23.52
1133  2024-06  27.48
1134  2024-07  27.97
1135  2024-08  27.86
1136  2024-09  24.59

[1137 rows x 2 columns]

Let’s break it down step by step:

  1. The code starts by defining a DataFrame called df_columns.
  2. The data assigned to df_columns is obtained from a pandas DataFrame called df.
  3. The double square brackets [[]] are used to select specific columns from the DataFrame df.
  4. Inside the double square brackets, we can specify the names of the columns we want to select. In this case, the columns with the names ‘DATE’ and ‘TAVG’ are selected. We separate the columns by a comma.
  5. The selected columns are then assigned to the dataset df_columns.

Let us look at a few more useful functions

df_sorted = df_columns.sort_values(by='DATE')
df_sorted.head()
DATE TAVG
0 1930-01 6.00
1 1930-02 11.03
2 1930-03 9.60
3 1930-04 17.48
4 1930-05 21.36

The dataset is already sorted by DATE so we don’t see any change.

What if we want to sort in descending order, that is, have the latest date first?

df_sorted = df_columns.sort_values(by='DATE', ascending=False)
df_sorted.head()
DATE TAVG
1136 2024-09 24.59
1135 2024-08 27.86
1134 2024-07 27.97
1133 2024-06 27.48
1132 2024-05 23.52

We can see that monthly average for September of 2024 is not computed, but we have August 2024 data.

Let us look at the top temperatures for the ATL airport during 1930-2024. We can sort by TAVG and use the head function as the data is already sorted.

df_sorted = df_columns.sort_values(by='TAVG', ascending=False)
df_sorted.head()
DATE TAVG
931 2007-08 29.78
762 1993-07 29.63
606 1980-07 29.49
786 1995-07 29.06
678 1986-07 28.93

We can see that July seems to be the month where 4 of the 5 average hottest temperatures are recorded at the ATL airport. The highest average monthly temperature seems to be 29.78°C or approximately 85.60°F. It does feel much hotter in August!

We can also get the largest values and smallest values without sorting

df_columns['TAVG'].nlargest(5)
931    29.78
762    29.63
606    29.49
786    29.06
678    28.93
Name: TAVG, dtype: float64

The nlargest method is used to retrieve the top n largest values from the Series. In this case, n is set to 5, meaning the method will return the five largest values from the ‘TAVG’ column. The result is another Series containing these top values, sorted in descending order.

How about the lowest temperatures? Also, it would be good to have the DATE also as we don’t when the highest or lowest temperatures occurred.

df[['DATE', 'TAVG']].nsmallest(5, 'TAVG')
DATE TAVG
564 1977-01 -1.51
120 1940-01 -1.31
576 1978-01 0.90
407 1963-12 1.92
480 1970-01 2.18
  • df[[‘DATE’, ‘TAVG’]]: This part of the code selects two columns from the DataFrame df, namely ‘DATE’ and ‘TAVG’. The double square brackets [[‘DATE’, ‘TAVG’]] are used to create a new DataFrame that contains only these two columns. This is useful when you want to focus on specific columns for further operations.

  • .nsmallest(5, ‘TAVG’): The nsmallest method is then called on the newly created DataFrame. This method is used to return the first n rows ordered by columns in ascending order. In this case, n is set to 5, meaning the method will return the 5 rows with the smallest values in the ‘TAVG’ column. The second argument, ‘TAVG’, specifies the column to sort by.

It looks like at the ATL airport, rarely does the average monthly temperature dip below zero Celsius!

We have used [[]] keep some columns before. Let us add one more column to the list.

df_columns = df[['DATE', 'TAVG', 'ADPT']]
df_columns
DATE TAVG ADPT
0 1930-01 6.00 NaN
1 1930-02 11.03 NaN
2 1930-03 9.60 NaN
3 1930-04 17.48 NaN
4 1930-05 21.36 NaN
... ... ... ...
1132 2024-05 23.52 16.03
1133 2024-06 27.48 17.49
1134 2024-07 27.97 21.32
1135 2024-08 27.86 NaN
1136 2024-09 24.59 NaN

1137 rows × 3 columns

Let’s look at code to drop some columns.

df_columns = df_columns.drop(columns = ['ADPT'])
df_columns.columns
Index(['DATE', 'TAVG'], dtype='object')

We see that instead of 3 columns, we have 2 columns and ADPT is no longer in the column list. By the way we used .columns to get the list of columns in the dataset

df_columns
DATE TAVG
0 1930-01 6.00
1 1930-02 11.03
2 1930-03 9.60
3 1930-04 17.48
4 1930-05 21.36
... ... ...
1132 2024-05 23.52
1133 2024-06 27.48
1134 2024-07 27.97
1135 2024-08 27.86
1136 2024-09 24.59

1137 rows × 2 columns

Sometimes, we may need to rename the column names.

The code typically looks like this:

df = df.rename(columns={'old_column_name': 'new_column_name'})

For our dataset, let us rename ‘NAME’ to ‘STATION_NAME’

df_columns = df[['DATE', 'TAVG', 'NAME']]
df_columns = df_columns.rename(columns={'NAME': 'STATION_NAME'})
df_columns.columns
Index(['DATE', 'TAVG', 'STATION_NAME'], dtype='object')

We can see the column name is changed. We can also see it by printing the dataset:

df_columns
DATE TAVG STATION_NAME
0 1930-01 6.00 ATLANTA HARTSFIELD JACKSON INTERNATIONAL AIRPO...
1 1930-02 11.03 ATLANTA HARTSFIELD JACKSON INTERNATIONAL AIRPO...
2 1930-03 9.60 ATLANTA HARTSFIELD JACKSON INTERNATIONAL AIRPO...
3 1930-04 17.48 ATLANTA HARTSFIELD JACKSON INTERNATIONAL AIRPO...
4 1930-05 21.36 ATLANTA HARTSFIELD JACKSON INTERNATIONAL AIRPO...
... ... ... ...
1132 2024-05 23.52 ATLANTA HARTSFIELD JACKSON INTERNATIONAL AIRPO...
1133 2024-06 27.48 ATLANTA HARTSFIELD JACKSON INTERNATIONAL AIRPO...
1134 2024-07 27.97 ATLANTA HARTSFIELD JACKSON INTERNATIONAL AIRPO...
1135 2024-08 27.86 ATLANTA HARTSFIELD JACKSON INTERNATIONAL AIRPO...
1136 2024-09 24.59 ATLANTA HARTSFIELD JACKSON INTERNATIONAL AIRPO...

1137 rows × 3 columns

Let us look at a general purpose .apply function.

3.1.1.1 Apply a function to a column

The code typically looks like this

df['new_column'] = df['column_name'].apply(function)

Let us say, we want to create a new variable where we take the average temperature in Celsius and convert to °F.

We need to do two things

  1. write a function to take input of temperature in Celsius and convert to °F.
  2. apply to every row of the data so that we have a new column.
import pandas as pd

def Celsius_to_Fahrenheit(celsius):
    return (celsius * 9/5) + 32

df_columns = df[['DATE', 'TAVG']]
df_columns = df_columns.rename(columns={'TAVG': 'TAVG_C'})

# Apply the celsius_to_fahrenheit function to each element in the 'TAVG' column
df_columns['TAVG_F'] = df_columns['TAVG_C'].apply(Celsius_to_Fahrenheit)

df_columns
DATE TAVG_C TAVG_F
0 1930-01 6.00 42.800
1 1930-02 11.03 51.854
2 1930-03 9.60 49.280
3 1930-04 17.48 63.464
4 1930-05 21.36 70.448
... ... ... ...
1132 2024-05 23.52 74.336
1133 2024-06 27.48 81.464
1134 2024-07 27.97 82.346
1135 2024-08 27.86 82.148
1136 2024-09 24.59 76.262

1137 rows × 3 columns

Let us break down the code step by step

  1. We imported the pandas with alias pd
  2. The function Celsius_to_Fahrenheit is defined to convert a temperature from Celsius to Fahrenheit. It takes a single argument, Celsius, and returns the equivalent temperature in Fahrenheit using the formula (celsius * 9/5) + 32.
  3. Next, the code selects two columns, ‘DATE’ and ‘TAVG’, from an existing DataFrame df and assigns this subset to a new DataFrame df_columns. The ‘TAVG’ column, which contains average temperature data in Celsius, is then renamed to ‘TAVG_C’ to make its contents more explicit.
  4. The Celsius_to_Fahrenheit function is then applied to each element in the ‘TAVG_C’ column using the apply method. This method is highly efficient for element-wise operations in Pandas. The results of this conversion are stored in a new column, ‘TAVG_F’, within the df_columns DataFrame.
  5. Finally, the modified df_columns DataFrame, now containing the original date and temperature in both Celsius and Fahrenheit, is returned or displayed, depending on the context in which this code is executed. This allows for easy comparison and analysis of temperature data in both units.

It is easier to show the code where we actually need it. So, we will close this chapter with writing the data back to a CSV file or an Excel file so that we have a copy of the file.

df.to_csv('ATL_Temp.csv', index = False)
  • The df.to_csv method is a function provided by the Pandas library that allows you to write the contents of a DataFrame to a CSV file.
  • The first argument, ‘ATL_Temp.csv’, specifies the name of the CSV file to which the DataFrame will be written. In this case, the file will be named ATL_Temp.csv. If the file does not already exist, it will be created. If it does exist, it will be overwritten. Note that it will be written in the current working directory.
  • The index=False parameter indicates that the row indices of the DataFrame should not be written to the CSV file. By default, the to_csv method includes the row indices in the output file. Setting index=False ensures that only the data columns are written, without the additional index column.

Similarly, we can write the data to an Excel file

df.to_excel('ATL_Temp.xlsx', index=False)

We started by reading a CSV or Excel file, and we ended this chapter by writing the data back as a CSV or Excel file.

3.2 Useful References and Websites

  1. Introduction to Pandas
  2. Pandas Documentation
  3. Python for Data Analysis
2  Python Primer

Sahasra Chava

 
  • Edit this page
  • Report an issue
  • View source

NextGen360.