16 NOAA Climate Indicators: Droughts

In this part, we would like to understand and visualize the impact of climate events that can negatively impact human race.

We will start with droughts.

16.2 Droughts

We can get many climate indicators, including droughts, from National Center for Environmental Information at NOAA.

The drought and other climate indicators are available at the climate division level from the NCEI website.

We can see the directory structure as shown below

There are a number of files. But let us first read the drought documentation.

It appears from the readme file that

Period of record is 1895 through latest month available, updated monthly.
The major parameters in this file are sequential climatic divisional, statewide, regional and national monthly
Standardized Precipitation Index (SPI), and
Palmer Drought Indices (PDSI, PHDI, PMDI, and ZNDX).

The definitions are also listed in the readme file.

Palmer Drought Severity Index (PDSI)

This is a meteorological drought index used to assess the severity of dry or wet spells of weather.

This is the monthly value (index) that is generated indicating the severity of a wet or dry spell.

This index is based on the principles of a balance between moisture supply and demand. Man-made changes were not considered in this calculation.

The index generally ranges from -6 to +6, with negative values denoting dry spells and positive values indicating wet spells. There are a few values in the magnitude of +7 or -7.

PDSI values 0 to -.5 = normal;

-0.5 to -1.0 = incipient drought;

-1.0 to -2.0 = mild drought;

-2.0 to -3.0 = moderate drought;

-3.0 to -4.0 = severe drought;

and greater than -4.0 = extreme drought.

Similar adjectives are attached to positive values of wet spells.

Palmer Hydrological Drought Index (PHDI)

This is the monthly value (index) generated monthly that indicates the severity of a wet or dry spell.

This index is based on the principles of a balance between moisture supply and demand. Man-made changes such as increased irrigation, new reservoirs, and added industrial water use were not included in the computation of this index.

The index generally ranges from -6 to +6, with negative values denoting dry spells, and positive values indicating wet spells. There are a few values in the magnitude of +7 or -7.

PHDI values:

0 to -0.5 = normal

-0.5 to -1.0 = incipient drought

-1.0 to -2.0 = mild drought

-2.0 to -3.0 = moderate drought

-3.0 to -4.0 = severe drought

greater than -4.0 = extreme drought

Similar adjectives are attached to positive values of wet spells.

This is a hydrological drought index used to assess long-term moisture supply.

Palmer “Z” Index (ZNDX)

This is the generated monthly Z values, and they can be expressed as the “Moisture Anomaly Index.”
Each monthly Z value is a measure of the departure from normal of the moisture climate for that month.
This index can respond to a month of above-normal precipitation, even during periods of drought.
Table 1 contains expected values of the Z index and other drought parameters.
See Historical Climatology Series 3-6 through 3-9 for a detailed description of the drought indices.

Modified Palmer Drought Severity Index (PMDI)

This is a modification of the Palmer Drought Severity Index. The modification was made by the National Weather Service Climate Analysis Center for operational meteorological purposes. The Palmer drought program calculates three intermediate parallel index values each month. Only one value is selected as the PDSI drought index for the month. This selection is made internally by the program on the basis of probabilities.

If the probability that a drought is over is 100%, then one index is used.
If the probability that a wet spell is over is 100%, then another index is used.
If the probability is between 0% and 100%, the third index is assigned to the PDSI.

The modification (PMDI) incorporates a weighted average of the wet and dry index terms, using the probability as the weighting factor.¹.

The PMDI and PDSI will have the same value during an established drought or wet spell (i.e., when the probability is 100%), but they will have different values during transition periods.

Standardized Precipitation Index (SPxx)

This is a transformation of the probability of observing a given amount of precipitation in xx months.
A zero index value reflects the median of the distribution of precipitation.
A -3 indicates a very extreme dry spell.
A +3 indicates a very extreme wet spell.
The more the index value departs from zero, the drier or wetter an event lasting xx months is when compared to the long-term climatology of the location.
The index allows for comparison of precipitation observations at different locations with markedly different climates.
An index value at one location expresses the same relative departure from median conditions at one location as at another location.
It is calculated for different time scales since it is possible to experience dry conditions over one time scale while simultaneously experiencing wet conditions over a different time scale.

Table 1 Classes for Wet and Dry Periods, expected values of the z-index

Approximate Cumulative Frequency %	PHDI Range	Category	Z Range
> 96	> 4.00	Extreme wetness	> 3.50
90-95	3.00, 3.99	Severe wetness	2.50, 3.49
73-89	1.50, 2.99	Mild to moderate wetness	1.00, 2.49
28-72	-1.49, 1.49	Near normal	-1.24, 0.99
11-27	-1.50, -2.99	Mild to moderate drought	-1.25, -1.99
5-10	-3.00, -3.99	Severe drought	-2.00, -2.74
< 4	<-4.00	Extreme drought	<-2.75

Let us get

Palmer Drought Severity Index (PDSI)
Palmer Hydrological Drought Index (PHDI)
Palmer “Z” Index (ZNDX)

There are two files for each index, one at the state level and one at the division level.

Let us first get the state level files.

import pandas as pd

df_pdsi_st = pd.read_csv("https://www.ncei.noaa.gov/pub/data/cirs/climdiv/climdiv-pdsist-v1.0.0-20241021")

df_pdsi_st.info

<bound method DataFrame.info of       0010051895   0.78  -0.59  -0.17  -0.54  -0.55   0.58   0.42   0.77  -0.78  -0.80  -1.41  -1.62   
0      0010051896  -1.65  -1.19  -1.23  -1.70  -2.15 ...                                               
1      0010051897  -3.49  -2.86  -1.58  -1.63  -2.22 ...                                               
2      0010051898  -4.16  -4.61  -5.13  -4.57  -5.37 ...                                               
3      0010051899   1.45   1.79  -0.02  -0.65  -1.62 ...                                               
4      0010051900  -3.12   1.12   0.97   1.83   1.06 ...                                               
...                                                  ...                                               
12474  3650052020   3.34   3.76   3.60   3.83  -0.30 ...                                               
12475  3650052021  -1.94  -1.87  -1.73  -1.65   0.23 ...                                               
12476  3650052022  -2.79  -3.28  -3.18  -2.97  -3.15 ...                                               
12477  3650052023  -1.92  -2.08  -1.54  -1.22  -1.26 ...                                               
12478  3650052024  -3.12  -3.05  -2.66  -2.18  -1.61 ...                                               

[12479 rows x 1 columns]>

As you notice, there is only variable and 11903 rows or observations. Pandas seem to take the first row of the data as headers.

So, we need to tell pandas read_csv that there is no header i.e. the first row also contains data.

import pandas as pd

df_pdsi_st = pd.read_csv("https://www.ncei.noaa.gov/pub/data/cirs/climdiv/climdiv-pdsist-v1.0.0-20241021",
                        header = None)

df_pdsi_st.info

<bound method DataFrame.info of                                                        0
0      0010051895   0.78  -0.59  -0.17  -0.54  -0.55 ...
1      0010051896  -1.65  -1.19  -1.23  -1.70  -2.15 ...
2      0010051897  -3.49  -2.86  -1.58  -1.63  -2.22 ...
3      0010051898  -4.16  -4.61  -5.13  -4.57  -5.37 ...
4      0010051899   1.45   1.79  -0.02  -0.65  -1.62 ...
...                                                  ...
12475  3650052020   3.34   3.76   3.60   3.83  -0.30 ...
12476  3650052021  -1.94  -1.87  -1.73  -1.65   0.23 ...
12477  3650052022  -2.79  -3.28  -3.18  -2.97  -3.15 ...
12478  3650052023  -1.92  -2.08  -1.54  -1.22  -1.26 ...
12479  3650052024  -3.12  -3.05  -2.66  -2.18  -1.61 ...

[12480 rows x 1 columns]>

Now we see that there are 12480 rows (observations) and one column (variable).

We need to read the data into different variables. The readme file on the NOAA website lists the position of the variables. But as all the information is one variable, we need to separate or create new variables based on the position number.

An easier way maybe to use read_fwf that can read fixed width data files. There are two ways we can do, use default options and then rename the variables.

import pandas as pd

df_pdsi_st = pd.read_csv("https://www.ncei.noaa.gov/pub/data/cirs/climdiv/climdiv-pdsist-v1.0.0-20241021")

df_pdsi_st = df_pdsi_st.rename(columns={0: "LocIndexYear", 
                                        1: "Jan",
                                        2: "Feb",
                                        3: "Mar",
                                        4: "Apr",
                                        5: "May",
                                        6: "Jun",
                                        7: "Jul",
                                        8: "Aug",
                                        9: "Sep",
                                        10: "Oct",
                                        11: "Nov",
                                        12: "Dec" }
                                        )

df_pdsi_st

	0010051895 0.78 -0.59 -0.17 -0.54 -0.55 0.58 0.42 0.77 -0.78 -0.80 -1.41 -1.62
0	0010051896 -1.65 -1.19 -1.23 -1.70 -2.15 ...
1	0010051897 -3.49 -2.86 -1.58 -1.63 -2.22 ...
2	0010051898 -4.16 -4.61 -5.13 -4.57 -5.37 ...
3	0010051899 1.45 1.79 -0.02 -0.65 -1.62 ...
4	0010051900 -3.12 1.12 0.97 1.83 1.06 ...
...	...
12474	3650052020 3.34 3.76 3.60 3.83 -0.30 ...
12475	3650052021 -1.94 -1.87 -1.73 -1.65 0.23 ...
12476	3650052022 -2.79 -3.28 -3.18 -2.97 -3.15 ...
12477	3650052023 -1.92 -2.08 -1.54 -1.22 -1.26 ...
12478	3650052024 -3.12 -3.05 -2.66 -2.18 -1.61 ...

12479 rows × 1 columns

This doesn’t seem to help as all the variables are read into one column or variable.

A better option would be to check the position numbers of each of the variables and read it directly.

Note, we need to be careful about the column numbers. two things to keep in mind 1. index value starts from zero 2. if we give a value of (0,3), the variable is read from positions 0,1 and 2.

import pandas as pd
pd.set_option('display.precision', 2)
pd.set_option('display.max_columns', None)

#let us build a list of tuples with the column / variable names
colspecs = [(0,3), (3,4), (4,6), (6,10), (10,17), (17,24), (24,31), (31,38), (38,45), (45,52), (52,59), (59,66), (66,73), (73,80), (80,87), (87,-1)]

names = ['state', 'division', 'element', 'year','jan','feb','mar','apr','may','jun','jul','aug','sep','oct','nov','dec']


df_pdsi_st = pd.read_fwf("https://www.ncei.noaa.gov/pub/data/cirs/climdiv/climdiv-pdsist-v1.0.0-20241021",
colspecs=colspecs, names=names)

df_pdsi_st

	state	division	element	year	jan	feb	mar	apr	may	jun	jul	aug	sep	oct	nov	dec
0	1	0	5	1895	0.78	-0.59	-0.17	-0.54	-0.55	0.58	0.42	0.77	-0.78	-0.80	-1.41	-1.62
1	1	0	5	1896	-1.65	-1.19	-1.23	-1.70	-2.15	-1.58	-1.70	-2.29	-2.91	-2.88	-2.63	-3.39
2	1	0	5	1897	-3.49	-2.86	-1.58	-1.63	-2.22	-3.12	-3.49	-2.75	-3.65	-4.05	-4.41	-4.13
3	1	0	5	1898	-4.16	-4.61	-5.13	-4.57	-5.37	-5.43	-4.99	0.85	0.61	0.93	1.62	1.21
4	1	0	5	1899	1.45	1.79	-0.02	-0.65	-1.62	-2.25	-1.95	-2.23	-3.12	-3.07	-3.20	-2.92
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
12475	365	0	5	2020	3.34	3.76	3.60	3.83	-0.30	-0.23	-0.55	0.11	0.57	0.66	-0.86	-1.46
12476	365	0	5	2021	-1.94	-1.87	-1.73	-1.65	0.23	1.31	2.39	3.20	-0.28	-0.11	-1.09	-1.96
12477	365	0	5	2022	-2.79	-3.28	-3.18	-2.97	-3.15	-3.76	-4.19	-3.00	-3.72	-3.70	-3.39	-2.99
12478	365	0	5	2023	-1.92	-2.08	-1.54	-1.22	-1.26	-0.87	-0.99	-1.38	-2.08	-2.31	-2.97	-3.64
12479	365	0	5	2024	-3.12	-3.05	-2.66	-2.18	-1.61	-1.81	-1.59	-2.28	-1.41	-99.99	-99.99	-99.99

12480 rows × 16 columns

It is always a good idea to check the original file to see if all the variables are read properly.

let us look at the first two rows of the file

0010051895   0.78  -0.59  -0.17  -0.54  -0.55   0.58   0.42   0.77  -0.78  -0.80  -1.41  -1.62   
0010051896  -1.65  -1.19  -1.23  -1.70  -2.15  -1.58  -1.70  -2.29  -2.91  -2.88  -2.63  -3.39

It appears that the variables are read correctly for the first two rows.

Let us check the last two rows also,

3650052023  -1.92  -2.08  -1.54  -1.22  -1.26  -0.87  -0.99  -1.38  -2.08  -2.31  -2.97  -3.64   
3650052024  -3.12  -3.05  -2.66  -2.18  -1.61  -1.81  -1.59  -2.28  -1.41 -99.99 -99.99 -99.99

Again, it appears that the file is read correctly.

Also from the readme file at NOAA, this file is for PSDI. Let us make sure that this is accurate.

df_pdsi_st['element'].describe()

count    12480.0
mean         5.0
std          0.0
min          5.0
25%          5.0
50%          5.0
75%          5.0
max          5.0
Name: element, dtype: float64

It appears that all of the element values are 5, Palmer Drought Severity Index (PDSI). So, we have used the right data set and read the variables correctly.

Let us check the values of the other variables also. It is a good idea to check to check the data before we do any visualization or analysis.

df_pdsi_st.describe()

	state	division	element	year	jan	feb	mar	apr	may	jun	jul	aug	sep	oct	nov	dec
count	12480.00	12480.0	12480.0	12480.00	12480.00	12480.00	12480.00	12480.00	12480.00	12480.00	12480.00	12480.00	12480.00	12480.00	12480.00	12480.00
mean	111.05	0.0	5.0	1959.50	0.14	0.13	0.08	0.09	0.10	0.11	0.14	0.15	0.12	-0.58	-0.65	-0.65
std	102.87	0.0	0.0	37.53	2.50	2.44	2.42	2.37	2.39	2.49	2.63	2.63	2.53	9.10	9.10	9.10
min	1.00	0.0	5.0	1895.00	-8.03	-8.12	-8.37	-8.48	-9.01	-9.29	-9.83	-9.63	-8.58	-99.99	-99.99	-99.99
25%	24.75	0.0	5.0	1927.00	-1.61	-1.52	-1.57	-1.55	-1.58	-1.61	-1.67	-1.66	-1.59	-1.58	-1.67	-1.72
50%	74.50	0.0	5.0	1959.50	0.08	0.13	0.04	0.08	0.14	0.19	0.17	0.07	0.02	0.07	-0.05	0.01
75%	207.25	0.0	5.0	1992.00	1.97	1.91	1.83	1.80	1.81	1.92	2.07	2.07	1.91	1.98	1.96	1.98
max	365.00	0.0	5.0	2024.00	9.76	9.19	8.53	9.36	8.99	9.12	9.80	9.56	11.82	12.51	11.31	10.60

We see that division has a value of zero always and it makes sense as this is a state level dataset.

We see that the year ranges from 1895 to 2024.

The describe() method gives count, mean, std, min, 25%, 50%, 75% and max. These descriptive statistics give an idea of how the data is distributed.

What are these terms?

Note

mean: The mean is the average of a set of numbers. It is calculated by summing all the values and then dividing by the number of values.

25%: This refers to the 25th percentile, also known as the first quartile (Q1). It is the value below which 25% of the data falls.

50% or median: The median is the middle value of a dataset when it is ordered from least to greatest. If the dataset has an odd number of observations, the median is the middle number. If it has an even number, the median is the average of the two middle numbers. The 50th percentile is the same as the median.

75%: This refers to the 75th percentile, also known as the third quartile (Q3). It is the value below which 75% of the data falls.

std: This refers to the standard deviation, which measures the amount of variation or dispersion in a set of values. A low standard deviation means that the values tend to be close to the mean, while a high standard deviation means that the values are spread out over a wider range.

For the values of PDSI for some months, we see the minimum is -99.99 and this distorts the distribution and the standard deviation as it is an outlier

What does -99.99 mean. A look at the readmehttps://www.ncei.noaa.gov/pub/data/cirs/climdiv/drought-readme.txt indicates that Missing values in the latest year are indicated by -99.99

This is a common occurrence and that’s why it is always a good idea to check the documentation and the descriptive statistics of the data.

So we need to set -99.99 as missing value na.

Note, we are saving the new dataset using the same name, with only -99.99 as missing value nan.

import numpy as np
df_pdsi_st = df_pdsi_st.replace(-99.99,np.nan)

let us check the variable distribution now.

df_pdsi_st.describe()

	state	division	element	year	jan	feb	mar	apr	may	jun	jul	aug	sep	oct	nov	dec
count	12480.00	12480.0	12480.0	12480.00	12480.00	12480.00	12480.00	12480.00	12480.00	12480.00	12480.00	12480.00	12480.00	12384.00	12384.00	12384.00
mean	111.05	0.0	5.0	1959.50	0.14	0.13	0.08	0.09	0.10	0.11	0.14	0.15	0.12	0.19	0.12	0.13
std	102.87	0.0	0.0	37.53	2.50	2.44	2.42	2.37	2.39	2.49	2.63	2.63	2.53	2.50	2.51	2.53
min	1.00	0.0	5.0	1895.00	-8.03	-8.12	-8.37	-8.48	-9.01	-9.29	-9.83	-9.63	-8.58	-8.14	-7.62	-7.50
25%	24.75	0.0	5.0	1927.00	-1.61	-1.52	-1.57	-1.55	-1.58	-1.61	-1.67	-1.66	-1.59	-1.54	-1.62	-1.68
50%	74.50	0.0	5.0	1959.50	0.08	0.13	0.04	0.08	0.14	0.19	0.17	0.07	0.02	0.12	-0.03	0.04
75%	207.25	0.0	5.0	1992.00	1.97	1.91	1.83	1.80	1.81	1.92	2.07	2.07	1.91	1.99	1.98	2.00
max	365.00	0.0	5.0	2024.00	9.76	9.19	8.53	9.36	8.99	9.12	9.80	9.56	11.82	12.51	11.31	10.60

Now we can see that the minimum values for some months are not -99.99 and as expected the standard deviation also has come down.

We can check the entire distribution of the values using boxplot().

import plotly.express as px

#get national data
df = df_pdsi_st.loc[df_pdsi_st.state == 110]

fig = px.bar(df, x = 'year', y = 'jul')
fig.show()

Note: PDSI values are given as,

0 to -.5 = normal;
-0.5 to -1.0 = incipient drought;
-1.0 to -2.0 = mild drought;
-2.0to -3.0 = moderate drought;
-3.0 to -4.0 = severe drought; and
greater than - 4.0 = extreme drought.

Let us check the number of extreme drought incidents per each year nationally.

There is some more tidying up to do before we can analyze this dataset. Notice that the month values are in columns. It may be easier for our analysis if we can have these values in a new coulmn. This is called reshaping the data from wide to long format.

Let us do some tidying of the data in order to do analysis easily. we can convert the data from the values in columns to rows i.e.

# let us reshape the  data to Loc*Year*Month

df_long = pd.melt(df_pdsi_st,    
             id_vars=['state', 'division', 'element', 'year'],
             value_vars=['jan','feb','mar','apr','may','jun','jul','aug','sep','oct','nov','dec'],
             var_name='month', 
             value_name='PDSI')

# let us drop cases where missing values are present,  for 2024
df_long = df_long.dropna()

What did the code do?

The code reshapes the DataFrame df_pdsi_st from a wide format to a long format using the pd.melt function.

The function pd.melt() is used to reshape the DataFrame df_pdsi_st from a wide format to a long format. pd.melt() is particularly useful when you want to consolidate multiple columns into a single column, which can make it easier to perform certain types of data analysis or visualization.

The id_vars parameter is set to [‘state’, ‘division’, ‘element’, ‘year’]. These columns will remain as they are in the resulting DataFrame and will not be unpivoted. They act as identifier variables that are repeated for each melted value.
The value_vars parameter is set to [‘jan’, ‘feb’, ‘mar’, ‘apr’, ‘may’, ‘jun’, ‘jul’, ‘aug’, ‘sep’, ‘oct’, ‘nov’, ‘dec’], which represents the months of the year. These columns contain the values that will be unpivoted into a single column.
The var_name parameter is set to ‘month’, which means that the names of the original columns specified in value_vars will be stored in a new column named ‘month’.
Finally, the value_name parameter is set to ‘PDSI’, which means that the values from the original columns specified in value_vars will be stored in a new column named ‘PDSI’.
This code effectively transforms the DataFrame from a format where each month is a separate column into a format where there is a single column for months and another column for the corresponding values. This long format is often more suitable for time series analysis and plotting.
We drop rows with missing values using the dropna method for the year 2024 (note we converted the -99 values to nan).

The original pdsi_st data set has 12480 rows and df_pdsi_st.shape[1] columns. The new data set, after transforming from wide to long, has 149472 rows and 6 columns which is 12480 rows and 16 columns.

After dropping the missing values, we have 149472 observations. It is always a good practice to check the data set information after each major manipulation to see that the size is what we expect.

Now that we have the data in the format that we want, we can analyze it.

Let us count the number of extreme drought or extreme precipitation over the years.

import numpy as np

pdsi_conditions = [
    (df_long['PDSI'] < -4.0),
    (df_long['PDSI'].between(-4.0, -3.0, 'left')),    
    (df_long['PDSI'].between(-3.0, -2.0, 'left')),    
    (df_long['PDSI'].between(-2.0, -1.0, 'left')),    
    (df_long['PDSI'].between(-1.0, -0.5, 'left')),    
    (df_long['PDSI'].between(-0.5, 0.5, 'neither')),
    (df_long['PDSI'].between(0.5, 1.0, 'right')),    
    (df_long['PDSI'].between(1.0, 2.0, 'right')),    
    (df_long['PDSI'].between(2.0, 3.0, 'right')),    
    (df_long['PDSI'].between(3.0, 4.0, 'right')),    
    (df_long['PDSI'] >= 4.0)
]

# create a list of the PDSI labels that we want to assign for each condition based on PDSI
pdsi_labels = [ 
    'extreme drought',
    'severe drought', 
    'moderate drought', 
    'mild drought', 
    'incipient drought', 
    'normal', 
    'incipient wetspell', 
    'mild wetspell', 
    'moderate wetspell', 
    'severe wetspell', 
    'extreme wetspell'
]

# create a new column and use np.select to assign values to it using our lists as arguments
df_long['pdsi_label'] = np.select(pdsi_conditions, pdsi_labels, default='unlabeled')

# display updated DataFrame
df_long.head()

	state	element	year	month	PDSI	pdsi_label
0	1	5	1895	jan	0.78	incipient wetspell
1	1	5	1896	jan	-1.65	mild drought
2	1	5	1897	jan	-3.49	severe drought
3	1	5	1898	jan	-4.16	extreme drought
4	1	5	1899	jan	1.45	mild wetspell

What did we do in this code?

First, a list named pdsi_conditions is created. This list contains a series of boolean conditions that check the values in the ‘PDSI’ column of the DataFrame df_long.
- Each condition corresponds to a specific range of PDSI values.
- For example, the first condition checks if the PDSI value is less than -4.0, indicating an extreme drought, while the last condition checks if the PDSI value is greater than or equal to 4.0, indicating an extreme wet spell.
- The between method is used to check if the PDSI values fall within specific ranges.For example,
  - df_long[‘PDSI’].between(-4.0, -3.0, ‘left’): This checks if the PDSI value is between -4.0 (inclusive) and -3.0 (exclusive), indicating a severe drought.
  - df_long[‘PDSI’].between(-0.5, 0.5, ‘neither’): This checks if the PDSI value is between -0.5 and 0.5 (both exclusive), indicating near-normal conditions.
  - df_long[‘PDSI’].between(3.0, 4.0, ‘right’): This checks if the PDSI value is between 3.0 (exclusive) and 4.0 (inclusive), indicating severe wet conditions
Next, a list named pdsi_labels is defined. This list contains the labels that correspond to each of the conditions in pdsi_conditions. These labels describe the severity of the drought or wet spell, ranging from ‘extreme drought’ to ‘extreme wetspell’.
The np.select function from the NumPy library is then used to create a new column in the DataFrame called ‘pdsi_label’. This function takes the pdsi_conditions and pdsi_labels lists as arguments and assigns the appropriate label to each row in the DataFrame based on the PDSI value.
Finally, the head method is called on df_long to display the first few rows of the updated DataFrame, allowing the user to verify that the new ‘pdsi_label’ column has been correctly added and populated with the appropriate labels. This approach ensures that each PDSI value is categorized accurately, providing a clear and descriptive label for the severity of drought or wet spell conditions.

Let us plot the data to see the distribution of the extreme drought and extreme wetspell over the years.

import plotly.express as px

# Plot the distribution of the 'PDSI' variable
fig = px.histogram(df_long, x='PDSI', nbins=20, title='Distribution of PDSI')

# Update layout for better visualization
fig.update_layout(
    xaxis_title='PDSI',
    yaxis_title='Frequency',
    title_x=0.5
)

# Show the plot
fig.show()

We created a historgram to see the distribution of the variable. We can see that there are few values which are less than -6 or greater than 6.

We can also see that the distribution is skewed to the left, indicating that there are more months with negative PDSI values (indicating drought conditions) than positive values (indicating wet conditions).

Let us count the number of observations for each of the PDSI labels.

import pandas as pd
df_long[['pdsi_label']].value_counts()

pdsi_label        
mild wetspell         21274
mild drought          20721
normal                18166
moderate wetspell     17477
moderate drought      14117
incipient drought     12804
severe wetspell       10437
incipient wetspell    10422
severe drought         8331
extreme wetspell       8050
extreme drought        7244
unlabeled               429
Name: count, dtype: int64

We can see that there are 429 observations with the PDSI Label ‘unlabeled’.

Let us check the values of the PDSI for these observations.

df_long.loc[df_long['pdsi_label'] == 'unlabeled', 'PDSI'].value_counts()

PDSI
-0.5    231
 0.5    198
Name: count, dtype: int64

We can see that for some reason. there are some value of PDSI, -0.5 and +0.5 that are not labeled.

It looks like in the code, we have used neither for the condition -0.5 to 0.5. We should have used ‘both’ as we want to include both -0.5 and 0.5. We used neither as the description in the readme file is not clear.

Let us fix this by changing the conditions assuming that the lower value is inclusive and the upper limit is not inclusive (the documentation is not clear). As this impacts only a small fraction, less than 0.5%, it shouldn’t be much of an issue even if we were wrong.

import numpy as np

pdsi_conditions = [
    (df_long['PDSI'] < -4.0),
    (df_long['PDSI'].between(-4.0, -3.0, 'left')),    
    (df_long['PDSI'].between(-3.0, -2.0, 'left')),    
    (df_long['PDSI'].between(-2.0, -1.0, 'left')),    
    (df_long['PDSI'].between(-1.0, -0.5, 'left')),    
    (df_long['PDSI'].between(-0.5, 0.5, 'both')),
    (df_long['PDSI'].between(0.5, 1.0, 'right')),    
    (df_long['PDSI'].between(1.0, 2.0, 'right')),    
    (df_long['PDSI'].between(2.0, 3.0, 'right')),    
    (df_long['PDSI'].between(3.0, 4.0, 'right')),    
    (df_long['PDSI'] >= 4.0)
]

# create a list of the PDSI labels that we want to assign for each condition based on PDSI
pdsi_labels = [ 
    'extreme drought',
    'severe drought', 
    'moderate drought', 
    'mild drought', 
    'incipient drought', 
    'normal', 
    'incipient wetspell', 
    'mild wetspell', 
    'moderate wetspell', 
    'severe wetspell', 
    'extreme wetspell'
]

# create a new column and use np.select to assign values to it using our lists as arguments
df_long['pdsi_label'] = np.select(pdsi_conditions, pdsi_labels, default='unlabeled')

The only difference is we have used,

(df_long[‘PDSI’].between(-0.5, 0.5, ‘both’)), instead of

(df_long[‘PDSI’].between(-0.5, 0.5, ‘neither’)).

Let us print the results, in a slightly better format.

import pandas as pd
from tabulate import tabulate

# Count the number of occurrences of each label
label_counts = df_long[['pdsi_label']].value_counts().reset_index(name='count')

# Rename columns for better readability
label_counts.columns = ['PDSI Label', 'Count']

# Print the table in a pretty format
print(tabulate(label_counts, headers='keys', tablefmt='pretty'))

+----+--------------------+-------+
|    |     PDSI Label     | Count |
+----+--------------------+-------+
| 0  |   mild wetspell    | 21274 |
| 1  |    mild drought    | 20721 |
| 2  |       normal       | 18595 |
| 3  | moderate wetspell  | 17477 |
| 4  |  moderate drought  | 14117 |
| 5  | incipient drought  | 12804 |
| 6  |  severe wetspell   | 10437 |
| 7  | incipient wetspell | 10422 |
| 8  |   severe drought   | 8331  |
| 9  |  extreme wetspell  | 8050  |
| 10 |  extreme drought   | 7244  |
+----+--------------------+-------+

Now, we can analyze whether extreme drought or exteme wetspells are getting more common nationally or in any state.

Let us now find out the average value of these extreme events at the yearly level and plot them.

# let us count the number of months in which a extreme weather event has happened over the years, nationally

#df = df.loc[df.state == 110 and df.pdsi_label == 'extreme drought']

df = df_long.loc[(df_long.state == 110) & (df_long.pdsi_label == 'extreme drought')]

df_pivot = pd.pivot_table(
    data=df,
    index='year',
    aggfunc='count'
)
df_pivot

	PDSI	division	element	month	pdsi_label	state
year
1910	1	1	1	1	1	1
1911	6	6	6	6	6	6
1918	8	8	8	8	8	8
1925	4	4	4	4	4	4
1931	11	11	11	11	11	11
1933	1	1	1	1	1	1
1934	12	12	12	12	12	12
1935	4	4	4	4	4	4
1936	6	6	6	6	6	6
1937	1	1	1	1	1	1
1939	2	2	2	2	2	2
1940	10	10	10	10	10	10
1952	1	1	1	1	1	1
1953	1	1	1	1	1	1
1954	10	10	10	10	10	10
1955	6	6	6	6	6	6
1956	4	4	4	4	4	4
1957	2	2	2	2	2	2
1963	2	2	2	2	2	2
1964	2	2	2	2	2	2
1981	4	4	4	4	4	4
1988	5	5	5	5	5	5
1989	1	1	1	1	1	1
2000	10	10	10	10	10	10
2012	7	7	7	7	7	7
2013	3	3	3	3	3	3

We notice that df_pivot data set as the count of extreme drought nationally for each of the years that the extreme drought has happened (as we filtered data based on the code for nationally (110) and pdsi_label = extreme drought). However, we can see that all the columns have the same value. This is becuase of the aggregation function that we used, aggfunc = ‘count’.

In this particular case, we need only count the number of months each year when extreme drought conditions were there nationally. We don’t need the other variables. We can change our code slightly.

# let us count the number of months in which a extreme weather event has happened over the years, nationally

#df = df.loc[df.state == 110 and df.pdsi_label == 'extreme drought']

df = df_long.loc[(df_long.state == 110) & (df_long.pdsi_label == 'extreme drought')]

df_pivot = pd.pivot_table(
    data=df[['year', 'pdsi_label']],
    index='year',
    aggfunc='count'
).rename(columns={'pdsi_label': 'count'}).sort_values(by='count', ascending=False)

df_pivot

	count
year
1934	12
1931	11
2000	10
1940	10
1954	10
1918	8
2012	7
1955	6
1936	6
1911	6
1988	5
1981	4
1925	4
1935	4
1956	4
2013	3
1957	2
1963	2
1964	2
1939	2
1989	1
1910	1
1952	1
1937	1
1933	1
1953	1

What do we do in the code?

First, the code filters a DataFrame named df_long to create a new DataFrame df.
- The filtering condition selects rows where the state column equals 110 and the pdsi_label column equals ‘extreme drought’.
- This operation is performed using the .loc accessor, which allows for label-based indexing. The resulting df DataFrame contains only the rows that meet these specific criteria.
Next, the code creates a pivot table from the filtered DataFrame df. The pd.pivot_table function is used to reshape the data.
- The pivot table is constructed using the year and pdsi_label columns from df.
- The index parameter is set to ‘year’, meaning that each unique year will become a row in the pivot table.
- The aggfunc parameter is set to ‘count’, which means that the pivot table will count the number of occurrences of each year.
The resulting pivot table is assigned to a new DataFrame named df_pivot.
- The pivot table contains the count of extreme drought events for each year that they occurred nationally.
- The columns of the pivot table are renamed to ‘count’ for clarity using the rename method.
- The rows of the pivot table are sorted in descending order based on the count values using the sort_values method.
Finally, the pivot table df_pivot is displayed. This table shows the count of extreme drought events for each year that they occurred nationally.

It appears that nationally, 1934 had the highest number of months with extreme drought conditions with all the 12 months in extreme drought. This is followed by 1930 with 11 months in extreme drought. The year 2000 had 10 months in extreme drought.

How about by decade? Let us check the number of months with extreme drought by decade.

df_pivot['decade'] = (df_pivot.index // 10) * 10

# sum up by decade
df_pivot_decade = df_pivot.groupby('decade').sum().sort_values(by='count', ascending=False)
df_pivot_decade

	count
decade
1930	37
1950	24
1910	15
1940	10
1980	10
2000	10
2010	10
1920	4
1960	4

It appears 1930s had the highest number of months with extreme drought conditions nationally.

The number of months with extreme drought are significantly higher since 2000, around 20 months compared to the 1980-2000 (10).

We can see it graphically.

fig = px.bar(df_pivot, x= df_pivot.index, y = 'count')
fig.update_layout(
    title='Extreme Drought Months Over the Years',
    title_x=0.5
)
fig.show()

How about extreme or severe drought? It is easy to change our code slightly.

df = df_long.loc[(df_long.state == 110) & ((df_long.pdsi_label == 'extreme drought') | (df_long.pdsi_label == 'severe drought'))]

df_pivot = pd.pivot_table(
    data=df[['year', 'pdsi_label']],
    index='year',
    aggfunc='count'
).rename(columns={'pdsi_label': 'count'})

fig = px.bar(df_pivot, x= df_pivot.index, y = 'count')
fig.update_layout(
    title='Extreme or Severe Drought Months Over the Years',
    title_x=0.5
)
fig.show()

How about extreme wetspell?

df = df_long.loc[(df_long.state == 110) & (df_long.pdsi_label == 'extreme wetspell')]

df_pivot = pd.pivot_table(
    data=df[['year', 'pdsi_label']],
    index='year',
    aggfunc='count'
).rename(columns={'pdsi_label': 'count'})

fig = px.bar(df_pivot, x= df_pivot.index, y = 'count')
fig.update_layout(
    title='Extreme Wetspell Months Over the Years',
    title_x=0.5
)
fig.show()

When we compare with extreme drought, we see that the decades in which there is extreme drought, there are no extreme wetspells and it makes sense and gives confidence that our data manipulation is doing what we want it to do.

How about extreme drought or extreme wetspell? It is easy to change our code

# let us count the number of months in which a extreme weather event has happened over the years, nationally


#df = df.loc[(df.state == 110) & ((df.pdsi_label == 'extreme drought') | (df.pdsi_label == 'severe drought'))]

df = df_long.loc[(df_long.state == 110) & ((df_long.pdsi_label == 'extreme drought') | (df_long.pdsi_label == 'extreme wetspell'))]

df_pivot = pd.pivot_table(
    data=df[['year', 'pdsi_label']],
    index='year',
    aggfunc='count'
).rename(columns={'pdsi_label': 'count'})

fig = px.bar(df_pivot, x= df_pivot.index, y = 'count')
fig.update_layout(
    title='Extreme Weather Events Over the Years',
    title_x=0.5
)
fig.show()

Let us write a more general code that takes the count for each state and drought condition.

df = df_long.loc[
    (df_long.state == 110) &
    ((df_long.pdsi_label == 'extreme drought') |
    (df_long.pdsi_label == 'severe drought')  | 
    (df_long.pdsi_label == 'extreme wetspell') | 
    (df_long.pdsi_label == 'severe wetspell')
    )]

df_pivot = pd.pivot_table(
    data=df[['year','pdsi_label']],
    index=['year'],
    aggfunc='count'
).rename(columns={'pdsi_label': 'count'})

df_pivot

fig = px.line(df_pivot, x= df_pivot.index, y = 'count', 
              labels={'year': 'Year', 'pdsi_label': 'Number of Extreme Events'})

fig.update_layout(
    title='Extreme Weather Events Over the Years',
    title_x=0.5
)

fig.show()

If we need for any state or climate division, we can just change our code by changing the state code. In fact, we can make it a small function with one input, the state

import plotly.express as px

def pdsi_pivot_state(statecode):
    df = df_long.loc[
        (df_long.state == statecode) &
        ((df_long.pdsi_label == 'extreme drought') |
        (df_long.pdsi_label == 'severe drought')  | 
        (df_long.pdsi_label == 'extreme wetspell') | 
        (df_long.pdsi_label == 'severe wetspell')
        )]

    if df.empty:
        print('There are no severe or extreme weather events in this state')

    else:
        df_pivot_state = pd.pivot_table(
        data=df[['year','pdsi_label']],
        index='year',
        aggfunc='count'
        ).rename(columns={'pdsi_label': 'count'})
        fig = px.bar(df_pivot_state, x= df_pivot_state.index, y = 'count')

        fig.update_layout(
            title=f'Extreme Weather Events for statecode {statecode} Over the Years',
            title_x=0.5
       )
        fig.show()

Let us check for a nationally and for a few states.

#national
pdsi_pivot_state(110)

Since we are in Georgia, we want to check for Georgia.

#georgia
pdsi_pivot_state(9)

California is another state that has been in the news for extreme weather events. Let us check for California.

pdsi_pivot_state(4)

It appears that there are many extreme or severe events in California in the last 2 decades compared to before.

Similarly, let us get the years when the extreme or severe events happened to confirm our visual analysis.

Let us use nlargest and nsmallest to get the years with the most extreme weather events

import plotly.express as px

def pdsi_pivot_state(statecode):
    df = df_long.loc[
        (df_long.state == statecode) &
        ((df_long.pdsi_label == 'extreme drought') |
        (df_long.pdsi_label == 'severe drought')  | 
        (df_long.pdsi_label == 'extreme wetspell') | 
        (df_long.pdsi_label == 'severe wetspell')
        )]

    if df.empty:
        print('There are no severe or extreme weather events in this state')

    else:
        df_pivot_state = pd.pivot_table(
        data=df[['year','pdsi_label']],
        index='year',
        aggfunc='count'
        ).rename(columns={'pdsi_label': 'count'})
        print(df_pivot_state.nlargest(5, 'count'))




        # most number of months with severe or extreme drought or wetspell
        # climate change is real for california!

#national
#pdsi_pivot_state(110)

#georgia
#pdsi_pivot_state(9)

pdsi_pivot_state(4)

      count
year       
1983     12
2014     12
2015     12
2021     12
2007     11

It appears many extreme or severe events in california in the last 2 decades compared to before.

Let us check this more carefully by checking the decades when the extreme or severe events happened.

df_pivot['decade'] = (df_pivot.index // 10) * 10

df_pivot_decade = df_pivot.groupby('decade').sum().sort_values(by='count', ascending=False)
df_pivot_decade

	count
decade
1980	57
1950	56
1930	48
1910	43
1990	41
1900	35
1970	35
2010	33
1940	32
2000	32
1920	24
2020	13
1960	9
1890	1

Our initial observation is not supported by the data. The number of months with extreme or severe events in California is not significantly higher since 2000 as compared to earlier periods when we look by the decades.

We can dig deeper by analyzing the data by state or climate division. We can also analyze the data by month or season.

16.3 Anomalies

Let us use Palmer “Z” Index (ZNDX)

This is the generated monthly Z values, and they can be expressed as the “Moisture Anomaly Index.” Each monthly Z value is a measure of the departure from normal of the moisture climate for that month. This index can respond to a month of above-normal precipitation, even during periods of drought. Table 1 contains expected values of the Z index and other drought parameters. See Historical Climatology Series 3-6 through 3-9 for a detailed description of the drought indices.

The data is available at climdiv-zndxst-vx.y.z-YYYYMMDD as per the readme.txt file

The data seems to be in a similar format as the PDSI data, so we can use the same code as before with a minor modification to the URL.

import pandas as pd
import numpy as np

#let us build a list of tuples with the column / variable names
colspecs = [(0,3), (3,4), (4,6), (6,10), (10,17), (17,24), (24,31), (31,38), (38,45), (45,52), (52,59), (59,66), (66,73), (73,80), (80,87), (87,-1)]

names = ['state', 'division', 'element', 'year','jan','feb','mar','apr','may','jun','jul','aug','sep','oct','nov','dec']


df_zndx_st = pd.read_fwf("https://www.ncei.noaa.gov/pub/data/cirs/climdiv/climdiv-zndxst-v1.0.0-20241021", colspecs=colspecs, names=names)

# replace missing values
df_zndx_st = df_zndx_st.replace(-99.99,np.nan)

df_zndx_st

	state	division	element	year	jan	feb	mar	apr	may	jun	jul	aug	sep	oct	nov	dec
0	1	0	7	1895	2.35	-1.77	1.08	-1.17	-0.18	1.73	-0.30	1.19	-2.35	-0.28	-2.08	-1.07
1	1	0	7	1896	-0.60	0.87	-0.50	-1.80	-1.85	1.02	-0.84	-2.30	-2.54	-0.82	-0.13	-3.11
2	1	0	7	1897	-1.35	0.80	2.96	-0.63	-2.28	-3.39	-2.05	1.13	-3.56	-2.32	-2.34	-0.51
3	1	0	7	1898	-1.38	-2.61	-2.99	0.08	-3.81	-1.82	-0.36	2.54	-0.45	1.14	2.35	-0.72
4	1	0	7	1899	1.08	1.47	-0.07	-1.89	-3.10	-2.38	0.19	-1.44	-3.35	-0.84	-1.32	-0.16
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
12475	365	0	7	2020	1.61	2.31	0.68	1.80	-0.89	0.10	-1.04	0.34	1.40	0.46	-2.58	-2.06
12476	365	0	7	2021	-1.89	-0.38	-0.17	-0.31	0.68	3.31	3.66	3.16	-0.85	0.43	-2.99	-2.94
12477	365	0	7	2022	-3.08	-2.33	-0.72	-0.36	-1.45	-2.81	-2.46	2.29	-3.10	-1.09	-0.20	0.14
12478	365	0	7	2023	2.28	-1.07	0.99	0.49	-0.51	0.80	-0.63	-1.49	-2.53	-1.32	-2.70	-2.93
12479	365	0	7	2024	0.45	-0.77	0.24	0.60	1.05	-1.09	0.09	-2.57	1.91	NaN	NaN	NaN

12480 rows × 16 columns

Notice that the only difference from the PDSI data is that we changed the URL to zndxst instead of pdsist and for more clarity we changed the dataset names correspondingly.

let us tidy the data as before i.e. convert the columns into rows

# let us reshape the  data to Loc*Year*Month

df_zndx_long = pd.melt(df_zndx_st,    
             id_vars=['state', 'division', 'element', 'year'],
             value_vars=['jan','feb','mar','apr','may','jun','jul','aug','sep','oct','nov','dec'],
             var_name='month', 
             value_name='ZNDX')

# let us drop cases where missing values are present, probably in 2024
# note: we don't want to drop the missing values earlier in the wide format as it would have dropped the entire year. now it would only drop the month
df_zndx_long = df_zndx_long.dropna()

Let us assign a label based on the zindex value as we did for PDSI index.

import numpy as np

zndx_conditions = [
    (df_zndx_long['ZNDX'] <= -2.75),
    (df_zndx_long['ZNDX'].between(-2.74,-2.0, 'both')),    
    (df_zndx_long['ZNDX'].between(-1.99,-1.25, 'both')),    
    (df_zndx_long['ZNDX'].between(-1.24, 0.99, 'both')),   
    (df_zndx_long['ZNDX'].between(1.0, 2.49, 'both')),    
    (df_zndx_long['ZNDX'].between(2.50, 3.49, 'both')),    
    (df_zndx_long['ZNDX'] >= 3.50) 
    ]

# create a list of the PDSI labels that we want to assign for each condition based on PDSI
zndx_labels = [ 
                'extreme drought',
                'severe drought', 
                'mild to moderate drought', 
                'near normal', 
                'mild to moderate wetspell', 
                'severe wetspell', 
                'extreme wetspell'
                ]

# create a new column and use np.select to assign values to it using our lists as arguments
df_zndx_long['zndx_label'] = np.select(zndx_conditions, zndx_labels)

# display updated DataFrame
df_zndx_long.head()

	state	element	year	month	ZNDX	zndx_label
0	1	7	1895	jan	2.35	mild to moderate wetspell
1	1	7	1896	jan	-0.60	near normal
2	1	7	1897	jan	-1.35	mild to moderate drought
3	1	7	1898	jan	-1.38	mild to moderate drought
4	1	7	1899	jan	1.08	mild to moderate wetspell

We can use a slight modification of the previous function to get for each state a graph of anamolous extreme or sever weather events

import plotly.express as px

def zndx_pivot_state(statecode):
    df = df_zndx_long.loc[
        (df_zndx_long.state == statecode) &
        ((df_zndx_long.zndx_label == 'extreme drought') |
        ( df_zndx_long.zndx_label == 'severe drought')  | 
        ( df_zndx_long.zndx_label == 'extreme wetspell') | 
        ( df_zndx_long.zndx_label == 'severe wetspell')
        )]

    if df.empty:
        print('There are no severe or extreme weather events in this state')

    else:
        df_pivot_state = pd.pivot_table(
        data=df[['year','zndx_label']],
        index='year',
        aggfunc='count'
        )
        # most number of months with severe or extreme drought or wetspell
        # climate change is real for california!

        fig = px.bar(df_pivot_state, x= df_pivot_state.index, y = 'zndx_label')
        fig.update_layout(
            title=f'Months with Extreme / Severe Drought/ Wetspell for statecode {statecode}',
            title_x=0.5
         )
        fig.show()

We can check for the national level and for a few states.

#national
zndx_pivot_state(110)

Let us try Georgia and California.

zndx_pivot_state(9)

How about California?

zndx_pivot_state(4)

Instead of time-series for each state, let us compute the number of months with extreme or severe events by decade for each state.

def zndx_pivot_decade_state(statecode):
    df = df_zndx_long.loc[
        (df_zndx_long.state == statecode) &
        ((df_zndx_long.zndx_label == 'extreme drought') |
        ( df_zndx_long.zndx_label == 'severe drought')  | 
        ( df_zndx_long.zndx_label == 'extreme wetspell') | 
        ( df_zndx_long.zndx_label == 'severe wetspell')
        )]

    if df.empty:
        print('There are no severe or extreme weather events in this state')

    else:
        df_pivot_state = pd.pivot_table(
        data=df[['year','zndx_label']],
        index='year',
        aggfunc='count'
        )
        df_pivot_state['decade'] = (df_pivot_state.index // 10) * 10

        df_pivot_decade = df_pivot_state.groupby('decade').sum().sort_values(by='zndx_label', ascending=False)
        print(df_pivot_decade)
        fig = px.bar(df_pivot_state, x= df_pivot_state.decade, y = 'zndx_label')
        fig.update_layout(
            title=f'Months with Extreme / Severe Drought/ Wetspell for statecode {statecode}',
            title_x=0.5
         )
        fig.show()

Let us check for California.

zndx_pivot_decade_state(4)

        zndx_label
decade            
1980            38
1970            35
1990            34
1950            33
2010            32
1900            31
2000            29
1910            27
1920            27
1940            26
1960            21
2020            20
1930            19
1890            13

Note that we need to ignore 1890s as the data is only available from 1895 and 2020s as the data is only available till 2024.

At least for California, the number of months with extreme or severe drought or wetspell seem to be higher in 1970-1990s compared to 2000s.

16.4 Programming Lab or Activities

We have given you the tools to get the data and understand the data. what can you do with it? use your imagination!

How is your state on drought and wet spells. play around with the data hint: change the state code in the code above
What are the top 5 years for extreme drought / extreme wetspells for your state?
Get the division level data. Hint: it needs only 2 or 3 lines of modification. How can you use it?

Thomas R. Heddinghause and Paul Sabol, 1991; “A Review of the Palmer Drought Severity Index and Where Do We Go From Here?,” Proceedings of the Seventh Conference on Applied Climatology, pp. 242-246, American Meteorological Society, Boston, MA)↩︎

16.1 Learning Objectives: Climate Insights

16.2 Droughts

16.3 Anomalies

16.4 Programming Lab or Activities