import pandas as pd
from pathlib import PathIntroduction
This is the dataset where we study historical indicators of wealth, prosperity and detriment (via CO2 production) in the world.
This data was taken from Kaggle. We will attempt to wrangle the data and bring it to a form that is suitable for predictive modelling later on (next chapter).
These are the datasets in question
data_folder = 'data'
!ls -1 {data_folder}/*.csvdata/co2_production.csv
data/gross_national_income_per_capital.csv
data/human_development_index.csv
data/life_expectancy_by_birth.csv
Datasets
Let’s read one of them and print the columns…
data_folder = Path(data_folder)
life_exp = pd.read_csv(data_folder/'life_expectancy_by_birth.csv')
life_exp.info()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 206 entries, 0 to 205
Data columns (total 37 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ISO3 195 non-null object
1 Country 206 non-null object
2 hdicode 191 non-null object
3 region 151 non-null object
4 hdi_rank_2021 191 non-null float64
5 le_1990 206 non-null float64
6 le_1991 206 non-null float64
7 le_1992 206 non-null float64
8 le_1993 206 non-null float64
9 le_1994 206 non-null float64
10 le_1995 206 non-null float64
11 le_1996 206 non-null float64
12 le_1997 206 non-null float64
13 le_1998 206 non-null float64
14 le_1999 206 non-null float64
15 le_2000 206 non-null float64
16 le_2001 206 non-null float64
17 le_2002 206 non-null float64
18 le_2003 206 non-null float64
19 le_2004 206 non-null float64
20 le_2005 206 non-null float64
21 le_2006 206 non-null float64
22 le_2007 206 non-null float64
23 le_2008 206 non-null float64
24 le_2009 206 non-null float64
25 le_2010 206 non-null float64
26 le_2011 206 non-null float64
27 le_2012 206 non-null float64
28 le_2013 206 non-null float64
29 le_2014 206 non-null float64
30 le_2015 206 non-null float64
31 le_2016 206 non-null float64
32 le_2017 206 non-null float64
33 le_2018 206 non-null float64
34 le_2019 206 non-null float64
35 le_2020 206 non-null float64
36 le_2021 206 non-null float64
dtypes: float64(33), object(4)
memory usage: 59.7+ KB
Now let’s read the rest of the data
co2 = pd.read_csv(data_folder/'co2_production.csv')
gross_income_percapita = pd.read_csv(data_folder/'gross_national_income_per_capital.csv')
hdi_index = pd.read_csv(data_folder/'human_development_index.csv')Let’s ensure all 206 countries are mentioned in all three datasets
hdi_index.shape[0] == co2.shape[0] == gross_income_percapita.shape[0] == life_exp.shape[0]True
We can see the same number of columns too
print(hdi_index.shape[1])
print(co2.shape[1])
print(gross_income_percapita.shape[1])
print(life_exp.shape[1])37
37
37
37
Determining ID column
We can look at the ISO3 & Country columns as global identifiers (we’ll single one out later on), let’s see if there’s any null values in the ISO3 column
life_exp[life_exp.ISO3.isnull()]| ISO3 | Country | hdicode | region | hdi_rank_2021 | le_1990 | le_1991 | le_1992 | le_1993 | le_1994 | ... | le_2012 | le_2013 | le_2014 | le_2015 | le_2016 | le_2017 | le_2018 | le_2019 | le_2020 | le_2021 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 195 | NaN | Very high human development | NaN | NaN | NaN | 73.776652 | 73.931896 | 74.058848 | 73.852640 | 74.007158 | ... | 78.578754 | 78.822196 | 79.057616 | 79.062435 | 79.264933 | 79.428843 | 79.582768 | 79.822662 | 78.789745 | 78.521301 |
| 196 | NaN | High human development | NaN | NaN | NaN | 67.315701 | 67.567552 | 67.966260 | 68.358081 | 68.636211 | ... | 74.212238 | 74.463776 | 74.732246 | 74.944071 | 75.152677 | 75.261530 | 75.600606 | 75.785826 | 75.120548 | 74.709094 |
| 197 | NaN | Medium human development | NaN | NaN | NaN | 58.757754 | 58.991558 | 59.592548 | 59.880467 | 60.250251 | ... | 67.460860 | 67.893291 | 68.411503 | 68.895967 | 69.314997 | 69.694263 | 69.998178 | 70.219458 | 69.517580 | 67.438318 |
| 198 | NaN | Low human development | NaN | NaN | NaN | 50.351409 | 50.608373 | 50.467296 | 50.606850 | 51.070213 | ... | 59.569566 | 60.010001 | 60.326771 | 60.657080 | 61.064843 | 61.424061 | 61.721181 | 62.083426 | 61.675690 | 61.310991 |
| 199 | NaN | Arab States | NaN | NaN | NaN | 62.973324 | 63.225764 | 63.520079 | 64.512882 | 65.562309 | ... | 70.261228 | 70.274253 | 70.570127 | 70.836099 | 71.019940 | 71.508744 | 71.711891 | 71.922194 | 71.002105 | 70.895040 |
| 200 | NaN | East Asia and the Pacific | NaN | NaN | NaN | 67.161758 | 67.383754 | 67.905512 | 68.375688 | 68.640580 | ... | 74.427861 | 74.677412 | 74.936109 | 75.156317 | 75.355177 | 75.419246 | 75.835426 | 76.036110 | 75.968330 | 75.579650 |
| 201 | NaN | Europe and Central Asia | NaN | NaN | NaN | 67.781962 | 67.609336 | 66.896095 | 66.718280 | 67.136734 | ... | 72.606274 | 72.992396 | 73.357649 | 73.681015 | 73.983262 | 74.343005 | 74.505959 | 74.700741 | 72.780992 | 72.856526 |
| 202 | NaN | Latin America and the Caribbean | NaN | NaN | NaN | 67.770568 | 68.127209 | 68.458998 | 68.786974 | 69.166664 | ... | 74.067656 | 74.316945 | 74.521732 | 74.579895 | 74.544240 | 74.705572 | 74.823478 | 75.014905 | 73.008316 | 72.099890 |
| 203 | NaN | South Asia | NaN | NaN | NaN | 58.830494 | 59.102221 | 59.727847 | 60.058622 | 60.441754 | ... | 67.915940 | 68.395439 | 68.905849 | 69.407382 | 69.874722 | 70.245467 | 70.517018 | 70.722315 | 69.972908 | 67.855530 |
| 204 | NaN | Sub-Saharan Africa | NaN | NaN | NaN | 49.868704 | 49.989060 | 49.900445 | 49.817075 | 49.722427 | ... | 57.987394 | 58.539213 | 59.010080 | 59.428775 | 59.954595 | 60.348837 | 60.735440 | 61.120031 | 60.729051 | 60.112467 |
| 205 | NaN | World | NaN | NaN | NaN | 65.144798 | 65.299082 | 65.584015 | 65.753616 | 65.986642 | ... | 71.289119 | 71.581265 | 71.886202 | 72.111793 | 72.370895 | 72.568952 | 72.816114 | 73.012099 | 72.257297 | 71.365465 |
11 rows × 37 columns
The NaN values aren’t related to any countries, rather, they are to related to group of countries for brevity. We will drop these rows from the table.
life_exp_countries = life_exp.dropna(subset='ISO3', axis=0)
life_exp_countries.info()<class 'pandas.core.frame.DataFrame'>
Int64Index: 195 entries, 0 to 194
Data columns (total 37 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ISO3 195 non-null object
1 Country 195 non-null object
2 hdicode 191 non-null object
3 region 151 non-null object
4 hdi_rank_2021 191 non-null float64
5 le_1990 195 non-null float64
6 le_1991 195 non-null float64
7 le_1992 195 non-null float64
8 le_1993 195 non-null float64
9 le_1994 195 non-null float64
10 le_1995 195 non-null float64
11 le_1996 195 non-null float64
12 le_1997 195 non-null float64
13 le_1998 195 non-null float64
14 le_1999 195 non-null float64
15 le_2000 195 non-null float64
16 le_2001 195 non-null float64
17 le_2002 195 non-null float64
18 le_2003 195 non-null float64
19 le_2004 195 non-null float64
20 le_2005 195 non-null float64
21 le_2006 195 non-null float64
22 le_2007 195 non-null float64
23 le_2008 195 non-null float64
24 le_2009 195 non-null float64
25 le_2010 195 non-null float64
26 le_2011 195 non-null float64
27 le_2012 195 non-null float64
28 le_2013 195 non-null float64
29 le_2014 195 non-null float64
30 le_2015 195 non-null float64
31 le_2016 195 non-null float64
32 le_2017 195 non-null float64
33 le_2018 195 non-null float64
34 le_2019 195 non-null float64
35 le_2020 195 non-null float64
36 le_2021 195 non-null float64
dtypes: float64(33), object(4)
memory usage: 57.9+ KB
let’s repeat this step for the other datasets, and see if they match in # of rows
co2_countries = co2.dropna(subset='ISO3', axis=0)
gross_income_percapita_countries = gross_income_percapita.dropna(subset='ISO3', axis=0)
hdi_index_countries = hdi_index.dropna(subset='ISO3', axis=0)
co2_countries.shape[0] == hdi_index_countries.shape[0] == gross_income_percapita_countries.shape[0]True
Time Series Data
These datasets, when combined, have all the makings of a time-series dataset! We wager that we can combine them into one big dataset and save it to disk. This will simplify our modeling in the next chapter.
We see some miscellaneous data that can be put aside; particularly the hdicode & hdi_rank_2021 columns
# Putting misc cols aside so it's not repeated
country_miscellaneous_cols = ['region', 'hdicode', 'hdi_rank_2021', 'ISO3']
country_misc_data = life_exp_countries[['Country']+country_miscellaneous_cols]
time_series_data = life_exp_countries.drop(country_miscellaneous_cols, axis=1)\
.merge(hdi_index_countries.drop(country_miscellaneous_cols, axis=1), on='Country', how='outer')\
.merge(co2_countries.drop(country_miscellaneous_cols, axis=1), on='Country', how='outer')\
.merge(gross_income_percapita_countries.drop(country_miscellaneous_cols, axis=1), on='Country', how='outer')\
time_series_data
| Country | le_1990 | le_1991 | le_1992 | le_1993 | le_1994 | le_1995 | le_1996 | le_1997 | le_1998 | ... | gnipc_2012 | gnipc_2013 | gnipc_2014 | gnipc_2015 | gnipc_2016 | gnipc_2017 | gnipc_2018 | gnipc_2019 | gnipc_2020 | gnipc_2021 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Afghanistan | 45.9672 | 46.6631 | 47.5955 | 51.4664 | 51.4945 | 52.5442 | 53.2433 | 53.6342 | 52.9431 | ... | 2125.862821 | 2193.553936 | 2178.507021 | 2101.589319 | 2077.566899 | 2085.487571 | 2054.939895 | 2097.889450 | 1997.852149 | 1824.190915 |
| 1 | Angola | 41.8933 | 43.8127 | 42.2088 | 42.1009 | 43.4217 | 45.8491 | 46.0329 | 46.3065 | 45.0570 | ... | 7280.845666 | 7478.104777 | 7704.231949 | 7652.656486 | 7189.426672 | 6861.575738 | 6381.521946 | 6082.746624 | 5593.142060 | 5465.617791 |
| 2 | Albania | 73.1439 | 73.3776 | 73.7148 | 73.9391 | 74.1313 | 74.3616 | 74.5923 | 73.9039 | 74.9899 | ... | 11146.263030 | 11552.982470 | 11691.648290 | 12016.297600 | 12484.624200 | 12802.148310 | 13302.705960 | 13485.311240 | 12996.762910 | 14131.110390 |
| 3 | Andorra | 78.4063 | 77.9805 | 80.3241 | 78.6633 | 82.6380 | 78.9616 | 80.3340 | 80.9439 | 79.4259 | ... | 47126.814610 | 46385.095200 | 48483.720320 | 49936.874540 | 52267.738320 | 52650.225760 | 53483.306630 | 54465.047400 | 47878.666640 | 51166.626610 |
| 4 | United Arab Emirates | 71.9004 | 72.2414 | 72.3062 | 72.5213 | 72.5982 | 72.6945 | 72.7674 | 72.9367 | 73.0658 | ... | 57445.954750 | 60005.695360 | 62573.505310 | 65577.512240 | 66881.329740 | 67667.508460 | 67195.095230 | 68590.900940 | 63016.401220 | 62573.591810 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 190 | Samoa | 67.6584 | 67.8814 | 68.3855 | 68.7009 | 68.9795 | 69.2984 | 69.5877 | 69.8746 | 70.1711 | ... | 5683.651395 | 5675.889504 | 5757.486050 | 6222.783552 | 6401.830980 | 6287.673021 | 6280.573936 | 6356.988690 | 5812.868328 | 5307.953374 |
| 191 | Yemen | 58.6994 | 59.0490 | 59.4283 | 59.8595 | 59.7135 | 60.4532 | 60.5678 | 61.1193 | 61.3748 | ... | 3152.900024 | 3212.651579 | 2775.842952 | 1785.788608 | 1494.230811 | 1302.425254 | 1341.656234 | 1349.567046 | 1370.601082 | 1314.270189 |
| 192 | South Africa | 63.3753 | 63.2649 | 63.3414 | 63.0447 | 62.6118 | 62.2616 | 61.4593 | 60.8053 | 60.0008 | ... | 13602.253520 | 13732.252520 | 13700.834560 | 13694.728110 | 13545.358590 | 13475.988210 | 13491.221790 | 13366.474640 | 12449.671040 | 12948.373250 |
| 193 | Zambia | 47.9263 | 47.0971 | 46.5119 | 46.2094 | 45.8543 | 45.5534 | 45.2326 | 44.9446 | 44.7011 | ... | 3333.576512 | 3389.478940 | 3263.039162 | 3403.471444 | 3237.505650 | 3330.552717 | 3418.096158 | 3365.410652 | 3178.619722 | 3217.767739 |
| 194 | Zimbabwe | 59.4264 | 58.0911 | 56.4354 | 54.4264 | 52.5878 | 50.5310 | 48.9551 | 47.9933 | 46.8192 | ... | 3618.629526 | 3632.111591 | 3644.856047 | 3638.532892 | 3606.750671 | 3728.918785 | 3864.012419 | 3674.564482 | 3654.289051 | 3809.887158 |
195 rows × 129 columns
Melt Dataset
We like the earlier form of the dataset, but we want something better; considering that there are too many columns that are hard to read. Let’s melt it!
Basically, it will involve extracting the year out of the life_exp, hdi, etc. variables and having it as a separate column, with the value of that year displayed next to it. Finally we’re going to use that as a key to merge all the variables together.
Finally each row will take the following form…
# Melt the dataset
life_exp_year_columns = tuple(f'le_{year}' for year in range(1990, 2022))
hdi_year_columns = tuple(f'hdi_{year}' for year in range(1990, 2022))
gnipc_year_columns = tuple(f'gnipc_{year}' for year in range(1990, 2022))
co2_year_columns = tuple(f'co2_prod_{year}' for year in range(1990, 2022))
def melt_df_by_var(df, value_name, value_vars):
melted = df.melt('Country', value_vars=value_vars, var_name='year', value_name=value_name)
melted.year = melted.year.str.slice(-4).astype(int)
return melted
melted_time_series = melt_df_by_var(time_series_data, 'life_exp', life_exp_year_columns)\
.merge(melt_df_by_var(time_series_data, 'hdi_index', hdi_year_columns), on=('Country', 'year'))\
.merge(melt_df_by_var(time_series_data, 'co2', co2_year_columns), on=('Country', 'year'))\
.merge(melt_df_by_var(time_series_data, 'gnipc', gnipc_year_columns), on=('Country', 'year'))
melted_time_series| Country | year | life_exp | hdi_index | co2 | gnipc | |
|---|---|---|---|---|---|---|
| 0 | Afghanistan | 1990 | 45.9672 | 0.273 | 0.209727 | 2684.550019 |
| 1 | Angola | 1990 | 41.8933 | NaN | 0.429586 | 4845.706901 |
| 2 | Albania | 1990 | 73.1439 | 0.647 | 1.656902 | 4742.215529 |
| 3 | Andorra | 1990 | 78.4063 | NaN | 7.461153 | 43773.146500 |
| 4 | United Arab Emirates | 1990 | 71.9004 | 0.728 | 28.277672 | 102433.136000 |
| ... | ... | ... | ... | ... | ... | ... |
| 6235 | Samoa | 2021 | 72.7675 | 0.707 | 1.238975 | 5307.953374 |
| 6236 | Yemen | 2021 | 63.7534 | 0.455 | 0.327510 | 1314.270189 |
| 6237 | South Africa | 2021 | 62.3410 | 0.713 | 7.620420 | 12948.373250 |
| 6238 | Zambia | 2021 | 61.2234 | 0.565 | 0.357535 | 3217.767739 |
| 6239 | Zimbabwe | 2021 | 59.2531 | 0.593 | 0.708562 | 3809.887158 |
6240 rows × 6 columns
Finally writing it to disk
melted_time_series.to_csv(data_folder/'processed/time_series.csv', index=False)Conclusion
This finalizes our steps to process the data. We’ll do a few more preprocessing steps necessary before modelling in the next chapter. Stay Tuned.