import pandas as pd
from pathlib import Path
Introduction
This is the dataset where we study historical indicators of wealth, prosperity and detriment (via CO2 production) in the world.
This data was taken from Kaggle. We will attempt to wrangle the data and bring it to a form that is suitable for predictive modelling later on (next chapter).
These are the datasets in question
= 'data'
data_folder !ls -1 {data_folder}/*.csv
data/co2_production.csv
data/gross_national_income_per_capital.csv
data/human_development_index.csv
data/life_expectancy_by_birth.csv
Datasets
Let’s read one of them and print the columns…
= Path(data_folder)
data_folder = pd.read_csv(data_folder/'life_expectancy_by_birth.csv')
life_exp life_exp.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 206 entries, 0 to 205
Data columns (total 37 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ISO3 195 non-null object
1 Country 206 non-null object
2 hdicode 191 non-null object
3 region 151 non-null object
4 hdi_rank_2021 191 non-null float64
5 le_1990 206 non-null float64
6 le_1991 206 non-null float64
7 le_1992 206 non-null float64
8 le_1993 206 non-null float64
9 le_1994 206 non-null float64
10 le_1995 206 non-null float64
11 le_1996 206 non-null float64
12 le_1997 206 non-null float64
13 le_1998 206 non-null float64
14 le_1999 206 non-null float64
15 le_2000 206 non-null float64
16 le_2001 206 non-null float64
17 le_2002 206 non-null float64
18 le_2003 206 non-null float64
19 le_2004 206 non-null float64
20 le_2005 206 non-null float64
21 le_2006 206 non-null float64
22 le_2007 206 non-null float64
23 le_2008 206 non-null float64
24 le_2009 206 non-null float64
25 le_2010 206 non-null float64
26 le_2011 206 non-null float64
27 le_2012 206 non-null float64
28 le_2013 206 non-null float64
29 le_2014 206 non-null float64
30 le_2015 206 non-null float64
31 le_2016 206 non-null float64
32 le_2017 206 non-null float64
33 le_2018 206 non-null float64
34 le_2019 206 non-null float64
35 le_2020 206 non-null float64
36 le_2021 206 non-null float64
dtypes: float64(33), object(4)
memory usage: 59.7+ KB
Now let’s read the rest of the data
= pd.read_csv(data_folder/'co2_production.csv')
co2 = pd.read_csv(data_folder/'gross_national_income_per_capital.csv')
gross_income_percapita = pd.read_csv(data_folder/'human_development_index.csv') hdi_index
Let’s ensure all 206 countries are mentioned in all three datasets
0] == co2.shape[0] == gross_income_percapita.shape[0] == life_exp.shape[0] hdi_index.shape[
True
We can see the same number of columns too
print(hdi_index.shape[1])
print(co2.shape[1])
print(gross_income_percapita.shape[1])
print(life_exp.shape[1])
37
37
37
37
Determining ID column
We can look at the ISO3
& Country
columns as global identifiers (we’ll single one out later on), let’s see if there’s any null values in the ISO3
column
life_exp[life_exp.ISO3.isnull()]
ISO3 | Country | hdicode | region | hdi_rank_2021 | le_1990 | le_1991 | le_1992 | le_1993 | le_1994 | ... | le_2012 | le_2013 | le_2014 | le_2015 | le_2016 | le_2017 | le_2018 | le_2019 | le_2020 | le_2021 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
195 | NaN | Very high human development | NaN | NaN | NaN | 73.776652 | 73.931896 | 74.058848 | 73.852640 | 74.007158 | ... | 78.578754 | 78.822196 | 79.057616 | 79.062435 | 79.264933 | 79.428843 | 79.582768 | 79.822662 | 78.789745 | 78.521301 |
196 | NaN | High human development | NaN | NaN | NaN | 67.315701 | 67.567552 | 67.966260 | 68.358081 | 68.636211 | ... | 74.212238 | 74.463776 | 74.732246 | 74.944071 | 75.152677 | 75.261530 | 75.600606 | 75.785826 | 75.120548 | 74.709094 |
197 | NaN | Medium human development | NaN | NaN | NaN | 58.757754 | 58.991558 | 59.592548 | 59.880467 | 60.250251 | ... | 67.460860 | 67.893291 | 68.411503 | 68.895967 | 69.314997 | 69.694263 | 69.998178 | 70.219458 | 69.517580 | 67.438318 |
198 | NaN | Low human development | NaN | NaN | NaN | 50.351409 | 50.608373 | 50.467296 | 50.606850 | 51.070213 | ... | 59.569566 | 60.010001 | 60.326771 | 60.657080 | 61.064843 | 61.424061 | 61.721181 | 62.083426 | 61.675690 | 61.310991 |
199 | NaN | Arab States | NaN | NaN | NaN | 62.973324 | 63.225764 | 63.520079 | 64.512882 | 65.562309 | ... | 70.261228 | 70.274253 | 70.570127 | 70.836099 | 71.019940 | 71.508744 | 71.711891 | 71.922194 | 71.002105 | 70.895040 |
200 | NaN | East Asia and the Pacific | NaN | NaN | NaN | 67.161758 | 67.383754 | 67.905512 | 68.375688 | 68.640580 | ... | 74.427861 | 74.677412 | 74.936109 | 75.156317 | 75.355177 | 75.419246 | 75.835426 | 76.036110 | 75.968330 | 75.579650 |
201 | NaN | Europe and Central Asia | NaN | NaN | NaN | 67.781962 | 67.609336 | 66.896095 | 66.718280 | 67.136734 | ... | 72.606274 | 72.992396 | 73.357649 | 73.681015 | 73.983262 | 74.343005 | 74.505959 | 74.700741 | 72.780992 | 72.856526 |
202 | NaN | Latin America and the Caribbean | NaN | NaN | NaN | 67.770568 | 68.127209 | 68.458998 | 68.786974 | 69.166664 | ... | 74.067656 | 74.316945 | 74.521732 | 74.579895 | 74.544240 | 74.705572 | 74.823478 | 75.014905 | 73.008316 | 72.099890 |
203 | NaN | South Asia | NaN | NaN | NaN | 58.830494 | 59.102221 | 59.727847 | 60.058622 | 60.441754 | ... | 67.915940 | 68.395439 | 68.905849 | 69.407382 | 69.874722 | 70.245467 | 70.517018 | 70.722315 | 69.972908 | 67.855530 |
204 | NaN | Sub-Saharan Africa | NaN | NaN | NaN | 49.868704 | 49.989060 | 49.900445 | 49.817075 | 49.722427 | ... | 57.987394 | 58.539213 | 59.010080 | 59.428775 | 59.954595 | 60.348837 | 60.735440 | 61.120031 | 60.729051 | 60.112467 |
205 | NaN | World | NaN | NaN | NaN | 65.144798 | 65.299082 | 65.584015 | 65.753616 | 65.986642 | ... | 71.289119 | 71.581265 | 71.886202 | 72.111793 | 72.370895 | 72.568952 | 72.816114 | 73.012099 | 72.257297 | 71.365465 |
11 rows × 37 columns
The NaN
values aren’t related to any countries, rather, they are to related to group of countries for brevity. We will drop these rows from the table.
= life_exp.dropna(subset='ISO3', axis=0)
life_exp_countries life_exp_countries.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 195 entries, 0 to 194
Data columns (total 37 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ISO3 195 non-null object
1 Country 195 non-null object
2 hdicode 191 non-null object
3 region 151 non-null object
4 hdi_rank_2021 191 non-null float64
5 le_1990 195 non-null float64
6 le_1991 195 non-null float64
7 le_1992 195 non-null float64
8 le_1993 195 non-null float64
9 le_1994 195 non-null float64
10 le_1995 195 non-null float64
11 le_1996 195 non-null float64
12 le_1997 195 non-null float64
13 le_1998 195 non-null float64
14 le_1999 195 non-null float64
15 le_2000 195 non-null float64
16 le_2001 195 non-null float64
17 le_2002 195 non-null float64
18 le_2003 195 non-null float64
19 le_2004 195 non-null float64
20 le_2005 195 non-null float64
21 le_2006 195 non-null float64
22 le_2007 195 non-null float64
23 le_2008 195 non-null float64
24 le_2009 195 non-null float64
25 le_2010 195 non-null float64
26 le_2011 195 non-null float64
27 le_2012 195 non-null float64
28 le_2013 195 non-null float64
29 le_2014 195 non-null float64
30 le_2015 195 non-null float64
31 le_2016 195 non-null float64
32 le_2017 195 non-null float64
33 le_2018 195 non-null float64
34 le_2019 195 non-null float64
35 le_2020 195 non-null float64
36 le_2021 195 non-null float64
dtypes: float64(33), object(4)
memory usage: 57.9+ KB
let’s repeat this step for the other datasets, and see if they match in # of rows
= co2.dropna(subset='ISO3', axis=0)
co2_countries = gross_income_percapita.dropna(subset='ISO3', axis=0)
gross_income_percapita_countries = hdi_index.dropna(subset='ISO3', axis=0)
hdi_index_countries
0] == hdi_index_countries.shape[0] == gross_income_percapita_countries.shape[0] co2_countries.shape[
True
Time Series Data
These datasets, when combined, have all the makings of a time-series dataset! We wager that we can combine them into one big dataset and save it to disk. This will simplify our modeling in the next chapter.
We see some miscellaneous data that can be put aside; particularly the hdicode
& hdi_rank_2021
columns
# Putting misc cols aside so it's not repeated
= ['region', 'hdicode', 'hdi_rank_2021', 'ISO3']
country_miscellaneous_cols = life_exp_countries[['Country']+country_miscellaneous_cols]
country_misc_data
= life_exp_countries.drop(country_miscellaneous_cols, axis=1)\
time_series_data =1), on='Country', how='outer')\
.merge(hdi_index_countries.drop(country_miscellaneous_cols, axis=1), on='Country', how='outer')\
.merge(co2_countries.drop(country_miscellaneous_cols, axis=1), on='Country', how='outer')\
.merge(gross_income_percapita_countries.drop(country_miscellaneous_cols, axis
time_series_data
Country | le_1990 | le_1991 | le_1992 | le_1993 | le_1994 | le_1995 | le_1996 | le_1997 | le_1998 | ... | gnipc_2012 | gnipc_2013 | gnipc_2014 | gnipc_2015 | gnipc_2016 | gnipc_2017 | gnipc_2018 | gnipc_2019 | gnipc_2020 | gnipc_2021 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Afghanistan | 45.9672 | 46.6631 | 47.5955 | 51.4664 | 51.4945 | 52.5442 | 53.2433 | 53.6342 | 52.9431 | ... | 2125.862821 | 2193.553936 | 2178.507021 | 2101.589319 | 2077.566899 | 2085.487571 | 2054.939895 | 2097.889450 | 1997.852149 | 1824.190915 |
1 | Angola | 41.8933 | 43.8127 | 42.2088 | 42.1009 | 43.4217 | 45.8491 | 46.0329 | 46.3065 | 45.0570 | ... | 7280.845666 | 7478.104777 | 7704.231949 | 7652.656486 | 7189.426672 | 6861.575738 | 6381.521946 | 6082.746624 | 5593.142060 | 5465.617791 |
2 | Albania | 73.1439 | 73.3776 | 73.7148 | 73.9391 | 74.1313 | 74.3616 | 74.5923 | 73.9039 | 74.9899 | ... | 11146.263030 | 11552.982470 | 11691.648290 | 12016.297600 | 12484.624200 | 12802.148310 | 13302.705960 | 13485.311240 | 12996.762910 | 14131.110390 |
3 | Andorra | 78.4063 | 77.9805 | 80.3241 | 78.6633 | 82.6380 | 78.9616 | 80.3340 | 80.9439 | 79.4259 | ... | 47126.814610 | 46385.095200 | 48483.720320 | 49936.874540 | 52267.738320 | 52650.225760 | 53483.306630 | 54465.047400 | 47878.666640 | 51166.626610 |
4 | United Arab Emirates | 71.9004 | 72.2414 | 72.3062 | 72.5213 | 72.5982 | 72.6945 | 72.7674 | 72.9367 | 73.0658 | ... | 57445.954750 | 60005.695360 | 62573.505310 | 65577.512240 | 66881.329740 | 67667.508460 | 67195.095230 | 68590.900940 | 63016.401220 | 62573.591810 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
190 | Samoa | 67.6584 | 67.8814 | 68.3855 | 68.7009 | 68.9795 | 69.2984 | 69.5877 | 69.8746 | 70.1711 | ... | 5683.651395 | 5675.889504 | 5757.486050 | 6222.783552 | 6401.830980 | 6287.673021 | 6280.573936 | 6356.988690 | 5812.868328 | 5307.953374 |
191 | Yemen | 58.6994 | 59.0490 | 59.4283 | 59.8595 | 59.7135 | 60.4532 | 60.5678 | 61.1193 | 61.3748 | ... | 3152.900024 | 3212.651579 | 2775.842952 | 1785.788608 | 1494.230811 | 1302.425254 | 1341.656234 | 1349.567046 | 1370.601082 | 1314.270189 |
192 | South Africa | 63.3753 | 63.2649 | 63.3414 | 63.0447 | 62.6118 | 62.2616 | 61.4593 | 60.8053 | 60.0008 | ... | 13602.253520 | 13732.252520 | 13700.834560 | 13694.728110 | 13545.358590 | 13475.988210 | 13491.221790 | 13366.474640 | 12449.671040 | 12948.373250 |
193 | Zambia | 47.9263 | 47.0971 | 46.5119 | 46.2094 | 45.8543 | 45.5534 | 45.2326 | 44.9446 | 44.7011 | ... | 3333.576512 | 3389.478940 | 3263.039162 | 3403.471444 | 3237.505650 | 3330.552717 | 3418.096158 | 3365.410652 | 3178.619722 | 3217.767739 |
194 | Zimbabwe | 59.4264 | 58.0911 | 56.4354 | 54.4264 | 52.5878 | 50.5310 | 48.9551 | 47.9933 | 46.8192 | ... | 3618.629526 | 3632.111591 | 3644.856047 | 3638.532892 | 3606.750671 | 3728.918785 | 3864.012419 | 3674.564482 | 3654.289051 | 3809.887158 |
195 rows × 129 columns
Melt Dataset
We like the earlier form of the dataset, but we want something better; considering that there are too many columns that are hard to read. Let’s melt it!
Basically, it will involve extracting the year
out of the life_exp, hdi, etc. variables and having it as a separate column, with the value of that year displayed next to it. Finally we’re going to use that as a key to merge all the variables together.
Finally each row will take the following form…
# Melt the dataset
= tuple(f'le_{year}' for year in range(1990, 2022))
life_exp_year_columns = tuple(f'hdi_{year}' for year in range(1990, 2022))
hdi_year_columns = tuple(f'gnipc_{year}' for year in range(1990, 2022))
gnipc_year_columns = tuple(f'co2_prod_{year}' for year in range(1990, 2022))
co2_year_columns
def melt_df_by_var(df, value_name, value_vars):
= df.melt('Country', value_vars=value_vars, var_name='year', value_name=value_name)
melted = melted.year.str.slice(-4).astype(int)
melted.year return melted
= melt_df_by_var(time_series_data, 'life_exp', life_exp_year_columns)\
melted_time_series 'hdi_index', hdi_year_columns), on=('Country', 'year'))\
.merge(melt_df_by_var(time_series_data, 'co2', co2_year_columns), on=('Country', 'year'))\
.merge(melt_df_by_var(time_series_data, 'gnipc', gnipc_year_columns), on=('Country', 'year'))
.merge(melt_df_by_var(time_series_data,
melted_time_series
Country | year | life_exp | hdi_index | co2 | gnipc | |
---|---|---|---|---|---|---|
0 | Afghanistan | 1990 | 45.9672 | 0.273 | 0.209727 | 2684.550019 |
1 | Angola | 1990 | 41.8933 | NaN | 0.429586 | 4845.706901 |
2 | Albania | 1990 | 73.1439 | 0.647 | 1.656902 | 4742.215529 |
3 | Andorra | 1990 | 78.4063 | NaN | 7.461153 | 43773.146500 |
4 | United Arab Emirates | 1990 | 71.9004 | 0.728 | 28.277672 | 102433.136000 |
... | ... | ... | ... | ... | ... | ... |
6235 | Samoa | 2021 | 72.7675 | 0.707 | 1.238975 | 5307.953374 |
6236 | Yemen | 2021 | 63.7534 | 0.455 | 0.327510 | 1314.270189 |
6237 | South Africa | 2021 | 62.3410 | 0.713 | 7.620420 | 12948.373250 |
6238 | Zambia | 2021 | 61.2234 | 0.565 | 0.357535 | 3217.767739 |
6239 | Zimbabwe | 2021 | 59.2531 | 0.593 | 0.708562 | 3809.887158 |
6240 rows × 6 columns
Finally writing it to disk
/'processed/time_series.csv', index=False) melted_time_series.to_csv(data_folder
Conclusion
This finalizes our steps to process the data. We’ll do a few more preprocessing steps necessary before modelling in the next chapter. Stay Tuned.