GNI Per Capita and Plastic Pollution Analyss in R

web scraping economics data cleaning linear regression

In this post we will reference analysis done on 20 countries that produce and most mismanaged waste based on their GNI income status in R.

Blake Conrad https://github.com/conradbm
2022-07-31

Introduction

In this post we will be working through some analysis posted here analyzing 20 different countries that produce mismanaged waste based on their GNI income status.

Getting the data

First we want to get a list of dataframes from the wikipedia website for Gross National Income. After some recon, one can readily identify that each table has the class table and subclass wikitable, which is ad-joined to become table.wikitable in CSS language.

The data is available here.

library(tidyverse)
library(tidymodels)
library(janitor)
library(rvest)
library(scales)
# library(bbplot)
library(vip)


#Scraping the GNI data set
url <- "https://en.wikipedia.org/wiki/List_of_countries_by_GNI_(nominal)_per_capita"
list_html <-
  read_html(url) %>% #scraping the interested web page
  rvest::html_elements("table.wikitable") %>% # takes the tables we want
  rvest::html_table() 


list_html
[[1]]
# A tibble: 36 x 4
   Rank  Country              `GNI per capita (US$)[1]`  Year
   <chr> <chr>                <chr>                     <int>
 1 —     Bermuda (UK)         116,540                    2021
 2 1     Liechtenstein        116,440                    2009
 3 2     Switzerland          90,360                     2021
 4 3     Norway               84,090                     2021
 5 —     Isle of Man (UK)     83,920                     2019
 6 4     Luxembourg           81,110                     2020
 7 5     Ireland              74,520                     2021
 8 6     United States        70,430                     2021
 9 7     Denmark              68,110                     2021
10 —     Channel Islands (UK) 66,220                     2007
# ... with 26 more rows
# i Use `print(n = ...)` to see more rows

[[2]]
# A tibble: 36 x 4
   Rank  Country                       `GNI per capita (US$)[1]`  Year
   <chr> <chr>                         <chr>                     <int>
 1 30    Malta                         30,560                     2021
 2 31    Spain                         29,740                     2021
 3 32    Slovenia                      28,240                     2021
 4 33    Cyprus                        28,130                     2021
 5 —     Sint Maarten (Netherlands)    27,510                     2018
 6 34    Bahamas                       27,220                     2021
 7 35    Estonia                       25,920                     2021
 8 36    Czech Republic                24,070                     2021
 9 37    Portugal                      23,730                     2021
10 —     Turks and Caicos Islands (UK) 23,600                     2021
# ... with 26 more rows
# i Use `print(n = ...)` to see more rows

[[3]]
# A tibble: 26 x 4
   Rank  Country     `GNI per capita (US$)[1]`  Year
   <chr> <chr>       <chr>                     <int>
 1 59    Costa Rica  12,310                     2021
 2 —     World       12,070                     2021
 3 60    China       11,890                     2021
 4 61    Russia      11,600                     2021
 5 62    Malaysia    10,930                     2021
 6 63    Mauritius   10,860                     2021
 7 64    Bulgaria    10,720                     2021
 8 65    Argentina   10,050                     2021
 9 66    Turkey      9,830                      2021
10 67    Saint Lucia 9,680                      2021
# ... with 16 more rows
# i Use `print(n = ...)` to see more rows

[[4]]
# A tibble: 27 x 4
    Rank Country                `GNI per capita (US$)[1]`  Year
   <int> <chr>                  <chr>                     <int>
 1    84 Belarus                6,950                      2021
 2    85 Botswana               6,940                      2021
 3    86 Bosnia and Herzegovina 6,770                      2021
 4    87 Tuvalu                 6,760                      2021
 5    88 Peru                   6,520                      2021
 6    89 South Africa           6,440                      2021
 7    90 Colombia               6,160                      2021
 8    91 North Macedonia        6,130                      2021
 9    92 Albania                6,110                      2021
10    93 Ecuador                5,930                      2021
# ... with 17 more rows
# i Use `print(n = ...)` to see more rows

[[5]]
# A tibble: 26 x 4
    Rank Country                        GNI per capita (US$)[1~1  Year
   <int> <chr>                          <chr>                    <int>
 1   111 Palestine                      4,220                     2021
 2   112 El Salvador                    4,140                     2021
 3   112 Indonesia                      4,140                     2021
 4   114 Ukraine                        4,120                     2021
 5   115 Federated States of Micronesia 3,880                     2021
 6   116 Samoa                          3,860                     2021
 7   117 Sri Lanka                      3,820                     2021
 8   118 Mongolia                       3,760                     2021
 9   119 Eswatini                       3,680                     2021
10   120 Algeria                        3,660                     2021
# ... with 16 more rows, and abbreviated variable name
#   1: `GNI per capita (US$)[1]`
# i Use `print(n = ...)` to see more rows

[[6]]
# A tibble: 28 x 4
    Rank Country               `GNI per capita (US$)[1]`  Year
   <int> <chr>                 <chr>                     <int>
 1   137 Laos                  2,520                      2021
 2   138 Ivory Coast           2,450                      2021
 3   139 Ghana                 2,360                      2021
 4   140 Solomon Islands       2,300                      2021
 5   141 Sao Tome and Principe 2,280                      2021
 6   142 India                 2,170                      2021
 7   143 Nigeria               2,100                      2021
 8   144 Nicaragua             2,010                      2021
 9   144 Kenya                 2,010                      2021
10   146 Uzbekistan            1,960                      2021
# ... with 18 more rows
# i Use `print(n = ...)` to see more rows

[[7]]
# A tibble: 13 x 4
    Rank Country       `GNI per capita (US$)[1]`  Year
   <int> <chr>         <chr>                     <int>
 1   165 South Sudan   1,090                      2015
 2   166 Zambia        1,040                      2021
 3   167 Guinea        1,010                      2021
 4   168 Togo          980                        2021
 5   169 Ethiopia      960                        2021
 6   170 Syria         930                        2018
 7   171 Mali          870                        2021
 8   172 Burkina Faso  860                        2021
 9   173 Rwanda        850                        2021
10   174 Uganda        840                        2021
11   175 The Gambia    800                        2021
12   176 Guinea-Bissau 780                        2021
13   177 Yemen         670                        2020

[[8]]
# A tibble: 14 x 4
    Rank Country                          GNI per capita (US$)~1  Year
   <int> <chr>                                             <int> <int>
 1   177 Sudan                                               670  2021
 2   179 Chad                                                650  2021
 3   180 Malawi                                              630  2021
 4   181 Liberia                                             620  2021
 5   182 Eritrea                                             600  2011
 6   183 Niger                                               590  2021
 7   184 Democratic Republic of the Congo                    580  2021
 8   185 Central African Republic                            530  2021
 9   186 Sierra Leone                                        510  2021
10   187 Madagascar                                          500  2021
11   187 Afghanistan                                         500  2020
12   189 Mozambique                                          480  2021
13   190 Somalia                                             450  2021
14   191 Burundi                                             240  2021
# ... with abbreviated variable name 1: `GNI per capita (US$)[1]`

[[9]]
# A tibble: 16 x 5
   `High-income group`           High-income~1 High-~2 High-~3 High-~4
   <chr>                         <chr>         <chr>   <chr>   <chr>  
 1 British Virgin Islands (UK)   <NA>          <NA>    <NA>    <NA>   
 2 Cook Islands (New Zealand)    <NA>          <NA>    <NA>    <NA>   
 3 Faroe Islands (Denmark)       <NA>          <NA>    <NA>    <NA>   
 4 Gibraltar (UK)                <NA>          <NA>    <NA>    <NA>   
 5 Guam (US)                     <NA>          <NA>    <NA>    <NA>   
 6 Monaco                        <NA>          <NA>    <NA>    <NA>   
 7 Northern Mariana Islands (US) <NA>          <NA>    <NA>    <NA>   
 8 Saint Martin (France)         <NA>          <NA>    <NA>    <NA>   
 9 San Marino                    <NA>          <NA>    <NA>    <NA>   
10 Taiwan                        <NA>          <NA>    <NA>    <NA>   
11 U.S. Virgin Islands (US)      <NA>          <NA>    <NA>    <NA>   
12 Upper-middle-income group     Upper-middle~ Upper-~ Upper-~ Upper-~
13 American Samoa (US)           <NA>          <NA>    <NA>    <NA>   
14 Venezuela                     <NA>          <NA>    <NA>    <NA>   
15 Low-income group              Low-income g~ Low-in~ Low-in~ Low-in~
16 North Korea                   <NA>          <NA>    <NA>    <NA>   
# ... with abbreviated variable names 1: `High-income group`,
#   2: `High-income group`, 3: `High-income group`,
#   4: `High-income group`

Next we will use the magic of the purrr package to flow our categories through each dataframe using the map function. Then we will flow the list of updated datafarmes into the rbind function with a do.call. Finally we “consistentize” the columns using janitor’s clean_names function. Last we will select, rename, and mutate columns as needed to get a nice clean wiki scraped data frame.

#gni income levels vector
status <- c("high","high","upper_middle","upper_middle",
            "lower_middle","lower_middle","low","low")

#Building and tidying GNI dataframe
df_gni = 
  #assigning the gni income levels to the corresponding countries
  purrr::map(1:8, function(x){
    list_html[[x]] %>% 
      dplyr::mutate(gni_status=status[x])
  }) %>% 
  do.call(what=rbind) %>% 
  janitor::clean_names() %>%
  dplyr::mutate(gni_status = forcats::as_factor(gni_status)) %>% 
  dplyr::select(country, 
                gni_metric=gni_per_capita_us_1, # renames
                gni_status) %>% 
  dplyr::mutate(gni_metric = stringr::str_remove(gni_metric,",") %>% as.numeric()) 
df_gni
# A tibble: 206 x 3
   country              gni_metric gni_status
   <chr>                     <dbl> <fct>     
 1 Bermuda (UK)             116540 high      
 2 Liechtenstein            116440 high      
 3 Switzerland               90360 high      
 4 Norway                    84090 high      
 5 Isle of Man (UK)          83920 high      
 6 Luxembourg                81110 high      
 7 Ireland                   74520 high      
 8 United States             70430 high      
 9 Denmark                   68110 high      
10 Channel Islands (UK)      66220 high      
# ... with 196 more rows
# i Use `print(n = ...)` to see more rows

How many countries do we know GNI on?

df_gni %>% 
  select(country) %>% 
  unique %>% count()
# A tibble: 1 x 1
      n
  <int>
1   206

At this point we know information about about 206 countries, their gross national income, and their gni_status of high, mid, or low. Now we want to collect data on waste by country. This is available here.

First we get the raw dataframe using the consistentcy of readr::read_csv. Next we “consistentize” the column names, drop missing rows, and rename to match our GNI dataframe.

#Building and tidying waste dataframe
df_waste <- readr::read_csv("https://raw.githubusercontent.com/mesdi/plastic-pollution/main/plastic-waste-mismanaged.csv")

df_waste =
  df_waste %>%  
  janitor::clean_names() %>% 
  tidyr::drop_na() %>% 
  select(country=entity,
         waste_metric=mismanaged_plastic_waste_metric_tons_year_1)

df_waste
# A tibble: 160 x 2
   country             waste_metric
   <chr>                      <dbl>
 1 Albania                    69833
 2 Algeria                   764578
 3 Angola                    236946
 4 Antigua and Barbuda          627
 5 Argentina                 465808
 6 Australia                   5266
 7 Bahamas                     2212
 8 Bahrain                     1043
 9 Bangladesh               1021990
10 Barbados                     872
# ... with 150 more rows
# i Use `print(n = ...)` to see more rows

How many countries do we have waste info on?

df_waste %>% 
  select(country) %>% 
  count
# A tibble: 1 x 1
      n
  <int>
1   160

Now we want to unify the data to speak about GNI and waste. First we will left_join the df_waste with the df_gni. Then only keep unique countries (equivalent to remove_duplicates in dplyr), filter out the world country, and drop missing rows again that the df_gni had as country names, but df_waste did not. This implicitly gives precendence to the df_waste data. This is no different than doing an inner_join where it will just drop those non-matches anyway. The left_join provides the benefit of axamining which, and how many, rows were not matched in the join. Not a bad idea to just do, then filter.

#Binding waste and gni dataframes by country
df_tidy =
  df_waste %>% 
  dplyr::left_join(df_gni) %>% 
  dplyr::distinct(country, .keep_all = TRUE) %>% # remove duplicate countries
  dplyr::filter(!country == "World") %>% 
  tidyr::drop_na()

df_tidy
# A tibble: 142 x 4
   country             waste_metric gni_metric gni_status  
   <chr>                      <dbl>      <dbl> <fct>       
 1 Albania                    69833       6110 upper_middle
 2 Algeria                   764578       3660 lower_middle
 3 Angola                    236946       1770 lower_middle
 4 Antigua and Barbuda          627      14900 high        
 5 Argentina                 465808      10050 upper_middle
 6 Australia                   5266      56760 high        
 7 Bahamas                     2212      27220 high        
 8 Bahrain                     1043      19930 high        
 9 Bangladesh               1021990       2620 lower_middle
10 Barbados                     872      16720 high        
# ... with 132 more rows
# i Use `print(n = ...)` to see more rows

Next identify the 6 worst counties with mismanaged_plastic_waste_metric_tons_year_1, which we renamed waste_metric. The slice_max function is incredible here. It will take your dataframe, let you select a column to order by, then do a max operation and cut it off by some n, effectively leaving you with the top_k or the slice_max of the column ordered by something you care about. This is great for quick assessments like:

  1. What are the top 6 countries by GNI?
  2. What are the top 6 countries w.r.t income to waste?
  3. What are the top 6 countries by mismanaged plastic waste metric tons per year?
df_tidy %>% 
  dplyr::slice_max(order_by = gni_metric, n = 6) %>% 
  dplyr::select(country, gni_metric) %>% 
  dplyr::mutate(gni_metric_label = paste0(round(gni_metric/1000, 2), " thousand"))
# A tibble: 6 x 3
  country       gni_metric gni_metric_label
  <chr>              <dbl> <chr>           
1 Norway             84090 84.09 thousand  
2 Ireland            74520 74.52 thousand  
3 United States      70430 70.43 thousand  
4 Denmark            68110 68.11 thousand  
5 Iceland            64410 64.41 thousand  
6 Singapore          64010 64.01 thousand  
df_tidy %>% 
  dplyr::mutate(gni_to_waste = gni_metric / waste_metric) %>% 
  dplyr::slice_max(order_by = gni_to_waste, n=20) %>% 
    dplyr::select(country, gni_to_waste)
# A tibble: 20 x 2
   country               gni_to_waste
   <chr>                        <dbl>
 1 Montenegro                   581. 
 2 Iceland                      427. 
 3 Seychelles                   402. 
 4 Marshall Islands             316. 
 5 Saint Kitts and Nevis        191. 
 6 Denmark                      175. 
 7 Maldives                     140  
 8 Palau                        124. 
 9 Malta                        118. 
10 Norway                        56.3
11 Brunei                        45.5
12 Estonia                       43.2
13 Kiribati                      39.3
14 Qatar                         37.3
15 Mauritius                     36.3
16 Cyprus                        33.6
17 Slovenia                      33.5
18 Ireland                       27.9
19 New Zealand                   26.5
20 Singapore                     25.9
#Top 6 countries in terms of the amounts of mismanaged plastic waste 
df_6 =
  df_tidy %>% 
  dplyr::slice_max(order_by = waste_metric, n=6) %>% 
  dplyr::mutate(waste = paste0(round(waste_metric/10^6, 2), " million t"))

df_6
# A tibble: 6 x 5
  country     waste_metric gni_metric gni_status   waste          
  <chr>              <dbl>      <dbl> <fct>        <chr>          
1 India           12994100       2170 lower_middle 12.99 million t
2 China           12272200      11890 upper_middle 12.27 million t
3 Philippines      4025300       3640 lower_middle 4.03 million t 
4 Brazil           3296700       7720 upper_middle 3.3 million t  
5 Nigeria          1948950       2100 lower_middle 1.95 million t 
6 Tanzania         1716400       1140 lower_middle 1.72 million t 

Visualizing the data

df_tidy %>% 
  
  # Top 20 waste countries
  dplyr::slice_max(order_by = waste_metric, n=20) %>% 
  
  # X,Y, and color variables
  ggplot(aes(x=gni_metric, y=waste_metric, color=gni_status)) +
  
  # Country names
  geom_text(aes(label = country),
            hjust= 0, vjust= -0.5, size=4, key_glyph= "rect") +
  
  # Top 6 waste labels
  geom_text(data = df_6, aes(x = gni_metric, y = waste_metric, label = waste),
            hjust = 0, vjust=1.2) +
  
  scale_x_log10(labels = scales::label_dollar(accuracy = 2)) +
  scale_y_log10(labels = scales::label_number(scale_cut = cut_si("t"))) +
  scale_color_discrete(labels = c("Upper-Middle", "Lower-Middle", "Low")) +
  labs(title = "Mismanaged Plastic Waste (2019) vs. GNI Income.",
       y = "Plastic Waste (Metric Tons/year)",
       x = "Gross National Income ($)") +
  coord_fixed(ratio = 0.5, clip = "off") +
  theme_minimal()

Modeling with tidymodels

# Remember what we're modeling
df_tidy %>% head
# A tibble: 6 x 4
  country             waste_metric gni_metric gni_status  
  <chr>                      <dbl>      <dbl> <fct>       
1 Albania                    69833       6110 upper_middle
2 Algeria                   764578       3660 lower_middle
3 Angola                    236946       1770 lower_middle
4 Antigua and Barbuda          627      14900 high        
5 Argentina                 465808      10050 upper_middle
6 Australia                   5266      56760 high        
# Build a receipe
#   1. log(y)
#   2. dummies(x)
#
df_rec =
  df_tidy %>% 
  recipes::recipe(waste_metric ~ gni_status) %>% 
  recipes::step_log(waste_metric, base = 10) %>% 
  recipes::step_dummy(gni_status)

df_rec
Recipe

Inputs:

      role #variables
   outcome          1
 predictor          1

Operations:

Log transformation on waste_metric
Dummy variables from gni_status
# Build a model
#   1. Linear regression
#
lm_model =
  linear_reg() %>% 
  set_engine("lm")

lm_model
Linear Regression Model Specification (regression)

Computational engine: lm 
# Build workflow
# 1. Add model
# 2. Add recipe
#
lm_wflow =
  workflow() %>% 
  add_model(lm_model) %>% 
  add_recipe(df_rec)

lm_wflow
== Workflow ==========================================================
Preprocessor: Recipe
Model: linear_reg()

-- Preprocessor ------------------------------------------------------
2 Recipe Steps

* step_log()
* step_dummy()

-- Model -------------------------------------------------------------
Linear Regression Model Specification (regression)

Computational engine: lm 
# Fit the workflow with some data
lm_fit = fit(lm_wflow, df_tidy)

# Descriptive and inferential statistics
lm_fit %>% 
  extract_fit_engine() %>% 
  summary()

Call:
stats::lm(formula = ..y ~ ., data = data)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.2939 -0.6246  0.1494  0.8228  2.6911 

Coefficients:
                        Estimate Std. Error t value Pr(>|t|)    
(Intercept)               3.5666     0.1595  22.359  < 2e-16 ***
gni_status_upper_middle   0.8313     0.2393   3.474 0.000685 ***
gni_status_lower_middle   1.5965     0.2410   6.625 7.15e-10 ***
gni_status_low            1.0701     0.3512   3.047 0.002768 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.128 on 138 degrees of freedom
Multiple R-squared:  0.2471,    Adjusted R-squared:  0.2307 
F-statistic: 15.09 on 3 and 138 DF,  p-value: 1.501e-08
# ... OR ...

tidy(lm_fit)
# A tibble: 4 x 5
  term                    estimate std.error statistic  p.value
  <chr>                      <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)                3.57      0.160     22.4  1.01e-47
2 gni_status_upper_middle    0.831     0.239      3.47 6.85e- 4
3 gni_status_lower_middle    1.60      0.241      6.63 7.15e-10
4 gni_status_low             1.07      0.351      3.05 2.77e- 3

Variable importance

#variable importance  
lm_fit %>% 
  extract_fit_parsnip() %>%
  vip(aesthetics = list(color = "lightblue", fill = "lightblue")) + 
  theme_minimal() +
  ggtitle("Variable Importance", subtitle = "Lower Middle Status Indicates Higher Waste")