Recipes in R

machine learning data pipeline data cleaning r

In this post we will discuss how to use recipes to build effective machine learning pipelines in R.

Blake Conrad https://github.com/conradbm
2022-07-24

Introduction

recipes is a built in package I found within tidymodels. We actually wrote a section on this in our last post showing how to 1) add roles, 2) create features, and 3) add recipes to model workflows. In another post we will go through the rsample package which I discovered through the same means. However for recipes, it offers a very clean dplyr-like syntax to pass data through before modeling takes place. This helps for a variety of reasons:

Enough chit chat, let’s dive in and discuss what’s going on.

Gentle overview

Geting data

library(recipes)

data(ad_data, package="modeldata")

ad_data %>% glimpse
Rows: 333
Columns: 131
$ ACE_CD143_Angiotensin_Converti   <dbl> 2.0031003, 1.5618560, 1.520~
$ ACTH_Adrenocorticotropic_Hormon  <dbl> -1.3862944, -1.3862944, -1.~
$ AXL                              <dbl> 1.09838668, 0.68328157, -0.~
$ Adiponectin                      <dbl> -5.360193, -5.020686, -5.80~
$ Alpha_1_Antichymotrypsin         <dbl> 1.7404662, 1.4586150, 1.193~
$ Alpha_1_Antitrypsin              <dbl> -12.631361, -11.909882, -13~
$ Alpha_1_Microglobulin            <dbl> -2.577022, -3.244194, -2.88~
$ Alpha_2_Macroglobulin            <dbl> -72.65029, -154.61228, -136~
$ Angiopoietin_2_ANG_2             <dbl> 1.06471074, 0.74193734, 0.8~
$ Angiotensinogen                  <dbl> 2.510547, 2.457283, 1.97636~
$ Apolipoprotein_A_IV              <dbl> -1.427116, -1.660731, -1.66~
$ Apolipoprotein_A1                <dbl> -7.402052, -7.047017, -7.68~
$ Apolipoprotein_A2                <dbl> -0.26136476, -0.86750057, -~
$ Apolipoprotein_B                 <dbl> -4.624044, -6.747507, -3.97~
$ Apolipoprotein_CI                <dbl> -1.2729657, -1.2729657, -1.~
$ Apolipoprotein_CIII              <dbl> -2.312635, -2.343407, -2.74~
$ Apolipoprotein_D                 <dbl> 2.0794415, 1.3350011, 1.335~
$ Apolipoprotein_E                 <dbl> 3.7545215, 3.0971187, 2.753~
$ Apolipoprotein_H                 <dbl> -0.15734908, -0.57539617, -~
$ B_Lymphocyte_Chemoattractant_BL  <dbl> 2.2969819, 1.6731213, 1.673~
$ BMP_6                            <dbl> -2.200744, -1.728053, -2.06~
$ Beta_2_Microglobulin             <dbl> 0.69314718, 0.47000363, 0.3~
$ Betacellulin                     <int> 34, 53, 49, 52, 67, 51, 41,~
$ C_Reactive_Protein               <dbl> -4.074542, -6.645391, -8.04~
$ CD40                             <dbl> -0.7964147, -1.2733760, -1.~
$ CD5L                             <dbl> 0.09531018, -0.67334455, 0.~
$ Calbindin                        <dbl> 33.21363, 25.27636, 22.1660~
$ Calcitonin                       <dbl> 1.3862944, 3.6109179, 2.116~
$ CgA                              <dbl> 397.6536, 465.6759, 347.863~
$ Clusterin_Apo_J                  <dbl> 3.555348, 3.044522, 2.77258~
$ Complement_3                     <dbl> -10.36305, -16.10824, -16.1~
$ Complement_Factor_H              <dbl> 3.5737252, 3.6000471, 4.474~
$ Connective_Tissue_Growth_Factor  <dbl> 0.5306283, 0.5877867, 0.641~
$ Cortisol                         <dbl> 10.0, 12.0, 10.0, 14.0, 11.~
$ Creatine_Kinase_MB               <dbl> -1.710172, -1.751002, -1.38~
$ Cystatin_C                       <dbl> 9.041922, 9.067624, 8.95415~
$ EGF_R                            <dbl> -0.1354543, -0.3700474, -0.~
$ EN_RAGE                          <dbl> -3.688879, -3.816713, -4.75~
$ ENA_78                           <dbl> -1.349543, -1.356595, -1.39~
$ Eotaxin_3                        <int> 53, 62, 62, 44, 64, 57, 64,~
$ FAS                              <dbl> -0.08338161, -0.52763274, -~
$ FSH_Follicle_Stimulation_Hormon  <dbl> -0.6516715, -1.6272839, -1.~
$ Fas_Ligand                       <dbl> 3.1014922, 2.9788133, 1.360~
$ Fatty_Acid_Binding_Protein       <dbl> 2.5208712, 2.2477966, 0.906~
$ Ferritin                         <dbl> 3.329165, 3.932959, 3.17687~
$ Fetuin_A                         <dbl> 1.2809338, 1.1939225, 1.410~
$ Fibrinogen                       <dbl> -7.035589, -8.047190, -7.19~
$ GRO_alpha                        <dbl> 1.381830, 1.372438, 1.41267~
$ Gamma_Interferon_induced_Monokin <dbl> 2.949822, 2.721793, 2.76223~
$ Glutathione_S_Transferase_alpha  <dbl> 1.0641271, 0.8670202, 0.889~
$ HB_EGF                           <dbl> 6.559746, 8.754531, 7.74546~
$ HCC_4                            <dbl> -3.036554, -4.074542, -3.64~
$ Hepatocyte_Growth_Factor_HGF     <dbl> 0.58778666, 0.53062825, 0.0~
$ I_309                            <dbl> 3.433987, 3.135494, 2.39789~
$ ICAM_1                           <dbl> -0.1907787, -0.4620172, -0.~
$ IGF_BP_2                         <dbl> 5.609472, 5.347108, 5.18178~
$ IL_11                            <dbl> 5.121987, 4.936704, 4.66591~
$ IL_13                            <dbl> 1.282549, 1.269463, 1.27413~
$ IL_16                            <dbl> 4.192081, 2.876338, 2.61610~
$ IL_17E                           <dbl> 5.731246, 6.705891, 4.14932~
$ IL_1alpha                        <dbl> -6.571283, -8.047190, -8.18~
$ IL_3                             <dbl> -3.244194, -3.912023, -4.64~
$ IL_4                             <dbl> 2.484907, 2.397895, 1.82454~
$ IL_5                             <dbl> 1.09861229, 0.69314718, -0.~
$ IL_6                             <dbl> 0.26936976, 0.09622438, 0.1~
$ IL_6_Receptor                    <dbl> 0.64279595, 0.43115645, 0.0~
$ IL_7                             <dbl> 4.8050453, 3.7055056, 1.005~
$ IL_8                             <dbl> 1.711325, 1.675557, 1.69139~
$ IP_10_Inducible_Protein_10       <dbl> 6.242223, 5.686975, 5.04985~
$ IgA                              <dbl> -6.812445, -6.377127, -6.31~
$ Insulin                          <dbl> -0.6258253, -0.9431406, -1.~
$ Kidney_Injury_Molecule_1_KIM_1   <dbl> -1.204295, -1.197703, -1.19~
$ LOX_1                            <dbl> 1.7047481, 1.5260563, 1.163~
$ Leptin                           <dbl> -1.5290628, -1.4660558, -1.~
$ Lipoprotein_a                    <dbl> -4.268698, -4.933674, -5.84~
$ MCP_1                            <dbl> 6.740519, 6.849066, 6.76734~
$ MCP_2                            <dbl> 1.9805094, 1.8088944, 0.400~
$ MIF                              <dbl> -1.237874, -1.897120, -2.30~
$ MIP_1alpha                       <dbl> 4.968453, 3.690160, 4.04950~
$ MIP_1beta                        <dbl> 3.258097, 3.135494, 2.39789~
$ MMP_2                            <dbl> 4.478566, 3.781473, 2.86663~
$ MMP_3                            <dbl> -2.207275, -2.465104, -2.30~
$ MMP10                            <dbl> -3.270169, -3.649659, -2.73~
$ MMP7                             <dbl> -3.7735027, -5.9681907, -4.~
$ Myoglobin                        <dbl> -1.89711998, -0.75502258, -~
$ NT_proBNP                        <dbl> 4.553877, 4.219508, 4.24849~
$ NrCAM                            <dbl> 5.003946, 5.209486, 4.74493~
$ Osteopontin                      <dbl> 5.356586, 6.003887, 5.01728~
$ PAI_1                            <dbl> 1.00350156, -0.03059880, 0.~
$ PAPP_A                           <dbl> -2.902226, -2.813276, -2.93~
$ PLGF                             <dbl> 4.442651, 4.025352, 4.51086~
$ PYY                              <dbl> 3.218876, 3.135494, 2.89037~
$ Pancreatic_polypeptide           <dbl> 0.5787808, 0.3364722, -0.89~
$ Prolactin                        <dbl> 0.00000000, -0.51082562, -0~
$ Prostatic_Acid_Phosphatase       <dbl> -1.620527, -1.739232, -1.63~
$ Protein_S                        <dbl> -1.784998, -2.463991, -2.25~
$ Pulmonary_and_Activation_Regulat <dbl> -0.8439701, -2.3025851, -1.~
$ RANTES                           <dbl> -6.214608, -6.938214, -6.64~
$ Resistin                         <dbl> -16.475315, -16.025283, -16~
$ S100b                            <dbl> 1.5618560, 1.7566212, 1.435~
$ SGOT                             <dbl> -0.94160854, -0.65392647, 0~
$ SHBG                             <dbl> -1.897120, -1.560648, -2.20~
$ SOD                              <dbl> 5.609472, 5.814131, 5.72358~
$ Serum_Amyloid_P                  <dbl> -5.599422, -6.119298, -5.38~
$ Sortilin                         <dbl> 4.908629, 5.478731, 3.81018~
$ Stem_Cell_Factor                 <dbl> 4.174387, 3.713572, 3.43398~
$ TGF_alpha                        <dbl> 8.649098, 11.331619, 10.858~
$ TIMP_1                           <dbl> 15.204651, 11.266499, 12.28~
$ TNF_RII                          <dbl> -0.06187540, -0.32850407, -~
$ TRAIL_R3                         <dbl> -0.1829004, -0.5007471, -0.~
$ TTR_prealbumin                   <dbl> 2.944439, 2.833213, 2.94443~
$ Tamm_Horsfall_Protein_THP        <dbl> -3.095810, -3.111190, -3.16~
$ Thrombomodulin                   <dbl> -1.340566, -1.675252, -1.53~
$ Thrombopoietin                   <dbl> -0.1026334, -0.6733501, -0.~
$ Thymus_Expressed_Chemokine_TECK  <dbl> 4.149327, 3.810182, 2.79199~
$ Thyroid_Stimulating_Hormone      <dbl> -3.863233, -4.828314, -4.99~
$ Thyroxine_Binding_Globulin       <dbl> -1.4271164, -1.6094379, -1.~
$ Tissue_Factor                    <dbl> 2.04122033, 2.02814825, 1.4~
$ Transferrin                      <dbl> 3.332205, 2.890372, 2.89037~
$ Trefoil_Factor_3_TFF3            <dbl> -3.381395, -3.912023, -3.72~
$ VCAM_1                           <dbl> 3.258097, 2.708050, 2.63905~
$ VEGF                             <dbl> 22.03456, 18.60184, 17.4761~
$ Vitronectin                      <dbl> -0.04082199, -0.38566248, -~
$ von_Willebrand_Factor            <dbl> -3.146555, -3.863233, -3.54~
$ age                              <dbl> 0.9876238, 0.9861496, 0.986~
$ tau                              <dbl> 6.297754, 6.659294, 6.27098~
$ p_tau                            <dbl> 4.348108, 4.859967, 4.40024~
$ Ab_42                            <dbl> 12.019678, 11.015759, 12.30~
$ male                             <dbl> 0, 0, 1, 0, 0, 1, 1, 1, 0, ~
$ Genotype                         <fct> E3E3, E3E4, E3E4, E3E4, E3E~
$ Class                            <fct> Control, Control, Control, ~

This data is entirely numeric with a response variable that is binary. We can also see it is a factor.

ad_data %>% count(Class) %>% mutate(prop = n/sum(n))
# A tibble: 2 x 3
  Class        n  prop
  <fct>    <int> <dbl>
1 Impaired    91 0.273
2 Control    242 0.727

Simple recipe

ad_recipe = 
  recipe(Class ~ tau + VEGF, data = ad_data) %>% 
  step_normalize(all_numeric_predictors())

ad_recipe
Recipe

Inputs:

      role #variables
   outcome          1
 predictor          2

Operations:

Centering and scaling for all_numeric_predictors()

With the recipes package, you create a standard recipe for the model itself. This will contain outcomes and predictors. These are synonymous with dependent and independent variables.

Then once you create the recipe formula, each step will do some type of work against it. For example, this step normalized all predictors. Pretty slick! And easy..

An Example

Getting data

I really like this modeldata package. The developers often use it and it has a rich spread of unique data. For this example it is credit card information, which is very interesting, and contains enough nominal information to be as practical as it gets out there in the real world.

library(recipes)
library(rsample)
library(modeldata)

data("credit_data")
set.seed(55)

credit_data %>% glimpse
Rows: 4,454
Columns: 14
$ Status    <fct> good, good, bad, good, good, good, good, good, goo~
$ Seniority <int> 9, 17, 10, 0, 0, 1, 29, 9, 0, 0, 6, 7, 8, 19, 0, 0~
$ Home      <fct> rent, rent, owner, rent, rent, owner, owner, paren~
$ Time      <int> 60, 60, 36, 60, 36, 60, 60, 12, 60, 48, 48, 36, 60~
$ Age       <int> 30, 58, 46, 24, 26, 36, 44, 27, 32, 41, 34, 29, 30~
$ Marital   <fct> married, widow, married, single, single, married, ~
$ Records   <fct> no, no, yes, no, no, no, no, no, no, no, no, no, n~
$ Job       <fct> freelance, fixed, freelance, fixed, fixed, fixed, ~
$ Expenses  <int> 73, 48, 90, 63, 46, 75, 75, 35, 90, 90, 60, 60, 75~
$ Income    <int> 129, 131, 200, 182, 107, 214, 125, 80, 107, 80, 12~
$ Assets    <int> 0, 0, 3000, 2500, 0, 3500, 10000, 0, 15000, 0, 400~
$ Debt      <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2500, 260, 0, ~
$ Amount    <int> 800, 1000, 2000, 900, 310, 650, 1600, 200, 1200, 1~
$ Price     <int> 846, 1658, 2985, 1325, 910, 1645, 1800, 1093, 1957~

It looks like our outcome is the Status with several other numeric and nominal predictors.

credit_data %>% count(Status) %>% mutate(prop = n/sum(n))
  Status    n      prop
1    bad 1254 0.2815447
2   good 3200 0.7184553
train_test_split = credit_data %>% initial_split(prop = 3/4, strata = Status)
credit_train = training(train_test_split)
credit_test = testing(train_test_split)

credit_train %>% glimpse
Rows: 3,340
Columns: 14
$ Status    <fct> bad, bad, bad, bad, bad, bad, bad, bad, bad, bad, ~
$ Seniority <int> 10, 0, 0, 0, 2, 3, 0, 0, 1, 5, 2, 1, 4, 1, 0, 25, ~
$ Home      <fct> owner, parents, other, ignore, rent, owner, NA, ow~
$ Time      <int> 36, 48, 18, 48, 60, 24, 48, 36, 54, 48, 36, 48, 42~
$ Age       <int> 46, 41, 21, 36, 25, 23, 37, 23, 36, 31, 27, 29, 27~
$ Marital   <fct> married, married, single, married, single, married~
$ Records   <fct> yes, no, yes, no, no, no, no, no, no, no, no, yes,~
$ Job       <fct> freelance, partime, partime, partime, fixed, fixed~
$ Expenses  <int> 90, 90, 35, 45, 46, 75, 35, 45, 70, 44, 48, 85, 35~
$ Income    <int> 200, 80, 50, 130, 107, 85, NA, 122, 99, 90, 128, 1~
$ Assets    <int> 3000, 0, 0, 750, 0, 5000, NA, 2500, 0, 0, 0, 0, 0,~
$ Debt      <int> 0, 0, 0, 0, 0, 0, NA, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
$ Amount    <int> 2000, 1200, 400, 1100, 1500, 600, 1500, 400, 950, ~
$ Price     <int> 2985, 1468, 500, 1511, 2189, 1600, 1850, 400, 950,~
credit_test %>% glimpse
Rows: 1,114
Columns: 14
$ Status    <fct> good, good, good, good, bad, good, good, good, bad~
$ Seniority <int> 17, 6, 8, 19, 1, 12, 14, 10, 0, 2, 3, 14, 8, 10, 4~
$ Home      <fct> rent, owner, owner, priv, owner, owner, priv, owne~
$ Time      <int> 60, 48, 60, 36, 60, 36, 24, 24, 60, 60, 12, 24, 48~
$ Age       <int> 58, 34, 30, 37, 45, 54, 51, 56, 32, 43, 37, 55, 31~
$ Marital   <fct> widow, married, married, married, married, married~
$ Records   <fct> no, no, no, no, no, no, no, no, yes, no, no, no, n~
$ Job       <fct> fixed, freelance, fixed, fixed, partime, fixed, pa~
$ Expenses  <int> 48, 60, 75, 75, 105, 60, 75, 45, 45, 75, 60, 90, 9~
$ Income    <int> 131, 125, 199, 170, 112, 130, 198, 208, 142, 71, 1~
$ Assets    <int> 0, 4000, 5000, 3500, 2000, 4000, 1000, 13500, 7000~
$ Debt      <int> 0, 0, 2500, 260, 500, 0, 0, 0, 3000, 0, 0, 0, 0, 0~
$ Amount    <int> 1000, 1150, 1500, 600, 600, 700, 450, 500, 1250, 1~
$ Price     <int> 1658, 1577, 1650, 940, 1332, 2100, 650, 1560, 1566~

Notice that we have a nice 3/4 split. Also that there are some nasty NAs in our data.

vapply(credit_train, FUN = function(x){ mean(!is.na(x))}, FUN.VALUE = numeric(1))
   Status Seniority      Home      Time       Age   Marital   Records 
1.0000000 1.0000000 0.9982036 1.0000000 1.0000000 0.9997006 1.0000000 
      Job  Expenses    Income    Assets      Debt    Amount     Price 
0.9997006 1.0000000 0.9116766 0.9886228 0.9955090 1.0000000 1.0000000 

We can use recipes to impute this missing data instead of dropping it.

An initial recipe

This does no justice to the examples to follow, but gives a feel for how this will start to look as we create recipes, add steps, then prep, then bake it it all together for a model (or analysis).

rec_obj = recipe(Status ~ ., data = credit_train)
rec_obj
Recipe

Inputs:

      role #variables
   outcome          1
 predictor         13

Processing Steps

There are so many options for pre-processing steps. The manual page ?selections will provide them all. I headed over that way and collected a few I found:

A way to explore even more steps is below,

grep("step_", ls("package:recipes"), value = TRUE)
 [1] "step_arrange"            "step_bagimpute"         
 [3] "step_bin2factor"         "step_BoxCox"            
 [5] "step_bs"                 "step_center"            
 [7] "step_classdist"          "step_corr"              
 [9] "step_count"              "step_cut"               
[11] "step_date"               "step_depth"             
[13] "step_discretize"         "step_dummy"             
[15] "step_dummy_extract"      "step_dummy_multi_choice"
[17] "step_factor2string"      "step_filter"            
[19] "step_filter_missing"     "step_geodist"           
[21] "step_harmonic"           "step_holiday"           
[23] "step_hyperbolic"         "step_ica"               
[25] "step_impute_bag"         "step_impute_knn"        
[27] "step_impute_linear"      "step_impute_lower"      
[29] "step_impute_mean"        "step_impute_median"     
[31] "step_impute_mode"        "step_impute_roll"       
[33] "step_indicate_na"        "step_integer"           
[35] "step_interact"           "step_intercept"         
[37] "step_inverse"            "step_invlogit"          
[39] "step_isomap"             "step_knnimpute"         
[41] "step_kpca"               "step_kpca_poly"         
[43] "step_kpca_rbf"           "step_lag"               
[45] "step_lincomb"            "step_log"               
[47] "step_logit"              "step_lowerimpute"       
[49] "step_meanimpute"         "step_medianimpute"      
[51] "step_modeimpute"         "step_mutate"            
[53] "step_mutate_at"          "step_naomit"            
[55] "step_nnmf"               "step_nnmf_sparse"       
[57] "step_normalize"          "step_novel"             
[59] "step_ns"                 "step_num2factor"        
[61] "step_nzv"                "step_ordinalscore"      
[63] "step_other"              "step_pca"               
[65] "step_percentile"         "step_pls"               
[67] "step_poly"               "step_profile"           
[69] "step_range"              "step_ratio"             
[71] "step_regex"              "step_relevel"           
[73] "step_relu"               "step_rename"            
[75] "step_rename_at"          "step_rm"                
[77] "step_rollimpute"         "step_sample"            
[79] "step_scale"              "step_select"            
[81] "step_shuffle"            "step_slice"             
[83] "step_spatialsign"        "step_sqrt"              
[85] "step_string2factor"      "step_time"              
[87] "step_unknown"            "step_unorder"           
[89] "step_window"             "step_YeoJohnson"        
[91] "step_zv"                

And then finally, you call the prep() function at the end of the chain and this will cook it up and return your final dataset.

You can also select variables using some build in functions and the typical dplyr and tidyselect syntax.

imputed = rec_obj %>% 
  step_impute_knn(all_predictors())
imputed
Recipe

Inputs:

      role #variables
   outcome          1
 predictor         13

Operations:

K-nearest neighbor imputation for all_predictors()
ind_vars = imputed %>% 
  step_dummy(all_nominal_predictors())
ind_vars
Recipe

Inputs:

      role #variables
   outcome          1
 predictor         13

Operations:

K-nearest neighbor imputation for all_predictors()
Dummy variables from all_nominal_predictors()
standardized = ind_vars %>% 
  step_center(all_numeric_predictors()) %>% 
  step_scale(all_numeric_predictors())
standardized
Recipe

Inputs:

      role #variables
   outcome          1
 predictor         13

Operations:

K-nearest neighbor imputation for all_predictors()
Dummy variables from all_nominal_predictors()
Centering for all_numeric_predictors()
Scaling for all_numeric_predictors()
trained_rec = prep(standardized, training = credit_train)
trained_rec
Recipe

Inputs:

      role #variables
   outcome          1
 predictor         13

Training data contained 3340 data points and 321 incomplete rows. 

Operations:

K-nearest neighbor imputation for Seniority, Home, Time, Age, Marital,... [trained]
Dummy variables from Home, Marital, Records, Job [trained]
Centering for Seniority, Time, Age, Expenses, Incom... [trained]
Scaling for Seniority, Time, Age, Expenses, Incom... [trained]
train_data = bake(trained_rec, new_data = credit_train)
test_data = bake(trained_rec, new_data = credit_test)

train_data %>% glimpse
Rows: 3,340
Columns: 23
$ Seniority         <dbl> 0.2326022, -0.9756267, -0.9756267, -0.9756~
$ Time              <dbl> -0.7046334, 0.1143941, -1.9331747, 0.11439~
$ Age               <dbl> 0.819168853, 0.363377561, -1.459787609, -0~
$ Expenses          <dbl> 1.7475123, 1.7475123, -1.0452600, -0.53748~
$ Income            <dbl> 0.75001838, -0.78930019, -1.17412983, -0.1~
$ Assets            <dbl> -0.21062879, -0.48150940, -0.48150940, -0.~
$ Debt              <dbl> -0.2615791, -0.2615791, -0.2615791, -0.261~
$ Amount            <dbl> 2.06539627, 0.35400235, -1.35739158, 0.140~
$ Price             <dbl> 2.44185759, 0.01922555, -1.52665963, 0.087~
$ Status            <fct> bad, bad, bad, bad, bad, bad, bad, bad, ba~
$ Home_other        <dbl> -0.2750668, -0.2750668, 3.6343927, -0.2750~
$ Home_owner        <dbl> 1.0616266, -0.9416688, -0.9416688, -0.9416~
$ Home_parents      <dbl> -0.4693061, 2.1301673, -0.4693061, -0.4693~
$ Home_priv         <dbl> -0.2358159, -0.2358159, -0.2358159, -0.235~
$ Home_rent         <dbl> -0.5338767, -0.5338767, -0.5338767, -0.533~
$ Marital_married   <dbl> 0.6146338, 0.6146338, -1.6264981, 0.614633~
$ Marital_separated <dbl> -0.1774584, -0.1774584, -0.1774584, -0.177~
$ Marital_single    <dbl> -0.5324876, -0.5324876, 1.8774157, -0.5324~
$ Marital_widow     <dbl> -0.1194505, -0.1194505, -0.1194505, -0.119~
$ Records_yes       <dbl> 2.215860, -0.451157, 2.215860, -0.451157, ~
$ Job_freelance     <dbl> 1.8235759, -0.5482089, -0.5482089, -0.5482~
$ Job_others        <dbl> -0.198784, -0.198784, -0.198784, -0.198784~
$ Job_partime       <dbl> -0.3332834, 2.9995509, 2.9995509, 2.999550~
test_data %>% glimpse
Rows: 1,114
Columns: 23
$ Seniority         <dbl> 1.078362381, -0.250689410, -0.009043629, 1~
$ Time              <dbl> 0.9334216, 0.1143941, 0.9334216, -0.704633~
$ Age               <dbl> 1.913067955, -0.274730248, -0.639363282, -~
$ Expenses          <dbl> -0.3851502, 0.2241819, 0.9858471, 0.985847~
$ Income            <dbl> -0.135089799, -0.212055727, 0.737190723, 0~
$ Assets            <dbl> -0.48150940, -0.12033525, -0.03004171, -0.~
$ Debt              <dbl> -0.26157915, -0.26157915, 1.64153404, -0.0~
$ Amount            <dbl> -0.07384614, 0.24704023, 0.99577507, -0.92~
$ Price             <dbl> 0.32265342, 0.19329733, 0.30987751, -0.823~
$ Status            <fct> good, good, good, good, bad, good, good, g~
$ Home_other        <dbl> -0.2750668, -0.2750668, -0.2750668, -0.275~
$ Home_owner        <dbl> -0.9416688, 1.0616266, 1.0616266, -0.94166~
$ Home_parents      <dbl> -0.4693061, -0.4693061, -0.4693061, -0.469~
$ Home_priv         <dbl> -0.2358159, -0.2358159, -0.2358159, 4.2393~
$ Home_rent         <dbl> 1.8725310, -0.5338767, -0.5338767, -0.5338~
$ Marital_married   <dbl> -1.6264981, 0.6146338, 0.6146338, 0.614633~
$ Marital_separated <dbl> -0.1774584, -0.1774584, -0.1774584, -0.177~
$ Marital_single    <dbl> -0.5324876, -0.5324876, -0.5324876, -0.532~
$ Marital_widow     <dbl> 8.3691608, -0.1194505, -0.1194505, -0.1194~
$ Records_yes       <dbl> -0.451157, -0.451157, -0.451157, -0.451157~
$ Job_freelance     <dbl> -0.5482089, 1.8235759, -0.5482089, -0.5482~
$ Job_others        <dbl> -0.198784, -0.198784, -0.198784, -0.198784~
$ Job_partime       <dbl> -0.3332834, -0.3332834, -0.3332834, -0.333~
vapply(train_data, function(x) mean(!is.na(x)), numeric(1))
        Seniority              Time               Age 
                1                 1                 1 
         Expenses            Income            Assets 
                1                 1                 1 
             Debt            Amount             Price 
                1                 1                 1 
           Status        Home_other        Home_owner 
                1                 1                 1 
     Home_parents         Home_priv         Home_rent 
                1                 1                 1 
  Marital_married Marital_separated    Marital_single 
                1                 1                 1 
    Marital_widow       Records_yes     Job_freelance 
                1                 1                 1 
       Job_others       Job_partime 
                1                 1 
vapply(test_data, function(x) mean(!is.na(x)), numeric(1))
        Seniority              Time               Age 
                1                 1                 1 
         Expenses            Income            Assets 
                1                 1                 1 
             Debt            Amount             Price 
                1                 1                 1 
           Status        Home_other        Home_owner 
                1                 1                 1 
     Home_parents         Home_priv         Home_rent 
                1                 1                 1 
  Marital_married Marital_separated    Marital_single 
                1                 1                 1 
    Marital_widow       Records_yes     Job_freelance 
                1                 1                 1 
       Job_others       Job_partime 
                1                 1 

Checks

trained_rec <- trained_rec %>%
  check_missing(contains("Marital"))
trained_rec
Recipe

Inputs:

      role #variables
   outcome          1
 predictor         13

Training data contained 3340 data points and 321 incomplete rows. 

Operations:

K-nearest neighbor imputation for Seniority, Home, Time, Age, Marital,... [trained]
Dummy variables from Home, Marital, Records, Job [trained]
Centering for Seniority, Time, Age, Expenses, Incom... [trained]
Scaling for Seniority, Time, Age, Expenses, Incom... [trained]
Check missing values for contains("Marital")

Full example

It’s always easier for me to view everything all at once. So here is the full example chained together in the dplyr syntax.

# Get the data
data("credit_data")
credit_data %>% glimpse
Rows: 4,454
Columns: 14
$ Status    <fct> good, good, bad, good, good, good, good, good, goo~
$ Seniority <int> 9, 17, 10, 0, 0, 1, 29, 9, 0, 0, 6, 7, 8, 19, 0, 0~
$ Home      <fct> rent, rent, owner, rent, rent, owner, owner, paren~
$ Time      <int> 60, 60, 36, 60, 36, 60, 60, 12, 60, 48, 48, 36, 60~
$ Age       <int> 30, 58, 46, 24, 26, 36, 44, 27, 32, 41, 34, 29, 30~
$ Marital   <fct> married, widow, married, single, single, married, ~
$ Records   <fct> no, no, yes, no, no, no, no, no, no, no, no, no, n~
$ Job       <fct> freelance, fixed, freelance, fixed, fixed, fixed, ~
$ Expenses  <int> 73, 48, 90, 63, 46, 75, 75, 35, 90, 90, 60, 60, 75~
$ Income    <int> 129, 131, 200, 182, 107, 214, 125, 80, 107, 80, 12~
$ Assets    <int> 0, 0, 3000, 2500, 0, 3500, 10000, 0, 15000, 0, 400~
$ Debt      <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2500, 260, 0, ~
$ Amount    <int> 800, 1000, 2000, 900, 310, 650, 1600, 200, 1200, 1~
$ Price     <int> 846, 1658, 2985, 1325, 910, 1645, 1800, 1093, 1957~
# Create train-test split
train_test_split = credit_data %>% initial_split(prop = 3/4, strata = Status)
credit_train = training(train_test_split)
credit_test = testing(train_test_split)

# Create a recipe w/ formula
credit_recipe = 
  recipe(Status ~ ., data = credit_train) %>% 
  step_impute_knn(all_predictors()) %>% 
  step_dummy(all_nominal_predictors()) %>% 
  step_center(all_numeric_predictors()) %>% 
  step_scale(all_numeric_predictors()) %>% 
  prep(training = credit_train)


train_data = 
  credit_recipe %>% 
  bake(new_data = credit_train)

test_data =
  credit_recipe %>% 
  bake(new_data = credit_test)

train_data %>% head
# A tibble: 6 x 23
  Seniority    Time     Age Expen~1 Income Assets   Debt Amount  Price
      <dbl>   <dbl>   <dbl>   <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
1     0.240 -0.733   0.805    1.74   0.792 -0.207 -0.291  2.07   2.51 
2    -0.987 -1.97   -1.47    -1.04  -1.21  -0.462 -0.291 -1.37  -1.58 
3    -0.741  0.915  -1.11    -0.485 -0.446 -0.462 -0.291  0.992  1.20 
4    -0.864  0.915   0.714    2.50  -0.380 -0.292  0.145 -0.941 -0.208
5    -0.987  0.0908 -0.0143  -1.04  -0.193 -0.394 -0.291  0.992  0.644
6    -0.864  0.503  -0.105    0.729 -0.553 -0.462 -0.291 -0.189 -0.836
# ... with 14 more variables: Status <fct>, Home_other <dbl>,
#   Home_owner <dbl>, Home_parents <dbl>, Home_priv <dbl>,
#   Home_rent <dbl>, Marital_married <dbl>, Marital_separated <dbl>,
#   Marital_single <dbl>, Marital_widow <dbl>, Records_yes <dbl>,
#   Job_freelance <dbl>, Job_others <dbl>, Job_partime <dbl>, and
#   abbreviated variable name 1: Expenses
# i Use `colnames()` to see all variable names
test_data %>% head
# A tibble: 6 x 23
  Senio~1    Time    Age Expen~2   Income Assets   Debt  Amount  Price
    <dbl>   <dbl>  <dbl>   <dbl>    <dbl>  <dbl>  <dbl>   <dbl>  <dbl>
1   1.10   0.915   1.90   -0.384 -0.127   -0.462 -0.291 -0.0820 0.328 
2  -0.987  0.0908  0.350   1.74  -0.806   -0.462 -0.291  0.348  0.0159
3  -0.251  0.0908 -0.288   0.223 -0.207   -0.122 -0.291  0.240  0.195 
4   0.853 -1.56    1.35   -1.04   2.52     0.939 -0.291  0.992  1.22  
5  -0.987  0.0908 -0.105  -0.535 -0.140   -0.398 -0.291  0.133  0.0866
6   2.32   0.915   0.350   0.931 -0.00713 -0.462 -0.291 -0.189  0.0636
# ... with 14 more variables: Status <fct>, Home_other <dbl>,
#   Home_owner <dbl>, Home_parents <dbl>, Home_priv <dbl>,
#   Home_rent <dbl>, Marital_married <dbl>, Marital_separated <dbl>,
#   Marital_single <dbl>, Marital_widow <dbl>, Records_yes <dbl>,
#   Job_freelance <dbl>, Job_others <dbl>, Job_partime <dbl>, and
#   abbreviated variable names 1: Seniority, 2: Expenses
# i Use `colnames()` to see all variable names

Getting creative

What if we wanted to do PCA on just our categorical columns after imputation? Then get those as numeric, united with the rest of the numeric variables, center, scale, prep, and bake as usual?

# Create a recipe w/ formula
credit_pca_recipe = 
  
  # Formula
  recipe(Status ~ ., data = credit_train) %>% 
  
  # Add ID roles, if rows have unique identifiers
  # update_role(flight, time_hour, new_role = "ID") %>%
  
  # Impute missing data
  step_impute_knn(all_predictors()) %>% 
  
  # Make dummies for factors
  step_dummy(all_nominal_predictors()) %>% 
  
  # Drop those dummies using PCA
  step_pca(contains("_"), threshold = .90) %>%
  
  # Center&Scale
  step_center(all_numeric_predictors()) %>% 
  step_scale(all_numeric_predictors()) %>% 
  
  # Drop zero variance columns
  step_zv(all_numeric_predictors()) %>% 
  
  # Prep from the training data
  prep(training = credit_train)

train_data_pca = 
  credit_pca_recipe %>% 
  bake(new_data = credit_train)

test_data_pca =
  credit_pca_recipe %>% 
  bake(new_data = credit_test)

train_data_pca %>% head
# A tibble: 6 x 17
  Seniority    Time     Age Expen~1 Income Assets   Debt Amount  Price
      <dbl>   <dbl>   <dbl>   <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
1     0.240 -0.733   0.805    1.74   0.792 -0.207 -0.291  2.07   2.51 
2    -0.987 -1.97   -1.47    -1.04  -1.21  -0.462 -0.291 -1.37  -1.58 
3    -0.741  0.915  -1.11    -0.485 -0.446 -0.462 -0.291  0.992  1.20 
4    -0.864  0.915   0.714    2.50  -0.380 -0.292  0.145 -0.941 -0.208
5    -0.987  0.0908 -0.0143  -1.04  -0.193 -0.394 -0.291  0.992  0.644
6    -0.864  0.503  -0.105    0.729 -0.553 -0.462 -0.291 -0.189 -0.836
# ... with 8 more variables: Status <fct>, PC1 <dbl>, PC2 <dbl>,
#   PC3 <dbl>, PC4 <dbl>, PC5 <dbl>, PC6 <dbl>, PC7 <dbl>, and
#   abbreviated variable name 1: Expenses
# i Use `colnames()` to see all variable names
test_data_pca %>% head
# A tibble: 6 x 17
  Senio~1    Time    Age Expen~2   Income Assets   Debt  Amount  Price
    <dbl>   <dbl>  <dbl>   <dbl>    <dbl>  <dbl>  <dbl>   <dbl>  <dbl>
1   1.10   0.915   1.90   -0.384 -0.127   -0.462 -0.291 -0.0820 0.328 
2  -0.987  0.0908  0.350   1.74  -0.806   -0.462 -0.291  0.348  0.0159
3  -0.251  0.0908 -0.288   0.223 -0.207   -0.122 -0.291  0.240  0.195 
4   0.853 -1.56    1.35   -1.04   2.52     0.939 -0.291  0.992  1.22  
5  -0.987  0.0908 -0.105  -0.535 -0.140   -0.398 -0.291  0.133  0.0866
6   2.32   0.915   0.350   0.931 -0.00713 -0.462 -0.291 -0.189  0.0636
# ... with 8 more variables: Status <fct>, PC1 <dbl>, PC2 <dbl>,
#   PC3 <dbl>, PC4 <dbl>, PC5 <dbl>, PC6 <dbl>, PC7 <dbl>, and
#   abbreviated variable names 1: Seniority, 2: Expenses
# i Use `colnames()` to see all variable names

Absolutely ridiculous. You can just specify what columns you want, then it will drop the dimensions on them for you and keep moving. Let’s table this credit_recipe_pca for now.

Make a model

Cross-validation and Model performance

It’s only natural now to build a classification model. We will be referencing our example from the tidymodels post.

library(tidymodels)

lr_mod <- logistic_reg() %>% 
  set_engine("glm")

# Cross validation set from train
folds = vfold_cv(credit_train, v = 10)

# Workflow to incorporate folds
credit_workflow_rs = 
  workflows::workflow() %>% 
  workflows::add_model(lr_mod) %>% 
  workflows::add_recipe(credit_recipe) %>% 
  tune::fit_resamples(folds)

# See performance
collect_metrics(credit_workflow_rs)
# A tibble: 2 x 6
  .metric  .estimator  mean     n std_err .config             
  <chr>    <chr>      <dbl> <int>   <dbl> <chr>               
1 accuracy binary     0.795    10 0.00743 Preprocessor1_Model1
2 roc_auc  binary     0.829    10 0.00590 Preprocessor1_Model1

This could be done kind of manually by building the model, predicting out values, then seeing how probabilities stack against the roc_auc and labels against accuracy. We will leave this for another dya.

Fit the whole model

# Workflow for whole train set
credit_workflow = 
  workflows::workflow() %>% 
  workflows::add_model(lr_mod) %>% 
  workflows::add_recipe(credit_recipe)

# Train on the whole now
credit_fit =
  credit_workflow %>%
  fit(data = credit_train)

tidy(credit_fit)
# A tibble: 23 x 5
   term        estimate std.error statistic   p.value
   <chr>          <dbl>     <dbl>     <dbl>     <dbl>
 1 (Intercept)  1.37       0.0549   25.0    1.12e-137
 2 Seniority    0.678      0.0700    9.69   3.19e- 22
 3 Time         0.00568    0.0588    0.0967 9.23e-  1
 4 Age         -0.122      0.0631   -1.94   5.28e-  2
 5 Expenses    -0.273      0.0580   -4.72   2.41e-  6
 6 Income       0.529      0.0614    8.61   7.11e- 18
 7 Assets       0.356      0.0927    3.85   1.20e-  4
 8 Debt        -0.121      0.0501   -2.42   1.57e-  2
 9 Amount      -0.985      0.0948  -10.4    2.54e- 25
10 Price        0.677      0.0932    7.26   3.80e- 13
# ... with 13 more rows
# i Use `print(n = ...)` to see more rows

What was our response?

credit_data %>% glimpse
Rows: 4,454
Columns: 14
$ Status    <fct> good, good, bad, good, good, good, good, good, goo~
$ Seniority <int> 9, 17, 10, 0, 0, 1, 29, 9, 0, 0, 6, 7, 8, 19, 0, 0~
$ Home      <fct> rent, rent, owner, rent, rent, owner, owner, paren~
$ Time      <int> 60, 60, 36, 60, 36, 60, 60, 12, 60, 48, 48, 36, 60~
$ Age       <int> 30, 58, 46, 24, 26, 36, 44, 27, 32, 41, 34, 29, 30~
$ Marital   <fct> married, widow, married, single, single, married, ~
$ Records   <fct> no, no, yes, no, no, no, no, no, no, no, no, no, n~
$ Job       <fct> freelance, fixed, freelance, fixed, fixed, fixed, ~
$ Expenses  <int> 73, 48, 90, 63, 46, 75, 75, 35, 90, 90, 60, 60, 75~
$ Income    <int> 129, 131, 200, 182, 107, 214, 125, 80, 107, 80, 12~
$ Assets    <int> 0, 0, 3000, 2500, 0, 3500, 10000, 0, 15000, 0, 400~
$ Debt      <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2500, 260, 0, ~
$ Amount    <int> 800, 1000, 2000, 900, 310, 650, 1600, 200, 1200, 1~
$ Price     <int> 846, 1658, 2985, 1325, 910, 1645, 1800, 1093, 1957~
credit_data %>% select(Status) %>% str
'data.frame':   4454 obs. of  1 variable:
 $ Status: Factor w/ 2 levels "bad","good": 2 2 1 2 2 2 2 2 2 1 ...

Right, good and bad credit. The levels start with bad, then go to good.

Recipe visualization

tidy(credit_fit) %>% filter(p.value <= 0.25) %>% 
  ggplot(., aes(fill=p.value)) +
  geom_col(aes(x=reorder(term, -estimate), y=estimate)) +
  geom_hline(yintercept=0, linetype="longdash", color = "black") +
  coord_flip() +
  scale_fill_gradient(low = "green", high = "red") +
  theme_bw() +
  labs(fill = "Statistical p-value", x = "Term", y = "Estimate") +
  ggtitle("Credit Signifiant Coeffients")

References

https://recipes.tidymodels.org/