Recommender Systems as a Field

Recommender Systems as such is often referred to as a sub-field of machine learning. This is a class of learning problems that is entirely based on recommendation. As such, this sub-field has it’s own unique flavor of data. In short, there is some form of recommendation as the objective of the problem. Whether it be a movie, song, or some other object of potential interest to someone who has yet to actually experience it. The most popular paradigms to approaching the problem are:

Content-based Recommendation
Collaborative Filtering

Content-based Recommendation

In Content-based Recommendation the goal is acquire highly intelligent information about the domain. By obtaining a feature vector on the items, as users experience those items we can train based on their preferences, then predict out new observations using the standard classification paradigm. This is very convenient, and powerful, as classification is time tested and can be produce great results. Some challenges this faces are that of small datasets, as users will have to experience things first, which may take hours, days, or weeks in some cases.

Feature Vector \[ \phi(x) \rightarrow \mathbb{R^d} \] The feature vector will take all of the items and produce a highly articulate representation of it. How would you describe features of music? What about a movie? How about a trip to a state park? Though complex, many companies have had success at this. Pandora pays experts to accurately tag down a high dimensional space for this. As users experience a movie they populate one of these d-vectors and their positive or negative experience to it will drill into the feature space of preferences, making recommendation possible.

Data Set \[ D_{items}=\{(\phi(x_i), y_i)\}_{i=1..n} \] The data set becomes what item a user experiences as x and their response as y. Apply a simple transformation to make x tabular and boom … you’ve got a classification problem.

Collaborative Filtering

A far more common approach to solving the problem is that of Collaborative Filtering. The data set for this is an accumulation of all user experience. The data set has one form with two common notations.

User Rating Matrix \[ R_{i,j}\in \mathbb{R^{n,m}} \: \forall i \in 1..n, \: \forall j \in 1..m \] The user rating matrix contains every user i and every item j. Populated typically with a 1 to 5 type of ordinal ranking. This matrix becomes HUGE very quickly. Seldom can you actually load this into memory. We will discuss the tricks required in order to extract meaningful information this later.

Database \[ D=\{(i,j,r)\}_{i=1..l} \]

Since the above matrix will become so sparse, the database is often a book keeping data structure to contain all of the non-zero elements of this matrix. For some user i with some rating r for item j. We assume there are l of these. Clearly, we expect l to be significantly smaller than n*m which is the total number of entries in this matrix.

Problem Setup

We would like to construct a matrix that will essentially fill in the gaps for us with a highly suggestive amount of potential rating, then argmax over any particular user and provide the top k as a recommendation. We could construct a loss function for this, but it might not be too useful. Instead, we will stand on the shoulders of giants and set up our problem as follows:

Model Parameters

User Latent Representation \[ U \in \mathbb{R^{n,k+1}} \] Item Latent Representation \[ V \in \mathbb{R^{k+1,n}} \]

The plus one represents the bias for each user and item respectively (e.g., how pessimistic I am, and in general how well the movie is liked by folks).

U will contain information about each user lodged up in k+1 mystical components. V will contain information about each item lodged up in k+1 mystical components. These become a feature vector of sorts for users and items respectively.

Loss Function

The loss function is as follows:

\[ J(U,V)=[\frac{1}{2}\sum_{(i,j,r)\in D}{(u_i^Tv_j + b_{u_{i}} + b_{v_{j}} - r)^2}] + \frac{\lambda}{2}\sum_{i=1..n}{||U_i||^2}+ \frac{\lambda}{2}\sum_{j=1..m}{||V_j||^2} \]

At a glance this looks like typical linear regression. Don’t be fooled. Try taking the derivative and setting to zero you end up with a nasty set of equations in both directions. One clever trick to solve this problem is what is known as iterative regression.

Iterative Regression

In iterative regression we will leverage the fact that we know how to solve a simple regression problem analytically. If we hold V constant and solve for U using the typical ordinary least squares model, we will arrive at cofficients for some user. We can loop this for all users to obtain their estimates. Then, we can hold these estimates U constant and go solve for V for all items. We will leverage the fact that we know the following as true for any data set X, response column Y and coefficient estimate theta from least squares.

Ordinary Least Squares Solution \[ \theta^*=(X^TX)^{-1}X^TY \]

Simulating Data

Let’s see if we can tackle this monster with some fake data.

# Items
M = 3

# Users
N = 10

# Populate ratings matrix
R = matrix(0, nrow=N, ncol=M)

# Populating ratings matrix
R[1,] <- c(2,0,5)
R[2,] <- c(1,5,3)
R[3,] <- c(0,0,5)
R[4,] <- c(4,1,5)
R[5,] <- c(0,0,2)
R[6,] <- c(2,4,0)
R[7,] <- c(0,3,5)
R[8,] <- c(2,0,1)
R[9,] <- c(0,1,2)
R[10,] <- c(2,0,0)

# for(n in 1:N){
#   user_n_ratings = rbinom(M, 5, 0.5)
#   
#   R[n, ] = user_n_ratings
#   
#   # Put some holes in the data
#   R[n, floor(runif(1, min=0, max=4))] = 0
#   # R[n, floor(runif(1, min=0, max=4))] = 0
# }

R

      [,1] [,2] [,3]
 [1,]    2    0    5
 [2,]    1    5    3
 [3,]    0    0    5
 [4,]    4    1    5
 [5,]    0    0    2
 [6,]    2    4    0
 [7,]    0    3    5
 [8,]    2    0    1
 [9,]    0    1    2
[10,]    2    0    0

(User, Movie, Rating) Tuples

library(dplyr)
library(tidyr)

We are all fully aware this matrix will never fit into memory, so let’s make it a structure we will probably actually be working with in the real world.

D <- R %>% as.data.frame %>%
  setNames(nm = 1:M) %>% 
  dplyr::mutate(user_id=row_number()) %>% 
  tidyr::gather(., key = "item_id", value ="rating", -user_id) %>% 
  dplyr::mutate(user_name = paste("User_", user_id, sep=""),
                item_name = paste("Item_", item_id, sep="")) %>% 
  dplyr::filter(rating != 0)
D

   user_id item_id rating user_name item_name
1        1       1      2    User_1    Item_1
2        2       1      1    User_2    Item_1
3        4       1      4    User_4    Item_1
4        6       1      2    User_6    Item_1
5        8       1      2    User_8    Item_1
6       10       1      2   User_10    Item_1
7        2       2      5    User_2    Item_2
8        4       2      1    User_4    Item_2
9        6       2      4    User_6    Item_2
10       7       2      3    User_7    Item_2
11       9       2      1    User_9    Item_2
12       1       3      5    User_1    Item_3
13       2       3      3    User_2    Item_3
14       3       3      5    User_3    Item_3
15       4       3      5    User_4    Item_3
16       5       3      2    User_5    Item_3
17       7       3      5    User_7    Item_3
18       8       3      1    User_8    Item_3
19       9       3      2    User_9    Item_3

Ordinary Least Squares

That looks better. Now let’s see if we can do some iterative least squares to solve this beast.

library(MASS)

solve.ols <- function(X,y){
  theta = (ginv(t(X) %*% X) %*% t(X)) %*% y
}

X = matrix(rep(rnorm(2), 10), nrow = 10)
y = matrix(rnorm(10), ncol=1)
theta = solve.ols(X,y)
theta

             [,1]
[1,] 0.0007774986
[2,] 0.0007774986

Create the U matrix

Neat-o, we can solve a single OLS problem. Let’s define our parameters.

# Latent dimensions
K = 3

# User matrix
U = matrix(rnorm(N*K), nrow=N, ncol=K)

# Don't forget to append the bias vector for the items
U = cbind(U, 1)

# # Don't forget to append the bias vector
U = cbind(U, rnorm(N))


U

            [,1]        [,2]        [,3] [,4]       [,5]
 [1,] -1.3270366 -1.64229706 -0.74543293    1 -0.3785241
 [2,]  0.7363001  0.51817104  1.68335141    1 -0.7233356
 [3,]  0.8175293 -0.06825585 -0.25251234    1 -1.3509502
 [4,]  0.3253387  1.31086086  0.48109986    1 -0.8145052
 [5,]  0.3341991 -1.40574984 -0.97059620    1  0.4811544
 [6,]  0.4674589 -0.82871156  1.51013370    1 -0.4384462
 [7,]  0.4585841  0.90775656 -0.18085370    1  1.5732527
 [8,] -0.5184098 -0.87080546  0.00712267    1  0.1115782
 [9,] -0.0118098  1.33809955  2.14143620    1 -0.3801498
[10,]  0.7271083  0.71103304  1.23705950    1  0.2636276

Create the V matrix

# Item matrix
V = matrix(rnorm(K*M), nrow=K, ncol=M)

# Don't forget to append the bias vector
V = rbind(V, rnorm(M))

# Don't forget to append the bias vector for the users
V = rbind(V, 1)

V

           [,1]       [,2]       [,3]
[1,] 0.22583268 -0.7298176 -1.5101003
[2,] 0.10789959 -1.0803082  0.6284866
[3,] 0.04698169 -0.5567730 -0.4943936
[4,] 1.48580304 -0.1077601 -0.7066269
[5,] 1.00000000  1.0000000  1.0000000

I believe by alternating those columns of ones like that we will bundle up the bias terms inside both matrices. So then when we get our dot product from the problem, it will hit a bias term for the user and a bias term for the item with ones of alternating locations. Just remember they are at the end. :)

Excellent so far, now we need to perform loops in succession to solve the following sub-problem.

\[ U_i = \frac{1}{2}\sum_{(j | (i,j,r) \in D)}{(u_i^Tv_j - r)^2} \] Once we solve the above problem for all users i, we will alternative over and solve the inverse problem of all items. This is as follows.

\[ V_j = \frac{1}{2}\sum_{(i | (i,j,r) \in D)}{(u_i^Tv_j - r)^2} \] Same exact setup, this just alternates which role will be data and which will be parameters. So for the user problem, the items will serve as data X and u will be the theta. On the item side the users will serve as data X and items will be the theta v. Let’s give it a shot and cause some trouble.

Learn the U matrix

# Solve for U
n = unique(D$user_id)
for(uid in as.numeric(n)){
  
  # The user id
  # cat(uid, "\n")
  
  # Get all the items now
  all_items <- D %>% dplyr::filter(user_id ==  uid) %>% dplyr::pull(item_id)
  m <- length(all_items)
  
  tmp_X <- matrix(0, nrow=m, ncol=K+2)
  tmp_y <- matrix(0, nrow=m, ncol=1)
  
  idx = 1
  for(iid in as.numeric(all_items)){
    
    # The item id
    # cat(" ", iid, " ")
    
    # Store away the item vector for this record
    tmp_X[idx, ] = V[,iid]
    
    # Store away the response element for this record
    r = D %>% dplyr::filter(user_id == uid, item_id == iid) %>% dplyr::pull(rating)
    tmp_y[idx] = r
    
    # Iterate to the next item
    idx = idx + 1
    
  }
  # cat("\n")
  # break
  
  U[uid,] = solve.ols(tmp_X, tmp_y)
}

U

            [,1]        [,2]        [,3]       [,4]      [,5]
 [1,] -1.6293749  0.82692416 -0.55330880  0.2547490 1.9262298
 [2,] -1.2218737 -1.74068943 -0.91970519 -0.1309888 1.7015915
 [3,] -1.7085888  0.71109522 -0.55937700 -0.7995063 1.1314406
 [4,] -1.4344886  2.12956316 -0.23031705  1.2422230 2.2592973
 [5,] -0.6834355  0.28443809 -0.22375080 -0.3198025 0.4525762
 [6,] -0.8218100 -1.30872541 -0.67801104  0.4704885 1.6596034
 [7,] -1.7616312  0.13835765 -0.71333716 -0.7238713 1.3886281
 [8,] -0.2728864  0.24286636 -0.10660255  0.7561736 0.9169049
 [9,] -0.6939606  0.17079081 -0.25430084 -0.3047944 0.5036095
[10,]  0.1380201  0.06594401  0.02871337  0.9080647 0.6111609

Learn the V matrix

We got the U matrix after performing i=1..n regressions and building up our vectors from the item data. Now let’s do it for the ratings matrix.

# U[,M+1] = 1

# Solve for V
m = unique(D$item_id)
for(iid in as.numeric(m)){
  
  # The item id
  # cat(iid, "\n")
  
  # Get all the users now
  all_users <- D %>% dplyr::filter(item_id ==  iid) %>% dplyr::pull(user_id)
  n <- length(all_users)
  
  tmp_X <- matrix(0, nrow=n, ncol=K+2)
  tmp_y <- matrix(0, nrow=n, ncol=1)
  
  idx = 1
  for(uid in as.numeric(all_users)){
    
    # The user id
    # cat(" ", uid, " ")
    
    # Store away the user vector for this record
    tmp_X[idx, ] = U[uid,]
    
    # Store away the response element for this record, same as last time
    r = D %>% dplyr::filter(user_id == uid, item_id == iid) %>% dplyr::pull(rating)
    tmp_y[idx] = r
    
    # Iterate to the next item
    idx = idx + 1
    
  }

  V[,iid] = solve.ols(tmp_X, tmp_y)
}

V

           [,1]       [,2]       [,3]
[1,] 0.22583268 -0.7298176 -1.5101003
[2,] 0.10789959 -1.0803082  0.6284866
[3,] 0.04698169 -0.5567730 -0.4943936
[4,] 1.48580304 -0.1077601 -0.7066269
[5,] 1.00000000  1.0000000  1.0000000

t(V)

           [,1]       [,2]        [,3]       [,4] [,5]
[1,]  0.2258327  0.1078996  0.04698169  1.4858030    1
[2,] -0.7298176 -1.0803082 -0.55677298 -0.1077601    1
[3,] -1.5101003  0.6284866 -0.49439360 -0.7066269    1

            [,1]        [,2]        [,3]       [,4]      [,5]
 [1,] -1.6293749  0.82692416 -0.55330880  0.2547490 1.9262298
 [2,] -1.2218737 -1.74068943 -0.91970519 -0.1309888 1.7015915
 [3,] -1.7085888  0.71109522 -0.55937700 -0.7995063 1.1314406
 [4,] -1.4344886  2.12956316 -0.23031705  1.2422230 2.2592973
 [5,] -0.6834355  0.28443809 -0.22375080 -0.3198025 0.4525762
 [6,] -0.8218100 -1.30872541 -0.67801104  0.4704885 1.6596034
 [7,] -1.7616312  0.13835765 -0.71333716 -0.7238713 1.3886281
 [8,] -0.2728864  0.24286636 -0.10660255  0.7561736 0.9169049
 [9,] -0.6939606  0.17079081 -0.25430084 -0.3047944 0.5036095
[10,]  0.1380201  0.06594401  0.02871337  0.9080647 0.6111609

Learning From Users and Item Embeddings

Pretty sweet. Now we have two matrices U and V. U for our users. V for our items. But they are SUBSTANTIALLY smaller than our initial matrix

U %>% dim

[1] 10  5

V %>% dim

[1] 5 3

R %>% dim

[1] 10  3

cat("\n")

nrow(U) * ncol(U)

[1] 50

nrow(V) * ncol(V)

[1] 15

nrow(R) * ncol(R)

[1] 30

We didn’t really gas the problem, but we literally cut it in half. That is 550 entries of data to learn instead of 1000. Imagine if we had TONS of items (which most do) and TONS of users (which most do). There it is folks. Now let’s see what we’ve learned

For users the bias column was in the last slot. Let’s see what each users bias rating is.

library(ggplot2)

We want to be careful about interpreting that intercept necessarily as the basic, because it is ultimately created as a projection with many of things being factored in. Namely V1..V3. With each variable it collectively minimizes things with that offset.

U_df <- U %>% 
  as.data.frame %>%
  dplyr::rename(user_bias = V4) %>%
  dplyr::mutate(user = paste("User_", row_number(), sep=""),
                user_actual_mean_rating = apply(R, 1, mean)) 

row.names(U_df) <- U_df$user
U_df

                V1          V2          V3  user_bias        V5
User_1  -1.6293749  0.82692416 -0.55330880  0.2547490 1.9262298
User_2  -1.2218737 -1.74068943 -0.91970519 -0.1309888 1.7015915
User_3  -1.7085888  0.71109522 -0.55937700 -0.7995063 1.1314406
User_4  -1.4344886  2.12956316 -0.23031705  1.2422230 2.2592973
User_5  -0.6834355  0.28443809 -0.22375080 -0.3198025 0.4525762
User_6  -0.8218100 -1.30872541 -0.67801104  0.4704885 1.6596034
User_7  -1.7616312  0.13835765 -0.71333716 -0.7238713 1.3886281
User_8  -0.2728864  0.24286636 -0.10660255  0.7561736 0.9169049
User_9  -0.6939606  0.17079081 -0.25430084 -0.3047944 0.5036095
User_10  0.1380201  0.06594401  0.02871337  0.9080647 0.6111609
           user user_actual_mean_rating
User_1   User_1               2.3333333
User_2   User_2               3.0000000
User_3   User_3               1.6666667
User_4   User_4               3.3333333
User_5   User_5               0.6666667
User_6   User_6               2.0000000
User_7   User_7               2.6666667
User_8   User_8               1.0000000
User_9   User_9               1.0000000
User_10 User_10               0.6666667

ggplot(data = U_df, aes(label = user_bias)) +
  geom_col(aes(x=user, y=user_actual_mean_rating, fill = "Average Rating"), alpha = 0.5) +
  geom_col(aes(x=user, y=user_bias, fill="User Bias"), alpha = 0.5) +
  ggtitle("Latent User Bias vs. Average Rating") +
  labs(fill = "User Centrality") +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 90)) +
  xlab("User") +
  ylab("Rating")

Since the iterative regression trains the users first, we will lose bias information for the

Recommendations

We can now construct our rating matrix at once, or piece by piece since we have U and V.

R_reconstructed <- U %*% V

R_reconstructed

             [,1]      [,2]      [,3]
 [1,]  2.00000000 2.5026590  5.000000
 [2,]  1.00000000 5.0000000  3.000000
 [3,] -0.39187712 2.0077978  5.000000
 [4,]  4.00000000 1.0000000  5.000000
 [5,] -0.15675085 0.8031191  2.000000
 [6,]  2.00000000 4.0000000  2.080847
 [7,] -0.10332101 3.0000000  5.000000
 [8,]  2.00000000 0.8315597  1.000000
 [9,] -0.09949318 1.0000000  2.000000
[10,]  2.00000000 0.3253515 -0.211677

      [,1] [,2] [,3]
 [1,]    2    0    5
 [2,]    1    5    3
 [3,]    0    0    5
 [4,]    4    1    5
 [5,]    0    0    2
 [6,]    2    4    0
 [7,]    0    3    5
 [8,]    2    0    1
 [9,]    0    1    2
[10,]    2    0    0

Recommender Systems in R