Recommender Systems and Collaborative Filtering - Walkthrough
This is a walkthrough of Collaborative Filtering that I wrote for a class.
Summary
Every user on the internet today is faced with an overwhelming set of choices on almost every website he/she visits. Be it Facebook, Spotify, Amazon or Google, there is a need to filter, rank and deliver relevant information quickly in order to alleviate the problem of information overload. Recommender systems are used in almost every major website these days to solve this problem by searching through a large volume of dynamically generated information to provide users with personalized content and services. See Figure 1 1.
After completing this walkthrough you should be able to :
-
Understand what Recommender Systems do.
-
Get a high level overview of different approaches used in Recommender Systems.
-
Understand the concept of collaborative filtering.
-
Learn how to implement collaborative filtering using low dimensional matrix factorization method.
-
Get a hands on experience on building a collaborative filtering recommender system for a real-life dataset.
Background
Recommender systems have become extremely popular in recent years. They
are a subclass of information filtering systems that seek to predict the
“rating” or “preference” that a user would give to an item. Applications
of recommender systems include: movies, music, news, books, research
articles, search queries, social tags, and products. For example,
recommending news articles to on-line newspaper readers.
Basic Concepts :
User: any individual who provides ratings to a system. User who
provides provides ratings ratings and user who receive receive
recommendations recommendations.
Item: anything for which a human can provide a rating. Eg. art,
books, CDs, journal articles, music, movie, or vacation destinations
Ratings: vote from a user for an item by means of some value.
Scalar/ordinal ratings (5 points Likert scale), binary ratings
like/dislike), unary rating (observed/abase of rating).
Recommender systems typically produce a list of recommendations in one of two ways – through collaborative and content-based filtering.
-
Collaborative filtering: This approach is used to build a model from user’s past behaviour i.e. items previously purchased or selected and/or numerical ratings given to those items as well as similar decisions made by other users. This model is then used to predict items (or ratings for items) that the user may have an interest in.
-
Content-based filtering: This approach uses features of an item in order to recommend additional items with similar features/properties.
Sometimes the above two approaches are often combined and termed as Hybrid Recommender Systems.
The differences between collaborative and content-based filtering can be
demonstrated by comparing two popular music recommender systems –
Last.fm and Pandora Radio.
Last.fm creates a “station” of recommended songs by observing what bands
and individual tracks the user has listened to on a regular basis and
comparing those against the listening behavior of other users. Last.fm
will play tracks that do not appear in the user’s library, but are often
played by other users with similar interests. As this approach leverages
the behavior of users, it is an example of a collaborative filtering
technique.
Pandora uses the properties of a song or artist (a subset of the 400
attributes provided by the Music Genome Project) in order to seed a
“station” that plays music with similar properties. User feedback is
used to refine the station’s results, deemphasizing certain attributes
when a user “dislikes” a particular song and emphasizing other
attributes when a user “likes” a song. This is an example of a
content-based approach.
Each type of system has its own strengths and weaknesses. In the above
example, Last.fm requires a large amount of information on a user in
order to make accurate recommendations. This is an example of the cold
start problem, and is common in collaborative filtering systems. While
Pandora needs very little information to get started, it is far more
limited in scope (for example, it can only make recommendations that are
similar to the original seed).
Recommender systems are a useful alternative to search algorithms since they help users discover items they might not have found by themselves. Interestingly enough, recommender systems are often implemented using search engines indexing non-traditional data.
In this tutorial we will be exploring an implementation of Collaborative Filtering using matrix factorization.
The Problem
To illustrate an everyday application of Collaborative Filtering, let’s
consider building the “Recommended Movies” section of IMDB. Most of the
recommender systems usually have a set of users and a group of items,
such as videos, movies, books or music. A common insight that these
recommenders follow are that personal tastes are correlated. If Alice
and Bob both like X and Alice likes Y then Bob is more likely to like Y
especially (perhaps) if Bob knows Alice. Thus, given that the users have
rated some items in the system, we would like to predict how the users
would rate the items that they have not yet rated, so that we can make
recommendations to the users.
Assume each user gives a movie a rating between 1 and 5 stars and we
have 5 users and 7 movies. We can represent this users and video ratings
information as a matrix such as the one in figure .
Our task now is to predict the ratings for all the entries with ?. We
will use low dimensional matrix factorization method to accomplish this.
Low Dimensional Matrix Factorization
To get the intution behind matrix factorization, lets consider the
example in figure. As discussed above personal tastes
are correlated. For example, two users would give high ratings to a
certain movie like the actors/singers of the video or if the genre is
liked by both the users. Hence, if we can find these
characteristics/features/properties of users and movies, we can easily
predict the rating of unknown movies. Also the number of features are
very small as compared to number of users and movies. Now lets see the
math behind matrix factorization. We need to obtain the two matrices:
users(P) and movies(Q)
We have a \(m \times n\) matrix of users and items and we need to find two
matrices P(\(m \times k\)) and Q(\(n \times k\)) such that their product
approximates R. See Figure below. Note:
\(k \ll m,n\) \(R \approx P \times Q^T = \widehat{R}\)
Why not SVD?
You must be wondering why we are not using SVD for factorization. The
problem is that SVD is not defined if there are missing values in
matrix. We can tackle this problem by filling missing values with zeros
or the average rating that user has given to rest of the videos. But if
the matrix is too big, millions of users and items then this may not be
a good idea.
Cold Start
A common problem faced by most recommender systems is something called
as a Cold start problem.
In collaborative filtering, the recommender system tries to infer the
rating an active user would give to an item from the choices like-minded
users would have made. This approach would fail when there are some
items which no user has rated previously. Specifically, it concerns the
issue that the system cannot draw any inferences for users or items
about which it has not yet gathered sufficient information.
Gradient Descent
How do we now find the matrices P and Q? The idea
behind finding the values for these two matrices is that their products
should nearly be the same as the one in figure above along with
the predicted values for the missing values.
To do this, we can start off by initializing P and Q
with random values and use gradient descent to find the local minima of
the difference between the actual matrix and the calculated matrix
iteratively.
We can write the error between the estimated rating and actual rating
for every user-video pair as given below. Our task here is to minimize
the error in each individual rating.
Note: Objective function is multiplied by a factor of 1/2 in order to
remove 2 during differentiation.
\(e_{ij}^2 = \frac{1}{2}(r_{ij} - \hat{r}_{ij})^2 = \frac{1}{2}(r_{ij} - \sum_{k=1}^K{p_{ik}q_{kj}^T})^2\\\)
Now we need to know the direction in which we should modify the values
of \(p_ik\) and \(q_kj\). This can be obtained by calculating the gradient
at the current value. This is done by differentiating the above equation
with respect to \(p_{ik}\) and \(q_{kj}\) separately :
\(\frac{\partial}{\partial p_{ik}}e_{ij}^2 = -(r_{ij} - \hat{r}_{ij})(q_{kj})\)
\(\frac{\partial}{\partial q_{ik}}e_{ij}^2 = -(r_{ij} - \hat{r}_{ij})(p_{ik})\)
Now we can write the update rule for \(p_{ik}\) and \(q_{kj}\) as follows :
\(p_{ik}^{t+1} = p_{ik}^t - \alpha \frac{\partial}{\partial p_{ik}^t}e_{ij}^2 = p_{ik}^t - \alpha (-r_{ij} + \hat{r}_{ij})q_{kj}^t\)
\(q_{kj}^{t+1} = q_{kj}^t - \alpha \frac{\partial}{\partial q_{kj}^t}e_{ij}^2 = q_{kj}^t - \alpha (-r_{ij} + \hat{r}_{ij})p_{ik}^t\)
Weighted Objective Function
In order to penalize only the known rating we will be using weighted objected function where \(w_{ij} = 1\) if \(r_{ij}\) is observed and \(w_{ij} = 0\) otherwise. \(\min_{P,Q} \frac{1}{2}\times w_{ij}(r_{ij} - \sum_{k=1}^K{p_{ik}q_{kj}^T})^2\)
Regularization
The above algorithm can lead to overfitting. To avoid this, we use L2
regularization to minimize the norm of the residual as follows :
\(e_{ij}^2 = \frac{1}{2}\times w_{ij}(r_{ij} - \sum_{k=1}^K{p_{ik}q_{kj}^T})^2 + \lambda {(||P||^2 + ||Q||^2)}\)
\(\lambda\) provides a knob on the magnitudes of the user-feature and
video-feature vectors. It ensures that P and Q would give a good
approximation of R without having to contain large numbers. Thus our
objective function now becomes:
\(\min_{P,Q} \frac{1}{2}||W\cdot(R-PQ^T)||^2 + \lambda(||P||^2 + ||Q||^2)\)
Here W is the indicator matrix i.e. \(w_{ij} = 1\) if \(r_{ij}\) is observed
and \(w_{ij} = 0\) otherwise.
The new update rules after calculating the
gradient are as follows:
\(\Delta P = W\cdot(PQ^T - R)Q +2\lambda P\)
\(\Delta Q = (W\cdot(PQ^T - R))^TP +2\lambda Q\)
\(P^{t+1} = P^t - \alpha \Delta P^t = P^t - \alpha(W\cdot(P^tQ^{T,t} - R)Q^t +\lambda P^t)\)
\(Q^{t+1} = Q^t - \alpha \Delta Q^t = Q^t - \alpha((W\cdot(P^tQ^{T,t} - R))^TP^t +\lambda Q^t)\)
Note: Multiplication factor 2 can be consumed in \(\lambda\).
Biases
The variation in rating values is most of the times associated with either the item being rated or the user who is rating. Certain users consistently give higher ratings to items while sometimes the item always commands a higher rating from all the users. We use a first order approximation of the above bias as follows : \({b_i}_j = \mu + b_i + b_j\) Where \(\mu\) is the overall average rating for the item.
Lab
Now let’s perform a more thorough analysis of collaborative filtering
using matrix factorization method. To do this we will use the dataset
available here : http://grouplens.org/datasets/movielens/100k/
The goal of this exercise is to build a recommendation system for IMDB.
More specifically :
-
Construct the User-Item Matrix.
-
Define a factorization model - cost function. We will use matrix factorization, regularization and gradient descent to obtain a model that minimizes the below function. \(\min_{M,U} \frac{1}{2}||R\cdot(Y - MU^T)||_F^2 + \lambda(||M||_F^2 + ||U||_F^2)\) Here M is the movie matrix, U is the user matrix, R is the indicator weight matrix \(||\cdot||_F\) is Frobenius norm, the operator \(\cdot\) means the dot product and $\lambda$ is the regularization parameter.
-
Understand the impact of increasing or decreasing the number of features on the accuracy of the model.
-
Experiment the convergence of the model by varying the learning rate and the regularization parameters.
-
Understand the impact of adding bias.
Load the dataset
Download the dataset from this link
http://pages.cs.wisc.edu/~ashenoy/CS532/ Since the above dataset has
many files and asks you to merge the files using specific commands we
have merged the data for you and have created training and testing
datasets. The dataset in total has 100K ratings. We will be using 80K
for training and 20K for testing. train_all.mat
has two matrices
each of size \(1682 \times 943\) matrix. In Rating_train
(Y) each
entry (i,j) is the rating given to the ith movie by jth user and
L_train
(R) is the corresponding indicator matrix. Similarly
test_all.mat
has two matrices each of size $1682 \times 943$
matrix. test_Y
represents the corresponding test rating matrix and
test_R
as the corresponding test indicator matrix.
Write the MATLAB code to load train_all.mat
and test_all.mat
and verify the above.
Initialize learning rate, regularization parameter and maximum number of iterations
Now that the dataset is loaded and we have the training and testing set,
we can now initialize the following tuning parameters :
\(alpha = 0.001
lambda = 10
max\_iter = 500\)
Initialize the M(movies) and U(users) matrix
M and U matrices are the factors of the ratings matrix Y. Let’s start with a feature size of 10 and initialize these two matrices with appropriate dimensions and fill them with normally distributed random values.
Gradient Descent
Now that you have the M(movies) and U(users) matrices, write code to
update M and U using gradient descent method :
\(M^{t+1} = M^t - \alpha(R\cdot(M^tU^{t,T}-Y)U^t +\lambda M^t)\)
\(U^{t+1} = U^t - \alpha((R\cdot(M^tU^{t,T})-Y)^TM^t +\lambda U^t)\)
This has to be performed \(max\_iter\) number of times.
Formulate the loss function
Everytime after updating M and U, write code to check for the
convergence. We can assume convergence if the calculated error using the
formula below is less than a threshold 0.0001. If the convergence
condition is met we should stop the gradient calculation.
\(\frac{||M^{t+1}-M^t||^2_F+||U^{t+1}-U^t||^2_F }{||M^t||^2_F+||U^t||^2_F} < \epsilon\)
Predicted Ratings
After performing all of the above steps, you will end up with an updated M and U matrix. Lets call them \(M\_result\) and \(U\_result\) and assign them to these two new variables. Now we can obtain the predicted ratings matrix by calculating the dot product of \(M\_result\) and \(U\_result\).
Calculate Error
Now that you have the predicted ratings matrix, let’s calculate the
error rate using the test dataset.
\(error\_rate = \frac{||test\_R \cdot (predicted\_matrix - test\_Y)||^2_F}{||test\_Y||^2_F}\)
Varying number of features
Now let’s try and analyze how varying the number of features in M and U matrices impacts the accuracy. Plot a graph between \(error\_rate\) on Y axis and \(num\_of\_features\) on X axis. What do you observe? what is the best value of number of features? Take atleast 10 values ranging from min to max number of features.
Varying regularization parameter
Now let’s analyze how varying \(\lambda\) impacts the accuracy. Plot a graph between \(error\_rate\) on Y axis and \(\lambda\) on X axis. What do you observe? what is the best value of lambda? Take logarithmic values of lambda( i.e 0.01, 0.1 etc). Take at least 10 values.
Varying learning rate
Also analyze how varying \(\alpha\) impacts the convergence rate. What happens if you make alpha too small( like 0.0001 or 0.00001), keeping number of iterations as same? Also what if you make alpha too big(like 1, 10)?
Predict missing ratings with best values of parameters
Now that you have tuned all the parameters and have got the best values of each of them, lets find out the missing ratings in Y. Output the result(user, movie, rating) in a text file. Note: use the indicator matrix to find out missing entries and round them before writing in text file.
If you found this work useful please cite as:
@misc{AshishShenoy_Rec_Sys,
author = {Shenoy, Ashish},
title = {Recommender Systems and Collaborative Filtering - Walkthrough},
howpublished = {\url{https://www.ashishvs.in}},
year = {2017},
note = {Accessed: 2017-03-21},
url = {www.ashishvs.in}
}