{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "(probabilistic_matrix_factorization)=\n", "# Probabilistic Matrix Factorization for Making Personalized Recommendations\n", "\n", ":::{post} June 3, 2022\n", ":tags: case study, product recommendation, matrix factorization\n", ":category: intermediate\n", ":author: Ruslan Salakhutdinov, Andriy Mnih, Mack Sweeney, Colin Carroll, Rob Zinkov\n", ":::" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import arviz as az\n", "import matplotlib.pyplot as plt\n", "import numpy as np\n", "import pandas as pd\n", "import pymc as pm\n", "import xarray as xr" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "%config InlineBackend.figure_format = 'retina'\n", "RANDOM_SEED = 8927\n", "rng = np.random.default_rng(RANDOM_SEED)\n", "az.style.use(\"arviz-darkgrid\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Motivation\n", "\n", "So you are browsing for something to watch on Netflix and just not liking the suggestions. You just know you can do better. All you need to do is collect some ratings data from yourself and friends and build a recommendation algorithm. This notebook will guide you in doing just that!\n", "\n", "We'll start out by getting some intuition for how our model will work. Then we'll formalize our intuition. Afterwards, we'll examine the dataset we are going to use. Once we have some notion of what our data looks like, we'll define some baseline methods for predicting preferences for movies. Following that, we'll look at Probabilistic Matrix Factorization (PMF), which is a more sophisticated Bayesian method for predicting preferences. Having detailed the PMF model, we'll use PyMC for MAP estimation and MCMC inference. Finally, we'll compare the results obtained with PMF to those obtained from our baseline methods and discuss the outcome.\n", "\n", "## Intuition\n", "\n", "Normally if we want recommendations for something, we try to find people who are similar to us and ask their opinions. If Bob, Alice, and Monty are all similar to me, and they all like crime dramas, I'll probably like crime dramas. Now this isn't always true. It depends on what we consider to be \"similar\". In order to get the best bang for our buck, we really want to look for people who have the most similar taste. Taste being a complex beast, we'd probably like to break it down into something more understandable. We might try to characterize each movie in terms of various factors. Perhaps films can be moody, light-hearted, cinematic, dialogue-heavy, big-budget, etc. Now imagine we go through IMDB and assign each movie a rating in each of the categories. How moody is it? How much dialogue does it have? What's its budget? Perhaps we use numbers between 0 and 1 for each category. Intuitively, we might call this the film's profile.\n", "\n", "Now let's suppose we go back to those 5 movies we rated. At this point, we can get a richer picture of our own preferences by looking at the film profiles of each of the movies we liked and didn't like. Perhaps we take the averages across the 5 film profiles and call this our ideal type of film. In other words, we have computed some notion of our inherent _preferences_ for various types of movies. Suppose Bob, Alice, and Monty all do the same. Now we can compare our preferences and determine how similar each of us really are. I might find that Bob is the most similar and the other two are still more similar than other people, but not as much as Bob. So I want recommendations from all three people, but when I make my final decision, I'm going to put more weight on Bob's recommendation than those I get from Alice and Monty.\n", "\n", "While the above procedure sounds fairly effective as is, it also reveals an unexpected additional source of information. If we rated a particular movie highly, and we know its film profile, we can compare with the profiles of other movies. If we find one with very close numbers, it is probable we'll also enjoy this movie. Both this approach and the one above are commonly known as _neighborhood approaches_. Techniques that leverage both of these approaches simultaneously are often called _collaborative filtering_ {cite:p}`koren2009matrixfactorization`. The first approach we talked about uses user-user similarity, while the second uses item-item similarity. Ideally, we'd like to use both sources of information. The idea is we have a lot of items available to us, and we'd like to work together with others to filter the list of items down to those we'll each like best. My list should have the items I'll like best at the top and those I'll like least at the bottom. Everyone else wants the same. If I get together with a bunch of other people, we all watch 5 movies, and we have some efficient computational process to determine similarity, we can very quickly order the movies to our liking.\n", "\n", "## Formalization\n", "\n", "Let's take some time to make the intuitive notions we've been discussing more concrete. We have a set of $M$ movies, or _items_ ($M = 100$ in our example above). We also have $N$ people, whom we'll call _users_ of our recommender system. For each item, we'd like to find a $D$ dimensional factor composition (film profile above) to describe the item. Ideally, we'd like to do this without actually going through and manually labeling all of the movies. Manual labeling would be both slow and error-prone, as different people will likely label movies differently. So we model each movie as a $D$ dimensional vector, which is its latent factor composition. Furthermore, we expect each user to have some preferences, but without our manual labeling and averaging procedure, we have to rely on the latent factor compositions to learn $D$ dimensional latent preference vectors for each user. The only thing we get to observe is the $N \\times M$ ratings matrix $R$ provided by the users. Entry $R_{ij}$ is the rating user $i$ gave to item $j$. Many of these entries may be missing, since most users will not have rated all 100 movies. Our goal is to fill in the missing values with predicted ratings based on the latent variables $U$ and $V$. We denote the predicted ratings by $R_{ij}^*$. We also define an indicator matrix $I$, with entry $I_{ij} = 0$ if $R_{ij}$ is missing and $I_{ij} = 1$ otherwise.\n", "\n", "So we have an $N \\times D$ matrix of user preferences which we'll call $U$ and an $M \\times D$ factor composition matrix we'll call $V$. We also have a $N \\times M$ rating matrix we'll call $R$. We can think of each row $U_i$ as indications of how much each user prefers each of the $D$ latent factors. Each row $V_j$ can be thought of as how much each item can be described by each of the latent factors. In order to make a recommendation, we need a suitable prediction function which maps a user preference vector $U_i$ and an item latent factor vector $V_j$ to a predicted ranking. The choice of this prediction function is an important modeling decision, and a variety of prediction functions have been used. Perhaps the most common is the dot product of the two vectors, $U_i \\cdot V_j$ {cite:p}`koren2009matrixfactorization`.\n", "\n", "To better understand CF techniques, let us explore a particular example. Imagine we are seeking to recommend movies using a model which infers five latent factors, $V_j$, for $j = 1,2,3,4,5$. In reality, the latent factors are often unexplainable in a straightforward manner, and most models make no attempt to understand what information is being captured by each factor. However, for the purposes of explanation, let us assume the five latent factors might end up capturing the film profile we were discussing above. So our five latent factors are: moody, light-hearted, cinematic, dialogue, and budget. Then for a particular user $i$, imagine we infer a preference vector $U_i = <0.5, 0.1, 1.5, 1.1, 0.3>$. Also, for a particular item $j$, we infer these values for the latent factors: $V_j = <0.5, 1.5, 1.25, 0.8, 0.9>$. Using the dot product as the prediction function, we would calculate 3.425 as the ranking for that item, which is more or less a neutral preference given our 1 to 5 rating scale.\n", "\n", "$$ 0.5 \\times 0.5 + 0.1 \\times 1.5 + 1.5 \\times 1.25 + 1.1 \\times 0.8 + 0.3 \\times 0.9 = 3.425 $$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data\n", "\n", "The MovieLens 100k dataset {cite:p}`harper2015movielens` was collected by the GroupLens Research Project at the University of Minnesota. This data set consists of 100,000 ratings (1-5) from 943 users on 1682 movies. Each user rated at least 20 movies, and be have basic information on the users (age, gender, occupation, zip). Each movie includes basic information like title, release date, video release date, and genre. We will implement a model that is suitable for collaborative filtering on this data and evaluate it in terms of root mean squared error (RMSE) to validate the results.\n", "\n", "The data was collected through the [MovieLens website](https://movielens.org/) during the seven-month period from September 19th,\n", "1997 through April 22nd, 1998. This data has been cleaned up - users\n", "who had less than 20 ratings or did not have complete demographic\n", "information were removed from this data set.\n", "\n", "\n", "Let's begin by exploring our data. We want to get a general feel for what it looks like and a sense for what sort of patterns it might contain. Here are the user rating data:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | userid | \n", "itemid | \n", "rating | \n", "timestamp | \n", "
---|---|---|---|---|
0 | \n", "196 | \n", "242 | \n", "3 | \n", "881250949 | \n", "
1 | \n", "186 | \n", "302 | \n", "3 | \n", "891717742 | \n", "
2 | \n", "22 | \n", "377 | \n", "1 | \n", "878887116 | \n", "
3 | \n", "244 | \n", "51 | \n", "2 | \n", "880606923 | \n", "
4 | \n", "166 | \n", "346 | \n", "1 | \n", "886397596 | \n", "
\n", " | movie title | \n", "release date | \n", "video release date | \n", "IMDb URL | \n", "unknown | \n", "Action | \n", "Adventure | \n", "Animation | \n", "Children's | \n", "Comedy | \n", "... | \n", "Fantasy | \n", "Film-Noir | \n", "Horror | \n", "Musical | \n", "Mystery | \n", "Romance | \n", "Sci-Fi | \n", "Thriller | \n", "War | \n", "Western | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
movie id | \n", "\n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " |
1 | \n", "Toy Story (1995) | \n", "1995-01-01 | \n", "NaN | \n", "http://us.imdb.com/M/title-exact?Toy%20Story%2... | \n", "0 | \n", "0 | \n", "0 | \n", "1 | \n", "1 | \n", "1 | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "
2 | \n", "GoldenEye (1995) | \n", "1995-01-01 | \n", "NaN | \n", "http://us.imdb.com/M/title-exact?GoldenEye%20(... | \n", "0 | \n", "1 | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "1 | \n", "0 | \n", "0 | \n", "
3 | \n", "Four Rooms (1995) | \n", "1995-01-01 | \n", "NaN | \n", "http://us.imdb.com/M/title-exact?Four%20Rooms%... | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "1 | \n", "0 | \n", "0 | \n", "
4 | \n", "Get Shorty (1995) | \n", "1995-01-01 | \n", "NaN | \n", "http://us.imdb.com/M/title-exact?Get%20Shorty%... | \n", "0 | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "1 | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "
5 | \n", "Copycat (1995) | \n", "1995-01-01 | \n", "NaN | \n", "http://us.imdb.com/M/title-exact?Copycat%20(1995) | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "1 | \n", "0 | \n", "0 | \n", "
5 rows × 23 columns
\n", "