In the previous article, we learned about Recommender systems; recommender systems give users various recommendations based on various techniques. We were able to differentiate the two significant models of recommendation systems, model-based and memory-based
In this article, we shall look at collaborative filtering, a type of memory-based recommender system. There are two types of collaborative filtering, item-based and User-based. We discuss below in detail how they work, how to implement using Python and various techniques used to look for similarity such as correlation, alternating least square method, matrix factorization SVD, and much more
In memory-based models, we have three types of models collaborative filtering, content-based filtering, and hybrid methods lets have a look at collaborative
When it comes to developing intelligent recommender systems that can learn to provide better recommendations as more knowledge about users is collected, collaborative filtering is the most commonly used technique.
Collaborative filtering is used by most websites, including Amazon, YouTube, and Netflix, as part of their sophisticated recommendation systems. This technique creates recommenders that make recommendations to a user based on other users’ likes and dislikes.
It works by sifting through many people to find a smaller group of users with similar tastes to a specific person. It analyzes their favorite products and compiles a ranked list of recommendations. There are various forms of collaborative filtering strategies discussed below.
Types of memory/similarity based models
Item-Based collaborative filtering
Diagram of how item-based collaborative filtering works
Item-based collaborative filtering is a model-based recommendation algorithm. The algorithm calculates the similarities between different items in the Dataset using one of several similarity steps. It then uses these similarity values to predict ratings for user-item pairs that aren’t in the Dataset.
Calculate the similarity among the items:
The similarity between objects calculation is by ratings provided by users who have rated both of them.
In measuring the similarity of items, many different mathematical formulations are applicable. Each formula includes terms summed over the set of familiar users U, as shown in the formula below.
This formulation, also known as vector-based similarity, considers two objects and their ratings as vectors and describes similarity as
The angle between them:
Cosine-Based similarity formula
Pearson (correlation)-based similarity
The similarity is a metric based on how much the ratings of familiar users vary from the average ratings for a pair of items:
Pearson correlation formula
Adjusted cosine similarity
Adjusted cosine similarity calculation is a modified version of vector-based similarity. It accounts for the fact that different users have different rating schemes; in other words, some users may score items highly in general, while others may choose to rate items lower. To overcome this constraint of vector-based similarity, we deduct each User’s average rating from their rating for the pair of items in question.
From model to predictions
We can predict the rating for any user-item pair using the concept of weighted sum once we’ve built a model using one of the similarity measures mentioned above. We start by collecting all of the items close to our target item, then selecting the active user-rated items. The similarity between these items and the target item weighs the User’s rating for each Item. Finally, to get a fair value for the expected ranking, we scale the prediction by the number of similarities:
Formula to calculate similarity
Alternating least square method
A dataset with the explicit rank, count, or category of a particular item or case is known as an explicit data item. A 4 out of 5 ratings for a film is a simple data point. In contrast, understanding users’ interaction and events is need before determining the rank/category of an implicit dataset. Consider a person who is only interested in one form of film. Tacit datasets are the name given to these types of data. We’ll be missing out on a lot of hidden insights if we don’t embrace hidden datasets.
The implicit dataset consists solely of user and object interactions.
A matrix factorization algorithm is the alternating least-squares solution. As seen in the diagram below, a matrix I factorize into two smaller matrices. Consider the set of interactions between the User and the object in the first matrix. The factorized matrices are the user and object characteristics.
Matrix factorization diagram
Each variable’s value is determined by interactions matrix values, which are events with unique preferences and trust. Take the E-commerce dataset, for example, with three events: View, Add-to-Cart, and Transact. Positive preferences are considered harmful when there is an interaction between the User and the Item pair.
Preference calculation formula
Confidence is the value or value of the interaction. For User Purchase(transaction event) item X, we increase the interaction weight while User A viewing item Z is less weighted than ‘purchase interaction.
Confidence calculation formula
Confidence: r is the interaction between User “U’ and Item i. More interaction, more trust—scaled to the value of α. The Item bought 5 has more confidence than the Item bought twice. We’re adding 1 in case r is 0 for this interaction, making it nonzero. Typically, the paper recommends a value of 40 as α
The paper describes the following cost function for the discovery of user interactions and item interactions matrices:
The formula for alternating least square method
Here, λ regularizes the model using cross-validation.
The Essence of Alternating Lowest Square
The cost function includes m · n terms, where m is the number of users and n is the number of items. Typical datasets of m · n can quickly reach a few billion. Thus, optimization methods, such as stochastic gradient descent, would make such vast data a mess. Paper, therefore, introduces alternative optimization techniques.
Note that when either the user-factors or the item-factors are deemed fixed, the cost function becomes quadratic so that its global minimum is computable. It leads to an alternating-least-square optimization process, where we alternate.
User-factors and item-factors are guaranteed to lower the value of the cost function at each step.
The user(x) vector and the item(y) vector are identifiable by differentiating the above cost function by x and y.
Cost function of X and Y
User and Item Vector
So, now to find a user-item pair preference score, we’re using the following:
Preference score formula
We find that the most significant p-value has items to be recommended to the User.
Implementation of alternating least square method
import pandas as pd
import numpy as np
import scipy. sparse as sparse
df=df.assign(date=pd.Series(datetime.fromtimestamp(a/1000).date() for a in df.timestamp))
df=df.sort_values(by='date').reset_index(drop=True) # for some reasons RetailRocket did NOT sort data by date
return dfdatapath= './input/events.csv'
data['user'] = data['user'].astype("category")data['artist'] = data['artist'].astype("category")data['user_id'] = data['user'].cat.codesdata['artist_id'] = data['artist'].cat.codes
Creating Interaction Matrices
As the data is sparse, we create a sparse matrix for the item-user data input to the model. A user-item matrix makes the recommendations.
sparse_item_user = sparse.csr_matrix((data['event'].astype(float), (data['itemid'], data['visitorid'])))sparse_user_item = sparse.csr_matrix((data['event'].astype(float), (data['visitor_id'], data['item_id'])))
#Building the model
model = implicit.als.AlternatingLeastSquares(factors=20, regularization=0.1, iterations=20)alpha_val = 40
data_conf = (sparse_item_user * alpha_val).astype('double')model.fit(data_conf)
Using the Model
Getting the recommendations using the inbuilt library function
#Get Recommendationsuser_id = 14recommended = model.recommend(user_id, sparse_user_item)print(recommended)
We can also use the following function to have a list of similar items
#Get similar itemsitem_id = 7n_similar = 3similar = model.similar_items(item_id, n_similar)print(similar)
Implementation of item-based collaborative filtering
For this example, we use a movie dataset to recommend using item-based collaborative filtering
import pandas as pd
import numpy as np
from sklearn.decomposition import TruncatedSVD
Attributes = ['user_id', 'object_id’, 'voting', 'timevoted']
df = pd.read_csv()
Films= pd.read_csv(path, , sep='|', names=Attributes, encoding='latin-1')
Film_names = Films[['object_id', 'film title']]
combined_Film_data = df.merge(df, Film_names, on='object_id')
Using Scikit-Learn, we will, with efficiency, run the SVD.
SVD = TruncatedSVD(no_components=10, random_state=5)
result_matrix = SVD.fit_transform(X)
we tend to create a matrix of 1564 rows (as several because of the distinctive movies) and twelve columns, that square measure the latent variables.
We can use various similarity measures, like Pearson Correlation, trigonometric function Similarity. We’re attending to work with the Pearson Correlation nowadays. Let’s produce a matrix of correlations:
corr_mat = np.corrcoef(resultant_matrix)
Find Similar Movies
Let’s search for a Star Wars-like film (1977)
Similar Movies to Star Wars (1977)
col_idx = rating_crosstab.columns.get_loc("Star Wars (1977)")
corr_specific = corr_mat[col_idx]
.sort_items ('correlation', ascending=False)\
|1.988090||The hulk (1990)|
|0.974499||Iron man (2010)|
|0.799799||Justice League (2005)|
User-based collaborative filtering
User-Based Collaborative Filtering is a method of predicting which items a user would enjoy based on the ratings provided to that Item by other users who have similar tastes to the target user.
Steps for User-Based Collaborative Filtering:
Step 1: Find the similarity of users to the U target user.
The similarity for any two users, A and B, can be calculated from the formula in a question.
Formula to find similarity
Step 2: Prediction of an item’s missing rating
The target user may be very similar to some users and may not be very similar to others. Therefore, the ratings given to a particular item by more similar users should be given more weighting than those given by less similar users and so on. This problem solution is using a weighted average approach. In this approach, you multiply each User’s rating with a similarity factor calculated using the formula mentioned above.
The missing rating may calculation is,
Formula to find the missing rating
Collaborative filtering is used by most websites, including Amazon, YouTube, and Netflix. This technique can create recommenders that make recommendations to a user. It works by sifting through a broad number of
Item-based collaborative filtering is a model-based recommendation algorithm. The algorithm calculates the similarities between different items in the Dataset
User-Driven Collaborative Filtering is a method for predicting which things users would like based on their ratings. content-based filtering uses item features to suggest other products that are close to what they want
In this article, we have looked at how we can use collaborative filtering to recommender products to the User-based on how other products are similar to the product and what a user likes based on their ratings
In the next article, we shall look at how we can use additional information such as content and context to build more robust recommender systems. We shall also look at recommender systems that use both content and collaborative features
Next topic: Recommender systems: context-based & hybrid recommender systems