1. Introduction

As I mentioned in the last blog, collaborative filtering is a popular algorithm in recommender systems. This type of method makes automatic predictions about a user's interests by collecting and comparing preferences information from many users. The underlying assumption of the collaborative filtering approach is that if person A has the same opinion as person B on an item, A is more likely to have B's opinion on a different item than that of a randomly selected person.

2. Memory-based methods

Memory-based methods, also called neighbor-based methods, build a user-item co-rated matrix from the database, where each entry within the matrix represents the rating item given by the user. Each row of the matrix can be viewed as a user vector representing a user's historical rating behaviour. Each matrix column can be seen as an item vector representing the users rated the item.

Example of the user-item rating matrix

Then, we can compare user or item vectors with a similarity metric to get a similarity score. For the user-based method, we can find the nearest user with the highest similarity score and feed that user's interest. For the item-based method, we can find the nearest item to the past rated items by a user.

3. Symbols and Notations Declaration

In this section, I will define some symbols and notations used in mathmatical representations.

The upper bold letters are used to represent matrics or sets(e.g., R: User-item rating matrix; U: A set of users in the database) The lower bold letters are used to represent vectors (e.g., u: A user vector)

4. Common Similariy Measurements

The table below lists the common similarity measurements:

</tr>
Similarity Name Experssion
Non-normalized Cosine \(cos(a,b)=\frac{a * b}{||a|| * ||b||}\)
Jaccard Similarity \(JC(i,j)=\frac{|N(i) \cup N(j)|}{\sqrt{|N(i)||N(j)|}}\)
ItemIUF (Item Inverse-User Frequency) \(ItemIUF(i,j)=\frac{\sum_{u \in N(i) \cup N(j) }{\frac{1}{log(1+|N(u)|)}}}{\sqrt{|N(i)||N(j)|}}\)

Non-normalized Cosine and Jaccard Similarity are the most common ways to measure the similarity between two vectors. Item Inverse-User Frequency inferences from the theory of Term-frequency Inverse Document Frequency (TF-IDF). In this formula, N(x) represents the number of users rated by the item (x). Both measurements can be conducted in either a user-based or item-based way. Both measurements will give accurate similarity scores between two vectors.

5. Recommendation strategy

I can get a symmetric similarity matrix (W) after comparing all pairs of users or items. This matrix is usually called user-model (user-based CF) or item-model (item-based CF). I will use the item-model as an example to show how to generate the recommendation result for a user (u).

\[P_{u,j}={\sum}_{i \in N(u) \cap S(j,K)}{W_{j,i}r_{u,i}}\]

In the formula above, \(P_{u,j}\) represents the prediction score of item (j) given by the user (u). This formula shows the weighted sum of the similarities and ratings among top-K most similar items for the user (u). In some cases, I need to try different neighbour-size (K) to verify the recommendation result to get the optimal one.

6. Discussion

User-based or item-based collaborative filtering provides an effective and straightforward way to compute the recommendations. However, it suffers from the issues of data sparsity. For example, given an item, I need to compare the similarities with all items in the database. Assuming that the database consists of millions of items, the execution will be very time-consuming. In addition, the items with lower similarity scores will mislead the recommendation results. Therefore, some low-rated items need to be dropped off to reduce the number of neighbours.