Recommender systems (Part 2) – Collaborative filtering

8 minute read

Updated:

  1. Collaborative filtering
    1.1 Step 1 - Represent input data
    1.2 Step 2 - Find nearest neighbours
    1.3 Step 3 - Predictions/ recommendations
  2. Pros and cons of user-user CF
  3. Item-item collaborative filtering
    3.1. E.g. Amazon item-item collaborative filtering - to address scalability
  4. Collaborative filtering Summary

Collaborative filtering

  • Examples
    • Google - It assumes you are looking for the flight hotel and stuff
      • Uses location too
    • Spotify – content-based filtering – based on what you listen to, it gives you more of that
    • YouTube – content-based filtering – based on what you watch, gives more of that
  • Same problem – too much content and the user has too many choices – how do we narrow the choices and give users the right things
  • In collaborative filtering:
    • Need to look at other people’s consumption.
    • Also need to have what the user rates and likes as well
  • → give me what people similar to me would like

Collaborative filtering - CF

  • Based on user similarity – it tries to find similar users and based on that recommends it.
    • You may also like this as well, people like you preferred this
  • Key challenge – the rating, where does this come from?
  • How to find the ratings or preferences of the user
    • Explicit methods - Can explicitly ask the user to rate – Netflix, amazon
    • Implicit method – observing what user is doing, e.g. what they buy, click, visit – gives additional info to user rating
  • We are observing every user in the system based on either explicit or implicit methods
    • This is the input of the system

CF – how does it work?

  • image
  • Steps
    1. Creating user profile and created ratings for items
    2. User modelling component is, based on the user profile Identifying who are the users – who’s similar to my user
    3. Take items which they liked – and recommends it
    4. User gets the recommendation back
  • If we want to build a CF, which is based on user similarity → user-user filtering
    • How you process the input data to come up w the user

Step 1 - Represent input data

  • image
  • First step is to prepare the input data – usually a matrix.
    • Have all the possible items and users of the system – matrix values that indicate the voting of user 1 and item 1
  • You end up with a large matrix where you indicate for the ones you know, whether they liked it or not
  • Numerical values in the matrix – can decide on the scale as well. i.e. 0-1 or 0-5
  • Biggest challenge
    • Creating the input matrix in advance – identifying what data you have about the user to come up w the numbers – if you ask user to rate, a lot will be empty coz they don’t rate everything
    • Think about other ways of collecting info

Example

  • Identifying what we are going to recommend by rating each item for each user
  • They’ve liked item 1, 5,7, hated 2,6 → which should I recommend?
  • How do we apply the user-user filtering to come up with the rating for an unknown cell?

Step 2 - Find nearest neighbours

  • We need to find similarities across the users and then define the neighbours that are the most similar
  • How to find similarity → cosine similarity
  • Sometimes referred as: K-nearest neighbour user-user collaborative filtering

Step 2.1 – Calculate similarity

  • image
  • Similarity between our user and all the other user
  • E.g.
    • U3 = (5,1,?,4,5,0,5)
    • U1 = (5,3,?,1,3,4,0)
    • We are looking for the ? part
    • Sim (u3,u1) = image
    • Sim = (25+3+4+15+0+0) / (sqrt(25+1+16+25+0+25) * sqrt(25+9+1+9+16+0)) = 0.63
  • Question – if u3 = (5,0,0,0,0) and u1 = (0,1,0,0,0,) what do we do? → we use neighbour

Step 2.2. Define neighbourhood

  • image
  • How to take neighbourhood given the users and similarities between them
  • Centre-based neighbourhood (size n)
  • → Sort them and take top n
  • K = 3, so top 3 in this case.
    • 0.93, 0.71 and 0.63 is the most relevant, U6,U4,U1
  • Problem is when there’s too many 0’s and there’s no similarity

Step 3 - Predictions/ recommendations

  • Now I know who the relevant user is, and we decide on the value. make predictions about the user
  • Q - I know we’ll take 0,2, and 4 (from prev. slide) do we take the weighted value or average?
  • Weighted sum
    • image
    • Scans the neighbourhood and calculates the frequency for each item
    • Can be combined with the rating value
    • we calculate the weighted sum. and we are dividing by the SUM of the 3!!! So (0.63+0.71+0.93)
    • Based on this we decide if to recommend this or not
  • Answer of User0user = 2.26

  • Association rule recommendation
    • Expands the number of items based on association rules upon what has been recommended by the neighbours

Pros and cons of user-user CF

  • Limitation
    • Data sparsity – with user that hasn’t stopped much with less indication, will have a lot of 0. We then have a very small user base
    • Individual characteristics are not catered for – the user is just vector of numbers
    • Tends to recommend popular items – the AI will be biased bc if the item has had a lot of attention, e.g. with unrepresented items, it will have a lot of zeros → Will be pushed away.
      • This way we end up converging around few items
    • Privacy and trust become important issue
      • Netflix uses user-user Collab filtering – they were testing their algo – by exposing data, you could guess who the user is, even if they are just numbers when it becomes niche data
      • We can mitigate this by getting rid of user and just doing items! next topic
  • Pros
    • It allows diversifying user experience, by getting out of filter bubble
    • Fairly easy to implement
    • Widely applicable – popular in social media applications
    • Starts looking at social interaction data
    • Can recommend items that are not linked to the user’s earlier choices → Useful for promotion – promoting new things and surprises the user
    • Considers the opinions of wide spectrum of users

Item-item collaborative filtering

  • image
    1. Get the user and other users.
    2. Look at the items and find similarities between the items (not about the users)
    3. Based on the items, find the neighbourhood of the items
    4. Then recommend
    • Now we are looking for items, rather than the user
    • We prepare the matrix with user and item the same way.

Item-Item collaborative filtering

  • image
  • Item that I need to think about = I3 = (0,1,?,2,1,4)
    • I1 = (5,1,?,4,0,3)
    • Sim (I3,I1) = 0.62 → cosine similarity
    • We do this for item 3 and every other items – I1,I2,I3, etc
  • Find the most similar 3 items – then take the weighted sum
  • Most similar is I4, I5, I7 – then we come up with the weighted sum
    • image

Discuss the results: 2.62 vs 4.62

  • We got 2 different values from user-user and item-item
  • Q - Which do you trust more?
    • Can’t decide. Can have different recommendations and you will come up with different values.
      • If we change the neighbour k = 4, the value will be diff
      • If we don’t use cosine similarities, it will be different
  • Can we trust these values?? evaluation is important.
    • We take several algo and benchmark them to find out what gives the best result
  • Not just coming up the number, but we need to evaluate them we need hybridization of algorithms
    • Precision, recall etc

Scalability problem

  • Calculating similarities of one to every one another, is the most computationally heavy.
  • Identify where the problem is -> over-calculation of similarity in this case.
    1. Reducing the space of calculation - May try to categorise the items and compute similarity
    2. Offline calculation – we pre-calculate these and store it, and when user comes it uses that to compare.
      • - Problem is updating db. Could lose some data during a week or so
      • + when the user item is big, its very fast

E.g. Amazon item-item collaborative filtering - to address scalability

  • image
  • With this sparse data, instead of doing the whole calculation
  • No need to do multiplications of 0.

Step 1. Find customers who have purchased the items that u3 has purchased

  • We narrow down to other users who have bought that – THEN we calculate THEIR them (with 1’s) instead of all users
    • user 3 bought item 1 – u1 and u4 also bought this
    • user bought item 4, - u2, u4 bought this
  • → Algorithm is trying to reduce the item space
  • With thousands of items, we will significantly reduce the items of interest – in this e.g. not so shown coz i3 is the only irrelevant

Step 2. Find items bought by these identified customers and register pairs of items

  • image
  • Now we start to look at similarities – we need to reduce the number of similarity calculation
    • We register pairs based on common purchases of items.
  • Register pair with regards to common items
  • Have user 3. Found that U3 has similarity with u1 as they bought one similar item → so by comparing u1 and u3, we calc:
    • image
  • By reducing the search space, only need to calc 4 to find the most similar

Step 3. Calculate similarity between user items in pairs & recommend the most similar

  • image
  • Sim between u3 and the rest of the item, and find out what’s the most similar to that. Then calc the similarity from the pairs
  • We have reduced space
  • Cleve – we look at the user, narrow the search space AND narrow the comparison
    • Reduction in less item and less pairs for comparison!

Collaborative filtering Summary

  • Pros
    • Fairly simple to implement
    • Widely used in recommender systems
    • Works successfully if large dataset is used
    • Facilitates the exploration of long tail, which is not otherwise addressed
  • Challenges
    • New items, new users – cold start
    • If an item has 0000 it will never be recommended
    • Sparsity – space of data and ways of reducing
    • Scalability – large dataset, but we can reduce it
    • Reliability of ratings – it is prone to subjectivity!! Bias!!
    • Lack of transparency – how did you come up with this rating? Explaining to the user
    • Lack of control – limited user influence on the algorithms

Applying collaborative filtering

  1. Represent data
  2. Define neighbourhood
  3. Make predictions or recommendations

Leave a comment