Recommender systems (Part 2) – Collaborative filtering
Updated:
- Collaborative filtering
1.1 Step 1 - Represent input data
1.2 Step 2 - Find nearest neighbours
1.3 Step 3 - Predictions/ recommendations - Pros and cons of user-user CF
- Item-item collaborative filtering
3.1. E.g. Amazon item-item collaborative filtering - to address scalability - Collaborative filtering Summary
Collaborative filtering
- Examples
- Google - It assumes you are looking for the flight hotel and stuff
- Uses location too
- Spotify – content-based filtering – based on what you listen to, it gives you more of that
- YouTube – content-based filtering – based on what you watch, gives more of that
- Google - It assumes you are looking for the flight hotel and stuff
- Same problem – too much content and the user has too many choices – how do we narrow the choices and give users the right things
- In collaborative filtering:
- Need to look at other people’s consumption.
- Also need to have what the user rates and likes as well
- → give me what people similar to me would like
Collaborative filtering - CF
- Based on user similarity – it tries to find similar users and
based on that recommends it.
- You may also like this as well, people like you preferred this
- Key challenge – the rating, where does this come from?
- How to find the ratings or preferences of the user
- Explicit methods - Can explicitly ask the user to rate – Netflix, amazon
- Implicit method – observing what user is doing, e.g. what they buy, click, visit – gives additional info to user rating
- We are observing every user in the system based on either
explicit or implicit methods
- This is the input of the system
CF – how does it work?
- Steps
- Creating user profile and created ratings for items
- User modelling component is, based on the user profile Identifying who are the users – who’s similar to my user
- Take items which they liked – and recommends it
- User gets the recommendation back
- If we want to build a CF, which is based on user similarity →
user-user filtering
- How you process the input data to come up w the user
Step 1 - Represent input data
- First step is to prepare the input data – usually a matrix.
- Have all the possible items and users of the system – matrix values that indicate the voting of user 1 and item 1
- You end up with a large matrix where you indicate for the ones you know, whether they liked it or not
- Numerical values in the matrix – can decide on the scale as well. i.e. 0-1 or 0-5
- Biggest challenge
- Creating the input matrix in advance – identifying what data you have about the user to come up w the numbers – if you ask user to rate, a lot will be empty coz they don’t rate everything
- Think about other ways of collecting info
Example
- Identifying what we are going to recommend by rating each item for each user
- They’ve liked item 1, 5,7, hated 2,6 → which should I recommend?
- How do we apply the user-user filtering to come up with the rating for an unknown cell?
Step 2 - Find nearest neighbours
- We need to find similarities across the users and then define the neighbours that are the most similar
- How to find similarity → cosine similarity
- Sometimes referred as: K-nearest neighbour user-user collaborative filtering
Step 2.1 – Calculate similarity
- Similarity between our user and all the other user
- E.g.
- U3 = (5,1,?,4,5,0,5)
- U1 = (5,3,?,1,3,4,0)
- We are looking for the ? part
- Sim (u3,u1) =
- Sim = (25+3+4+15+0+0) / (sqrt(25+1+16+25+0+25) * sqrt(25+9+1+9+16+0)) = 0.63
- Question – if u3 = (5,0,0,0,0) and u1 = (0,1,0,0,0,) what do we do? → we use neighbour
Step 2.2. Define neighbourhood
- How to take neighbourhood given the users and similarities between them
- Centre-based neighbourhood (size n)
- → Sort them and take top n
- K = 3, so top 3 in this case.
- 0.93, 0.71 and 0.63 is the most relevant, U6,U4,U1
- Problem is when there’s too many 0’s and there’s no similarity
Step 3 - Predictions/ recommendations
- Now I know who the relevant user is, and we decide on the value. make predictions about the user
- Q - I know we’ll take 0,2, and 4 (from prev. slide) do we take the weighted value or average?
- Weighted sum
- Scans the neighbourhood and calculates the frequency for each item
- Can be combined with the rating value
- we calculate the weighted sum. and we are dividing by the SUM of the 3!!! So (0.63+0.71+0.93)
- Based on this we decide if to recommend this or not
-
Answer of User0user = 2.26
- Association rule recommendation
- Expands the number of items based on association rules upon what has been recommended by the neighbours
Pros and cons of user-user CF
- Limitation
- Data sparsity – with user that hasn’t stopped much with less indication, will have a lot of 0. We then have a very small user base
- Individual characteristics are not catered for – the user is just vector of numbers
- Tends to recommend popular items – the AI will be biased bc
if the item has had a lot of attention, e.g. with unrepresented
items, it will have a lot of zeros → Will be pushed away.
- This way we end up converging around few items
- Privacy and trust become important issue
- Netflix uses user-user Collab filtering – they were testing their algo – by exposing data, you could guess who the user is, even if they are just numbers when it becomes niche data
- We can mitigate this by getting rid of user and just doing items! next topic
- Pros
- It allows diversifying user experience, by getting out of filter bubble
- Fairly easy to implement
- Widely applicable – popular in social media applications
- Starts looking at social interaction data
- Can recommend items that are not linked to the user’s earlier choices → Useful for promotion – promoting new things and surprises the user
- Considers the opinions of wide spectrum of users
Item-item collaborative filtering
-
- Get the user and other users.
- Look at the items and find similarities between the items (not about the users)
- Based on the items, find the neighbourhood of the items
- Then recommend
- Now we are looking for items, rather than the user
- We prepare the matrix with user and item the same way.
Item-Item collaborative filtering
- Item that I need to think about = I3 = (0,1,?,2,1,4)
- I1 = (5,1,?,4,0,3)
- Sim (I3,I1) = 0.62 → cosine similarity
- We do this for item 3 and every other items – I1,I2,I3, etc
- Find the most similar 3 items – then take the weighted sum
- Most similar is I4, I5, I7 – then we come up with the weighted
sum
Discuss the results: 2.62 vs 4.62
- We got 2 different values from user-user and item-item
- Q - Which do you trust more?
- Can’t decide. Can have different recommendations and you will come up
with different values.
- If we change the neighbour k = 4, the value will be diff
- If we don’t use cosine similarities, it will be different
- Can’t decide. Can have different recommendations and you will come up
with different values.
- Can we trust these values?? evaluation is important.
- We take several algo and benchmark them to find out what gives the best result
- Not just coming up the number, but we need to evaluate them we
need hybridization of algorithms
- Precision, recall etc
Scalability problem
- Calculating similarities of one to every one another, is the most computationally heavy.
- Identify where the problem is -> over-calculation of similarity in this case.
- Reducing the space of calculation - May try to categorise the items and compute similarity
- Offline calculation – we pre-calculate these and store it, and when user comes it uses that to compare.
- - Problem is updating db. Could lose some data during a week or so
- + when the user item is big, its very fast
E.g. Amazon item-item collaborative filtering - to address scalability
- With this sparse data, instead of doing the whole calculation
- No need to do multiplications of 0.
Step 1. Find customers who have purchased the items that u3 has purchased
- We narrow down to other users who have bought that – THEN we
calculate THEIR them (with 1’s) instead of all users
- user 3 bought item 1 – u1 and u4 also bought this
- user bought item 4, - u2, u4 bought this
- → Algorithm is trying to reduce the item space
- With thousands of items, we will significantly reduce the items of interest – in this e.g. not so shown coz i3 is the only irrelevant
Step 2. Find items bought by these identified customers and register pairs of items
- Now we start to look at similarities – we need
to reduce the number of similarity calculation
- We register pairs based on common purchases of items.
- Register pair with regards to common items
- Have user 3. Found that U3 has similarity with u1 as they bought one
similar item → so by comparing u1 and u3, we calc:
- By reducing the search space, only need to calc 4 to find the most similar
Step 3. Calculate similarity between user items in pairs & recommend the most similar
- Sim between u3 and the rest of the item, and find out what’s the most similar to that. Then calc the similarity from the pairs
- We have reduced space
- Cleve – we look at the user, narrow the search space AND narrow the
comparison
- Reduction in less item and less pairs for comparison!
Collaborative filtering Summary
- Pros
- Fairly simple to implement
- Widely used in recommender systems
- Works successfully if large dataset is used
- Facilitates the exploration of long tail, which is not otherwise addressed
- Challenges
- New items, new users – cold start
- If an item has 0000 it will never be recommended
- Sparsity – space of data and ways of reducing
- Scalability – large dataset, but we can reduce it
- Reliability of ratings – it is prone to subjectivity!! Bias!!
- Lack of transparency – how did you come up with this rating? Explaining to the user
- Lack of control – limited user influence on the algorithms
Applying collaborative filtering
- Represent data
- Define neighbourhood
- Make predictions or recommendations
Leave a comment