Recommender Systems (Part 4): Evaluation

5 minute read

Updated: November 28, 2020

Evaluation questions

Single algorithm
- Which parameter setting is better?
  - E.g. k-nearest neighbour. 2 or 10 neighbours?
- What are the cases when the algorithm performs the best/ worst?
  - Netflix. And the gray sheep items movie napoleon dynamite.
  - very important when you analysing algorithm to know the strength and weaknesses to think of ways to hybridise or minimise
Several algorithms
- Which algorithm performs best for the current circumstances? How to select/ hybridise based on strength and weaknesses?
Recommender system
- What are the benefits and drawbacks of recommendations on user experience and business?
- We look at the impact of the system on the user
  - E.g. for ecommerce – recommends, finds and leaves is good for the user, but not for the user because it will be more profitable if the user is slightly confused if they buy items that they didn’t know they needed

Data driven evaluation - Rating Accuracy

For single or several, we use data-driven evaluation
- Recommendation – p1 p2 p3
- Actual – a1 a2 a3
If we look at the accuracy, we are looking at the difference between the two
MAE – mean absolute error
- different between the actual and predicted and take absolute value. Then we take the average!
RMSE – root mean square error
- Slightly more advantageous
- Square of ai-pi – then we square root it
Correlation coefficient
- Correlation between two vectors – p1 an da1
- If we consider how close two vectors are correlation
- Can rank the vectors within the items – for both vectors and find the difference between the ranking Pearson correlation

Calculate MAE and RMSE

How do we know if its good or bad?
- A – we have a algorithm for baseline, and we are trying to get a value better than the baseline
- Will be looking at different metrics of the performance of the algorithm

Compare recommender algorithms

Compare more than one algorithms – metrics are becoming more meaningful
We’ve calculated the absolute error and the MSE
Bad choice is when the prediction and the actual result differs by large
recommendation 1 > rec2
recommendation 3 is not as bad as recommendation 2, but if we only look at the abs error, R1 and R3 performs the same. BUT if we have an extreme error, (a1-p1) > 4 this will impact the error because we square it
- This is why RMSE is higher
- RMSE hence is preferred‼ because it looks at the negative cases and considers it higher.

Compare recommender algorithms

How do we use these metrics to compare algorithms
Recall – it recommended everything it should recall – so 1 in recall
- 0.5<1
This is how we can compare – we can find the value for individual users and take the average

Beyond accuracy: coverage

Beyond how algorithm could be – so we look at other values when assessing
Coverage – what percentage of items can the recommender form predictions for
Coverage is from the WHOLE pool of item. If algorithm A does good but only recommends ‘easy-to-recommend’ items, it ain’t good because it is recommending popular items
Long tail – if it reaches the niche or stays in pop items

Beyond accuracy: Novelty/ Serendipity

Easy to recommend items that are common or what user is used to, but what about something that the user is not used to?
Novelty – something is not within the prediction, but is close to it hence will broaden the user experience
Serendipity – something new, but is relevant and a pleasant surprise
Collaborative filtering – novelty and serendipity is embedded because its about broadening the user’s horizon with the use of the crowd
Is content based and knowledge based – can be calculated in algorithm by looking at similarities and differences in metrics. More sophisticated way because it allows reasoning
→ they are subjective metrics because we need to see if the user was indeed pleasantly surprised

Beyond accuracy: Diversity

Data driven evaluation, how diverse is the list of items we recommend?
Variety – do we have from all content categories?
Balance – are the items from the categories proportional to the number of items in the categories
Disparity – how far are the items, note the relevance with clustering

Experimental studies with users

Evaluating the whole recommender system – we look at the impact of the system on user
AB testing – we have 2 systems with and without recommendations, and we compare them on pre-defined metrics
- Performance metric based on user interaction – then we can use MSRE? Precision, recall, coverage, diversity etc as a performance metric
- Utility metric – what was the benefit of the algorithm for the user/business
  - User satisfaction, how long user stayed, user loyalty, profit etc
Then we compare the values of the metrics – mean, median and mode
We also look at the statistical significance – it can perform better in some situation, which doesn’t mean that every time we have this test it’ll perform better
- Large dataset – can use parametric (t-test, ANova)
- Not large – non-parametric methods

Summary

Evaluation is crucial
- Consider the performance metrics and consider what you are evaluating (what aspect of system, whole system)
Data driven assessment
- We mostly do data driven, hence do we have access to this appropriate data?
- Careful - Are there any potential bias or overfitting?
- The data being small – which leads to overfitting
- Similarity with evaluation of machine learning classifiers – can consider predicted value as a classifier
Experimental studies with users
- To actually evaluate the system – AB testing
- Can consider variations (e.g. one system but offers from two different recommender algorithms – check which performs better)