Recommender Systems (Part 4): Evaluation

5 minute read

Updated:

  1. Data driven evaluation - Rating Accuracy
  2. Data driven evaluation - Usage Prediction

Evaluation questions

  • Single algorithm
    • Which parameter setting is better?
      • E.g. k-nearest neighbour. 2 or 10 neighbours?
    • What are the cases when the algorithm performs the best/ worst?
      • Netflix. And the gray sheep items movie napoleon dynamite.
      • very important when you analysing algorithm to know the strength and weaknesses to think of ways to hybridise or minimise
  • Several algorithms
    • Which algorithm performs best for the current circumstances? How to select/ hybridise based on strength and weaknesses?
  • Recommender system
    • What are the benefits and drawbacks of recommendations on user experience and business?
    • We look at the impact of the system on the user
      • E.g. for ecommerce – recommends, finds and leaves is good for the user, but not for the user because it will be more profitable if the user is slightly confused if they buy items that they didn’t know they needed

Data driven evaluation - Rating Accuracy

  • For single or several, we use data-driven evaluation
    • Recommendation – p1 p2 p3
    • Actual – a1 a2 a3
  • If we look at the accuracy, we are looking at the difference between the two
  • MAE – mean absolute error
    • different between the actual and predicted and take absolute value. Then we take the average!
    • image
  • RMSE – root mean square error
    • Slightly more advantageous
    • Square of ai-pi – then we square root it
    • image
  • Correlation coefficient
    • Correlation between two vectors – p1 an da1
    • If we consider how close two vectors are correlation
    • Can rank the vectors within the items – for both vectors and find the difference between the ranking Pearson correlation

Calculate MAE and RMSE

  • image
  • How do we know if its good or bad?
    • A – we have a algorithm for baseline, and we are trying to get a value better than the baseline
    • Will be looking at different metrics of the performance of the algorithm

Compare recommender algorithms

  • Compare more than one algorithms – metrics are becoming more meaningful
  • We’ve calculated the absolute error and the MSE
  • Bad choice is when the prediction and the actual result differs by large
  • recommendation 1 > rec2
  • recommendation 3 is not as bad as recommendation 2, but if we only look at the abs error, R1 and R3 performs the same. BUT if we have an extreme error, (a1-p1) > 4 this will impact the error because we square it
    • This is why RMSE is higher
    • RMSE hence is preferred‼ because it looks at the negative cases and considers it higher.

Data driven evaluation - Usage Prediction

  • Accuracy is a very accurate measure of how well it worked – but it doesn’t tell us where it performed well. E.g what’s the precision or recall?
    • We need to create a confusion matrix image
    • TP – true positive
    • FP – false positive
    • TN – true negative – user and algorithm said no
    • FN – algorithm didn’t rec, but user used
  • Given these values, it seemed to do better on precision
  • image

Compare recommender algorithms

  • How do we use these metrics to compare algorithms
  • Recall – it recommended everything it should recall – so 1 in recall
    • 0.5<1
  • This is how we can compare – we can find the value for individual users and take the average

Beyond accuracy: coverage

  • Beyond how algorithm could be – so we look at other values when assessing
  • Coverage – what percentage of items can the recommender form predictions for
  • Coverage is from the WHOLE pool of item. If algorithm A does good but only recommends ‘easy-to-recommend’ items, it ain’t good because it is recommending popular items
  • Long tail – if it reaches the niche or stays in pop items

Beyond accuracy: Novelty/ Serendipity

  • Easy to recommend items that are common or what user is used to, but what about something that the user is not used to?
  • Novelty – something is not within the prediction, but is close to it hence will broaden the user experience
  • Serendipity – something new, but is relevant and a pleasant surprise
  • Collaborative filtering – novelty and serendipity is embedded because its about broadening the user’s horizon with the use of the crowd
  • Is content based and knowledge based – can be calculated in algorithm by looking at similarities and differences in metrics. More sophisticated way because it allows reasoning
  • → they are subjective metrics because we need to see if the user was indeed pleasantly surprised

Beyond accuracy: Diversity

  • Data driven evaluation, how diverse is the list of items we recommend?
  • Variety – do we have from all content categories?
  • Balance – are the items from the categories proportional to the number of items in the categories
  • Disparity – how far are the items, note the relevance with clustering

Experimental studies with users

  • Evaluating the whole recommender system – we look at the impact of the system on user
  • AB testing – we have 2 systems with and without recommendations, and we compare them on pre-defined metrics
    • Performance metric based on user interaction – then we can use MSRE? Precision, recall, coverage, diversity etc as a performance metric
    • Utility metric – what was the benefit of the algorithm for the user/business
      • User satisfaction, how long user stayed, user loyalty, profit etc
  • Then we compare the values of the metrics – mean, median and mode
  • We also look at the statistical significance – it can perform better in some situation, which doesn’t mean that every time we have this test it’ll perform better
    • Large dataset – can use parametric (t-test, ANova)
    • Not large – non-parametric methods

Summary

  • Evaluation is crucial
    • Consider the performance metrics and consider what you are evaluating (what aspect of system, whole system)
  • Data driven assessment
    • We mostly do data driven, hence do we have access to this appropriate data?
    • Careful - Are there any potential bias or overfitting?
    • The data being small – which leads to overfitting
    • Similarity with evaluation of machine learning classifiers – can consider predicted value as a classifier
  • Experimental studies with users
    • To actually evaluate the system – AB testing
    • Can consider variations (e.g. one system but offers from two different recommender algorithms – check which performs better)

Leave a comment