The Problem With Averaging Ratings

Personal rating scales

Here’s a common situation: We want to compare a number of items—like movies, restaurants, or job applicants—to see which ones we prefer to others. So we have people provide numerical ratings for each item—like a number between 1 and 10—and we compare the average ratings for each item. This is an approach that goes back almost a hundred years, but there’s a fundamental issue: people rate according to different personal rating scales.

For example, suppose we have two people, Alice and Bob, rating movies.

Alice tends to give ratings between 2 and 9 (out of 10), and Bob tends to give ratings between 8 and 10.

If Alice and Bob both rate a movie 8, then these numbers mean very different things! Alice loved the movie, but Bob hated it—and the average treats these two ratings as the same.

Averaging takes the numbers at face value. A score of 8 is considered to be literally twice as good as a 4, and the difference between 7 and 8 is assumed to be the same as the difference between 9 and 10. But this doesn't capture the difference in meaning of Alice and Bob's ratings.

Personal rating scales can lead to unfair comparisons, as well. If Alice and Bob rate different job applicants, Bob’s high ratings give an unfair advantage to anyone he rates. Simply averaging ratings forces us to interpret these numbers in a way which is divorced from their context.

Looking at the data

If we take a closer look at the data, we can get a hint as to where we're going wrong. When we average the ratings for a particular item, we only look at one column of the rating matrix at a time.

However, we have a greater wealth of information available. Looking at each row of the rating matrix provides context for that person's ratings. When Alice rates multiple items, we can learn her personal rating scale via her empirical rating distribution.

Now let's see how we can use this insight to average the ratings in a smarter way.

A common scale

If we can't average Alice's and Bob's ratings because they correspond to different rating scales, then we should first convert the scores to corresponding ratings on a common/consensus scale to make them comparable. In our example, Alice's 8 might be converted to a consensus 9, while Bob's 8 might be converted to a consensus 2. But how do we determine a common scale? Well, if we take a step back, we can notice that converting all of Alice's scores to the common scale induces a transformation of her personal rating scale.

This is known as an optimal transport map, and we can measure the degree to which this transformation alters Alice's rating distribution. So a natural consensus scale is given by the distribution which minimizes how much we alter everyone's rating distributions, an average of the personal rating scales known as a Wasserstein barycenter.

In mathematical terms, the Wasserstein barycenter has a simple formula. If everyone's personal rating distributions have cumulative distribution functions \( F_1, \dots, F_n \) (so the quantile functions are \( F_1^{-1},\dots,F_n^{-1} \)), then the consensus scale \( \widehat \mu \) has quantile function \[ F_{\widehat \mu}^{-1} = \frac{1}{n} \sum_{i=1}^n F_i^{-1}, \] and the optimal transport map converting person \( k \)'s ratings to the common scale is \[F_{\widehat \mu}^{-1} \circ F_k = \frac{1}{n} \sum_{i=1}^n F_i^{-1} \circ F_k. \]

The rating estimator

Once we convert the ratings to a common scale, we can average them fairly, and we get the rating estimator. It has two steps, the first of which we've already described.

Step 1. (Primitive ratings): Given an item with ratings \( r_1,...,r_n \), define the primitive aggregate rating for that item as \[ \begin{align*} R_0(r_1,\dots,r_n) &:= \frac{1}{n} \sum_{k=1}^n F_{\widehat \mu}^{-1} \circ F_k(r_k) \\ &= \frac{1}{n^2} \sum_{i=1}^n \sum_{k=1}^n F_i^{-1} \circ F_k(r_k). \end{align*} \] Calculate the primitive rating for all items, and obtain the distribution \( \nu \) of primitive ratings for all items.

Step 2. (Final rescaling): Adjust each primitive rating r by reporting the aggregate rating for that item as \[ R(r) := F_{\widehat \mu}^{-1} \circ F_\nu(r) = \frac{1}{n} \sum_{i=1} F_i^{-1} \circ F_\nu(r). \]

Step 2, applied to a primitive rating for each item, yields the rating estimator rating for each item. Why do we need this second step? It turns out that if people disagree on preferences, i.e. which items are better than others, then \( R_0 \) outputs a distribution of \( \nu \) which is different from the consensus scale \( \widehat \mu \). Step 2 rescales the result to fit the consensus scale, without changing the ordering of the values that \( R_0 \) gave us.

For more details, including an application to anime ratings from MyAnimeList, check out my paper on the rating estimator! And if you want to try out the rating estimator on your own rating data, here's a Python implementation, as well. Here are some highlights from the paper: