# Universal metric for cross-comparison of spaced repetition algorithms

This text is a part of a series of articles about SuperMemo, a pioneer of spaced repetition software since 1987

Author: Piotr Wozniak

Date: June 2018

## Origins: universal metric

Around 2014, someone proposed that all developers of spaced repetition algorithms should open their actual forgetting index data to users to see if a given algorithm meets the retention criteria set by the learner. This openness is needed to avoid asymmetric information than may lead to a *market for lemons*.

The problem with that proposition is that if two algorithms make different retrievability predictions, they may still have the same average retention prediction. For example: for two repetitions we might measure: (1) R=0.8 and R=0.8, or (2) R=0.7 and R=0.9. Both sets would result in the same forgetting index of 20% (average R=0.8). From two component model of long-term memory we know that stability increase will differ at different levels of retrievability. Two algorithms that produce the same forgetting index may have a different effect on memory and long-term retention.

This is when we thought we need a more universal metric for comparing spaced repetition algorithms.

In SuperMemo 17, we came up with an idea of running two algorithms concurrently: Algorithm SM-15 and Algorithm SM-17. Repetitions are scheduled by Algorithm SM-17, however, all computations of the older algorithm are available at hand for comparisons. As both algorithms produce retrievability predictions, it is possible to compare the computations without running two separate real-life learning processes. Predictions can then be confronted with actual recall data in SuperMemo.

Interestingly, when there are major differences between the algorithms, the user can override the new algorithm with the propositions of the older algorithm. This does not affect the calculations. Predictions are always made for both algorithms and real data is used to compare them. The actual interval used in learning has no effect on the results unless one of the algorithms performs better at certain levels of stability or retrievability or a combination thereof.

For each repetition, we have two predictions R15 and R17 and a score: 1 for recall, and 0 for a memory lapse. For example, if R15=0.8 and R17=0.9 then a successful recall of a memory marks the prediction *R17* as superior (0.9 is closer to 1.0 than 0.8).

Side by side comparison of the algorithms had two main effects: (1) proving the ultimate superiority of Algorithm SM-17, and (2) showing the areas of weakness for both algorithms (e.g. in large collections, Algorithm SM-17 is slower in adapting the first review interval as it always takes into account all students prior repetitions).

## Universal metric theory

If we have a set of recorded repetition scores (wins): W_{1}, W_{2}, W_{3}, …, W_{n}, we would like to measure the accuracy of retrievability predictions by a spaced repetition algorithm. The predictions (bets) that we want to assess are denoted as: B_{1}, B_{2}, B_{3}, …, B_{n}.

We need a universal metric, which is a single number that would express algorithmic accuracy.

The value of such a metric would be to compare different algorithms without the need to run an actual learning process (which may take decades). With a single set of repetition histories (set of wins), we can simulate the execution of any imaginable algorithm and generate the set of predictions (bets).

Ideally, the universal metric should be symmetric, transitive, and equal to zero for a perfectly predicting algorithm.

## Least squares deviation

In the wake of a lengthy discussion at SuperMemopedia.com, in SuperMemo 17, we used a simple RMS deviation. For a set of pairs: Win (W) : Bet (B), each algorithm would produce a square root of the average of sqr(B-W). The algorithmic comparison metric would be the difference of those two metric values for the two algorithms. The algorithm with a lower deviation would be marked as the winner for a given set of measurements. Naturally, the bigger the sample of repetitions, the more accurate the verdict of the metric.

## Alternative statistics

Both Duolingo and Quizlet use statistical evaluations that make sense but are not ideal for comparing spaced repetition algorithms:

- mean absolute error (MAE), which can easily be gamed
- area under the ROC curve (AUC)
- Spearman rank correlation, which isn't appropriate with just two ranks of W

We have shown earlier that (B-W)^{n} metric is valid only for n=2 in our case (for MAE, we have n=1).

## Murakowski's universal metric

In an elegant proof (private communication), Murakowski showed that the function f(a)=-log(1-a) is the best metric for comparisons in a set of (W,B) pairs where a=abs(B-W). It is the perfect metric function such that it is minimized when the predicted probability distribution (B_{1}, B_{2}, B_{3}, etc.) matches the actual probability distribution (W_{1}, W_{2}, W_{3}, etc.).

## Expected penalty problem

SuperMemo 17 uses the least squares optimization in a number of contexts. One of those contexts is the recall prediction metric for comparing spaced repetition algorithms (see: R-metric). That metric can also be used in the overall optimization of the parameters of the spaced repetition algorithm to maximize correct recall predictions.

If SuperMemo predicts R=0.8 and the repetition is a success (grade>=3), we have W=1 ("W" stands for "win"), and the metric function is M=f(W,R)=f(1,0.8).

In case of least squares optimization, we have *f=sqr(W,R)* or *f=(W,R) ^{2}*. Let's name that metric

*M=LSD()*.

We have tested numerous metrics, and discovered a serious flaw in the entire optimization approach. All metrics tested thus far show a bias against the mid-range of probabilities of recall (i.e. around R=0.5).

Let's define the expected metric:

`E=E(M)=M(1,R)*R+M(0,R)*(1-R)`

For example, expected metric E(LSD(R=0.5)) is 0.25, while the same expected metric E(LSD(R=0.9)) is just 0.09. For R=0.97, it is only E(LSD(R=0.97))=0.0291.

This makes it relatively easy to "cheat" the "universal" metric if it strongly penalizes mid-range bets. For example, one might implement Algorithm SM-2, i.e. an inferior variant of a SuperMemo algorithm, and always predict R=0.9. In other words, independent of the algorithm and the interval, we can always predict the chance of recall to be 90%.

Murakowski metricFigure:When a spaced repetition algorithm makes a prediction on the probability of recall (i.e. retrievability), the metric penalty on that prediction may depend on the average level of retrievability even for a perfectly predicting algorithm. This shows to a similar degree in power((B-W),2) or -log(1-abs(B-W)) metrics. In SuperMemo, when recall often hovers at 90%, a perfectly predicting algorithm may be penalized for predictions around 50% and lose ground to a "cheating" algorithm that always bets 90%. The graph shows expected metric penalty for several metrics: (1)squaredcorresponds with sqr(B-W) used in SuperMemo 17, (2)-log(1-abs(B-W))for sets of probability predictions (), (3)power(B-W,4)that shows a minimum at R of 0.5, (4)abs(B-W)that is a simple metric with poor discriminatory power, and (5)B-Wis the simplest difference metric distinguished by the fact that the expected metric is zero for all levels of retrievability (flat blue line along the horizontal axis)

Some expected metric graphs are presented here

## Example of cheating

If the expected R is 50%, a perfectly predicting algorithm will get the win score 50% of the time, and a lose score 50% of the time. As a result, a perfect algorithm will always be penalized hard with the presented metrics. For example, in SuperMemo 17, the average penalty would stand at 0.25 (*LSD()* metric, as in the graph above). A cheating algorithm at R=0.5 will also be penalized to the exact same degree independent of the level of its bet.

If we take a cheating algorithm, which always predicts R=0.95 without doing any calculations, the average penalty stands at around 0.1-0.15 for a typical dataset taken from SuperMemo. This is a good number than a reasonable, but imperfect algorithm might struggle to beat, esp. each time it errs by making mid-range predictions of R.

Now the comparison between algorithms will depend on a specific collection and a specific set of numbers used in comparison. In SuperMemo, we always hover at high probability of recall, and we will predict R=0.95 pretty often. Each time, we get a repetition at the level of R=0.5, the good algorithm will get penalized hard, even if the prediction is correct. In the meantime, the cheating algorithm will keep the average penalty low (by guessing R=0.95).

We have shown that all previously used metrics could thus be beaten by a cheating algorithm in some sets of data, esp. if the cheat would use the exact recall prediction as the average recall in the repetition set.

It is important to remember that forgetting is exponential and thus random. Each individual score is unpredictable and cannot be "measured by a metric" individually. We need to work on populations of scores. For example, if we take a SuperMemo collection, and see that the average recall is R=0.91, why would the perfect algorithm be better than a cheating constant prediction of R=0.91? In a mass of predictive numbers, we have no way of knowing, which ones are those which should stand for forgetting, and which should stand for remembering. Only if we start fractionating populations, and differentiate, e.g. population R=0.9, and population R=0.92, we can show that 0.91 is always wrong (a bit), while a perfect algorithm can differentiate between the two populations.

## Normalizing the metric

We can take any metric and normalize it by dividing its value by the expected value for a given level of retrievability. This should reduce penalties in the middle range of probabilities (around R=0.5).

To remedy the problem of the expected metric penalty, we attempted normalization for the expected metric deviation:

- Expected metric:
`E=E(M)=M(1,R)*R+M(0,R)*(1-R)`

- Metric deviation:
`D=M(W,R)`

- Normalized metric:
`NM(W,R)=D/(D+E)`

With the above normalization, NM is 0 for D=0, NM is 0.5 for D=E, and NM increases with D up to 1/(1+E).

The above normalized metric is far more efficient in comparing algorithms. Normalization improves the performance of all metrics tested. It accurately sifts out cheats. However, normalized metrics still depend on the expected metric. Its maximum of 1/(1+E) still depends on E, which depends on R. Normalized metrics are strong, but they are not perfect.

## Law of the perfect algorithm

Due to the random nature of forgetting, for a set of retrievability predictions, we can formulate **the law of the perfect spaced repetition algorithm**:

**perfect algorithm**

For a perfect algorithm, a perfect prediction of retrievability for a uniform population should produce the metric of zero.

In other words, we should not draw a metric on prediction:score pairs (B:W). All we need is to add up the scores.

If all we have is a set of retrievability predictions, a cheating algorithm that guesses the right level of recall is always as good as an algorithm that perfectly understands human memory.

## Circumventing the law

Naturally, SuperMemo can easily circumvent the law of the perfect algorithm. In addition to the set of retrievability predictions and a set of scores (i.e. grades), it got the timing of repetitions, and the model of memory. Using this data and the memory model, SuperMemo can classify repetitions into expected recall categories, and compete with other algorithms in all those categories. Random nature of forgetting makes it impossible for SuperMemo to be an accurate score predictor as deemed by any reasonable score metric. However, SuperMemo is an excellent retrievability classifier. Instead of comparing individual score predictions, the universal metric should compare the ability of spaced repetition algorithms to classify memories into retrievability categories.

## Return of the forgetting index

In a set of retrievability predictions (B) and scores (W), i.e. the set of (B:W) pairs, we cannot employ any of the tested metrics over individual predictions because all metrics discriminate in favor of low and high probability predictions. At the same time, forgetting is entirely random, and there is no way to say which individual scores W=1 are correct predictions.

However, there is a perfect metric that can be employed for all individual pairs. This metric is simply the difference: B-W. For M=B-W, the expected metric is always zero for perfect predictions. The expected metric graph is perfectly flat and non-discriminatory (see above). Adding up scores of such a metric is equivalent to computing the average retrievability. In SuperMemo lingo, it is not much different from finding out the measured forgetting index for a population of repetitions. The **B-W metric** is a good detector in the light of the law of the perfect algorithm: it is always 0 for a set with an accurate average recall prediction. If the B-W metric is indeed a perfect metric for uniform classes of memories, we can say that the old good forgetting index was a good optimization criterion for prior algorithms, starting with Algorithm SM-6. It only requires than an algorithm be a good retrievability classifier. For example, a good forgetting index in Algorithm SM-2 would indicate a reasonably good performance due to the fact that the algorithm is a reasonably good classifier of memories. Today, we depart from the forgetting index criterion for the need to optimize non-optimum schedules.

## Law of the perfect metric

Due to the random nature of forgetting, for a set of retrievability predictions and scores, we can formulate **the law of the perfect spaced repetition metric**:

A perfect algorithm that employs a perfect metric, in an infinite uniform population of repetitions will always produce a metric of zero, for its recall prediction is perfect.

## Cheating the perfect metric

For a population of uniform repetitions, any algorithm can cheat the perfect metric by guessing the correct average recall and betting on that recall at each prediction. If we cannot discriminate between populations of repetitions, we cannot discriminate between good and bad algorithms.

## Comparing algorithms

If an algorithm is a good classifier, and can discriminate between populations of repetitions, we can use this classifying power in comparisons with other algorithms. We can integrate the absolute value, or the squared value of the B-W metric function over R from 0 to 1. In SuperMemo 18, with *M(R)=B-W*, the universal metric idea is defined as follows:

A metric defined in this way is not only useful in comparing algorithms. It can also be used by the algorithm against its own performance as an optimization metric. In other words, we can run all imaginable hill-climbing procedures on algorithmic parameters to maximize the above metric and inch towards **maximizing the performance of a spaced repetition algorithm**.

In a discrete algorithm, we split the [0,1] spectrum into sufficiently many quantiles, and average metrics over the whole spectrum of R to provide a trapezoidal approximation of the metric. We can employ any reasonable statistical metric over the obtained set of results to penalize outliers (SuperMemo uses RMSD). The problem remains that we cannot perfectly classify repetition populations without having a perfect algorithm for all levels of retrievability.

Instead, we can use the trick of a cross-comparison by **playing the algorithms against each other**. Let Algorithm A classify repetitions by the predicted level of R, and produce simple B-W metric for each retrievability quantile for both algorithms. Let then Algorithm B do the same in reverse. The algorithm that is a better classifier will score better (i.e. achieve a lower cumulative metric). Such a comparison is obviously reflexive. It is also easy to show that it is symmetric. It is less trivial to prove it is transitive.

In tests conducted thus far, cross-comparisons based on the spectrum of R and the perfect B-W metric prove to be extremely accurate in weeding out poor algorithms.

## Algorithm SM-17 as a good classifier

Instead of using quantiles for algorithmic cross-comparisons, we can use simpler tools for determining the accuracy of a single algorithm. For example, we can also use a moving average over the retrievability spectrum to eliminate the noise of the randomness of forgetting. The graph below was generated using a bi-directional trailing average. It shows that Algorithm SM-17 is a good classifier, and can be used in algorithmic cross-comparison as an independent benchmark that could ensure a transitive metric. The same role could be played by any reasonably accurate algorithm. The graph below may suggest that the *area under the curve* approach might work in this case, however, it can be demonstrated that an AUC metric can easily be gamed. In SuperMemo 17, cross-comparison metrics use Algorithm SM-17 as a frame of reference.

Figure:Algorithm SM-17 is an accurate classifier of repetitions by the predicted retrievability level. The predicted R and the measured R align into a nearly straight line. A dataset with 100,000 repetitions was used to produce this graph. Due to its ability to accurately classify repetitions, Algorithm SM-17 alone may be used as a basis for an approximation of a transitive algorithmic metric for cross-algorithmic comparisons

## Algorithm SM-17 as a good predictor

If we group items by difficulty and stability, we can investigate individual predictions for each category. This task is harder for some categories, which may collect only a small number of repetition cases. Moreover, specific areas of algorithmic weakness can be detected. In the pictures below, we can see the areas of retrievability underestimates (cumulative B-W metric) and the area of weaker overall prediction (square root of mean square B-W metric):

Figure:A cumulative Bet-Win metric in Algorithm SM-17 shows the areas where predictions might be less accurate. The metric sums up individual metrics for all retrievability quantiles. This does not eliminate outliers, but allows of spotting areas of underestimates and overestimates. Reddish colors indicate retrievability underestimate (student's recall beats the bet). Bluish colors show overestimates where the student performs worse than predicted. Columns group items by difficulty, while rows correspond with stability categories (expressed in days). Averages on the right and at the bottom are not arithmetic correspondents of their rows and columns due to weighing up by repetition counts. The ultimate metric for the entire data set can be found at the bottom-right corner, and shows that there is still room for improvement by parameter optimization. Even though low stabilities seem to be associated with negative metric (good student performance), the picture is obscured by the fact that high stabilities carry far less data and are less accurate. The ultimate goal is to fine-tune the algorithm to get to the B-W metric approaching zero for large datasets. In the optimization processes, square root of mean square needs to be taken to eliminate outliers. Otherwise, the metric might be brought to zero for a specific data set

Figure:A universal Bet-Win metric in SuperMemo shows the areas of stability and difficulty where predictions might be less accurate. The universal metric is a square root of the mean square of the B-W metrics for individual retrievability quantiles. Red color indicates poor estimates for small populations. White and bluish colors indicate accurate predictions. Columns group items by difficulty, while rows correspond with stability categories (expressed in days). Averages on the right and at the bottom are not arithmetic correspondents of their rows and columns due to weighing up by repetition counts. The ultimate metric for the entire data set can be found at the bottom-right corner, and shows that there is still room for improvement by parameter optimization. The ultimate goal is to fine-tune the algorithm to get to the universal B-W metric approaching zero for large datasets.Important!The actual metric in SuperMemo will be better than the one shown in the bottom-right corner. This is due to the fact that the presented metrics have been derived from the Recall[] matrix that is based on the theoretical retrievability (theory vs. data), while the ultimate universal metric in SuperMemo will use best weighted retrievability prediction that includes the Recall[] data (data-based prediction vs. grades)

## Algorithmic contest: SuperMemo 2 vs. SuperMemo 17

As an example of an algorithmic contest, consider a cross-comparison between two SuperMemo algorithms: Algorithm SM-2 (1987) and Algorithm SM-17 (2016). Each algorithm can be used as a classifier for the retrievability spectrum and the performance of the other algorithm can be assess along the spectrum of R. Poor classifiers will be penalized by less accurate recall predictions. For example, a cheating non-classifier algorithm will appear as a **perfect algorithm** along its own criteria, but will easily be debunked by a good classifier (e.g. SuperMemo). A perfect cheating non-classifier may appear slightly superior to a good algorithm for it will penalize even the slightest departure from accurate recall prediction. However, it will lose out to any reasonable classifier when its predictions remain constant over the entire spectrum of R.

The picture shows an exemplary contest between SuperMemo 2 and SuperMemo 17:

Figure:Both algorithms, Algorithm SM-2 and Algorithm SM-17 are reasonable retrievability classifiers. However, they clearly disagree on the B-W metric at different levels of R. SuperMemo 2 claims SuperMemo 17 uses excessively long intervals early in the learning process. SuperMemo 17 claims the same in reference to SuperMemo 2 for high stability levels. Which one can be trusted? We know that Algorithm SM-17 is a better classifier. However, if we did not know that, we can use the RMS deviation over 11 quantiles presented in the graph and show that cross-comparison metric gives Algorithm SM-2 a relatively poor score of 35. Metaphorically speaking, accusations of SuperMemo 17 against SuperMemo 2 are stronger as the blue curve falls to a lower level

## Comparison of 35 algorithms

We used the above cross-algorithmic metric based on B-W differences. We classified retrievability predictions in 11 quantiles (0, 0.1, 0.2, etc.). We ran cross-comparisons between 35 different algorithms. For each comparison, we looked at metrics per quantile for (1) retrievability predictions classified by Algorithm A and (2) actual scores compared with predictions of Algorithm B. Symmetrically, data was obtained for Algorithm A using Algorithm B as the prediction classifier. Pairs of RMS deviations for sets of 11 average per-quantile metrics were compared to determine the accuracy of each algorithm.

Here is a preliminary estimate for an exemplary set of 700,000 repetitions:

- Algorithm SM-17: 1.1
- Algorithm SM-2: 35.3
- Leitner (normalized): 47.7
- Pimsleur (1967): 60.6

The above numbers are dataset-dependent. For an accurate comparison, large and old datasets are recommended. In beginner collections, the differences between algorithms are less pronounced.

Leitner and Pimsleur are out-of-date approaches to spaced repetition, however, we list them here because they were used as benchmarks for measuring the performance of the algorithm used in Duolingo.

If *B:W* comparison metrics other than *B-W* are used in algorithmic play-offs, the differences between algorithms become diluted, and "cheating" algorithms score substantially better.

In you have SuperMemo 17 ver. 17.4, you will find selected metrics for your own dataset in the *\stats* folder (including SM-2, Leitner and Pimsleur). Note that those metrics make sense only for larger datasets (at least hundreds of repetitions), and older collections (to differentiate between cramming and long-term learning).

## Universal metric in algorithmic optimization

The presented metric can also be used in optimizing Algorithm SM-17 itself. Any optimization parameter can be inspected individually, and its impact on performance measured with the use of the metric. A more comprehensive approach is to take a multidimensional hill-climbing approach in which a set of mutually dependent parameters are analyzed as a group. Such procedures are computationally expensive and may require running unsupervised optimizations that may take a few days to produce a meaningful outcome, esp. if large datasets are processed. There are around 30 major parameters in the algorithm that should still be considered as heuristic. This should not undermine the theoretical validity of the algorithm. For example, if we have three theoretically valid ways to measure or predict retrievability, we may want to weigh them up for the ultimate verdict. Weights used in such a compromise define a heuristic, which is superior to any of the non-heuristic measures.

Figure:Exemplary outcome of a single parameter optimization with the use of the universal metric. While investigating various strategies for determining item difficulty, we considered a B-W metric on an individual repetition as a datapoint in determining current difficulty estimate. It can be shown that difficulties change in the course of learning. Algorithm SM-17 uses estimates that change, but still looks for the ultimate difficulty level for the item. A better approach might be to let difficulty estimate change even when taking post-factum repetition histories. If B-W metric was to be used in such an approach, we would need to map the metric to difficulty, which is set in [0..1] range in SuperMemo. The graph shows the impact of mapping on the performance of the algorithm. For the ratio of 9.4, the universal metric hits the minimum of 4.7%. The value of 9.4 may then be taken as optimum for the analyzed sample set. The impact of this tiny adjustment adds up to the difference in intervals of over 100 days in a decade. Other parameters may have a different impact

## FAQ

## Future

This is definitely not the end of the search for the universal metric. Initially, we hoped to just compare sets of probabilities (see Murakowski metric). In the end, it is impossible to escape the memory model that underlies recall. In the future, it may be necessary to enhance the metric with a further departure from the purity of mathematics. We may need to include the *cost to the student*. For example, the cost of forgetting is much greater for memories reviewed at large intervals. In addition, *cost to the student* is a multicriterial concept. The quest for the universal metric is far from over.