First data-driven spaced repetition algorithm: Algorithm SM-8
This text is part of: "History of spaced repetition" by Piotr Wozniak (June 2018)
Birth of Algorithm SM-8 in a mountain hut
In 1995, SuperMemo was rewritten from grounds up and it was a great opportunity to implement a new spaced repetition algorithm based on the data collected in 4 years of the use of Algorithm SM-6.
In March 1995, at CeBIT in Hannover, we saw a new fantastic development environment from Borland: Delphi. It has lifted the old Borland Pascal to a new level, and opened dozens of development opportunities for SuperMemo. We decided to redesign the program along the lines depicted in my PhD dissertation. In addition to spaced repetition, we wanted to have knowledge structure and hypermedia. Instead of a mass of items, the users would be able to build a knowledge tree. Instead of the old template made of a question, answer, picture, and sound, we wanted to have all possible component types that could be mixed up into new hypermedia forms for expressing knowledge. There was also a dream of programmable SuperMemo in which developers could write their own procedures for any form of training, incl. procedural training, touch typing, or solving quadratic equations. At the same time, we have collected a lot of data that indicated that the algorithm used in SuperMemo could be improved. For example, the mathematical nature of the matrix of optimal factors has become pretty obvious.
In May 1995, I took my Pentium PC to a remote mountain hut in southern Poland to work on those ideas. That was a period of 100 days of total isolation interrupted only by a short visit from Krzysztof Biedalak during which we re-synchronized our vision for future SuperMemo. By September 1995, the new algorithm was ready, and I tested it on my own data. Back in Poznan, I started gradually moving all my learning process from multiple collections in SuperMemo 7 to the new environment nicknamed "Genius". Genius became SuperMemo 8 only two years later when the new program could finally include all functionality that was originally available in SuperMemo 7.
The most important data that helped develop Algorithm SM-8 were forgetting curves and OF matrix data collected with SuperMemo 6 and SuperMemo 7. This data took away a great deal of guesswork from the algorithm. The work was pretty easy in comparison to Algorithm SM-17 (2014-2016) when I had mountains of repetition histories to process, and the requirements for precision and good metrics have tripled. While Algorithm SM-17 took two years to develop, Algorithm SM-8 was designed, implemented, and well-tested in mere 100 days.
The main ideas behind Algorithm SM-8:
- precise mathematical determination of the OF matrix based on live approximations. Instead of matrix smoothing known from SuperMemo 5, I wanted to know the exact mathematical function that could describe the matrix, and perform live updates. It was easy to determine that a negative power function would determine OF=f(RepNo) (which is an expression of SInc=f(S)) in today's terms). A bit more guesswork went into the impact of difficulty on SInc. I opted for a linear approximation of the function mapping difficulty (A-Factor) to the decay constant for SInc (D-Factor), which expressed the decline in stability increase with stability/interval. That linear bet has survived to this day. It was a good guess.
- with a good definition of the OF matrix, I could provide a precise definition of item difficulty: instead of a fluid E-Factor that could be manually controlled by grades at the whim of the user, I wanted to have an absolute difficulty A-factor, which was defined as the stability increase after the first repetition timed for R=0.9. This made it possible for SuperMemo to adjust item difficulty with each repetition by correcting the fit of item's performance with the expected performance based on the OF matrix
- faster determination of startup difficulty by correlating the first grade with the A-factor. This is a weak mechanism of little significance, as shown by the fact that even with multi-repetition histories, item difficulty is still a hazy concept. In that context, users should be reminded that the best approach is to formulate items well, and just keep them easy
- approximating the first post-lapse interval by an exponential fit based on the number of memory lapses. The biggest value of that approach was to abolish a myth that reducing the length of intervals in case of a memory lapse could speed up learning. Some authors of software based on Algorithm SM-2 opted for such a solution, which has been proven wrong. This erroneous approach is also known in variants of the Leitner system, which has been used, among others, in Duolingo
- the idea to correlate grades with the forgetting index was a failure and did not contribute to improving the algorithm. That truth transpired slowly. It took nearly a decade to come to the ultimate verdict: grades correlate poorly with the forgetting index. The intuition born with Algorithm SM-2 is only weakly correct
Interestingly, Algorithm SM-8 did not require full repetition history for elements. Full repetition histories were to be implemented only in February 1996. The advantage was an easier implementation. The disadvantage came with the fact that once the user intervened manually in the learning process, the algorithm had no record of that intervention, and could not defend itself from a possible inflow of incorrect data. Naturally, only full repetition history record made it possible to implement Algorithm SM-17 two decades later.
My first "live" repetition in Algorithm SM-8 on my own data took place on Aug 16, 1995, Wed. For the test, I "sacrificed" a small 100 item collection with mnemonic peg list for memorizing numbers. Over the next two years, I gradually converted all my other collections to work with the new algorithm and in the new SuperMemo environment. In 1997, all my knowledge have finally been integrated into a single well-structured database. In 1995-1997, we called such a database a "knowledge system". Today we just call it a collection (as in a collection of pieces of knowledge).
To this day, the core of the algorithm born in 1995 runs in SuperMemo 17 in the background, and the user can still choose intervals based on that old algorithm in case he is unhappy with propositions of Algorithm SM-17.
Absolute item difficulty
In SuperMemo 1.0 through SuperMemo 3.0, E-Factors were defined in the same way as O-Factors (i.e. the ratio of successive intervals). They were an approximate measure of item difficulty (the higher the E-Factor, the easier the item). However, the spaced repetition optimization would force E-factors to correspond with stability increase which drops with stability. In other words, by definition, in Algorithm SM-2, items would be tagged as more and more "difficult" as they were subject to successive repetitions. This is a bit counter-intuitive and users never seemed to notice.
Starting with SuperMemo 4.0, E-Factors were used to index the matrix of O-Factors. They were still used to reflect item difficulty. They were still used to compute O-Factors. However, they could differ from O-Factors and thus make for a better reflection of difficulty.
In SuperMemo 4 through SuperMemo 7, difficulty of material in a given database would shape the relationship between O-Factors and E-Factors. For example, in an easy collection, the starting-point O-Factor (i.e. the one corresponding with the first repetition and the assumed starting difficulty) would be relatively high. As performance in repetitions determines E-Factors, items of the same difficulty in an easy collection would naturally have a lower E-Factor than the exactly same items in a difficult collection. This all changed in SuperMemo 8 where A-Factors where introduced. A-Factors are "bound" to the second row of the O-Factor matrix. This makes them an absolute measure of item difficulty. Their value does not depend on the content of the collection. For example, you know that if A-Factor is 1.5, the third repetition will take place in an interval that is 50% longer than the first interval.
Post-lapse interval
Post-lapse interval approximation in Algorithm SM-8 abolished two myths:
- shortening intervals after a lapse is a good idea (this idea was advocated multiple times in the years 1991-2000)
- the first interval should always be 1 day (as in some older SuperMemo solutions)
In the graph presented below, we can see that with successive lapses, the optimum post-lapse interval keeps getting slightly shorter. This expresses nothing else but the fact that those high-lapse counts are reached only by badly formulated items, or items that are really hard to remember for their semantic nature or knowledge interference. For memories starting with Lapse=10, I suggested a term "toxic" to express their impact on the learning process. If the brain rejected a piece of information that many times, we should get a message: this knowledge is badly formulated or has become toxic for other reasons (e.g. stress associated with learning, e.g at school).
Figure: In the graph above, which includes data from over 130,000 repetitions, newly memorized items are optimally repeated after seven days. However, the items that have been forgotten 10 times (which is rare in SuperMemo) will require an interval of two days. (Due to logarithmic scaling, the size of the circle is not linearly proportional to the data sample; the number of repetition cases for Lapses=0 is by far larger than for Lapses=10, as can be seen in Distributions : Lapses)
First grade vs. A-Factor
Correlating the first grade with the estimated item difficulty was to help classify items by difficulty at the entry to the learning process. The correlation appears to be weak and is highly dependent on user's grading system. For some users, there is virtually no correlation (picture #1). For others, the correlation is good enough to cover the full range of difficulty (A-factor) (picture #2).
In addition, in Algorithm SM-11 derived from Algorithm SM-8, the user was allowed to execute premature repetitions. Those repetitions would account for the spacing effect, however, they would still contribute to the graph and overestimate the grade for difficult items. With extensive use of incremental reading, this would flatten the graph.
Algorithm SM-17 does not use grade-difficulty correlation and derives difficulty from the entire repetition history. Practice shows that even then the estimate is hard to make and the good practice of learning is to keep all items easy (i.e. in the accepted mnemonic fit with the rest of the student's knowledge).
Grade vs. Forgetting Index
By correlating grades with the expected forgetting index (predicted retrievability), I hoped to be able to compute the estimated forgetting index (post-repetition estimate of the actual retrievability). This correlation appeared to be weak due to the fact that all users tend to deploy their own grading systems, which is often inconsistent. The grade and R correlation comes primarily from the fact that complex items get worse grades and tend to be forgotten faster (at least at the beginning). In that sense, grades provide a better reflection of complexity than a reflection of retrievability.
In the picture below, the entire range of the expected forgetting index seems to fall around the grade 3.
For Grade<=3 we can read the maximum estimated forgetting index, and for Grade>=4 we can read the minimum estimated forgetting index. In that light, two grade systems would have the exact same effect on the algorithm as the six grade system.
For other users, the curve might even peak at some levels of the expected forgetting index as if grading reflected a wish to remember items that are really hard to remember (lenient grading).
Algorithm SM-17 makes an extensive use of retrievability estimated after the repetition, however, it derives it from sheer recall data and the expected retrievability. Grade-retrievability correlations are also collected, however, their weight is negligible.
The FI-G graph is updated after each repetition by using the expected forgetting index and actual grade scores. The expected forgetting index can easily be derived from the interval used between repetitions and the optimum interval computed from the OF matrix. The higher the value of the expected forgetting index, the lower the grade. From the grade and the FI-G graph, we can compute the estimated forgetting index which corresponds to the post-repetition estimation of the forgetting probability of the just-repeated item at the hypothetical pre-repetition stage. Because of the stochastic nature of forgetting and recall, the same item might or might not be recalled depending on the current overall cognitive status of the brain; even if the strength and retrievability of memories of all contributing synapses is/was identical! This way we can speak about the pre-repetition recall probability of an item that has just been recalled (or not). This probability is expressed by the estimated forgetting index.
Algorithm SM-15
Algorithm SM-8 has been improved over years and evolved into Algorithm SM-11 (2002) and then Algorithm SM-15 (2011). Here I only present the latest version: Algorithm SM-15 (used in SuperMemo 15, SuperMemo 16, and as backup in SuperMemo 17).
The key improvements added to Algorithm SM-8 over two decades were:
- improved stability indexing: instead of using repetition numbers, as of SuperMemo 8 (1997), the algorithm used the concept of "repetition category" which roughly translates to stability
- tolerance for advanced and delayed repetitions, as of SuperMemo 11 (2002): a heuristic has been added to account for the spacing effect
- extending the representation of time in U-Factors from 60 days to 15 years (2011)
- correcting forgetting curve data for repetition delay beyond the original U-Factor span (2011)
Important! Algorithm SM-15 is used only to compute the intervals between repetitions of items. Topics are reviewed at intervals computed with an entirely different algorithm (not described here). The timing of topic review is optimized with the view to managing the reading sequence and is not aimed at aiding memory. Long-term memories are formed in SuperMemo primarily with the help of items, which are reviewed along the schedule computed by Algorithm SM-15.
This is a more detailed description of the Algorithm SM-15:
- Optimum interval: Inter-repetition intervals are computed using the following formula:
I(1)=OF[1,L+1]
I(n)=I(n-1)*OF[n,AF]
where:
- OF - matrix of optimal factors, which is modified in the course of repetitions
- OF[1,L+1] - value of the OF matrix entry taken from the first row and the L+1 column
- OF[n,AF] - value of the OF matrix entry that corresponds with the n-th repetition, and with item difficulty AF
- L - number of times a given item has been forgotten (from "memory Lapses")
- AF - number that reflects absolute difficulty of a given item (from "Absolute difficulty Factor")
- I(n) - n-th inter-repetition interval for a given item
- Advanced repetitions: Because of possible advancement in executing repetitions (e.g. forced review before an exam), the actual optimum factor (OF) used to compute the optimum interval is decremented by dOF using formulas that account for the spacing effect in learning:
dOF=dOFmax*a/(thalf+a)
dOFmax=(OF-1)*(OI+thalf-1)/(OI-1)
where:
- dOF - decrement to OF resulting from the spacing effect
- a - advancement of the repetition in days as compared with the optimum schedule (note that there is no change to OF if a=0, i.e. the repetition takes time at optimum time)
- dOFmax - asymptotic limit on dOF for infinite a (note that for a=OI-1 the decrement will be OF-1 which corresponds to no increase in inter-repetition interval)
- thalf - advancement at which there is half the expected increase to synaptic stability as a result of a repetition (presently this value corresponds roughly to 60% of the length of the optimum interval for well-structured material)
- OF - optimum factor (i.e. OF[n,AF] for the n-th interval and a given value of AF)
- OI - optimum interval (as derived from the OF matrix)
- Delayed repetitions: Because of possible delays in executing repetitions, the OF matrix is not actually indexed with repetitions but with repetition categories. For example if the 5-th repetition is delayed, OF matrix is used to compute the repetition category, i.e. the theoretical value of the repetition number that corresponds with the interval used before the repetition. The repetition category may, for example, assume the value 5.3 and we will arrive at I(5)=I(4)*OF[5.3,AF] where OF[5.3,AF] has a intermediate value derived from OF[5,AF] and OF[6,AF]
- Matrix of optimum intervals: SuperMemo does not store the matrix of optimum intervals as in some earlier versions. Instead it keeps a matrix of optimal factors that can be converted to the matrix of optimum intervals (as in the formula from Point 1). The matrix of optimal factors OF used in Point 1 has been derived from the mathematical model of forgetting and from similar matrices built on data collected in years of repetitions in collections created by a number of users. Its initial setting corresponds with values found for a less-than-average student. During repetitions, upon collecting more and more data about the student's memory, the matrix is gradually modified to make it approach closely the actual student's memory properties. After years of repetitions, new data can be fed back to generate more accurate initial OF matrix. In SuperMemo 17, this matrix can be viewed in 3D with Tools : Statistics : Analysis : 3-D Graphs : O-Factor Matrix
- Item difficulty: The absolute item difficulty factor (A-Factor), denoted AF in Point 1, expresses the difficulty of an item (the higher it is, the easier the item). It is worth noting that AF=OF[2,AF]. In other words, AF denotes the optimum interval increase factor after the second repetition. This is also equivalent with the highest interval increase factor for a given item. Unlike E-Factors in Algorithm SM-6 employed in SuperMemo 6 and SuperMemo 7, A-Factors express absolute item difficulty and do not depend on the difficulty of other items in the same collection of study material
- Deriving OF matrix from RF matrix: Optimum values of the entries of the OF matrix are derived through a sequence of approximation procedures from the RF matrix which is defined in the same way as the OF matrix (see Point 1), with the exception that its values are taken from the real learning process of the student for who the optimization is run. Initially, matrices OF and RF are identical; however, entries of the RF matrix are modified with each repetition, and a new value of the OF matrix is computed from the RF matrix by using approximation procedures. This effectively produces the OF matrix as a smoothed up form of the RF matrix. In simple terms, the RF matrix at any given moment corresponds to its best-fit value derived from the learning process; however, each entry is considered a best-fit entry on its own, i.e. in abstraction from the values of other RF entries. At the same time, the OF matrix is considered a best-fit as a whole. In other words, the RF matrix is computed entry by entry during repetitions, while the OF matrix is a smoothed copy of the RF matrix
- Forgetting curves: Individual entries of the RF matrix are computed from forgetting curves approximated for each entry individually. Each forgetting curve corresponds with a different value of the repetition number and a different value of A-Factor (or memory lapses in the case of the first repetition). The value of the RF matrix entry corresponds to the moment in time where the forgetting curve passes the knowledge retention point derived from the requested forgetting index. For example, for the first repetition of a new item, if the forgetting index equals 10%, and after four days the knowledge retention indicated by the forgetting curve drops below 90% value, the value of RF[1,1] is taken as four. This means that all items entering the learning process will be repeated after four days (assuming that the matrices OF and RF do not differ at the first row of the first column). This satisfies the main premise of SuperMemo, that the repetition should take place at the moment when the forgetting probability equals 100% minus the forgetting index stated as percentage. In SuperMemo 17, forgetting curves can be viewed with Tools : Statistics : Analysis : Forgetting Curves (or in 3-D with Tools : Statistics : Analysis : 3-D Curves):
Figure: Tools : Statistics : Analysis : Forgetting Curves for 20 repetition number categories multiplied by 20 A-Factor categories. In the picture, blue circles represent data collected during repetitions. The larger the circle, the greater the number of repetitions recorded. The red curve corresponds with the best-fit forgetting curve obtained by exponential regression. For ill-structured material the forgetting curve is crooked, i.e. not exactly exponential. The horizontal aqua line corresponds with the requested forgetting index, while the vertical green line shows the moment in time in which the approximated forgetting curve intersects with the requested forgetting index line. This moment in time determines the value of the relevant R-Factor, and indirectly, the value of the optimum interval. For the first repetition, R-Factor corresponds with the first optimum interval. The values of O-Factor and R-Factor are displayed at the top of the graph. They are followed by the number of repetition cases used to plot the graph (i.e. 21,303). At the beginning of the learning process, there is no repetition history and no repetition data to compute R-Factors. It will take some time before your first forgetting curves are plotted. For that reason, the initial value of the RF matrix is taken from the model of a less-than-average student. The model of average student is not used because the convergence from poorer student parameters upwards is faster than the convergence in the opposite direction. The Deviation parameter displayed at the top tells you how well the negatively exponential curve fits the data. The lesser the deviation, the better the fit. The deviation is computed as a square root of the average of squared differences (as used in the method of least squares).
Figure: 3D representation of the family of forgetting curves for a single item difficulty and varying memory stability levels (normalized for U-Factor).
Figure: Cumulative forgetting curve for learning material of mixed complexity, and mixed stability. The graph is obtained by superposition of 400 forgetting curves normalized for the decay constant of 0.003567, which corresponds with recall of 70% at 100% of the presented time span (i.e. R=70% on the right edge of the graph). 401,828 repetition cases have been included in the graph. Individual curves are represented by yellow data points. Cumulative curve is represented by blue data points that show the average recall for all 400 curves. The size of circles corresponds with the size of data samples.
- Deriving OF matrix from the forgetting curves: The OF matrix is derived from the RF matrix by:
- fixed-point power approximation of the R-Factor decline along the RF matrix columns (the fixed point corresponds to second repetition at which the approximation curve passes through the A-Factor value),
- for all columns, computing D-Factor which expresses the decay constant of the power approximation,
- linear regression of D-Factor change across the RF matrix columns, and
- deriving the entire OF matrix from the slope and intercept of the straight line that makes up the best fit in the D-Factor graph. The exact formulas used in this final step go beyond the scope of this illustration.
- Item difficulty: The initial value of A-Factor is derived from the first grade obtained by the item, and the correlation graph of the first grade and A-Factor (G-AF graph). This graph is updated after each repetition in which a new A-Factor value is estimated and correlated with the item's first grade. Subsequent approximations of the real A-Factor value are done after each repetition by using grades, OF matrix, and a correlation graph that shows the correspondence of the grade with the expected forgetting index (FI-G graph). The grade used to compute the initial A-Factor is normalized, i.e. adjusted for the difference between the actually used interval and the optimum interval for the forgetting index equal 10%
- Grades vs. expected forgetting index correlation: The FI-G graph is updated after each repetition by using the expected forgetting index and actual grade scores. The expected forgetting index can easily be derived from the interval used between repetitions and the optimum interval computed from the OF matrix. The higher the value of the expected forgetting index, the lower the grade. From the grade and the FI-G graph (see: FI-G graph in Tools : Statistics : Analysis : Graphs), we can compute the estimated forgetting index which corresponds to the post-repetition estimation of the forgetting probability of the just-repeated item at the hypothetical pre-repetition stage. Because of the stochastic nature of forgetting and recall, the same item might or might not be recalled depending on the current overall cognitive status of the brain; even if the strength and retrievability of memories of all contributing synapses is/was identical! This way we can speak about the pre-repetition recall probability of an item that has just been recalled (or not). This probability is expressed by the estimated forgetting index
- Computing A-Factors: From (1) the estimated forgetting index, (2) length of the interval and (3) the OF matrix, we can easily compute the most accurate value of A-Factor. Note that A-Factor serves as an index to the OF matrix, while the estimated forgetting index allows one to find the column of the OF matrix for which the optimum interval corresponds with the actually used interval corrected for the deviation of the estimated forgetting index from the requested forgetting index. At each repetition, a weighted average is taken of the old A-Factor and the new estimated value of the A-Factor. The newly obtained A-Factor is used in indexing the OF matrix when computing the new optimum inter-repetition interval