Employing forgetting curves in spaced repetition (1991)
This text is part of: "History of spaced repetition" by Piotr Wozniak (June 2018)
Painful birth of SuperMemo World (1991)
1991 was the most important year since the birth of SuperMemo. It was a year of big decisions, stress, drama, discovery, and hard work. At the start of the year, there were three greatest believers in SuperMemo: Biedalak, Murakowski and myself. We were all in the same spot in our lives: transitioning from the years of unconcern at the university to the uncertainty of independent adulthood. By default, we all dreamt of big science in the US: Biedalak dreamt of artificial intelligence, Murakowski of quantum physics, and I wanted to crack the secrets of molecular memory. In retrospect, graduate students from Eastern Bloc with good transcripts, good standardized test results, and rock-solid recommendations are pretty welcome in the US. Things get more complex if they demand full financial support. I did not have a penny. Moreover, eager Easterners were often treated as dutiful labor. Zeal for their own projects, and great ideas might have been less welcome. I will never know. The three believers had all different visions for SuperMemo.
On Jan 3, 1991, I started the implementation of the new spaced repetition algorithm for SuperMemo 6. On the same day, Murakowski left for London where he would pursue his educational dreams while trying to sell SuperMemo 2. He would not sell via a distribution channel or in a shop. He would just go from person to person, explain the merits of the program, and hopefully collect a few bucks to keep the hope going.
In the meantime, Biedalak and I met regularly for a 10 km jogging that would be combined with winter swimming and a brainstorming session on the way back home (we called it walktalking). We mostly spoke of studying in the US and selling SuperMemo. More and more frequently, the idea of our own company started coming up in discussions.
I started my work over the new spaced repetition algorithm with some ideas that would change SuperMemo for ever. Algorithm SM-6 used in SuperMemo 6 was a breakthrough that would power further development over the next 25 years. It would re-employ the simple experimental procedure that led to spaced repetition in 1985 but would do it in an automated manner. It would collect performance data, and choose the best time for review: it would plot the user's forgetting curve. This would also mean that the user would be able to decide the acceptable probability of forgetting for every single item (i.e. the optimum level of retention-workload tradeoff).
At that time, I was still bound to 360 kB floppy diskettes (5.25 in). For that reason, SuperMemo still could not keep all repetition histories that would fully replicate the 1985 approach on a massive scale. However, on Jan 6, 1990, I had a simple idea: I could just collect data about the forgetting curves for classes of items of different difficulty and stability. Instead of the full record, I would only update the approximation of how many items in a given class are retained in memory at a given time (i.e. at a given level of retrievability). That idea survives at the core of SuperMemo to this day. Even with the full record of repetition histories today, SuperMemo still instantly knows the expected retrievability of items in a given class.
At the crossroads in life, I was finally free from school. There is a powerful emotion that millions of teens and young adults face in their lives: a traumatic move from a slave called "pupil" or "student", to the freedom of becoming an "unemployed adult". The psychological shakeup can be even more dramatic if one turns from a "good student" to an "unemployed 28-year-old living with his mom". Like a lightswitch: the whole world seems to change from cheerful support to a gloomy-faced condemnation mixed up with pity.
I kept learning, and worked on new ideas for SuperMemo in the atmosphere of freedom mixed up with uncertainty. For me, uncertainty is an energizer. However, on Feb 12, 1991, I learned that my mom was diagnosed with terminal cancer. To the mix of freedom and uncertainty, it added the sense of gloom. For me again, gloom can also be an energizer. I tripled my learning with focus on cancer in hope of finding out some magic therapy on my own. This shows the positive impact of unreasonable optimism on productivity. It also shows that crazy optimism can help survive difficult times. By working harder, I could dispel the gloom. High productivity is a sure anti-depressant. My hard work left no room for dark thoughts. I was confident, I would cure my mom!
Incidentally, at the moment of my mom's diagnosis, I was also writing a program to simulate the optimum behavior of memory in response to the environment. This was to prove mathematically that the two component model of memory was optimum for survival. When I learned about my mom's diagnosis, I threw that effort out of my schedule to learn about cancer. I never completed that program, and that idea still lives in limbo pushed away by other projects.
On Mar 6, 1991, during one of our jogging-cum-brainstorming days with Biedalak, someone tossed up the name SuperMemo World. Little did we know that four months later that would be the name of our company that has survived 27 years today.
On Mar 12, 1991, I made my first repetitions with the new algorithm in SuperMemo 6 while my mom rested on her deathbed. A week later, she died peacefully in sleep at the young age of 70. In similar circumstances, the usual picture involves family meetings, mourning, funeral, and a whole host of traditions with roots in religion that I could never accept as rational. Instead, 9 hours after my mom's death, I worked on a better method for a fast approximation of the OF matrix. In that work, I capitalized on the job I once did for ZX Spectrum. I would employ linear regression along the difficulty columns of the matrix, and negative exponential regression along the repetition rows. Years later, I found that power regression is more appropriate for the latter. Only Algorithm SM-8 developed four years later would make a full use of those ideas. However, I mention it mostly to illustrate how hard work, and productivity can work great as a remedy against gloom and possible depression. At that time, I discovered that the impact of an emotional trauma follows a circadian curve. I would work hard in the morning, but the gloom would keep creeping back to my mind in the evening. Sleep would be the liberation, and the best anti-depressant. From those early days, I am a firm believer in the idea that sleep and learning carry a solution to the problem of depression, however, I never truly had a chance to work on it. It would help if I suffered a bit myself, but either I have some good resilience endowment, or, more likely, I instinctively employ the tools of good sleep and high productivity at hard times. Ever since Sapolsky called depression the "worst disease in the world", I wanted to find a prescription for preventing depression. I sense there is a simple formula. Perhaps that naive childlike optimism itself is part of the solution?
On Apr 13, 1991, we decided that SuperMemo 2 should be released as freeware. We hoped it might educate potential users abroad about the power of spaced repetition. However, initially, we had to send out diskettes with free SuperMemo at our own cost. Only in 1993, we uploaded SuperMemo to a local BBS called "Onkonet". It would take some more years before we could upload future versions to Simtel and freeware sites. That freeware idea had an interesting side effect: by the end of the year, it was clear: people would start using the program and then give up. This was a hint of an inherent problem with spaced repetition: poor motivation resulting from poor skills would produce a high drop out rate. We also heard that others would try to sell SuperMemo 2 as if this was a commercial product.
On May 2, 1991, I implemented the option for setting the requested forgetting index in SuperMemo 6. On July 5, 1991, SuperMemo World was born. One of the first investments was a PC with a hard disk that would finally help me move away from the slow era of floppy disks.
On Nov 23, 1991: SuperMemo was announced as the finalist of Software for Europe competition. This saved SuperMemo World. This also gave spaced repetition a good start.
Slow start of commercial SuperMemo
When we set up SuperMemo World with Krzysztof Biedalak on Jul 5, 1991, the future looked so bright we needed to buy shades. The earth is populated with highly intelligent people who all need to learn things. This whole human population was our market. The only problem was how to convince all those smart people that two poor students educated behind the Iron Curtain got anything of value to offer. We could not have used the web for that job. SuperMemo is older than the web itself. We could not afford advertising for lack of capital. There was no venture capital culture in Poland in 1991. All we could do is put the first few copies of SuperMemo in file folders and place them on shelves of nearby computer shops. As we aimed at global domination, we did not even have a manual in Polish. Instead of first sales, we had a long summer of silence and creeping doubts.
Figure: In 1991, we delivered the first copies of SuperMemo 5 for DOS to shops in Poznan (Poland) in pink folders with a sticker. The manual did not include a translation to Polish. Amazingly, we found a few buyers. The first sale took place some time between September 9 and 11, 1991 in the computer shop Axe Prim, which no longer exists (the picture is a reconstruction on the basis of original folders and stickers)
Why was it hard to sell the first copy? I can reconstruct the scenario from the words of one of our first customers who actually visited a shop and had a look at the first SuperMemo displayed in public. On a shelf with computer programs, along with shiny boxes from Microsoft, he noticed a shabby folder with enticing words: "Your breakthrough speed-learning software". He picked up the folder and opened a manual, which was a stack of poorly xeroxed pages in English. With lofty words, he read a story that defied belief. It was all too good to be true. Faster learning, great retention, new scientific method, a little cost in time, etc. He did not contemplate an investment, the package was pretty costly (around $100, which was a lot in Poland 1991), however, he approached the salesperson to find out who the people behind SuperMemo were. The owner of the shop knew SuperMemo pretty well and explained. The story started looking credible. The customer never forgot the episode. A few months later, he heard of SuperMemo from some local journal and became one of the first paying customers. His registration coupon arrived in January 1992, and the history of his upgrades says he stayed with SuperMemo for decades and now his son is one of the regular customers.
However, back in summer 1991, we had no sales and by fall, everyone except for myself started having serious doubts. Not about SuperMemo, but about the viability of the business.
It should help to know how we have met. With Biedalak, we were friends since forever. I attended a school with his brother, we lived 200 meters apart, and passed exams for the same year of computer science at the University of Technology in Poznan. I cannot say how I convinced Biedalak that SuperMemo is great. We have just been too close, and he has always been in the circle. This part was easy. Tomek Kuehn was one of the first great believers in SuperMemo. He was also a great programmer, and a great inspiration. He grasped the idea instantly. He wrote two versions of SuperMemo himself: for Atari 800 in 1988, and for Atari ST in 1989. In January 1989, he even sold 10 copies of SuperMemo 2 using an advert in one of the computer journals: Komputer. I presume, he did not recover the invested money or he would sure try the trick again. Upon graduation, Kuehn already had his own business: a computer shop. This shop was also one of the first to present SuperMemo to its customers. His partner and a friend was Marczello Georgiew who did not need much convincing either. Georgiew joined the team. Last but not least, I met Janusz Murakowski during GRE exams in Budapest in 1990. A great mathematical mind, he might be the fastest convert to SuperMemo ever. During our train trip back to Poland, I mentioned SuperMemo. He was instantly captivated. A few days later, he was already an enthusiastic user of SuperMemo 2 (as of Jun 13, 1990). In our company rap anthem, we sang "we are the guys who sell SuperMemo". It was very hard to convince people that SuperMemo works, but the guys on the team have always been enthusiastic.
By November 1991, the enthusiasm was thawing. If we continued without success, we would have gradually lost the team in proportion to their involvement and passion. With a few more months, the company might have died. SuperMemo would not have died. I would certainly look for a buyer, or continue one way or another. I was too tied to the product. I used it myself, and all my knowledge was invested in my databases. I might have thought of returning to the idea of a PhD in the US. In the same way as I was able to combine work at the university in the Netherlands in 1989 with programming "after hours", I would probably continue until some breakthrough, e.g. on the web. Perhaps it would be an open source product? Luckily, Dr Wojciech Makalowski of the Department of Biopolymer Biochemistry suggested we submit SuperMemo for Software for Europe competition. By some miraculous stroke of good luck, we qualified for the final, and this was instantly noticed by the Polish media, esp. computer journals. As of that point, SuperMemo had an easy ride with the Polish press that became more and more intrigued. Andrzej Horodenski was the first journalist to write about SuperMemo (Computer World 1992). Pawel Wimmer was the second. Wimmer remained faithful to this very day, and he actually used SuperMemo 2, which he probably received from Tomasz Kuehn at the time of his KOMPUTER journal advert in 1989.
1.5 years after being set up, SuperMemo World had finally become profitable. Not bad.
SuperMemo World was a fantastic set up from the getgo. We had no injection of venture capital in Poland in 1991, so we had to pull ourselves by our own bootstraps by selling, what others considered to be "snake oil". We might have easily failed, but we survived by the sheer power of passion, belief, and a big stroke of good luck.
Origins of Algorithm SM-6
Algorithm SM-6 was first used in SuperMemo 6 (1991), however, it kept evolving in SuperMemo 7 (1992). There has never been the SM-7 version despite multiple changes introduced in the Windows version. Most notably, as of 1994, the exponential function was used to approximate forgetting curves in SuperMemo 7 for Windows. OF matrix approximations have also been improved over time.
Figure: SuperMemo 7 for Windows was written in 1992. As of Sep 03, 1992, it was able to display user's forgetting curve graph. The horizontal axis labeled U-Factor corresponded with days in this particular graph. The kinks between days 14 and 20 were one of the reasons it was difficult to determine the nature of forgetting. Old erroneous hypotheses were hard to falsify. Until the day 13, forgetting seemed nearly linear and might also provide a good exponential fit. It took two more years of data collecting to find answers (source: SuperMemo 7: User's Guide)
Figure: SuperMemo 7 for Windows (1994) displaying a forgetting curve approximated with an exponential function. Vertical axis represents recall in percent. Horizontal axis corresponds with time represented by U-Factor. With 21,000 repetition cases, the curve finally looks regular enough to hypothesize about the underlying cause of forgetting
The most important new job of Algorithm SM-6 was to collect data on the rate of forgetting. Forgetting curves make it easy to accurately determine optimum intervals. This eliminated the need for a slow and inaccurate bang-bang approach of Algorithm SM-5:
In Algorithm SM-5, the process of determining the value of a single entry of the matrix of optimal factors looked as follows (see before):
- Set the initial value to an average optimal factor value (OF) obtained in previous experiments
- If the grade produced by the entry in question was (1) greater than the desired value then increase the value of OF, (2) less than the desired value then decrease OF, or (3) equal the desired value then do not change OF
The above approach shows that the optimum value of OF could be reached only after a great number of repetitions, and what is worst, the greater the ordinal number of a repetition, the longer it would take to execute the modification-verification cycle (i.e. the cycle in which an OF entry is changed, and verified upon scheduling another repetition with a correspondingly long interval).
Introducing the concept of the forgetting index
The novelty of Algorithm SM-6 is to approximate the slope of the forgetting curve corresponding to a given entry of the matrix of optimal factors, and compute the new value of the relevant optimal factor directly from the approximated curve. In other words, no modification-verification cycle is necessary in Algorithm SM-6 because of establishing the deterministic relationship between the forgetting curve and the optimum inter-repetition interval. The modification of the optimal factor occurs immediately after a repetition upon approximating the new forgetting curve derived from data that include the grade provided in the recent response. This modification not only made it possible to greatly accelerate the process of determining the optimum values of the matrix of optimal factors, but also provided a means for establishing the desired level of knowledge retention that will be reached in the course of the learning process (see an exemplary forgetting curve).
The desired level of knowledge retention is determined by the proportion of items that are not remembered at repetitions. This proportion is called the forgetting index (items are classified as remembered or forgotten on the basis of grades provided by the student in self-assessment of his or her progress).
Figure: An exemplary forgetting curve plotted by SuperMemo 8 in the course of repetitions (over 40,000 repetition cases recorded). R-Factor indicates that exponential regression points to the optimum interval of 7.146 days. This is then smoothed up via the OF matrix to 6.176 days. SuperMemo would use the first interval of 6 days in this case (if to ignore random dispersal)
In the figure presented above, the lapse of time is represented by the interval in days. The vertical axis represents knowledge retention stated as percentage. The horizontal line located at the retention level of 90% determines the requested forgetting index, i.e. the desired proportion of items that should be forgotten at the moment of repetition. The optimum interval will then naturally come at the cross-section of the requested forgetting index line with the forgetting curve. In the example above, the optimum interval equals seven days. The presented forgetting curve has been plotted on the basis of 40489 recorded repetition cases. See later in the text for explanation of the values R-Factor (RF), O-Factor (OF), etc.
Because of the highly irregular nature of the matrix of optimal factors computed directly from the forgetting curves, in Algorithm SM-6, the matrix used in spacing repetitions represents a smoothed version of the so-called matrix of retention factors (matrix RF), which is derived directly from forgetting curves corresponding to particular entries of the matrix OF. In other words, forgetting curves determine the value of entries of the matrix RF, and only the smoothed equivalent of the latter, the matrix OF is used in computing optimum intervalsAlgorithm SM-6
The description of the algorithm below is taken with some clarifications from my PhD Thesis, and refers to the status quo for 1994:
- The learned knowledge is split into smallest possible pieces called items
- Items are formulated in the question-answer form
- Items are memorized by means of a self-paced drop-out technique, i.e., by responding to the asked questions as long as it takes to provide all correct answers
- After memorizing an item, the first repetition is scheduled after an interval that is the same for all of the items. Its value is determined by the desired level of knowledge retention, which in turn can be converted into an interval by using an average forgetting curve taken from an average database of an average student (Wozniak 1994a). The desired retention is specified by means of the so-called forgetting index, which corresponds to the proportion of items forgotten at repetitions (to learn how to compute retention from the forgetting index, and vice versa). Note, that the first interval may be randomly shortened or lengthened for the sake of speeding up the optimization process (varying intervals increase the accuracy of approximating the forgetting curve).
- The first interval is computed as for an average student and an average database. However, as soon as the recorded value of the forgetting index deviates from the requested level, the length of the first interval is modified accordingly. The new value of the interval is derived from the approximation of the negatively exponential forgetting curve plotted in the course of repetition. With each repetition score recorded, the plot becomes more and more accurate and the used value of the optimum inter-repetition interval settles at the point that ensures the selected level of knowledge retention. After each repetition, the student produces a grade, which determines the accuracy and easiness of reproducing the correct answer.
- On the basis of the grades, items are classified into difficulty categories. Their difficulty is reestimated in each successive repetition. The difficulty of each item is characterized by the earlier mentioned E-factors (E stands for "easiness"). E-factors are equal to 2.5 for all items on the entry to the learning process, and modified after subsequent repetitions. For example, grades above four result in slightly increasing the E-factor (good grades indicate easy items), while grades below four reduce the E-factor. Historically, E-factors were used to determine how many times intervals should increase in successive repetitions of items of a given difficulty. At present, E-factors are only used to index the matrices of optimal factors and retention factors, and may bear little relevance to the actual interval increase.
- Different optimal intervals are applied to items of different difficulty.
- Different intervals are applied to items that have been repeated a different number of times.
- The function of optimal intervals is constantly modified in order to produce the desired knowledge retention determined by the forgetting index. In other words, the algorithm will detect how well the student copes with repetitions and adjust the length of inter-repetition intervals accordingly.
- The function of optimal intervals is represented as the matrix of optimal factors, OF-matrix in short, defined as follows:
- for n=1: I(n,EF)=OF(n,EF)
- for n>1: I(n,EF)=I(n-1,EF)*OF(n,EF)
- where:
- I(n,EF) - n-th interval for difficulty EF
- OF(n,EF) - optimal factor for the n-th repetition and the difficulty EF
- The entries of the matrix of optimal factors are modified in the course of repetitions to ensure the desired level of knowledge retention
- Matrix of optimal factors is produced by smoothing the so-called matrix of retention factors, RF matrix in short. Matrix of retention factors is defined in the same way as the matrix of optimal factors.
- Entries of the matrix of retention factors are intended to estimate the values of the entries of the matrix of optimal factors. Each optimal factor corresponds to an optimal interval that produces the desired retention at repetition (determined by the requested forgetting index). Each entry of the matrix of retention factors corresponds to a different value of E-factor and repetition number
- Entries of the matrix of retention factors, called R-factors, are computed from forgetting curves whose shape is sketched on the basis of the history of repetitions
- The lapse of time on the forgetting curve graph is measured by the so-called U-factor, which is the ratio of the current and the previous interval, except for the first repetition where U-factor equals the interval in days (as in Figure). The record of repetitions makes it possible to compute retention for different values of U-factor. The graph of the retention plotted versus the lapse of time (U-factor) represents a forgetting curve. The cross-section of the forgetting curve with the desired retention level determines the optimum R-factor, which, upon smoothing the matrix of retention factors, yields the optimum O-factor
- Each difficulty category and repetition number has its own record of repetitions used to sketch a separate forgetting curve. In other words, different intervals will be used for items of different difficulty, and for items repeated a different number of times.
- Intervals used in learning, including the first interval, are slightly dispersed round the optimal value in order to increase the accuracy of forgetting curve sketching, and consequently, to increase the convergence rate of the optimization procedure. By slightly dispersing intervals, the approximation of the forgetting curve will use a more scattered set of points on the graph