Exponential nature of forgetting

From supermemo.guru
Jump to navigation Jump to search

This text is part of: "History of spaced repetition" by Piotr Wozniak (June 2018)

Forgetting curve: power or exponential

The shape of the forgetting curve is vital for understanding memory. The math behind the curve may even weigh in on the understanding of the role of sleep (see later). When Ebbinghaus first determined the rate of forgetting, he got a pretty nice set of data with a good fit to the power function. However, today we know forgetting is exponential. The discrepancy is explained here.

Forgetting curve adapted from Hermann Ebbinghaus (1885). The curve has been rendered from original tabular data published by Ebbinghaus (Piotr Wozniak, 2017)

Wrong thinking helped spaced repetition

For many years, the actual shape of the curve did not play much of a role in spaced repetition. My early intuitions about the nature of forgetting were all over the place depending on the context. Back in 1982, I was thinking that the evolution has designed forgetting for the brain to make sure we do not run out of memory space. The optimum time for forgetting would be determined by the statistical properties of the environment. Decay would be programmed to maximize survival. Once the review did not take place, the memory would get deleted to provide space for new learning.

I was wrong thinking that there might be an optimum time for forgetting and this error was actually helpful for inventing spaced repetition. That "optimum time" intuition helped the first experiment in 1985. The optimum time for forgetting would imply sigmoidal forgetting curve with a clear inflection point that determines optimality. Before the review, forgetting would be minimal. A delayed review would result in rapid forgetting. This is why finding the optimum interval seemed so critical. When data started pouring in later on, with my confirmation bias, I still could not see my error. I wrote in my Master's Thesis about sigmoidal forgetting: "this follows directly from the observation that before the elapse of the optimal interval, the number of memory lapses is negligible". I must have forgotten my own forgetting curve plot produced in late 1984.

Today this sigmoidal proposition may seem preposterous, but even my model of intermittent learning provided some support for the notion. Exponential approximation yielded particularly high deviation error for data collected in my work on the model of intermittent learning, and the superposition of sigmoid curves for different E-Factors could easily mimic early linearity. Linear approximation seemed to excellently fit the model of intermittent learning within the recall range in the available data. No wonder, with whole pages of heterogeneous material, exponential nature of forgetting remained well hidden.

Contradictory models

I did not ponder forgetting curves much. However, my biological model of memory, dating back to 1988, spoke of the exponential decay in retrievability. Apparently, in those days, the forgetting curve and retrievability could exist in my head as independent entities.

In my credit paper for a class in computer simulation (Dr Katulski, Jan 1988), my figures clearly show exponential forgetting curves:

Hypothetical mechanism involved in the process of optimal learning. (A) Molecular phenomena (B) Quantitative changes in the synapse
Hypothetical mechanism involved in the process of optimal learning. (A) Molecular phenomena (B) Quantitative changes in the synapse

Figure: In my Master's Thesis titled "Optimization of learning" (1990), I presented some hypothetical concepts that might underly the process of optimal learning based on spaced repetition. (A) Molecular phenomena (B) Quantitative changes in the synapse. Those ideas are a bit dated today, but the serrated curves representing memory retrievability came to be widely known in popular publications on spaced repetition. They are usually wrongly attributed to Hermann Ebbinghaus

By that time, I might have picked a better idea from literature. In the years 1986-1987, I spent a lot of time in the university library looking for some good research on spaced repetition. I found none. I might have already been familiar with the forgetting curve determined by Ebbinghaus. It is mentioned in my Master's Thesis.

Collecting data

I collected data for my first forgetting curve plot in late 1984. As all the learning was done for learning's sake over the course of 11 months, and the cost of the graph was minimal, I forgot about that graph and it lay unused for 34 years in my archives:

The very first forgetting curve for the retention of English vocabulary plotted back in 1984, just a few months before designing SuperMemo on paper

Figure: My very first forgetting curve for the retention of English vocabulary plotted back in 1984, i.e. a few months before designing SuperMemo on paper. This graph was not part of the experiment. It was simply a cumulative assessment of the results of intermittent learning of English vocabulary. The graph was soon forgotten. It was re-discovered 34 years later. After memorization, 49 pages of ~40 word pairs of English were reviewed at different intervals and the number of recall errors was recorded. After rejecting outliers and averaging, the curve appears to be far less steep that the curve obtained by Ebbinghaus (1885), in which he used nonsense syllables and a different measure of forgetting: saving on re-learning

My 1985 experiment could also be seen as a noisy attempt to collect forgetting curve data. However, first SuperMemos did not care about the forgetting curve. The optimization was bang-bang in nature, even though today, collecting retention data seems such an obvious solution (as in 1985).

Until I started collecting data with SuperMemo software, where each item could be scrutinized independently, I could not fully recover from my early erroneous ideas about forgetting.

SuperMemo 1 for DOS (1987) collected full repetition histories that would make it possible to determine the nature of forgetting. However, within 10 days (on Dec 23, 1987), I had to ditch the full record of repetitions. At that time, my disk space was 360KB. That's correct. I would run SuperMemo from old type 5.25in diskettes. Full repetition history record returned to SuperMemo only 8 long years later (Feb 15, 1996) after the hectic effort from Dr Janusz Murakowski who considered every ticking minute a waste of valuable data that could power future algorithms and memory research. Two decades later, we have more data that we can effectively process.

Without repetition history, I could still investigate forgetting with a help of the forgetting curve data collected independently. On Jan 6, 1991, I figured out how to record forgetting curves in a small file that would not bloat the size of the database (i.e. without the full record of repetition history).

Only SuperMemo 6 started collecting forgetting curve data to determine optimum intervals (1991). It was doing the same thing as my first experiment, except it did it automatically, on a massive scale, and for memories separated into individual questions (this solved the heterogeneity problem). SuperMemo 6 initially used a binary chop to find the best moment corresponding with the forgetting index. A good fit approximation was still 3 years into the future.

First forgetting curve data

By May 1991, I had some first data to peek at, and this was a major disappointment. I predicted I would need a year to see any regularity. However, every couple of months, I kept noting down my disappointment with minimum progress. The progress in collecting data was agonizingly slow and the wait was excruciating. A year later, I was no closer to the goal. If Ebbinghaus was able to plot a good curve with nonsense syllables, his pain of non-coherence must have been worth it. With meaningful data, the truth was very slow to emerge. Even with the convenience of having it all done by a computer while having fun with learning.

On Sep 3, 1992, SuperMemo 7 for Windows made it possible to have a first nice peek at a real forgetting curve. The view was mesmerizing:

First peek at a pretty regular forgetting curve in SuperMemo 7 (1992)

Figure: SuperMemo 7 for Windows was written in 1992. As of Sep 03, 1992, it was able to display user's forgetting curve graph. The horizontal axis labeled U-Factor corresponded with days in this particular graph. The kinks between days 14 and 20 were one of the reasons it was difficult to determine the nature of forgetting. Old erroneous hypotheses were hard to falsify. Until the day 13, forgetting seemed nearly linear and might also provide a good exponential fit. It took two more years of data collecting to find answers (source: SuperMemo 7: User's Guide)

Forgetting curve approximations

By 1994, I still was not sure about the nature of forgetting. I took data collected in the previous 3 years (1991-1994) and set out to figure out the curve once and for all. I focused on my own data from over 200,000 repetitions. However, it was not easy. If SuperMemo schedules a repetition at R=0.9, you can draw a straight line from R=1.0 to R=0.9 and do great with noisy data:

Difficulty approximating forgetting curve
Difficulty approximating forgetting curve

Figure: Difficulty approximating the forgetting curve. Back in 1994, it was difficult to understand the nature of forgetting in SuperMemo because most of the data used to be collected in a high recall range

My notes from May 6, 1994 illustrate the degree of uncertainty:

Personal anecdote. Why use anecdotes?
May 6, 1994: All day of crazy attempts to better approximate forgetting curves. First I tried R=1-in/(Hn+in) where i - interval, H - memory half-life, and n - cooperativity factor. Late in the evening, I had it work quite slowly, but ... it appeared that r=exp(-a*i) works not much worse! Even the old linear approximation was not very much worse (sigmoid: D=8.6%, exponential D=8.8%, and linear D=10.8%). Perhaps, forgetting curves are indeed exponential? Going to sleep at 2:50

It was not easy to separate linear, power, exponential, Zipf, Hill, and other functions. Exponential, power and even linear approximations brought pretty good outcomes depending on circumstances that were hard to separate. Only when looking at forgetting curves well sorted for complexity at higher levels of stability, despite those graphs being data poor, could I see the exponential nature of forgetting more clearly.

One of the red herrings in 1994 was that, naturally, I had most data collected for the first review. New items at the entry to the process still provide a heterogeneous group that obeys the power law of forgetting.

The first review forgetting curve for newly learned knowledge collected with SuperMemo

Later on, when they are sorted by complexity and stability, they start becoming exponential. In Algorithm SM-6, complexity and stability were imperfectly expressed by E-Factors and repetition number respectively. This resulted in algorithmic imperfections that made for imperfect sorting. In addition, SuperMemo stays within the area of high retention when forgetting is nearly linear.

By May 1994, the main first-review curve in my Advanced English database collected 18,000 data points and seemed like the best analytical material. However, that curve encompasses all the learning material that enters the process independent of its difficulty. Little did I know that this curve is covered by the power law. My best deviation was 2.0.

For a similar curve from 2018 see:

Forgetting curve obtained in 2018 with SuperMemo 17 for average difficulty (A-Factor=3.9)

Figure: Forgetting curve obtained in 2018 with SuperMemo 17 for average difficulty (A-Factor=3.9). At 19,315 repetitions and least squares deviation of 2.319, it is pretty similar to the curve from 1994, except it is best approximated with an exponential function (for the power function example see: forgetting curve).

Exponential forgetting prevails

By summer 1994, I was reasonably sure of the exponential nature of forgetting. By 1995, we published "2 components of memory" with the formula R=exp(-t/S). Our publication remains largely ignored by mainstream science but is all over the web when forgetting curves are discussed.

Interestingly, in 1966, Nobel Prize winner Herbert Simon had a peek at Jost's Law derived from Ebbinghaus work in 1897. Simon noticed that the exponential nature of forgetting necessitates the existence of a memory property that today we call memory stability. Simon wrote a short paper and moved on to hundreds of other projects he was busy with. His text was largely forgotten, however, it was prophetic. In 1988, similar reasoning led to the idea of the two component model of long-term memory.

Today we can add one more implication: If forgetting is exponential, it implies a constant probability of forgetting in unit time, which implies neural network interference, which implies that sleep might build stability not by strengthening memories, but by simply removing the cause of interference: unnecessary synapses. Giulio Tononi might then be right about the net loss of synapses in sleep. However, he believes that loss is homeostatic. Exponential forgetting indicates that this could be much more. It might be a form of "intelligent forgetting" of things that interfere with key memories reinforced in waking.

Negatively exponential forgetting curve

Only in 2005, we wrote more extensively about the exponential nature of forgetting. In a paper presented by Dr Gorzelanczyk in a modelling conference in Poland, we wrote:

Archive warning: Why use literal archives?
Although it has always been suspected that forgetting is exponential in nature, proving this fact has never been simple. Exponential decay appears standardly in biological and physical systems from radioactive decay to drying wood. It occurs anywhere where expected decay rate is proportional to the size of the sample, and where the probability of a single particle decay is constant. The following problems have hampered the effort of modeling forgetting since the years of Ebbinghaus (Ebbinghaus, 1885):
  • small sample size
  • sample heterogeneity
  • confusion between forgetting curves, re-learning curves, practise curves, savings curves, trials to learn curves, error curves, and others in the family of learning curves

By employing SuperMemo, we can overcome all these obstacles to study the nature of memory decay. As a popular commercial application, SuperMemo provides virtually unlimited access to huge bodies of data collected from students all over the world. The forgetting curve graphs available to every user of the program (Tools : Statistics : Analysis : Forgetting curves) are plotted on relatively homogenous data samples and are a bona fide reflection of memory decay in time (as opposed to other forms of learning curves). The quest for heterogeneity significantly affects the sample size though. It is important to note that the forgetting curves for material with different memory stability and different knowledge difficulty differ. Whereas memory stability affects the decay rate, heterogeneous learning material produces a superposition of individual forgetting curves, each characterized by a different decay rate. Consequently, even in bodies with hundreds of thousands of individual pieces of information participating in the learning process, only relatively small homogeneous samples of data can be filtered out. These samples rarely exceed several thousands of repetition cases. Even then, these bodies of data go far beyond sample quality available to researchers studying the properties of memory in controlled conditions. Yet the stochastic nature of forgetting still makes it hard to make an ultimate stand on the mathematical nature of the decay function (see two examples below). Having analyzed several hundred thousand samples we have come closest yet to show that the forgetting is a form of exponential decay.

Exemplary forgetting curve sketched by SuperMemo

Figure: Exemplary forgetting curve sketched by SuperMemo. The database sample of nearly a million repetition cases has been sifted for average difficulty and low stability (A-Factor=3.9, S in [4,20]), resulting in 5850 repetition cases (less than 1% of the entire sample). The red line is a result of regression analysis with R=e-kt/S. Curve fitting with other elementary functions demonstrates that the exponential decay provides the best match to the data. The measure of time used in the graph is the so-called U-Factor defined as the quotient of the present and the previous inter-repetition interval. Note that the exponential decay in the range of R from 1 to 0.9 can plausibly be approximated with a straight line, which would not be the case had the decay been characterized by a power function.

Exemplary forgetting curve sketched by SuperMemo

Figure: Exemplary forgetting curve sketched by SuperMemo. The database sample of nearly a million repetition cases has been sifted for average difficulty and medium stability (A-Factor=3.3, S > 1 year) resulting in 1082 repetition cases. The red line is a result of regression analysis with R=e-kt/S.

Forgetting curve: Retrievability formula

In Algorithm SM-17, retrievability R corresponds with the probability of recall and represents the exponential forgetting curve. Retrievability is derived from stability and the interval:

R[n]:=exp-k*t/S[n-1]

where:

That neat theoretical approach is made a bit more complex when we consider that forgetting may not be perfectly exponential if items are difficult or with mixed difficulty. In addition, forgetting curves in SuperMemo can be marred by user strategies.

In Algorithm SM-8, we hoped that retrievability information might be derived from grades. This turned out to be false. There is very little correlation between grades and retrievability, and it primarily comes from the fact that complex items get worse grades and tend to be forgotten faster (at least at the beginning).

Retention vs. the forgetting index

Exponential nature of forgetting implies that the relationship between the measured forgetting index and knowledge retention can accurately be expressed using the following formula:

Retention = -FI/ln(1-FI)

where:

  • Retention - overall knowledge retention expressed as a fraction (0..1),
  • FI - forgetting index expressed as a fraction (forgetting index equals 1 minus knowledge retention at repetitions).

For example, by default, well-executed spaced repetition should result in retention 0.949 (i.e. 94.9%) for the forgetting index of 0.1 (i.e. 10%). 94.9% illustrates how much exponential decay resembles a linear function at first. For linear forgetting, the figure would be 95.000% (i.e. 100% minus half the forgetting index).

Forgetting curve for poorly formulated material

In 1994, I was lucky my learning collections were largely well-formulated. This often wasn't the case with users of SuperMemo. For badly-formulated items, the forgetting curve is flattened. It is not purely exponential (as superposition of several exponential curves). SuperMemo can never predict the moment of forgetting of a single item. Forgetting is a stochastic process and can only operate on averages. A frequently propagated fallacy about SuperMemo is that it predicts the exact moment of forgetting: this is not true, and this is not possible. What SuperMemo does is a search for intervals at which items of given difficulty are likely to show a given probability of forgetting (e.g. 10%). Those flattened forgetting curves led to a paradox. Neglecting complex items may lead to a great survival after long breaks from review. Even for a pure negatively exponential forgetting curve, a 10-fold deviation in interval estimation will result in R2=exp10*ln(R1) difference in retention. This is equivalent to a drop from 98% to 81%. For a flattened forgetting curve typical of badly-formulated items, this drop may be as little as 98%->95%. This leads to a conclusion that keeping complex material at lower priorities is a good learning strategy.

Power law emerges in superposition of exponential forgetting curves

To illustrate the importance of homogenous samples in studying forgetting curves, let us see the effect of mixing difficult knowledge with easy knowledge on the shape of the forgetting curve. The figure below shows why heterogeneous samples may lead to wrong conclusions about the nature of forgetting. The heterogeneous sample in this demonstration is best approximated with a power function! The fact that power curves emerge through averaging of exponential forgetting curves has earlier been reported by others (Anderson&Tweney 1997; Ritter&Schooler, 2002).

Heterogenous forgetting index
Heterogenous forgetting index

Figure: Superposition of forgetting curves may result in obscuring the exponential nature of forgetting. A theoretical sample of two types of memory traces has been composed: 50% of the traces in the sample with stability S=1 (thin yellow line) and 50% of the traces in the sample with stability S=40 (thin violet line). The superimposed forgetting curve will, naturally, exhibit retrievability R=0.5*Ra+0.5*Rb=0.5*(e-k*t+e-k*t/40). The forgetting curve of such a composite sample is shown in granular black in the graph. The thick blue line shows the exponential approximation (R2=0.895), and the thick red line shows the power approximation of the same curve (R2=0.974). In this case, it is the power function that provides the best match to data, even though the forgetting of sample subsets is negatively exponential.

SuperMemo 17 also includes a single forgetting curve that is best approximated by a power function. This is the first forgetting curve after memorizing items. At the time of memorization, we do not know item complexity. This is why the material is heterogeneous and we get a power curve of forgetting.

The first review forgetting curve for newly learned knowledge collected with SuperMemo
The first review forgetting curve for newly learned knowledge collected with SuperMemo

Figure: The first forgetting curve for newly learned knowledge collected with SuperMemo. Power approximation is used in this case due to the heterogeneity of the learning material freshly introduced in the learning process. Lack of separation by memory complexity results in superposition of exponential forgetting with different decay constants. On a semi-log graph, the power regression curve is logarithmic (in yellow), and appearing almost straight. The curve shows that in the presented case recall drops merely to 58% in four years, which can be explained by a high reuse of memorized knowledge in real life. The first optimum interval for review at retrievability of 90% is 3.96 days. The forgetting curve can be described with the formula R=0.9906*power(interval,-0.07), where 0.9906 is the recall after one day, while -0.07 is the decay constant. In this is case, the formula yields 90% recall after 4 days. 80,399 repetition cases were used to plot the presented graph. Steeper drop in recall will occur if the material contains a higher proportion of difficult knowledge (esp. poorly formulated knowledge), or in new students with lesser mnemonic skills. Curve irregularity at intervals 15-20 comes from a smaller sample of repetitions (later interval categories on a log scale encompass a wider range of intervals)