The Research Brief: What's New in Learning Science - October 2025
New studies on retrieval practice, explicit instruction, interleaving, sleep, erroneous examples, AI, worked examples, print vs digital, adaptive learning
This month features new evidence on retrieval practice as application rehearsal, the surprising power of studying mistakes, and why explicit instruction narrows learning gaps. We’ll look at how student evaluations misjudge teaching quality, why eight hours’ sleep is optimal for maths and science, and how meaningful noise derails recall. Other highlights include mixed results for interleaving and self-explanation, the limits of adaptive learning and AI detection, and studies on how children judge their own knowledge, what bilingual readers rely on, and how spiral curricula and causal structure help memory endure.
New study on retrieval practice answers a big question: what’s the point of learning facts if you don’t know where or how to use them?
Across three experiments, students first studied short slides on research-methods ideas (e.g., third variable, reverse causation), then used different follow-ups: short-answer quizzes with feedback (retrieval), multiple-choice with feedback, simply re-studying the slides, or studying a “Q&A sheet” (quiz-study). In Experiment 1 (8-minute delay), retrieval beat restudy for repeated items but not for new application questions.
In Experiments 2 and 3 (three rounds of retrieval; one-week delay), retrieval reliably outperformed restudy, and often quiz-study, on both repeated and novel application questions. A conditional analysis in Experiment 3 showed that, even when students remembered the base fact, those who had done retrieval were still better at answering the application version, suggesting retrieval helps learners notice when a concept is relevant (recognition), not just remember it. (See the example item formats on p.6 and the bar charts on pp. 11 & 13 showing retrieval’s edge, especially for application items.)
Takeaways: instead of only testing “What is X?”, also ask “Which situation best illustrates X?” or “Where would this apply?” Also teachers should think of retrieval as “application rehearsal,” not just checking memory.
Across maths, medicine and a handful of sciences, pupils can learn well from worked examples that include mistakes, and often even better when the incorrect solution is placed side-by-side with the correct one. In the review’s tally (see Table 2), contrasting erroneous examples beat correct examples in most head-to-heads (9/12); plain erroneous examples were mixed against correct ones; and both example types often matched or outperformed problem-solving, with some gains only showing up on delayed tests. The mechanism is twofold: pupils build “negative knowledge” (what not to do) while also shoring up the right procedure or concept.
The most effective approach appears to be contrasting erroneous examples, where students see both incorrect and correct solutions side by side, rather than just incorrect examples alone.
The researchers ran randomised controlled trials in disadvantaged schools in Martinique, comparing explicit direct instruction with constructivist approaches. In both subtraction (younger pupils) and area (older pupils), all children made progress, but those taught with explicit instruction made far larger gains, with effect sizes that are unusually high in education research. For younger pupils, explicit teaching even helped narrow the attainment gap between stronger and weaker children.
This reinforces a message already seen in other meta-analyses: when pupils are at risk of falling behind, clear explanations, guided practice, and structured feedback provide the most reliable route to mastery of foundational skills. That doesn’t mean abandoning collaboration or discussion altogether, but it suggests that for concepts like subtraction and area, disadvantaged children benefit most from strong teacher guidance before being asked to explore independently.
Good commentary here from Pedro De Bruyckere
Across 89 studies (~5.4m students), the pooled link between SET scores and later performance is essentially nil (r≈.04), turning slightly negative once you control for grades; in random-assignment settings, “popular” instructors sometimes leave students less prepared for the next course. In short: SETs capture how much students liked the course, not how much they learned and retained. The diagrams reinforce this: the conceptual model on page 8 shows how institutional emphasis on SETs lifts “leniency cues” (easy grading, entertainment) and immediate affect, while rigorous, difficulty-laden teaching deepens learning yet depresses SETs; the simulation on page 12 shows GPA rising as SET weight rises, while value-added learning stays flat; the coefficient plot on page 15 shows high-SET instructors clustering around zero or negative value-added.
Using standardised tests and self-reported sleep from 54k+ adolescents across 717 schools, the authors model an inverted-U link between sleep duration and attainment: performance rises as sleep increases up to ~8 hours (8–9 for maths), then tails off; the effect is largest in cognitively demanding subjects and for students in the lower–middle of the attainment distribution. Homework time and evening device use are both linked with shorter sleep. The figures (e.g., the peak in maths at 8–9 hours and the decline beyond) and threshold models converge on a practical “optimal window” of roughly 7–9 hours, centred near eight.
The researchers ran four experiments testing how different kinds of distractors (meaningless noise versus meaningful words) interfere with memory retrieval. They found that when people were asked to retrieve information under more effortful, controlled conditions, meaningful distractors (especially ones closely related to the retrieval cue) caused far more disruption than meaningless sounds. The authors argue this is because the brain automatically processes the meaning of distractors, which sparks task-irrelevant ideas that compete with what you are trying to recall. This makes retrieval less efficient, especially if you lack strong inhibitory control.
Background noise that contains meaning (like other students’ chatter, music with lyrics, or overlapping classroom talk) can be far more harmful to learning and recall than non-verbal sounds (like rain outside or ambient hum). This is particularly critical when students are doing controlled retrieval tasks, such as recalling specific vocabulary, solving word problems, or writing essays. It suggests that creating a quiet, language-free environment during demanding cognitive work is not just about reducing distractions, but about preventing semantic interference that actively undermines retrieval.
I’ve written a post about this study here: Is a Noisy Classroom a Thinking Classroom?
Across 1,992 human passages matched with 1,992 AI passages (news, blogs, consumer reviews, novels, résumés, restaurant reviews), the authors benchmark Pangram, GPTZero, OriginalityAI and an open-source RoBERTa model. Pangram tops the table on core metrics (AUROC ≈ 1; very low false positive/negative rates), is relatively robust to “humanizer” rewrites, and even works on ultra-short “stubs” (<50 words). Open-source RoBERTa performs poorly and is not suitable for high-stakes use.
For schools, the big idea is to treat AI-detection as a policy decision, not a magic bullet. The paper introduces “policy caps”, e.g., “our system must keep false positives under 0.5%”, and then tunes detector thresholds to meet that cap. Education contexts may prefer very low false positives to avoid accusing students incorrectly, accepting that some AI use will go undetected; detectors should be paired with assessment design and provenance checks, not used in isolation.
The study tested whether year-long use of worksheets with worked examples and self-explanation prompts would improve children’s foundational algebra knowledge and ability to learn new algebra concepts later on. Overall, the intervention didn’t produce a broad, significant effect compared with business-as-usual teaching. However, when students attempted more self-explanation prompts, their learning improved, particularly if they had average or higher prior knowledge and if they completed a high number of worksheets. For students with weaker prior knowledge, benefits were less clear, and too much practice even seemed counterproductive for stronger students.
Adapting interleaved practice to each learner’s confusion patterns did not improve learning beyond ordinary random interleaving.
In a large online experiment (n=259), adults learned to spot six artists’ styles under three study orders: blocked (all of one artist, then the next), random interleaved (mixed, no two in a row from the same artist), and adaptive interleaved (the next item came from whichever artist the learner had just confused) (see the flow in Fig. 1, p. 6; results in Table 3, p. 6). As usual, blocking looked easiest during practice, but both interleaved versions led to better test performance straight after and one week later; crucially, adaptive ≈ random on both tests. Learners felt worse while interleaving (lower category-learning judgements) even when they actually learned more.
The study reinforces that simple interleaving (mixing problem types or examples rather than blocking them) remains a powerful instructional strategy that works across different working memory capacities. However, educators must address the motivational challenge: learners consistently rated interleaved practice as more difficult and felt less confident during learning, despite achieving superior outcomes. Not a good result for personalised learning and adaptive learning algorithms.
The study examines how children judge what they know, either in absolute terms (“Do you know this?”) or relative terms (“Do you know this better than that?”) and how the phrasing of these prompts affects their self-assessment. It finds that subtle differences in how questions are framed can sway children’s confidence and performance judgments.
For educators, this has practical implications: the way we ask children to reflect on their understanding can shape how they perceive their knowledge and how confidently they respond. Being intentional in phrasing, for example, clarifying whether you're asking for a comparison or a standalone evaluation, can help foster more accurate self-assessment and guide more effective feedback.
This study had 99 caregiver–child pairs each read one custom story in print and one on an iPad; target words (e.g., okapi, chisel) were woven into the stories and later tested via naming (production), explaining (definition), and picking the right picture (comprehension). Overall, print and digital came out about the same for word learning—but who the child is mattered a lot: children with bigger vocabularies learned more words on every measure; boys outperformed girls on definition and comprehension; and executive functions (attention/working memory/self-control) predicted definition scores. Importantly, format × executive functions interacted for comprehension: the digital book helped children with higher executive functions but hindered those with lower ones
Study reveals adaptive learning field dominated by system-delivered solutions that sideline teacher expertise and pedagogical knowledge. The review exposes a concerning narrowness in adaptive learning research that mirrors broader educational technology trends. The overwhelming focus on mathematics (50% of studies) and performance outcomes (65% of studies) creates a reductive view of learning that neglects the holistic development of learners. This performance-centric approach, whilst pragmatically driven by ease of measurement, risks perpetuating an instrumental view of education where cognitive load, emotional engagement, and cultural responsiveness receive insufficient attention. The dominance of system-delivered adaptivity (59% of studies) over teacher-delivered approaches suggests a techno-solutionist bias that may undervalue pedagogical expertise and the relational aspects of learning.
AI uncovers thousands of fake academic journals.
Here is a good use case of AI in education: Estimating the predictability of questionable open-access journals. This recent study shows how machine learning can be used not to replace expert judgement, but to triage and flag risk in an area where the sheer scale of information overwhelms human capacity. By translating DOAJ’s best-practice criteria into machine-readable signals, such as whether a journal has a visible ethics policy, a transparent editorial board, or unusually high levels of self-citation, the model can highlight journals that warrant closer scrutiny.
The system still produces false positives (e.g. small society titles or discontinued journals) and depends on noisy ground truth data, so human librarians and subject experts remain essential. For me, the promise here isn’t automation replacing judgement, it’s better targeting of our limited attention.
The paper proposes a practical way to measure what LLMs truly memorise (vs what they genuinely generalise), estimates model capacity at roughly 3.5–3.8 bits per parameter, and shows memorisation plateaus as models “fill up”, after which generalisation and double-descent kick in.
The authors redefine memorisation using compression: if a model helps you compress a string of text into fewer bits, it “knows” something about that exact string. They separate unintended memorisation (unique, sample-specific details) from generalisation (real patterns in language). Training hundreds of GPT-style models on random bitstrings and on de-duplicated web text, they find a clear capacity limit of ≈3.6 bits per parameter (bfloat16) and ~3.8 (fp32). Once data size exceeds capacity, memorisation stops rising, generalisation increases, and double descent appears precisely at that crossover. Membership-inference attacks (guessing if a text was in training) follow a clean scaling law: with very large datasets, average-case membership inference becomes near random; however, rare, distinctive texts are more likely to be memorised.
People who know less about AI are paradoxically more enthusiastic and receptive to using AI, often because they view it as magical and awe-inspiring rather than technical.
The research overturns the usual assumption that deeper knowledge leads to wider adoption. With AI, those who understand it the least are often the quickest to embrace it, precisely because they are dazzled rather than cautious. The studies reveal that when people see AI doing tasks we normally associate with uniquely human traits such as writing poems, generating recipes, they feel awe and wonder, which increases their willingness to use the tools. By contrast, those with higher literacy are more sceptical, aware of limitations and ethical questions, and therefore more restrained.
Across three periodontal courses (Years 1–3), 113 students sat the same 20-item MCQ test before and after each course and rated how confident they were in every answer. Scores jumped after each course (≈43–55% to ≈76–81% in Year 1; ≈66–70% to ≈84–89% in Year 2), dipped after a four-month gap, then recovered and settled into a high plateau by Year 3 despite a nine-month gap (≈78–81% pre, ≈80–83% post). Confidence rose in lock-step with performance (r = .87), while two “control” items on untaught facts stayed flat on performance even as confidence crept up—an overconfidence red flag.
So spaced reinforcement of the same big ideas in progressively richer contexts (here, cases and clinical rotations) appears to counter the forgetting curve, and adding quick confidence ratings gives useful calibration data (where pupils feel sure but are wrong). The authors are careful to note limits (practice effects from reusing the same items; single site), but the overall picture favours cumulative, confidence-aware assessment designs over one-off, “teach-then-test” blocks.
We don’t just remember what we saw; we remember what caused what. Coherent sequences build memory.
Across three online experiments using short, unfamiliar, CGI videos, the authors manipulated whether clips formed a coherent causal sequence (each clip followed naturally from the last) or a fragmented one (states “reset” between clips). Participants then completed cued-recall judgements (e.g., did this still come before or after the cue; was it part of the same episode?). Memory for order was reliably better for coherent sequences; scrambling or reversing coherent clips removed the advantage, indicating the benefit really was about causal structure, not surface predictability. Longer coherent sequences didn’t overwhelm memory—if anything, performance held up or slightly improved—consistent with the idea that causality helps “compress” an event into a single organised memory. Finally, people were better at rejecting doctored images that violated the causal backbone (e.g., swapping the positions of focal objects) when the original video had been coherent, suggesting causality sharpens memory for causally relevant details.
The researchers compared how 6–10-year-olds rated their absolute knowledge (how much they knew on a topic) versus their relative knowledge (how much they knew compared with an expert). Children tended to overrate their absolute knowledge, particularly on familiar topics like common animals, but they also acknowledged that experts usually knew more. Interestingly, as children grew older, their relative self-assessments became less inflated, suggesting increasing intellectual humility. Between ages 6 and 10, children become more realistic in judging their own knowledge.
Third-graders (n=172; 86 monolingual, 86 bilingual) completed measures of reading comprehension, word/pseudoword fluency, receptive/expressive vocabulary, and three executive functions (working memory, inhibition, cognitive flexibility). Monolinguals outperformed bilinguals on vocabulary, real-word fluency, and comprehension; no group differences appeared on executive function tasks. Across all pupils, real-word fluency and both kinds of vocabulary explained a large chunk of comprehension, with small extra contributions from working memory and inhibition; pseudoword fluency did not help.
When groups were analysed separately, none of the EF measures predicted comprehension for monolinguals, but inhibition did for bilinguals, with a weight similar to vocabulary. The authors interpret this as bilingual children drawing more on cognitive control, possibly to manage interference from the other language and to compensate for smaller vocabulary and slower text reading. For classroom practice, this reinforces the core: keep building automatic word reading and rich vocabulary for everyone, and consider reducing executive-function load (e.g., clearer prompts, fewer simultaneous demands) especially for bilingual pupils during comprehension tasks.
That’s it for this month, sign up below for one of these every month and please add any interesting studies in the comments. For any work related enquiries please contact me here.
Also please check out this e-learning course based on How Learning Happens and How Teaching Happens I made with Jim Heal for Academica University.


If a rich-get-richer model of vocabulary learning is true, isn’t it surprising that young boys learned more vocabulary than the girls on the storybook reading study?
I'm particularly interested in the intentional-error emerging research. This is something I added to shaeda flashcards as it seems to be a *perfect* domain for this. What's interesting is that I had began to implement it prior to even knowing that it was a 'thing'. The beauty of it is that it can be used for both language learning and also general academic learning.
PS: There is no link to your e-learning course. All other links work. But clicking on this one does not load anything.