Half of social-science studies fail replication test in years-long project
Summary
The SCORE (Systematizing Confidence in Open Research and Evidence) project spent seven years evaluating 3,900 social-science papers to test reproducibility, robustness and replicability. Teams of researchers reviewed papers across economics, education, psychology and sociology to see whether results held up when analyses were rerun, when alternative reasonable analyses were applied, and when experiments were repeated from scratch.
Key outcomes: many papers lacked the data or methodological detail needed to reproduce analyses; of the 600 papers examined for reproducibility only 145 had enough information and just 53% of those reproduced exactly. In robustness checks, roughly 75% of 100 papers held up, though about 2% produced opposite conclusions under alternate analyses. Full replication attempts succeeded (statistically) in only 49% of 164 studies.
The team highlights common causes such as missing data/code and incomplete methods descriptions, while also noting that legitimate methodological updates can produce different — but valid — results. The project, coordinated in part by the Center for Open Science and funded by DARPA, aims to support development of automated confidence-scoring tools for social-science findings.
Key Points
- The SCORE initiative reviewed 3,900 papers across multiple social-science fields over seven years.
- Reproducibility: only 145 of 600 papers contained enough detail to attempt reproduction; 53% of those reproduced precisely.
- Robustness: about 75% of 100 papers remained consistent under alternate reasonable analyses; ~2% reversed conclusions.
- Replicability: only 49% of 164 full replication attempts produced statistically significant replication.
- Main problems include absent data/code and incomplete methodological detail; proposed fixes include open data, clearer methods, preregistration and automated multiverse-style checks.
Why should I read this?
Short version, no fluff — half the social-science findings tested didn’t hold up. If you rely on research for policy, practice, or to make sense of the world, this explains the gap and what might actually fix it. Worth five minutes if you care about reliable evidence.
Context and relevance
This is a landmark, large-scale confirmation of earlier reproducibility concerns across disciplines. It reinforces the push for better disclosure norms (data, code, methods), preregistration and new tools such as multiverse analyses and automated checks. Notably, independent work on more recent papers (2022–23) shows improved reproducibility, indicating that norms are shifting — but the overall message is clear: treat single studies as part of a bigger evidence puzzle, not the final word.
