Other articles in research week: RFK vs CDC, Benefits of Basic Research, and Gain of Function Research.
Imagine a celebrated Harvard scientist who built her career studying honesty and ethical behavior. In 2012, Francesca Gino and colleagues published a headline-grabbing study claiming that getting people to sign an honesty pledge at the top of a form (rather than the bottom) made them less likely to cheat. The intuitive idea turned heads at government agencies and companies eager for ways to encourage integrity. Yet nearly a decade later, this feel-good finding fell apart. In 2021, the paper was retracted after investigators discovered that key data in one of its experiments had been fabricated. Then, even more shockingly, Gino herself was accused in 2023 of falsifying data in at least four of her published studies (with suspicions that dozens more may be tainted). Harvard placed the prominent professor on administrative leave amid the allegations. The irony was glaring: a dishonesty expert potentially caught in dishonesty. This scandal is not just a quirky one-off; it illustrates a deeper crisis in research. Across many fields of science, especially in the United States, too many findings turn out to be unreliable or impossible to reproduce – a situation so widespread that scientists have dubbed it a “reproducibility crisis”.
Cracks in the Social Sciences
The reproducibility crisis first drew public attention through issues in the social sciences like psychology and economics. In 2015, a landmark project led by the Open Science Collaboration tried to replicate 100 recent psychology experiments published in top journals. The sobering result: only 36% of those replications succeeded in getting results consistent with the original studies. In other words, well over half of the influential findings they tested failed to hold up the second time around. This mass replication effort (involving 270 researchers) made headlines and raised uncomfortable questions about how much published psychology research one can trust (For comparison, a similar analysis in economics found a somewhat higher replication rate – about 61% of 18 major studies could be confirmed – but still left a large chunk of findings in doubt).
Notably, most of these discrepancies were not due to fraud. High-profile fraud cases have occurred – for example, a Dutch social psychologist, Diederik Stapel, infamously fabricated data in at least 50 studies, making him “perhaps the biggest con man in academic science”. And in the United States, psychology has seen its share of scandals (from plagiarism to made-up survey results). But experts say outright fakery is just the tip of the iceberg. Much more common are subtle, systemic problems with how research is conducted and published. Many psychology studies in past decades were underpowered – using small sample sizes that can produce fluke results. Researchers faced intense pressure to publish exciting findings, which in some cases led to p-hacking (tweaking analyses until something becomes statistically significant) and cherry-picking only the most favorable data. Over time, these practices can flood the literature with enticing but fragile results that collapse under replication attempts.
A telling indicator of the social sciences’ credibility problem came in a Nature survey of researchers in 2016: more than 70% of scientists admitted they had tried and failed to reproduce another scientist’s experiment. (Over half even said they couldn’t replicate their own experiments at times.) Findings like these have driven a “crisis of confidence” in fields like psychology – but as we’ll see, the issue extends well beyond psychology departments.
False Starts in Biological Research
Reproducibility challenges are just as worrying in biomedical and life sciences, where the stakes include human health. In the early 2010s, pharmaceutical companies began voicing alarm that many basic research results in cancer biology could not be replicated. C. Glenn Begley, then head of cancer research at Amgen, revealed that his team tried to reproduce 53 “landmark” cancer studies from academic labs – but 47 of the 53 (nearly 90%) failed to replicate. “It was shocking,” Begley said, because these were influential papers that industry was relying on for new drug targets. Another pharmaceutical company, Bayer AG, reported similar troubles, managing to duplicate only about 25% of published findings in its own internal checks. In one analysis, Begley and a colleague flatly concluded that more than 90% of claimed preclinical cancer breakthroughs were not reproducible – essentially, “just plain wrong.”
The consequences here are concrete: if companies base multi-million-dollar drug development programs on research that doesn’t hold up, it means wasted time, money, and lost opportunities to find real cures. “These are the studies the pharmaceutical industry relies on to identify new targets for drug development,” Begley noted, adding that “if you’re going to place a $1 million or $2 million or $5 million bet on an observation, you need to be sure it’s true”. After his experiences, Begley lamented, “we became convinced you can’t take anything at face value” in the published literature.
Importantly, the failures in biomedical research aren’t usually due to scientists intentionally faking data. More often, they stem from flawed methodologies and biases that give false signals. A 2015 study estimated that around 50% of all preclinical research in the U.S. is not reproducible, representing about $28 billion in wasted expenditures each year on experiments that don’t pan out. The biggest culprits included mundane but serious issues like faulty reagents and contaminated cell lines used in lab experiments. In other cases, animal studies weren’t designed rigorously enough (for instance, lacking proper randomization or blinding of researchers to avoid bias), or statistical results were overstated. As in the social sciences, publication bias plays a role too – positive, novel findings in biology get published more readily than “null” results, skewing the literature toward optimistic conclusions that might not hold up under stricter testing.
Why So Many Findings Fail to Hold Up
Why did science end up with so many results that can’t be replicated? Researchers have identified a mix of cultural and methodological problems that have plagued the research enterprise for years:
Publish or perish pressure: Scientists advance their careers through publications, and journals prefer exciting, positive findings. This creates a strong incentive to publish quickly and frequently, sometimes at the expense of careful, confirmatory work. With jobs and grants on the line, researchers may consciously or unconsciously cut corners, from formulating post-hoc hypotheses after seeing the data to glossing over null results. As one analysis put it, the high-pressure environment can lead scientists to “leaving out data... that doesn’t support their conclusions” or “massaging results” until a clear story emerges.
P-hacking and selective reporting: A lot of irreproducible science comes from statistical shenanigans rather than outright fraud. P-hacking refers to trying multiple analyses or data exclusions until you find a statistically significant result (usually p < 0.05). For example, a psychologist might test 20 different variables but only report the one that “worked.” Similarly, many studies suffer from the file drawer effect – if an experiment finds nothing interesting, it often never sees the light of publication. The outcome is a literature cluttered with false positives: intriguing correlations or effects that were really just flukes.
Small studies with big claims: Especially in fields like psychology, economics, and early-stage biology, it’s common to see studies with very small sample sizes or trivial effect sizes being trumpeted as breakthroughs. A study on 30 people in a lab might get published because it found a dramatic-sounding effect, but such a flimsy result is likely to fall apart in a larger follow-up study. Underpowered studies not only miss real effects; they also make it easy to find apparent effects that are actually statistical mirages.
Lack of replication culture: Until recently, attempting to replicate someone else’s work was not a prestigious endeavor. Few journals would publish a straightforward replication study, and funding agencies didn’t typically grant money just to double-check previous findings. This meant that errors or false results could sit unchallenged for years. Confirmatory research – the slow, careful process of validating discoveries – was undervalued. The result was that self-correction in science got delayed; questionable findings weren’t scrutinized until they had already done damage by influencing further research or policy.
Journal and peer review biases: Academic journals, competing for impact, often favor “interesting” and novel results. A recent study found that papers which later fail to replicate tend to be cited far more than those that hold up – likely because flashy findings get more attention. Reviewers and editors, being human, can be seduced by a surprising result and may apply lower standards of evidence if a claim is extraordinary or newsworthy. This means some shaky studies get waved through peer review when they shouldn’t. Moreover, methods sections can be vague or missing details, making it hard for others to repeat the experiment exactly. All of this contributes to a literature where lots of things look true until someone tries to replicate them.
It’s important to note that the vast majority of scientists are not fraudsters. Rather, they are working within a system that for a long time rewarded novel findings over rigor. The cumulative effect of all these factors has been a flood of published papers that may not actually tell us the truth about nature or human behavior. As the saying (and a famous 2005 paper) goes, “most published research findings are false” – or at least, most are not as solid as they appear.
Why It Matters: Trust and Progress at Stake
Why should non-scientists care if some psychology experiments or lab studies don’t replicate perfectly? For one, public trust in science is on the line. If headlines constantly yo-yo between “Coffee causes cancer” and “Coffee extends lifespan,” people understandably grow cynical about research. When high-profile studies turn out to be wrong, it feeds a narrative (often exploited by science skeptics) that “scientists can’t get their story straight.” In extreme cases, an irreproducible or fraudulent study can do lasting damage. Perhaps the most infamous example is the 1998 vaccine-autism study, a small British case series that implied vaccines cause autism. The study was later exposed as fraudulent and retracted 12 years after publication – but by then it had sown seeds of fear that grew into the modern anti-vaccine movement. The result has been lasting harm: countless parents still reject vaccines due to a debunked result, leading to resurgences of diseases like measles. This illustrates how a single bad study can undermine public health and trust in medical science for decades.
Policy decisions, too, can be led astray by irreproducible research. Governments and businesses have poured resources into programs inspired by exciting academic findings – from educational interventions to economic policies – only to later discover those findings were unreliable. In the 2000s, for instance, behavioral science produced many “nudge” strategies (small tweaks to influence people’s behavior) that were eagerly adopted by companies and policymakers. But many of those headline-grabbing ideas did not hold up when independent researchers tried to replicate them. As noted, the “sign at the top” honesty intervention was one such idea that even got attention from the White House; it seemed to offer a cheap tool for increasing honesty in settings like tax forms or insurance claims. Had the fraud in that research not been caught, organizations might have invested in a practice that simply doesn’t work. The same goes for economic research: consider that for years, policymakers leaned on a prominent economic study suggesting high national debt stunts growth – until a replication found spreadsheet errors and methodological flaws in that study, undermining the rationale for certain austerity measures. When research claims guide real-world policy, getting the science right is critical. Otherwise, we risk spending money on ineffective or harmful policies.
Finally, irreproducible science drags down scientific progress itself. Future research builds on past findings; if those findings are shaky, researchers can waste years (and millions of dollars) following false leads. In drug development, as Begley observed, basing huge projects on unverified results is a recipe for failure. Young scientists can become discouraged if they try to build on a published study and nothing works, not realizing the original might have been a fluke. Over time, that frustration can drive talent away and slow the pace of discovery. Science is self-correcting in principle – errors should eventually be discovered – but the reproducibility crisis shows that self-correction can be painfully slow and costly when the system isn’t set up to encourage it. As one meta-researcher quipped, “replication is supposed to be a hallmark of scientific integrity”, yet for too long it has been neglected.
Fixing the Problem: Toward More Reproducible Research
The good news is that over the past decade, awareness of the reproducibility crisis has spurred a flurry of reforms and new practices aimed at strengthening the reliability of research. Scientists, journals, and funding agencies are experimenting with ways to raise standards and reward transparency. Some key efforts and ideas include:
Preregistration of studies: More researchers now preregister their experimental plans and hypotheses in advance (for example, on the Open Science Framework website). By time-stamping a study design publicly, they commit to an analysis plan before seeing the data, which helps prevent p-hacking and data dredging. Journals are increasingly encouraging this practice, and some will even accept papers in advance based on the design (a format called Registered Reports), to remove the pressure of needing flashy results.
Open data and methods: There’s a growing push for open science, where scientists make their raw data, analysis code, and methods available to others. Many journals now require or reward data sharing. This transparency allows independent experts to double-check findings and try to reproduce analyses. When mistakes or inconsistencies are spotted, they can be corrected more quickly, and other researchers can build off the data instead of having to trust a paper’s summary. Openness also deters misconduct, since faked or cherry-picked data are harder to hide when everything must be uploaded for scrutiny.
Replication initiatives: The research community has launched numerous large-scale replication projects to systematically test important findings. We saw this with psychology’s 100-study effort, and similar projects have been organized in other fields (from economics to cancer biology). For example, the Center for Open Science coordinated an effort to replicate key results in cancer research – painstakingly repeating experiments from landmark papers. Such projects not only tell us which discoveries are solid; they also send a message that replication is valued. In addition, new journals and funding streams are appearing that specifically support replication studies and the reporting of negative results, helping to counteract the old publication bias.
Changing incentives and culture: Perhaps the hardest fix is altering the incentive structure in science. But there are signs of change. Universities and grant agencies are discussing how to reward quality over quantity – for instance, hiring and promotion committees placing more weight on rigorous methodology and reproducibility records, not just number of publications. Top-tier journals have begun awarding “open science” badges for practices like data sharing or preregistration, signaling that trustworthiness matters. There are calls for journals to accept more papers that attempt to replicate or nullify earlier findings. By making replication an expected part of the scientific process (rather than an act of doubt or hostility), the hope is that researchers will routinely check each other’s work. As one paper suggested, journals and institutions may need to “apply higher standards” and be less seduced by striking results, focusing instead on whether studies were well-designed.
Better training in methods: Another reform is improving how scientists are trained in statistics and experimental design. In response to the crisis, many PhD programs have updated their curricula to emphasize rigorous methods, reproducibility, and ethics. Young researchers learn about common pitfalls (like p-hacking and confirmation bias) so they can avoid them. There’s also more discussion of topics like power analysis (to ensure adequate sample sizes) and proper use of statistics. The aim is to foster a new generation of scientists who prioritize getting it right over getting it published.
While there is no overnight solution, these efforts are starting to bear fruit. Already, fields like psychology have seen a turn toward greater skepticism and self-correction – questionable past findings are being re-tested and, if needed, tossed out. The fact that Francesca Gino’s alleged misconduct was uncovered by vigilant peers (a trio of data-savvy researchers who run a blog on research integrity) is evidence that the scientific community is paying closer attention. Science has always progressed by trial and error, and some level of false starts is inevitable. The reproducibility crisis, however, has been a wake-up call that too many errors slipped through for too long. By reforming how research is done – making it more transparent, thorough, and truth-seeking – scientists hope to rebuild trust and ensure that the published discoveries we celebrate today are still standing strong tomorrow.
Note: This is my own opinion and not the opinion of my employer, State Street, or any other organization. This is not a solicitation to buy or sell any stock. My team and I use a Large Language Model (LLM) aided workflow. This allows us to test 5-10 ideas and curate the best 2-4 a week for you to read. Rest easy that we fact check, edit, and reorganize the writing so that the output is more engaging, more reliable, and more informative than vanilla LLM output. We are always looking for feedback to improve this process.
Additionally, if you would like updates more frequently, follow me on x: https://x.com/cameronfen1. In addition, feel free to send me corrections, new ideas for articles, or anything else you think I would like: cameronfen at gmail dot com.