Unofficially, biology researchers always complain how frequently fail to replicate results others have achieved. But of course, such failed experiments are underreported.
Is there data, or at least well-reasoned guesses, on what is the success rate of experiment replication in preclinical models? I don't mean translational research and the impossibility to replicate a mouse result in a human. My question is about failing to reproduce the results within the same model system, replicated at the detail level available in the original publication (e.g. same animal strain, same drug dosages, etc.)
The Cochrane Collaboration has a great deal of this type of analysis. One of my favourite features of a Cochrane Review is the routine use of Funnel Plots, where there are sufficient data to produce them. Sensible interpretation of funnel plots can give some fairly strong hints about the reproducibility of published literature.
Funnel plots show datapoints of individual published studies which investigated the effectiveness of a technique when used for a specific purpose. The datapoints are plotted by estimated effect size (on the x axis) and an assessment of the quality of the trial, incorporating things like sample size and experimental design (on the y axis). A study with a small sample size and/or poor design (low on the y axis) would be expected to provide a relatively inaccurate estimate of the method's true effect size. As sample sizes increase and experimental methodology gets better (travelling up the y axis), you expect the studies to zero in on the true effect size, giving a pyramid or 'funnel' shape to the overall plot.
The reproducibility of trials affects where the points are in the pyramid in a couple of ways.
If there were a bunch of studies which looked for an effect of the method, and didn't find one, they might not be published because null results are considered boring by journal editors - the so-called 'file-drawer effect'. This would show up in a funnel plot as a pyramid that has a suspiciously low number of datapoints on the side of the pyramid that is close to the zero effect-size. The 'missing' datapoints are an indicator that this technique probably did not always reproduce an effect size as strong as the published effect sizes.
The other way that funnel plots can inform on the reproducibility of published studies is simply by looking at the width of the base of the pyramid. If it's wide, then with the experimental design methods of the studies at the bottom, the results aren't highly reproducible - different studies give very different results when looking at the same question.
It's a big job to synthesise this into a global assessment of 'the reproducibility of pre-clinical trials', but if you're interested, I'd recommend you spend some time trawling the Cochrane Library to get a hang of the sorts of variance you're dealing with.
Design preclinical studies for reproducibility
To boost the research pipeline of medicines and therapies, the required preclinical work should be designed to facilitate reproducibility.
The minute fraction of published preclinical studies of medicines and therapies that have been tested in case studies or clinical trials is inescapable evidence that clinical translation is complex and costly. Yet another factor is also critical: difficulties in the reproducibility (by a different team using a different set-up), and even replicability (by a different team using the same set-up), of protocols, procedures and results. The few reproducibility efforts that have been carried out in biomedicine (mostly in cancer biology) have reported that only 10–25% of the small sets of preclinical studies considered could be successfully reproduced.
Chris Ryan/Nature, Springer Nature Ltd.
Improper design, implementation and reporting of preclinical studies of medicines and therapies can undermine their reproducibility and translatability. Figure reproduced from Nature 529, 456–458 (2016).
Preclinical findings can be difficult to replicate and reproduce, owing to hurdles associated with biological heterogeneity and complexity, or with the use of non-standard methods or of costly materials or technology. But many published studies unfortunately also suffer from poor or biased study design, insufficient statistical power, or lack of adherence to reporting standards these flaws — which can result in wasted time and resources, in reductions in funding and investments, and occasionally in the halting of promising research — are entirely avoidable.
How can preclinical biomedical studies be designed to maximize the chances of clinical translation? A Perspective by John Ioannidis, Betty Kim and Alan Trounson, published in this issue, provides specific advice for studies in nanomedicine and in cell therapy. For instance, to minimize the effects of biological heterogeneity when designing a study to test a new nanomedicine or cell product, investigators should adopt or develop standardized specimens, materials and protocols, and use multiple disease-relevant in vitro models and animal models. For example, immunocompetent mouse models and models using patient-derived cells typically better recapitulate critical aspects of human disease than immunocompromised animals and cell-line-derived xenografts. Still, all models have limitations to avoid hype, these and any other limitations, assumptions and caveats of the study should be reported. And to reduce biological variability and experimentation (and experimenter) biases, replication by different investigators (ideally in a multi-site collaboration project) across independent biological specimens or systems is often necessary. Also, when designing new medicines or therapies, early attempts to reduce complexities in product synthesis or manufacturing and in material costs should pay off too often, promising nanomedicines and cell products are not translatable because their synthesis cannot be scaled up (or producing them with consistent homogeneity or purity would be unfeasible or too expensive), or because their safety profiles do not offer satisfactory assurances for testing in humans.
Tools that protect investigators from biases and statistically poor results can act as effective safeguards against reproducibility failures in preclinical studies testing medicines or therapies. In particular, randomization and blinding, when feasible, reduce bias. And large and statistically powered studies boost trust in the claims. Yet studies that lack negative controls when claiming a positive result, that lack positive controls when claiming a negative finding, that claim statistical significance on the basis of P values barely below the arbitrary threshold of 0.05, or that selectively report the outcomes of exhaustive post-hoc data analyses carried out to find statistically significant patterns in the data, collectively pollute the biomedical literature and harm biomedical research. Biases, which are rarely intentional, usually result from inadequate training or from skewed incentives and rewards, such as the pressure to publish significant results. Hence, study design, including well-informed statistical analyses, should ideally be in place in advance of the implementation of the study (except for work of exploratory nature), and be registered on a publicly accessible database (such as preclinicaltrials.eu).
Even when preclinical studies have been properly designed, the lack of sufficiently thorough and clear reporting can hamper their reproducibility. In this respect, guidelines such as ARRIVE (Animal Research: Reporting of In Vivo Experiments) and reporting checklists (such as the Nature Research Reporting Summary, which is included in every research article published by Nature Biomedical Engineering) prompt authors to provide details about study design, methods, protocols, materials, animal models, data acquisition and processing, and software used.
Still, proper study design and clear and thorough reporting, although necessary for reproducibility and translatability, are not sufficient. Without an adequate understanding of the biological mechanisms underlying the action and performance of a particular medicine or therapy, reproducibility and translatability attempts can be handicapped. Naturally, unveiling sufficient mechanistic understanding requires exploratory research, which sometimes is painstakingly difficult and may require new animal models or in vitro microphysiological models that better recapitulate the biology of human disease.
Good practices in study design, the standardization of methods, protocols, models and reporting, and increased mechanistic understanding can better facilitate the reproducibility of promising medicines and therapies when the data, protocols, and unique materials and disease models, are readily shared among researchers via community-recognized databases and repositories. Only when all of these measures are pursued collectively, alongside calling for the funding of reproducibility studies, will the translational pipeline be widened and translational efforts accelerated.
The 'gold standard' of preclinical research yields ‘highly unreliable’ results, study findsA rack of mouse cages in an animal facility, where animals are kept under highly standardized conditions. (Image: Hanno Würbel)
According to the study, increasing study sample diversity can “significantly” improve the reproducibility of experimental results.
However, standardization is the gold standard in animal research – if not a dogma, said lead author Professor Hanno Würbel, director of the Division of Animal Welfare at the University of Bern, who has long hypothesized that standardization may be a cause of poor reproducibility, rather than the antidote.
“The more you standardize the animals and the environment, the more you risk obtaining results that are specific to these standardized conditions but would not be reproducible with different animals or under different conditions,” he explained.
The aim of the study, which was published recently in PLOS Biology, was to examine the extent to which standardization may be a problem in preclinical animal research, and whether a multi-laboratory approach could be a solution, said Würbel.
Researchers from the Universities of Bern and Edinburgh used computer simulations based on 440 pre-clinical studies across 13 different treatments in animal models. The reproducibility of results were then compared between single-laboratory and multi-laboratory studies.
“We found that the gold standard of single-laboratory studies conducted under standardized conditions yields highly unreliable results,” said Würbel.
Conversely, the researchers found that using the same number of animals distributed across up to four laboratories yielded “much more accurate and better reproducible” results.
Reproducibility of academic preclinical translational research: lessons from the development of Hedgehog pathway inhibitors to treat cancer
Academic translational research is growing at a great pace at a time in which questions have been raised about the reproducibility of preclinical findings. The development of Hedgehog (HH) pathway inhibitors for the treatment of cancer over the past two decades offers a case study for understanding the root causes of failure to predict clinical outcomes arising from academic preclinical translational research. Although such inhibitors were once hoped to be efficacious in up to 25% of human cancer, clinical studies showed responses only in basal cell carcinoma and the HH subtype of medulloblastoma. Close examination of the published studies reveals limitations in the models used, lack of quantitative standards, utilization of high drug concentrations associated with non-specific toxicities and improper use of cell line and mouse models. In part, these issues arise from scientific complexity, for example, the failure of tumour cell lines to maintain HH pathway activity in vitro, but a greater contributing factor appears to be the influence of unconscious bias. There was a strong expectation that HH pathway inhibitors would make a profound impact on human cancer and experiments were designed with this assumption in mind.
Over the past two decades, much of the burden for preclinical research, and even early clinical trials, has shifted from industrial to academic laboratories, as companies conserve their resources. There is now a great demand by industry for academic laboratories to de-risk projects, by providing strong indications of efficacy, before partnerships are formed. However, increasingly, questions are being raised about the lack of reproducibility of academic translational research findings in industrial laboratories . This has led to calls for greater education in statistical methodology, strategies to verify reproducibility and the establishment of publication watchdog websites, such as PubPeer and Retraction Watch, dedicated to pointing out deficiencies in the published literature. However, each of these initiatives fails to deal with the root problem. Rather than provide a theoretical discourse of the issues, I will review key steps involved in the development of small molecule inhibitors of the Hedgehog (HH) pathway for the treatment of cancer, to illustrate pragmatic lessons in preclinical translational research. Although the development of HH pathway inhibitors for the treatment of cancer was a successful translational journey, leading to drug approvals, there were many instructive failures along the way. These missteps point to flawed reasoning, improper use of models, lack of quantitative scientific approaches and self-delusion, each of which stem from pervasive unconscious bias in academic translational research. To illuminate these issues, it is necessary to dig into the details of the science to understand, and hopefully avoid, similar problems in the future.
2. The difference between translational and basic research
There is a fundamental difference between translational research and basic research. In basic research, investigators may favour a particular hypothesis, but they strive to maintain disinterest while testing their ideas through critical experimentation that advances knowledge regardless of the outcome. In translational research, the goal is to generate positive results that support the development of a product, or intervention strategy, to improve patient outcomes. In experimental therapeutics, the stakes can be very high during the preclinical research stage, carried out to support the decision to advance a product into the clinic. Similarly, in phase I/II clinical trials, there can be enormous negative consequences if a product progresses only to fail at the later phase III stage, or beyond. In industry, great efforts are made to challenge projects as early as possible in the pipeline because of the financial and lost opportunity costs of late-stage failure. Indeed, company scientists are often rewarded for uncovering scientific flaws, off-target effects or unexpected toxicities that result in shutting down a project before it consumes inordinate resources. However, this incentive does not exist in academia and therein lies the problem.
The currency of academic research is publications and grants. Notoriously, it is very challenging to publish negative data and grants are not awarded, or renewed, when a project fails to show promising results. This means that embarking on a multi-year translational study is a high-risk endeavour for the entire research team. Students need publications to graduate, fellows require high-impact studies to open the door to an independent career and laboratory heads must constantly generate grant revenue to survive. In this climate, it is almost impossible to maintain disinterest. Even if this perspective can be maintained at the conscious level, wishful thinking creeps in because of unconscious bias.
3. Target identification
The first step in any drug development project is to identify a target. In the case of the HH pathway, the target was uncovered as a result of the gene mapping studies, demonstrating that basal cell nevus carcinoma syndrome (BCNS), also known as Gorlin syndrome, is a consequence of mutations in Patched-1 (PTCH1), the receptor for HH ligands [2,3]. In addition to developmental defects, BCNS is associated with a high frequency of basal cell carcinoma (BCC) and an elevated incidence of the paediatric brain tumour medulloblastoma (MB) . Germline loss of one copy of PTCH1 is complemented by somatic mutation, or silencing, of the remaining allele in tumour cells, revealing that PTCH1 acts like a classic tumour suppressor gene in BCC and MB. These initial reports were followed quickly by studies identifying PTCH1 mutations in sporadic BCC and MB . It is now clear that other mutations in the pathway, including gain-of-function mutations in Smoothened (SMO) and loss of Suppressor-of-Fused (SUFU), also result in BCC and MB . In all cases, these mutations result in high levels of HH pathway activity, independent of ligands, leading to elevated expression of downstream transcriptional target genes, including GLI1 and GLI2.
The work progressed rapidly, largely due to the prior decades of outstanding basic research, initiated by pioneering work in Drosophila genetics  that was recognized by the award of a Nobel Prize to Christiane Nusslein-Voldhard, Eric Wieschaus and Edward Lewis. This work was extended by high-quality developmental biology studies that elucidated the critical role of the HH pathway in a broad range of developing tissues . It is important to stress that quality translational research is built on quality basic research, and we must continue to interpret translational research findings in the context of detailed knowledge of the biology of the target. This was particularly important in the development of SMO inhibitors, as fears of developmental bone toxicities, because of the well-known role of the HH pathway in the bone growth plate , were borne out in the clinic [10,11]. This resulted in a Federal Drug Administration restriction on the use of SMO inhibitors in young children prior to completion of bone growth that, unfortunately, was only put in place after bone malformations, first described in young mice [12,13], were recapitulated in children.
The identification of PTCH1 loss as a therapeutic target presented a conundrum—because it is deleted from tumour cells, how does one target an absence? The solution was revealed by a remarkable series of observations that, when tied together, read like clues from a detective novel (figure 1). The first clue was the observation of holoprosencephaly in lambs, caused by ingestion of corn lilies (Veratrum californicum) by pregnant ewes . Ultimately, the teratogens were identified as plant steroidal alkaloids  and one of these, termed cyclopamine because of its ability to induce holoprosencephaly (the cyclops-like phenotype known from ancient times), was also shown to induce limb malformations . Similar defects occur as a consequence of genetic mutations in Sonic Hedgehog (SHH) and other components of the HH pathway in a range of species, including humans . Cyclopamine was then shown to function as an inhibitor of the HH pathway [18,19]. Thus, the naturally occurring teratogen, cyclopamine, induces a phenocopy of certain HH pathway mutations. The key finding that tied the threads of the story together was the observation from the Beachy Laboratory that cyclopamine functions by direct binding to SMO to shut down HH pathway activity .
Figure 1. Flies, sheep, corn lilies, cyclops and Sonic the Hedgehog point the way to novel cancer therapeutics. Early genetic studies in Drosophila (a) described the role of the HH pathway in development and later this was extended to mammals. Mutations in the HH gene cause holoprosencephaly (cyclops-like phenotype) in mice, sheep (b), humans and mythical beasts (c). The same phenotype arises in lambs born to ewes that ingest the corn lily plant (V. californicum) (d) in the first trimester. This was the clue that ultimately led to the identification of small molecule inhibitors of the HH pathway. The mammalian HH gene family includes SHH, a name inspired by the eponymous video game character designed by Sega Inc. (e). Illustrations by T. Curran.
The complexity of the HH pathway continues to evolve to the present day. Briefly, HH ligands (Sonic, Indian and Desert) bind and inhibit PTCH1 (figure 2). PTCH1 is a negative regulator of SMO. Thus, in tumours cells lacking PTCH1, SMO is constitutively active and HH pathway activity remains elevated. Cyclopamine, and similarly acting compounds, binds to SMO and blocks its function, thereby shutting off HH pathway activity. With the discovery that cyclopamine is an inhibitor of SMO, the focus of many investigators in the field transitioned from a basic research perspective to preclinical translational research. These groups shared the common goal of developing proof-of-concept data, using cell culture and animal models, to support the use of HH pathway inhibitors as anticancer agents in clinical trials in humans. However, although this sounded simple, challenges were encountered immediately that led to overinterpretation of results and unrealistic expectations.
Figure 2. The HH signalling pathway. HH ligands bind to the membrane-associated protein PTCH1 and inhibit its function. PTCH1 inhibits SMO by preventing it from translocating to the primary cilium. SMO inhibits SUFU which, in turn, inhibits the activation and translocation of GLI1 and GLI2 to the nucleus. SUFU also activates GLI3, which is processed by proteolytic cleavage to become a transcriptional repressor. GLI1 and GLI2 activate transcription of several target genes, including themselves and PTCH1. The loss of PTCH leads to constitutive activation of the pathway. Small molecule inhibitors bind and inhibit SMO.
4. The use of cancer models
Traditionally, the first step in testing potential anticancer agents is to determine whether they can inhibit the growth of cultured tumour cells. The issue that confronted investigators studying SMO inhibitors was that there were no tumour cell lines available in which activating HH pathway mutations had been documented and elevated HH pathway activity had been demonstrated. Following initial studies in cell culture, the next step would normally be to test the drugs in xenograft models of human tumours, but there were no models in which the status of the HH pathway had been validated. An alternative approach was provided by the Ptch1 +/− mouse strain generated by the Scott Laboratory . These mice develop tumours resembling the desmoplastic subtype of human MB and they harboured an activated HH pathway. The low frequency and sporadic appearance of the tumours were addressed by crossing the mice into a p53 −/− background to generate a strain, Ptch1 +/− p53 −/− mice, that exhibited a 100% incidence of MB within two weeks of age . The mice were also used to generate a model for BCC by exposing their skin to ultraviolet or ionizing radiation .
The first published report, investigating the efficacy of the SMO inhibitor cyclopamine as an anticancer agent, used cultured tumour cells from mice and humans as well as allograft tumours established from mouse tumour cell lines . However, it was shown subsequently that the HH pathway activity is rapidly suppressed when MB cells from Ptch1 +/− mice are cultured in vitro . Recently, this was revealed to be a consequence of the loss of tumour-associated astrocytes which maintain HH pathway activity in tumour cells by secreting SHH . Allograft tumours, derived from cultured mouse MB cells, do not harbour an active HH pathway and they fail to respond to SMO inhibitors . So, how was it possible to obtain supportive efficacy data for cyclopamine if the target was not active in the models used?
The well-known problem with cyclopamine is that the concentration of drug required to block the HH pathway is close to the concentration that induces cell death independently of the HH pathway . Culturing mouse and human MB cells in the presence of 3–5 µM cyclopamine (or in the case of the more potent variant KAAD-cyclopamine, 1 µM) for 48–72 h reduced the growth of tumour cells . However, this concentration of cyclopamine is toxic for many cell types. In fact, it has now been demonstrated that cyclopamine promotes apoptosis in the human MB cell line DAOY by inducing expression of neutral sphingomyelin phosphodiesterase 3, which increases ceramide production and induces cell death, independently of the HH pathway . Thus, studies using cultured mouse MB tumour cell lines are hampered by the fact that cyclopamine has strong, off-target toxic effects and that the HH pathway is no longer active in these cells. Initial reports documenting the effects of cyclopamine on embryonic development and inhibition of the HH pathway used drug concentrations as low as 120–130 nM to achieve specific biological effects [18,19]. By contrast, the majority of tumour cell line studies used 5–10 µM, and sometimes up to 20–30 µM cyclopamine, to inhibit growth . This means that most studies were carried out under conditions in which cyclopamine promotes ceramide-induced cell death independently of SMO.
Initial attempts to generate mouse MB tumour cell lines that retain HH pathway activity in vitro failed . Although the tumour cells grew readily in vitro, they no longer exhibited the HH pathway gene expression signature. Some cell lines did express GLI1, which increased when they were propagated as allografts however, this turned out to be a consequence of Gli1 gene amplification but not SMO signalling . Subsequent efforts claimed greater success, but the lines were established at a low frequency (20%) and they exhibited only partial sensitivity to high doses of the SMO inhibitor, LDE225 . Recent studies have revealed that mouse MB tumour cells require the presence of tumour-associated astrocytes to maintain an active HH pathway in culture . These cells are lost when tumour tissue is placed into cell culture and this is why the HH pathway is suppressed in vitro.
The other critical experiment in the initial report on the use of cyclopamine as an anticancer agent was the treatment of mice carrying allograft tumours derived from mouse MB cells . Cyclopamine was shown to cause tumour regression in this model. However, because these allografts were derived from mouse MB cells propagated in culture, they should not have harboured an active HH pathway . The method chosen for drug delivery in this study was subcutaneous inoculation of 0.1 ml of cyclopamine suspended in a 4 : 1 mixture of triolein/ethanol . Others reported this approach to cause lesions at the site of injection, forcing premature termination of experiments [31,32]. One concern is that, as the lesions spread due to daily treatment, this may ultimately have led to inoculation of the drug near or even directly into the tumour mass, potentially inhibiting tumour cell growth, not because of suppression of the HH pathway, but as a result of the off-target toxic effects of the high drug concentrations. In a number of cases, alcohol was present in the carrier at a level of 20%, which causes necrosis at the site of inoculation into the tumours . Recognizing this problem, alternative carriers lacking ethanol were developed for cyclopamine. However, the practice of injecting the drug subcutaneously near, or directly into, the tumour was widely adopted [34–36]. A common strategy for xenograft treatment was described in the following way: ‘soon as the tumour was palpable, cyclodextrin-conjugated cyclopamine or cyclodextrin carrier alone (Sigma) at 10 mg kg −1 was injected in the immediate vicinity, or intratumorally when possible, twice daily’ (, supplemental data p. S6). This process continues to be used today, even though concerns about the procedure were pointed out, including the fact that the drug concentration at the site of injection is extremely high and above the level that induces off-target toxicity . In addition, because the tumour is being treated before it becomes established, the assay is really measuring inhibition of tumour establishment rather than inhibition of tumour growth. Finally, the physical and hydrostatic pressure damage resulting from twice daily inoculations into small tumour volumes may by itself be enough to prevent growth. This approach led to an overestimation of the range of tumour types that appear to respond to SMO inhibitors. The factors discussed above, including the off-target toxicity of cyclopamine, the use of excessively high concentrations of drugs, the lack of common gene expression markers to define HH pathway activation and the direct injection of cyclopamine into tumours, continue to affect the field today .
In general, the use of a systemic dosing route is recommended when testing anticancer agents in animal models, for example, by oral gavage or intraperitoneal inoculation. When the oral gavage route was tested for cyclopamine, it was not possible to reach a dose that completely suppressed HH pathway activity, in mice carrying a Gli-luciferase reporter transgene, due to toxicity . It is also important to treat established tumours, usually greater than 150 mm 3 , to obtain a reliable measure of tumour regression. Treating transplanted tumour cells before the tumour has been established does not provide reliable data on tumour growth. Rather, it provides information on whether the treatment can prevent engraftment in the host. Although we now know that SMO inhibitors are efficacious in treating a subset of human BCC and MB, it is important to re-examine the initial reports carefully, as these studies established methodological practices that were adopted by the field and are currently employed today.
5. Transitioning preclinical research into clinical trials
Several compounds from a range of structural classes, with a much better therapeutic index than cyclopamine, were generated in a high-throughput small molecule screen conducted by Curis Inc. . These compounds inhibited HH pathway activity at nanomolar levels, and they demonstrated efficacy in the ex vivo skin punch mouse BCC model from Ptch1 +/− mice described above . However, it was their ability to eliminate even large spontaneous mouse MB in Ptch1 +/− p53 −/− that attracted great interest in their potential anticancer agents . In this case, the tested compounds were delivered by oral gavage and shown to cross the blood–brain barrier to block HH pathway activity in brain tumour tissues. Two weeks of treatment, twice daily, with 100 mg kg −1 of one of the compounds, termed HhAntag, completely eliminated large MB tumours . Subsequently, it was found that MB, from Ptch1 +/− and Ptch1 +/− p53 −/− mice, grafted onto the flank of immunosuppressed mice, retained this high level of sensitivity to SMO inhibitors, so that even large tumour masses could be eradicated in less than 5 days of treatment . Because of ease of use, this direct allograft system became the model of choice for many preclinical studies, including those used to support the launch of successful clinical trials and subsequent approvals of vismodegib (Genentech Inc.) and sonedigib (Novartis Inc.) by the Federal Drug Administration. Importantly, no human xenograft studies exhibited this level of response and there remains no validated human xenograft model for HH pathway tumours even now. A large number of companies conducted their own successful small molecule screens, as SMO turned out to be a highly druggable target, leading to the testing of 10 different SMO inhibitors in 86 clinical trials listed on clinicaltrials.gov.
In contrast with the use of genetically engineered mouse (GEM) models to develop SMO inhibitors for the treatment of BCC and MB, numerous groups employed human tumour cell lines and xenograft models in preclinical studies to extend the potential application of SMO inhibitors to a broad range of human cancers. Most of the initial studies used cyclopamine as the SMO inhibitor, but, more recently, a broader range of SMO inhibitors have also been employed. These studies followed a familiar path, usually starting by testing the inhibitors on a collection of tumour cell lines to inhibit cell proliferation and induce apoptosis before transitioning into xenograft models. Although none of these other tumours harboured mutations in the HH pathway, evidence of expression of HH pathway genes was interpreted as an indication that the pathway was activated. Invariably, these studies reported evidence of preclinical efficacy, thus paving the way for the 86 clinical trials mentioned above. Several hundred such studies have been published and the following are representative examples of tumour-specific analyses: small cell lung cancer , pancreatic cancer , colorectal cancer , prostate cancer , breast cancer , hepatocellular carcinoma , ovarian carcinoma  and glioma . Each of these studies used cyclopamine, at various concentrations up to 20 µM, to induce cell death in tumour cell cultures. Based on these results and others, it was claimed that the HH pathway contributes to approximately 25% of human cancer deaths . Therefore, it was a great disappointment to learn that, despite these positive preclinical studies, clinical responses were reported in only BCC and MB [11,49–53]. Several studies documented the lack of efficacy of SMO inhibitors, alone or in combination with chemotherapy, in a range of tumours [54–58], even though in some cases, a reduction in GLI1 expression levels was observed in tumour tissues. Many other negative results have yet to be reported. The simple explanation for the widespread failure of SMO inhibitors in clinical trials is that the preclinical data used to support the transition into the clinic represented false-positive results.
6. Why preclinical studies of SMO inhibitors failed to predict responses in the clinic
Unlike BCC and MB, no activating mutations in HH pathway genes have been reported in the other tumours proposed to be treated with SMO inhibitors [31–38]. Evidence of an HH pathway gene expression signature was used to determine that the pathway was active in these tumours. However, there was no established standard for the level of gene expression required, no definition of which genes should be used to represent the authentic HH pathway signature, and no agreed upon methodology for documenting HH pathway gene expression. This led to each investigator defining their own standards, ultimately leading to confusion over the definition of ‘activated HH pathway’ and how this should be measured. In MB, tumour subsets were identified initially by supervised hierarchical clustering analysis using probes specific for genes whose expression increased in the presence of HH ligands . This approach readily identified the HH-MB subtype and a similar strategy applied to the WNT pathway distinguished tumours with β-catenin mutations . Subsequently, these groups were refined using non-supervised clustering approaches and the genes in the signature were not necessarily transcriptional targets of HH pathway signalling [60,61]. However, for the preclinical studies of SMO inhibitors, investigators generally relied on a select number of HH pathway target genes. In many cases, the level of GLI1 expression was employed as a quantitative measure of HH pathway activity. However, this can be misleading as GLI1 can be regulated independently of the HH pathway , and it can exhibit an increased copy number in tumour DNA which influences its expression level independently of the HH pathway. In fact, the name GLI was coined based on its discovery as an amplified gene in glioma .
Although GLI1 is a transcriptional target gene in the HH pathway, it is not an essential gene for development  and MB can arise in the absence of Gli1 in Ptch1 +/− mice, albeit at a reduced level . These tumours express high levels of Gli2 which has overlapping functions with Gli1. Thus, the presence and level of GLI1 expression may not reflect the level of HH pathway activity in tumours. In C3H10T1/2 cells, GLI1 and GLI1 reporter constructs are very sensitive indicators of HH pathway activity over a broad dynamic range . SMO inhibitors are capable of essentially eliminating GLI1 expression and blocking luciferase activity driven by GLI1-binding sites over several orders of magnitude . Therefore, observations of only a modest 50% drop in the level of GLI1 RNA in tumour cells treated with high doses of cyclopamine do not support the argument that the HH pathway is driving both GLI1 expression and the growth of such cells [66,67]. Instead, these results indicate that cyclopamine may be inhibiting cell growth independently of its effect on SMO and HH pathway activity. Nevertheless, it is important to point out that GLI1 is a bone fide target in some tumours regardless of its role in the HH pathway. Toftgard and co-workers  developed a GLI1 inhibitor, GANT61, that is effective at inhibiting tumour cell growth in vitro and in vivo. In addition, arsenic trioxide (ATO), an approved FDA drug, was shown to bind and inhibit GLI1, causing reduced tumour cell growth in vitro and in vivo [69,70]. GLI1 inhibitors, like GANT61, may have a much broader indication than SMO inhibitors because GLI1 is expressed in many tumour cells that do not have an active HH pathway. In the latter case, these tumours would not be expected to respond to SMO inhibitors. Similarly, tumours with an activating mutation in the HH pathway downstream of SMO, such as SUFU loss, or tumours that develop resistance to SMO inhibitors, may be responsive to GANT61 or similar inhibitors that target GLI1 [71–74]. As GLI1 is neither necessary, nor sufficient, for HH pathway activity, it cannot be relied upon, by itself, to provide a biomarker for the pathway. The lack of a defined, quantitative standard biomarker of the activated HH pathway, coupled with the broader role of GLI1 in tumours, means that the number and diversity of HH pathway tumours has been vastly overestimated.
7. Paracrine and autocrine Hedgehog pathway activity
The absence of HH pathway mutations in tumours proposed to have an activated HH pathway led to the suggestion that autocrine signalling by HH ligands was responsible for driving HH pathway activity and tumour growth [43,44,75]. Evidence supporting the overlapping expression of HH pathway genes, including ligands, in tumour cells relied on immunohistochemistry performed using commercially available antibodies that are notoriously difficult to rely on due to lack of specificity. Analysis of RNA from prostate tumours revealed the expression of the HH pathway genes SHH, PTCH1 and GLI1 at elevated levels compared to surrounding normal tissues in the broad range of 1.5–300-fold . However, there was no correlation among the expression levels of each of these genes, implying that there was no coordination between the level of ligand and that of two different HH pathway transcriptional target genes. There was also significant variation among cell line responses to a high concentration, 10 µM, of cyclopamine [43,44]. Other groups who investigated the same or similar prostate cancer cell lines could not confirm these results, which led them to conclude that there was no evidence of autocrine signalling, although there was evidence that HH ligands secreted by tumour cells were acting on the tumour stroma [76,77].
The controversy was addressed in an exhaustive analysis carried out by de Sauvage and co-workers  at Genentech Inc., using a more potent, and less toxic, SMO inhibitor termed HhAntag, in addition to cyclopamine, to investigate paracrine and autocrine HH pathway signalling in tumour cells. This was a remarkable study for a number of reasons. To set the scene, at that moment in time, the role of the HH pathway in cancer was subject to lively discussions at many scientific meetings. In fact, this author participated in an organized debate, with one of the proponents of the paracrine signalling model, at the American Association for Cancer Research annual meeting in 2007. The stakes were relatively high as several major pharmaceutical companies and numerous biotech companies had active programmes designed to develop small molecule inhibitors of SMO. The investigators at Genentech were in an unusual position. As the first major company to develop an HH pathway programme, in collaboration with Curis Inc., it would have been in their interest to find evidence supporting the optimistic view that SMO inhibitors would be efficacious in 25% of human cancer. However, they found no evidence of autocrine signalling in any tumour cell line studied . They did confirm that high concentrations of cyclopamine or HhAntag could inhibit cell growth, but this did not correlate with HH pathway activity and, in the case of HhAntag, they needed to use approximately 400 times higher concentrations than those required to block HH pathway activity. This analysis employed 122 tumour cell lines, many of which were the same lines used by other investigators to reach the opposite conclusion. The authors concluded that the previously reported effects of SMO inhibitors on growth and HH pathway activity in tumour cells were a consequence of off-target effects resulting from the use of the inhibitors at high, non-physiological concentrations.
One cannot help wondering whether this resounding failure to reproduce academic preclinical translational studies, and the fact that 0/122 human cancer cell lines supported a role for the HH pathway in tumour cell growth may have caused corporate leadership some pause before continuing on the path to clinical development.
In addition to studying tumour cell lines, de Sauvage and co-workers  also investigated a large collection of xenograft models. They noted that some of the cell lines, and certain human xenografts, did express HH ligands and they demonstrated that the ligands secreted by tumour cells were capable of stimulating HH pathway activity in stromal cells. They went on to show that SMO inhibitors were capable of retarding the growth of some xenograft models by inhibiting HH pathway activity in the stromal environment. However, although this effect was statistically significant, it was relatively modest, and no tumour regression was observed. This low level of growth retardation in xenograft models rarely, if ever, predicts responses in the clinic. The authors remained cautious in their interpretation while acknowledging that the exact mechanism whereby increased HH pathway activity in stromal cells supports tumour growth remained to be determined, the observation raised the potential for the use of SMO inhibitors to target the tumour microenvironment.
The proposed effect of paracrine HH signalling on tumour stroma was also investigated in a GEM model of pancreatic ductal adenocarcinoma . This model was generated by conditional expression of mutant K-ras and p53 genes in pancreatic progenitor cells . The authors hypothesized that HH signalling from tumour cells may support the maintenance of the stromal compartment and that disruption of HH signalling might facilitate the delivery of chemotherapeutic agents. They reported that the combined treatment of mice with gemcitabine and the SMO inhibitor IPI-926  resulted in increased survival from 11 to 25 days . While this result was statistically significant, it represents a relatively modest effect, as all the treated mice died within a few days. Furthermore, the effect was even less compelling when comparing the survival of mice treated with gemcitabine alone with those treated with the drug combination. In addition, not all tumours exhibited a transient decrease in size during the course of treatment. This may be a case of wishful thinking—that such a modest delay in tumour growth would translate into a clinical response in patients. A phase II clinical trial sponsored by Infinity Pharmaceuticals (IPI-926-03), in which patients received either the combination of gemcitabine plus IPI-926 or placebo, was halted early due to poor outcomes. A similar trial, using the combination of gemcitabine and the SMO inhibitor vismodegib, also failed to show improved survival . In a follow-up study of the mouse model, deletion of Shh from pancreatic epithelial cells resulted in earlier tumour growth and decreased survival of mice . This study also failed to reproduce prior results obtained by treating tumours with IPI-926 plus gemcitabine. In fact, the investigators found that IPI-926 treatment caused more aggressive tumour growth . These findings indicate that, in this tumour model, SHH is not just dispensable for tumorigenesis, but it actually constrains tumour progression.
8. The non-canonical Hedgehog pathway
Canon, meaning rule or accepted principle, is derived from the ancient Greek kanon—a measuring rod or standard. The standard, or canonical, HH pathway in mammalian cells refers to the genetically and biochemically defined signalling process involving HH ligands, PTCH1, SMO, SUFU and GLI proteins. There are, of course, many other proteins involved in HH signalling, but these are mostly believed to modulate the core components listed above. The lack of correlation between the effects of SMO inhibitors on tumour cell growth and the levels of HH pathway target genes led some investigators to propose that ‘non-canonical’ HH pathway signalling, potentially involving cross-talk with the RAS and TGFβ pathways , androgen receptor signalling , the mTOR pathway  and the WNT pathway  among others, contributed to tumour progression. Given the range of contributing factors involved, it has become challenging to distinguish among several possible non-canonical HH pathways. While it is clear that HH signalling influences, and is influenced by, numerous other signalling pathways, it is not easy to determine if these truly constitute non-canonical pathways. The studies referenced above suffer from some of the shortcomings already discussed, including the point that the presence of GLI1 by itself does not imply an activated HH pathway. Therefore, the fact that GLI1 inhibitors like GANT61 may inhibit tumour cell growth, whereas SMO inhibitors have no effect, cannot be used as evidence for a non-canonical HH pathway [68,84]. Instead, these studies indicate that expression of GLI1 does not always require SMO and that GLI1 has intrinsic growth and oncogenic properties, independently of the canonical HH pathway. The other common feature among non-canonical HH pathway studies is the use of SMO inhibitors at high concentrations where they inhibit cell growth through off-target effects.
Recently, signalling by exogenous SHH ligand was shown to occur in MB cells lacking PTCH1 . This effect still required SMO and it was blocked by SMO inhibitors . The source of SHH in vivo was shown to be tumour-associated astrocytes. In contrast with the regular HH pathway, in this case, SHH induced expression of nestin in mouse MB cells (figure 4). This effect, which could be blocked by SMO inhibitors, resulted in sequestration and inhibition of GLI3, thus abrogating its inhibitory effect on the HH pathway. While the SHH receptor in cells lacking PTCH1 has not yet been identified, PTCH2 has been shown to modulate tumorigenesis in Ptch1 +/− mice . In other cells, in the absence of PTCH1, PTCH2 mediates the response to SHH . The induction of nestin expression by SHH in Ptch1 −/− cells appears to be independent of GLI1 as it was not promoted by exogenous overexpression of GLI1 . The induction of nestin in MB cells by SHH secreted from astrocytes clearly involves a paracrine mechanism. Thus, in the case of mouse MB, tumour progression is associated with a gradual acquisition of nestin expression that abrogates a negative constraint on HH pathway activity. It seems that SMO inhibitors may be particularly effective in this mouse MB model as they simultaneously block the intrinsic activity of the HH pathway resulting from Ptch1 loss as well as the extrinsic effect of SHH secreted by astrocytes. As the induction of nestin by SHH signals through SMO, it may not be accurate to refer to this process as a non-canonical pathway. However, its lack of dependence on PTCH1 and GLI1 make it different from the canonical pathway therefore, it has been referred to as a paradoxical HH pathway . Whatever the name, it appears to play a significant role in the growth of mouse MB and, potentially, human MB.
9. The curious case of rhabdomyosarcoma
Rhabdomyosacroma (RMS) is a collection of soft tissue sarcomas, derived from skeletal muscle, that primarily occur in children. They can be very challenging to treat, depending on the subtype . The loss of PTCH1 in Gorlin syndrome is associated with a range of rare tumours in addition to BCC and MB, including fetal rhabdomyoma . In mice, heterozygous loss of Ptch1 in CD1 mice is associated with a 9% incidence of tumours resembling the embryonic subtype of RMS (ERMS) . The incidence of these tumours is influenced by genetic modifiers, as they are only rarely encountered on a C57Bl/6 background. Parenthetically, Balmain and co-workers  demonstrated that a polymorphic variant in Ptch1, present in FVB/N mice but absent in C57BL/6 mice, functions as a genetic modifier to promote RAS-induced squamous cell carcinoma. ERMS tumours from Ptch1 +/− mice display high levels of some HH pathway target genes, including Gli1, Ptch1 and Igf2 . Several groups have performed detailed preclinical studies on RMS in an effort to determine whether they are good candidates for treatment with HH pathway inhibitors however, this work has proved challenging and several questions remain [31,92,95–98]. The emerging consensus seems to be that, while such tumours may be targeted by agents that inhibit GLI1, such as GANT61, they are not sensitive to treatment with physiological levels of SMO inhibitors .
ERMS in humans has not been linked to mutations in the HH pathway and while there is frequent loss of heterozygosity of chromosome 11p15, the chromosomal translocations indicative of alveolar RMS are not present . Often, HH pathway activity is reported to be elevated in ERMS based primarily on the detection of high levels of GLI1 and PTCH1 RNA expression [99,100]. However, this is not the standard gene expression profile that defines an activated HH pathway. The complication is that PTCH1 is a negative regulator of the HH pathway and, as a transcriptional target of the pathway, it participates in a negative feedback loop . This leads to the contradictory observations that high levels of normal PTCH1 RNA indicate an activated HH pathway but, conversely, the presence of high levels of PTCH1 protein implies that the pathway is inhibited. In BCC, coincidental high expression of PTCH1 and GLI1 is only seen in tumours expressing a mutated PTCH1 allele that is ineffective at suppressing the HH pathway . This is not the case in RMS as PTCH1 mutations are not present in these tumours . In MB with an activated HH pathway, high GLI1 expression is associated with low levels of expression of the normal PTCH1 allele . Thus, the presence of elevated levels of both PTCH1 and GLI1 may not imply that the HH pathway is active in ERMS. Hahn and co-workers  reported that a series of four SMO inhibitors were ineffective at inhibiting GLI1 expression in RMS cells and, in some cases, treatment even resulted in elevated expression levels. The change in expression levels detected was relatively modest, mostly in the 0.5–1.5-fold range. This contrasts with the approximately 50-fold inhibition of GLI1 and other HH pathway target genes seen in MB treated with SMO inhibitors . The SMO inhibitor concentrations used by Hahn and co-workers (up to 30 µM) were in great excess of the levels required to completely block SMO activity (0.1 µM) [41,97]. These results indicate that SMO-dependent HH pathway activity is absent from RMS tumour cell lines. While the high concentrations of SMO inhibitors used did affect cell growth, the effects observed were variable across the cell lines, they could be positive or negative, and they did not correlate with the effect on GLI1 expression. As concluded by Hahn and co-workers , the observed effects may represent off-target effects of the compounds. This is the most likely explanation for observations of different responses among a class of inhibitors that share the same mechanism of action—inhibition of SMO. As discussed previously, cyclopamine has a narrow therapeutic index, so it can be challenging to distinguish on-target from off-target effects. By contrast, the other SMO inhibitors only exhibited inhibitory effects on tumour cell growth when used at concentrations several hundredfold higher than those required to achieve target inhibition. These very high doses are unlikely to be achieved in patients, and, if they could be achieved, they may well be accompanied by off-target toxicities. Thus, these studies do not support the use of SMO inhibitors for the treatment of RMS.
The fact that heterozygous loss of Ptch1 in mice is associated with ERMS, depending on the genetic background, demonstrates that HH pathway activation is capable of driving the initiation of RMS, even if it is not prevalent in human cancer . The insensitivity of these tumours to treatment with cyclopamine led to the proposal that, while initiation of ERMS may be HH pathway-dependent, during tumour progression dependency on HH pathway signalling is lost . Interestingly, while expression of an activated Smo gene was shown to drive BCC and MB formation in cell lineages that are both HH-expressing and HH-responsive, ERMS arises from cell lineages in which the HH pathway is not active . Taken together, the results indicate that ERMS is in a distinct category from BCC and MB, which clearly harbour an activated HH pathway. Nevertheless, the presence of high levels of GLI1 in ERMS may provide an opportunity for agents such as GANT61, even if SMO inhibitors are not recommended . A similar situation may exist in rhabdoid tumours that express GLI1 but do not harbour HH pathway mutations . These tumours arise as a consequence of loss of SNF5, a chromatin remodelling component, which can directly bind to GLI1 . Similar to ERMS, cyclopamine was not able to inhibit tumour growth, whereas ATO did show some activity .
10. Resistance mechanisms
The Achilles heel of precision therapeutics is the ease with which drug resistance can develop. This was evident in the first clinical trial of SMO inhibitors, in which a patient with advanced metastatic MB relapsed after an initial dramatic response to treatment . Molecular analysis of a recurrent tumour biopsy revealed the presence of a point mutation in SMO (D473H) that reduced the affinity for vismodegib. Interestingly, the same mutation was observed in a study designed to model drug resistance in mouse allograft tumours . The finding that the amino acid substitution effectively blocked drug binding essentially proved the mechanism of action of vismodegib and it was a harbinger of things to come (figure 3). In BCC, drug resistance arises primarily a consequence of activating mutations in SMO [72,107], whereas no SMO mutations were detected in three cases of HH-MB that had acquired resistance to vismodegib . Mutations in the HH pathway that lie downstream of SMO, including loss of SUFU and amplification of GLI1/2, also confer resistance to SMO inhibitors [72,107]. This was also predicted from GEM model studies [25,71]. Thus, acquired resistance to SMO inhibitors in BCC and MB develops, at least in part, through alterations that bypass the role of SMO in driving HH pathway activity (figure 3).
Figure 3. The paradoxical HH pathway. SHH, secreted by tumour-associated astrocytes, induces nestin expression in tumour cells lacking PTCH1, through a signalling mechanism that requires SMO. Nestin accumulates in tumour cells and binds to GLI3, thereby abrogating a negative feedback loop, leading to increased HH pathway activity. Both the induction of GLI1 expression in the absence of PTCH1 and SHH-induced nestin expression require the presence of cholesterol. Statins and vismodegib synergize in the inhibition of HH pathway activity.
An additional hypothesis, first proposed by investigators from Novartis Inc., suggested that upregulation of IGF-1/PI3 K signalling compensates for loss of HH pathway activity in MB that acquire resistance to SMO inhibitors . However, the authors reported that while PI3 K inhibitors appeared to prevent the development of resistance to the SMO inhibitor LDE-225, they were not able to inhibit the growth of established tumours. In addition, while loss of PTEN in Ptch1 +/− mice results in MB that respond to SMO inhibitors by stopping growth, the treated tumours fail to regress . The PI3 K pathway has long been proposed to contribute to MB growth , PTEN mutations have been reported in human MB  and heterozygous ablation of PTEN in mice carrying a SmoA1 transgene promotes MB formation . Recently, the effect of PTEN loss on MB formation was shown to occur in cells within the postnatal perivascular progenitor niche . Taken together, these results support a role for the PI3 K pathway in MB, but it remains unclear as to whether or not this represents a mechanism for driving resistance to SMO inhibitors. Nevertheless, the observations point to the need to consider additional signalling pathways as potential targets for co-treatment of HH pathway MB and BCC with SMO inhibitors. In both cases, numerous additional genetic lesions are present in the tumours and these may represent potential drivers that could be targeted using precision therapeutics [60,114].
The persistence of the HH pathway gene expression signature in resistant tumours indicates that, at least in part, the resistance mechanism involves pathway activation at, or downstream of, SMO. Recently, Oro and co-workers  investigated the mechanism whereby GLI1 transcription activity is increased in mouse allograft BCCs selected for resistance to SMO inhibitors. In resistant tumours lacking activating mutations in SMO, they found that GLI1 participates in a complex with SRF and MKL1 which promotes enhanced transcription of HH pathway target genes. They also show that MKL1 accumulates in the nucleus as a consequence of cytoskeletal activation of RHO. The findings suggest that MKL1 inhibitors may be effective in treating a subset of HH pathway tumours that are resistant to SMO inhibitors and they imply that the combined use of SMO inhibitors and MKL1 inhibitors should be considered for the treatment of naive tumours . These tumours would also be predicted to be sensitive to agents that target GLI proteins similar to GANT61 .
11. Cholesterol and the Hedgehog pathway
Cholesterol plays several roles in the HH pathway. It is a necessary modification of HH ligands  that is required for long-range signalling . Initially, it was speculated that abnormal cholesterol metabolism could affect SHH function, thereby leading to holoprosencephaly . However, the recent identification of cholesterol as the endogenous ligand for SMO [119,120] provides a compelling alternative mechanism. The first clue regarding a direct effect of sterols on SMO was the observation that certain cholesterol derivatives, oxysterols, could stimulate HH pathway activity . However, it was not clear that the abundance and affinity of these compounds was sufficient to allow binding in vivo. Structural studies confirmed the interaction of cholesterol itself with the extracellular domain of SMO . These findings immediately raised the question of whether the clinically approved inhibitors of cholesterol metabolism, statins, would be effective in treating HH pathway tumours.
Attempts had already been made to investigate the potential of statins alone, or in combination with the SMO inhibitor cyclopamine, for the treatment of MB and other tumours thought to be dependent on HH pathway activity [89,123–126]. These studies, for the most part, used tumour cell lines which, as discussed above, fail to maintain an active HH pathway . The effects observed required the use of high concentrations of statins, around 1000-fold higher than those required to block cholesterol biosynthesis. Similarly, cyclopamine was used at a concentration that causes off-target toxicity . The limitations of these experimental approaches mean that the data did not provide proof-of-concept support for the use of statins in the treatment of HH pathway tumours.
Recently, Yang and co-workers  demonstrated that statins do indeed function synergistically with SMO inhibitors in the treatment of MB. They observed that cholesterol biosynthesis is upregulated in HH-MB from both mice and humans. The inhibition of cholesterol biosynthesis, using physiological levels of simvastatin, atorvastatin or triparanol, all reduced HH pathway activity and proliferation of tumour cells. Simvastatin or atorvastatin alone reduced the growth of allograft mouse MB and they functioned synergistically, together with low doses of vismodegib, to prevent tumour growth . These findings support the use of statins in the treatment of HH pathway tumours in conjunction with SMO inhibitors, and they suggest that some resistant tumours may still be sensitive to statin treatment. Currently, SMO inhibitors are not recommended for use in young children because of their effects on bone growth [10,11,13]. Potentially, the combined use of statins may allow the use of lower doses of SMO inhibitors to avoid bone toxicity while still being effective as antitumour agents (figure 4).
Figure 4. Mechanisms of resistance to SMO inhibitors. Three mechanisms of acquired resistance to SMO inhibitors have been identified in tumours, all of which result in downstream activation of the pathway. (a) Mutation of SMO in the drug-binding site leads to loss of inhibition. (b) Loss of SUFU leads to downstream activation of the pathway. (c) Amplification of GLI1, or GLI2, leads to increased expression and downstream activation of the HH pathway.
BCC and MB cancers that harbour an activated HH pathway should be considered as a distinct class from other tumours in which the HH pathway has been proposed as a therapeutic target. Even in BCC and MB, the decision to include a SMO inhibitor among the therapeutic options depends on the exact nature of the HH pathway activating mutation and also, potentially, the epigenetic mechanisms responsible for maintaining pathway activity. The most significant take-home message from this review is that there is a lack of compelling preclinical evidence supporting the use of SMO inhibitors in a broad class of tumours lacking mutations in HH pathway genes. Several factors, including limitations in the model systems used, the experimental designs and overinterpretation of marginal results, conspired to present unrealistic expectations regarding the potential impact of SMO inhibitors on human cancer. It remains possible that agents that bind GLI1/2 directly could be used in the tumours that express high levels of these proteins regardless of the presence of activating mutations in the HH pathway. In addition, GLI1/2 inhibitors may also be effective in tumours harbouring activating mutations in SMO, or downstream mutations in the HH pathway, as well as in tumours that have acquired resistance to SMO inhibitors.
Data obtained using human tumour cell lines treated with SMO inhibitors were not predictive in the clinic because the HH pathway was suppressed in cell culture. In the case of mouse MB, the absence of the tumour microenvironment, specifically astrocytes that secrete SHH, was not present in the cell culture conditions employed. Potentially, this could be addressed by defining appropriate co-culture systems to preserve the tumour microenvironment and HH pathway activity in vitro [25,26]. Xenograft models, including patient-derived xenograft models, also failed to provide reliable preclinical data. While allografts of mouse MB closely resemble the original spontaneous tumours, as they re-create a supportive tumour microenvironment in the flank of immunosuppressed mice, this has not yet been successful in the case of human xenografts. Therefore, in HH BCC and MB, xenograft models are not recommended for preclinical studies.
The use of GEM mice, particularly mice with a loss-of-function Ptch1 mutation, was critical for the development of SMO inhibitors. The strategy developed for the successful use of this class of models for the development of SMO inhibitors can be applied more broadly (figure 5). Briefly, the first step is to determine if the drug can bind and inhibit the target in vivo. This initial step should use a physiological route of delivery (e.g. oral gavage) and the experiments should be conducted quantitatively to determine the dose regime that maintains target suppression. To be successful, this initial in vivo study requires the use of robust biomarkers to monitor pathway activity in tumour tissues. The second step is to determine the effect of target suppression on tumour growth. This requires a combination of molecular, pharmacological and histopathological methodology. While understanding the mechanism of action of the agent under consideration is not required, it is extremely useful to support future development. In the case of SMO inhibitors, the primary mechanism appeared to be inhibition of tumour cell proliferation, leading to abortive differentiation and ultimately cell death . By contrast, the off-target effects of SMO inhibitors used at high, non-physiological concentrations resulted in direct induction of apoptosis . The third step is to determine whether the effects of the agent on tumour growth result in increased survival. This step requires prolonged dosing which should be carried out using a physiological route that models the intended delivery system anticipated in clinical studies. The models can also be used for follow-up studies that optimize dosing regimens, identify resistance mechanisms, explore drug combinations and analyse potential side effects of prolonged treatment. Using these approaches, a series of predictions were made based on the GEM models that were recapitulated in clinical trials of SMO inhibitors (figure 6).
Figure 5. The use of GEM models. The use of GEM models for the development of SMO inhibitors followed a multi-step path that serves as a guideline for future drug development. Step I involves determining the level of inhibitor required to block the target in spontaneous tumours in vivo. This required the use of quantitative biomarkers as a read-out of pathway activity. Step II measures the consequences of target inhibition in terms of tumour growth. Step III assesses whether inhibition of tumour growth results in increased lifespan. Step IV represents follow-up studies to refine dosing regimens, identify drug-resistance mechanisms, explore potential co-treatments and investigate side effects.
Figure 6. SMO inhibitor predictions from mouse GEM models. A number of observations were made in GEM models that paralleled the experience with SMO inhibitors in the clinic. There was an initial dramatic elimination of large tumour mass. The tumour genotype accurately predicted the response to treatment. There was a rapid acquisition of resistance due to downstream activating mutations in the HH pathway. Developmental toxicities were observed in bone growth in young mice and young children.
Many of the issues involving the lack of reproducibility of preclinical research, carried out using SMO inhibitors, appear to stem from the fact that several inhibitors were used at high concentrations where they caused toxic off-target effects, resulting in false-positive data. This was particularly challenging in the case of the naturally occurring SMO inhibitor, cyclopamine, because it exhibits a very narrow therapeutic index. In the case of some of the new, highly potent, SMO inhibitors, these toxic effects were only seen when the drugs were used in vast excess (several hundredfold) over the levels required to suppress HH pathway activity. Figure 7 illustrates the relative differences of specific and non-specific dose–response curves. This is a very important, fundamental principle in pharmacology. When the dose of a drug reaches a level that saturates the target, adding more drug does not increase the specific response. This is a particular concern in cancer research as the assays used for determining drug responses involve inhibition of cell proliferation or induction of cell death, both of which can readily arise from off-target, non-specific effects. In addition, traditionally, classic chemotherapeutic agents were developed to be used at the maximum tolerated dose. This concept was developed for broadly active chemotherapeutics, for example, those that induce DNA damage, to cause as much damage as possible to cancer cells while preserving life. However, this strategy is not appropriate for targeted therapies which should be used at the dose specifically required to block the target in the tumour.
Figure 7. Dose–response analysis. This illustration is based on data reported in . Using a GLI1-luciferase assay in NIH3T3 cells, the dose of HhAntag (a SMO inhibitor) and cyclopamine, required to inhibit 50% of SHH-induced HH pathway activity, was determined to be 30 and 300 nM, respectively. By contrast, the dose of each agent required to block proliferation of tumour cell lines by 50% was approximately 10 µM. These data, by themselves, demonstrate that there is no relationship between the effect of the agents on HH pathway activity and tumour cell proliferation because the doses are so discrepant. Although simple, this key lesson was widely ignored because of unconscious bias.
The route of drug administration is also key. It is rarely, if ever, valid to test the toxic potential of anticancer drugs by direct inoculation of small allografts or xenografts. In this case, the concentration of drug at the site of inoculation is extremely high and it is not really possible to conduct a dose–response analysis. In addition, because small allografts and xenografts are not yet established as transplanted tumours, the assay really only addresses the ability of the agent tested to inhibit the establishment of a graft, rather than its ability to cause tumour regression. The biological processes involved in the establishment of a tumour graft may well be very different from those responsible for maintaining tumour growth therefore, the results obtained are not relevant to the treatment of human cancer.
Despite the approval of SMO inhibitors for the treatment of BCC in the USA, the National Institute for Health and Care Excellence (NICE) in the UK did not recommend the use of vismodegib for symptomatic metastatic, or locally advanced, BCC. This decision was based on overall survival data and an estimate that the cost-effectiveness of vismodegib, compared with best supportive care, is much higher than 30 000 UK pounds per quality-adjusted life-year gained. This analysis did not include the use of molecular diagnostics to select only those tumours for treatment that would be predicted to respond. The committee that made the recommendation acknowledged that the overall survival data were immature, that previous trials were not applicable to the UK population and that, while clinically relevant benefits are plausible, they did not find the evidence presented as substantial. The main issue appears to be the insistence on the use of overall survival data as the primary criteria for recommendation without due consideration of impacts on health-related quality of life. At the same time, it was acknowledged by the committee that mortality is rarely attributed to locally advanced BCC in the average UK population. There remains some hope, as data accumulate from sites outside of the UK supporting the benefits of SMO inhibitors in specific patient populations, that this decision will be revisited. It will also be important to include companion diagnostics to rule out patients that would not be predicted to benefit from SMO inhibitors either initially or after the appearance of drug-resistant mutations.
13. Future directions
The translational journey of SMO inhibitors teaches us that there is no ideal, single approach or perfect model. Human cancer is diverse, and it is not possible to represent all of that diversity in collections of cell lines, transplanted tumours or GEM models. However, specific models can be used very effectively to address defined proof-of-concept questions. Cell lines with reporter constructs allowed precise determination of the drug concentrations required to fully suppress HH pathway activity. Spontaneous tumours in GEM models allowed analysis of the ability of compounds to penetrate the blood–brain barrier, inhibit the HH pathway in tumour cells and to show antitumour efficacy only in the appropriate molecular subtype of tumour. Allograft models permitted analyses of the mechanisms involved in tumour elimination, and they predicted mechanisms of drug resistance that were recapitulated in the clinic. These same models are now being used to investigate drug combinations aimed at different targets in the HH pathway or complementary targets that could be used together with SMO inhibitors. In the case of BCC, there is great interest in developing topical applications and mice may not provide a suitable model because mouse skin is more permeable than human skin [128–130]. Recent advances in the ability to determine tumour genetic profiles from circulating tumour DNA suggest that the molecular subtype of paediatric MB can be determined from blood biospecimens, opening the possibility that SMO inhibitors could be used even prior to surgery, radiation or chemotherapy, to avoid the side effects of these traditional approaches . As these studies progress, it is important that investigators maintain constant vigilance against unconscious bias. We also need to investigate our failures in the context of tumour biology to understand their root causes. As we all know, if we fail to learn lessons from history, we may well be doomed to repeat it.
Perceptions of research irreproducibility in the preclinical sciences are based on limited, frequently anecdotal data from select therapeutic areas of cell and molecular biology.
Identified irreproducibility issues in the area of oncology research, while not well characterized, have been broadly extrapolated to all therapeutic areas as causal factors for the lack of productivity in drug discovery and development.
Deficiencies in published research that include hypothesis generation, experimental design, control and execution, reagent validation and/or authentication, statistical analysis, and reporting are contributory factors to the lack of reproducibility in preclinical research.
Many thoughtful and authoritative research guidelines and/or publication checklists designed to enhance transparency and reproducibility have emerged as corrective measures.
In the context of increased therapeutic and technological specialization in the biomedical sciences, the implementation of these guidelines and checklists can provide an opportunity to harness best practices and bridge knowledge gaps between established and emerging scientific disciplines.
Concerns regarding the reliability of biomedical research outcomes were precipitated by two independent reports from the pharmaceutical industry that documented a lack of reproducibility in preclinical research in the areas of oncology, endocrinology, and hematology. Given their potential impact on public health, these concerns have been extensively covered in the media. Assessing the magnitude and scope of irreproducibility is limited by the anecdotal nature of the initial reports and a lack of quantitative data on specific failures to reproduce published research. Nevertheless, remediation activities have focused on needed enhancements in transparency and consistency in the reporting of experimental methodologies and results. While such initiatives can effectively bridge knowledge gaps and facilitate best practices across established and emerging research disciplines and therapeutic areas, concerns remain on how these improve on the historical process of independent replication in validating research findings and their potential to inhibit scientific innovation.
Rethinking science in the context of the reproducibility crisis.
In 2015 a crisis of reproducibility left the scientific community in a state of disorientation very similar to the one health officials found themselves during COVID in pre-vaccine times. This article inspired by my own experience in the lab and the mentioned reproducibility crisis was originally posted in 2018 and provides a framework to understand science anomalies pointing to a convergence of science and Jewish sources.
Rethinking Science in the context of the reproducibility crisis
In light of the current reproducibility crisis, this article proposes a different way to look at what scientific contributions represent. It proposes that the requirement for coherence in biological scientific publications masks the ambiguity and indetermination inherent to the practice of science. Recognition of that ambiguity is important as it allows for the coexistence of contrasting perspectives, and the development of mutually complementary models. As the publication process tends to filter out what does not fit into a linear logical narrative, it builds an artificial self-constructed sense of certainty. Instead, scientific publications should be assumed as useful particular ways of structuring order allowing for a multiplicity of viewpoints consistent with experimental observation.
“The Aleph, the only place on earth where all places are — seen from every angle, each standing clear, without any confusion or blending.”
“To signify the godhead, one Persian speaks of a bird that somehow is all birds Alanus de Insulis, of a sphere whose center is everywhere and circumference, is nowhere Ezekiel, of a four-faced angel who at one and the same time moves east and west, north and south.”
“I saw the Aleph from every point and angle, and in the Aleph I saw the earth and in the earth the Aleph and in the Aleph the earth I saw my own face and my own bowels I saw your face and I felt dizzy and wept, for my eyes had seen that secret and conjectured object whose name is common to all men but which no man has looked upon — the unimaginable universe.” Jorge Luis Borges, The Aleph (1)
In recent years a crisis of reproducibility has put the scientific community in a state of perplexity. In response to this crisis, much emphasis has been put in the performance of research with rigor. Certainly, there are cases of research conducted in a sloppy manner, and cases of misconduct, but even if one could clear out the scientific literature of those cases there would still be an intrinsic problem with the way science is practiced, how the scientific system is organized, and the limitation of science to deal with ambiguity. The root of the problem should be looked at not in what is published but in what it is not.
One of the pillars on which experimental science stands on is the idea that hypotheses are either true or false, that the scientific method can discern the truthfulness of any hypothesis, and if wrong, correct itself over time. On the basis of that assumption, two seemingly incompatible hypotheses can not be true at the same time. One has to be true and the other false and that assumption sets the basis for the whole ethical system that governs the practice of science.
In reality science reductionist approach makes researchers operate on the basis of partial information that reveal partial trues, sometimes contradictory, reflection of a higher truth
In some instances, these partial trues can be reconciled with new experimental information leading to a unifying model, but in other cases, they can not. Imagine you are a scientist trying to answer the following question: “Can I park my car in the 3rd floor of Boston Logan Airport’s terminal B Garage tonight at 9 PM?”. To answer the question you do the following experiment: you drive your car to the sited place at 9 PM and when you get through the ramp to the third floor you look to your right-hand side and see a red light sign that says “Full”. You conclude that it is not possible to park the car in the said place and time and publish accordingly. Now a second scientist comes behind you and performs a different experiment. Instead of looking to the right, looks to the left and sees an empty parking spot. He or she arrives to the opposite conclusion, that it is possible to park the car at the said place and time. An argument is established, the second scientist probably experiencing a self-imposed or community ( i.e. reviewers) imposed higher barrier to publish. Imagine now a third scientist that being more thoughtful performed both experiments. In light of the contradicting data the scientist will find the results non-conclusive and will not publish them. Ambiguity has been masked.
Eventually evaluating some other piece of information, i.e in the cultural context of where the garage is situated, what has more weight a sign saying you can not park or the physical availability of an empty parking spot, a decision is made to park the car or continue to the fourth floor, ignoring one of the discrepant pieces of information. A choice is made, leaving behind the ambiguity and indetermination that allowed for that choice.
There is indeterminacy underlying the way in which scientific knowledge is generated. Even though ambiguity presents itself in the lab on daily basis, it is consciously or unconsciously disregarded. The end-product of the discovery process, the published paper, is linear and logical but the discovery process is chaotic and involves many conscious or unconscious decisions on the way that affect the end result. Each of those decisions sets a deterministic course in an otherwise indeterministic landscape. Perhaps the decision involves the use of protocol A instead of protocol B, or just the use of protocol A without considering other options. We may assign to ambiguous results some rational explanation for instance sample variability. To test a hypothesis proof is commonly sought by three, four or five different techniques. Often one of those techniques produces a result that differs from the rest, and as we expect results to be consistent, we assume that something went wrong with the discrepant technique, In doing so we mask ambiguity.
Reviewers and journals require self coherent stories and scientists produce them generating a self-constructed certainty that leaves out anything that does not fit into a linear narrative. In the face of ambiguity scientists either do not publish as results are deemed non-conclusive, or are biased in what they publish or in the way results are interpreted. Fortunately, this is something that began to change, with an increasing number of journals now accepting the inclusion of negative data. However, negative data, rather than being regarded as an accident should be seen as the natural manifestation of the ambiguous nature of science.
Shedding light with Light
Probably the first and best-established example in science in which one phenomenon is described by two alternative models is light. So let’s shed light with light. Light can be described both as a wave and as a particle. The model describing light’s behavior as a wave has yielded applications like the photovoltaic cell, solar energy, or x-ray crystallography. The model describing light as a particle yielded X-ray imaging. The experiments that led to the description of light as a particle however do not fit the narrative that explains light’s behavior as a wave and vice versa. The arrival to the coexistence of two alternative models has not been easy and without fights. It took more than a century to reach that point.
The indeterminacy underlying the dual behavior of light is likely, in light of the current reproducibility crisis, a more common phenomenon in science than we think, only that in our scientific way of thinking we are conditioned to disregard it, in favor of a self-conceived certainty.
Paradigm shifts and Experimental Biology
Paradigm shifts are seen as the result of science self-correction over time. When the reproducibility crisis started, NIH published in its web page a statement saying “science corrects itself over time, but that is not longer true in the short run”, An alternative view is that that paradigm shifts are the manifestation at different time points of otherwise coexisting truthful alternative models, theories or perspectives. With a larger scientific enterprise today than a century or two ago, data that supports alternative views are now produced synchronously. Perhaps because it is an older discipline than experimental biology and is both experimental and theoretical, Physics has reached a point that allows for alternative models or theories coexistence. Examples are Newtonian and Relative Physics, light’s dual wave/particle behavior, or theories in the process of validation like those of Multiverse or String Theories.
The time seems to have come now for experimental biology to transit through a similar path.
A number of replication efforts conducted in a very controlled manner revealed reproducibility problems of published experiments in psychology and Cancer biology. In psychology, a replication study on an effect called “verbal overshadowing” the pooled results of 22 groups succeeded in replicating results of a previously doubted study, but replication was sensitive to protocol parameters (2). In cancer, a highly cited study published in Cancer Cell on a mutation in the gene IDH whose product can be detected in blood and potentially be used as a leukemia biomarker, failed to replicate in a first batch of studies but succeeded in a more recent round, factoring in that 20% of leukemia patients carry the mutation (3) . A second study, however, on BET Inhibitors succeeded in replicating the in-vitro cell experiments but failed to replicate the in-vivo mouse experiments (4). Four other studies in the context of the Cancer Biology Reproducibility Project have also revealed reproducibility problems of previously published results or their statistical significance (5,6,7,8). A poll conducted by the journal Nature with 1500 scientists revealed that
70% of the scientists polled in the fields of biology and medicine failed to reproduce someone else’s experiments, and
55-60% failed to reproduce their own (9).
With the increasing number of cases of reproducibility problems and cases where the scientific puzzle can not be fully solved, we should call for measures to increase rigor in the practice of science as has been done. We can assume that the acquisition of scientific knowledge as an incremental process will eventually fill the gaps. Concomitantly, a radical scientific approach demands that when data says so, we also need to question some of the assumptions on which the practice of science stands.
What scientists should do in the light of ambiguity? First of all, acknowledge it.
By accepting that the one universal truth that science seeks can harbor contrasting perspectives, scientists should be able to publish contradicting models and results, without necessarily engaging in conflict or fear of inflicting damage to their own or other researcher’s reputations.
The acknowledgment of ambiguity should take scientists to develop a “YES AND” attitude as opposed to a “EITHER OR” attitude. At the laboratory level, in the face of ambiguous results, scientists should first try to disambiguate them experimentally. However, if efforts are unsuccessful, groups should consider the possibility of the coexistence of alternative models. Since our minds operate through linear reasoning, those lines of research should probably be put in the hands of separate lab members or left to be developed by independent groups. As in the case of light, the knowledge generated through alternative discrepant models could lead to more diversified technological/medical applications than a single model would.
At the scientific community level, recognition of ambiguity is a moral necessity. In the presence of opposing or discrepant reports scientist’s natural tendency is to assume one of them wrong often doubting the skills or integrity of those proposing it. Instead, ambiguity should be factored in to make the system of professional advancement, rewards and punishments more just.
While science is no religion, scientists could draw lessons from traditions that see oneness in diversity and opposites. For instance, says Rabbi Jonathan Sacks, “truth on earth is not nor can aspire to be the whole truth. It is limited, not comprehensive, particular, not universal. When two propositions conflict it is not necessarily because one is true and the other false. It may be, and often is, that each represents a different perspective on reality, an alternative way of structuring order….In heaven there is Truth, on earth there are truths. Therefore each culture has something to contribute. Each person knows something no one else does.” (10).
In the Talmud, the written record of an intergenerational dialog between Rabbis interpreting the Torah to establish laws of observance, for every opinion sustaining that something should be done in a certain way there are a multiplicity of well-fundamented opinions of why things can not be done in that particular way and should be done differently. Not all opinions make it into the law but all opinions are recorded. In an often-quoted passage (Eruvin 13b), two houses of study, the house of Hillel and the house of Shammai were in constant disagreement. Whenever the House of Hillel would propose a certain way of observance, the House of Shamai would propose a different and more stringent way. The disagreement generated a dispute about the opinion of which house should dictate the law. Ultimately, the story goes, a Divine voice emerged and proclaimed that both these and those are the words of the living God but the law should be dictated by the house of Hillel. The reason for following the opinion of the house of Hillel however was not because its opinion was more righteous but because Hillel’s House was more humble, always presenting first the point of view of Shamai before making its own case (11).
The idea of indetermination, ambiguity, and opposites in science is not new, only that the reproducibility crisis is now providing empirical evidence to support it.
Albert Einstein sustained that is the theory that decides what one can be observed. (12).
Jacques Monod sustained that value-free knowledge is the result not of evidence, but of choice which precedes the collection of evidence and the arrival of performance (13).
Back in 1996, before any reproducibility crisis, Frederick Grinnell called attention to the issue of ambiguity in the practice of science (14). There, among other references, he relates to a passage in Nobel Prize Rita Levi-Montalcini’s autobiography where she refers to Alexander Luria’s “law of disregard of negative information… facts that fit into a preconceived hypothesis attract attention, are singled out, and remembered. Facts that are contrary to it are disregarded, treated as an exception, and forgotten” (15) In this regard Rita Levi-Montalcini makes reference to the tendency to oversee information that could be self-destructive. No amount of science education can make clear the difference between facts to be remembered and facts to be ignored. Grinell also refers to the presentation of science as a historically reconstructed, self-consistent, logical process that in the words of Francois Jacob replaces with order the disorder that takes place in the lab (16), and led Nobel Prize Sir Peter Medawar to write an essay entitled “Is the scientific paper a fraud ?”(17) where among other things he says “ all scientific work of an experimental or exploratory character starts with some expectation about the outcome of the inquiry…. It is in light of this expectation that some observations are held relevant and others not that some methods are chosen and others discarded that some experiments are done rather than others”.
It would be a mistake, however, to assume that everything is ambiguous in science. Some things are, and some things are not. The paradoxical example of things that are nonambiguous is the genetic code, in which 64 possible codons, a sequence of 3 positions each covered by four possible nucleotides ( A, T, C and G) , code for only 21 amino acids and a signal to stop translation. The codon possibilities for each amino acid are even larger as the third base is less stringent due to what is known as the wobbling effect. Still, the engagement preferences of that third base are such that the genetic code is redundant with more than one codon coding for a particular amino acid, but is non-ambiguous. It could also be argued, of course, that we do not know, during the discovery process, what pieces of discrepant information may have been consciously or unconsciously ignored in order to arrive to a coherent model.
Rather than escaping ambiguity scientists should embrace it. There is so much to gain from it. From the possibility of a broader menu of technological and medical applications to a more ethical and just way of practicing science and the revelation of a universal framework in which opposites coexist and integrate, opening the path for the confluence of two different approaches to seek truth that have been at odds for long, those of Science and Religion.
1) Jorge Luis Borges, 1949, “The Aleph” in “The Aleph and other stories”.
2) Jonathan W. Schooler, Metascience could rescue the ‘replication crisis’, 2014, Nature 515, 9
3) Showalter MR, Hatakeyama J, Cajka T, VanderVorst K, Carraway KL, Fiehn O Reproducibility Project: Cancer Biology. Collaborators: Iorns E, Denis A, Perfito N, Errington TM. Replication Study: The common feature of leukemia-associated IDH1 and IDH2 mutations is a neomorphic enzyme activity converting alpha-ketoglutarate to 2-hydroxyglutarate. Elife. 20176. pii: e26030. doi: 10.7554/eLife.26030.
4) Fraser Aird Irawati Kandela Christine Mantis Reproducibility Project: Cancer Biology. Replication Study: BET bromodomain inhibition as a therapeutic strategy to target c-Myc. eLife 20176:e21253 DOI: 10.7554/eLife.21253
5) Horrigan SK, Reproducibility Project: Cancer Biology. 2017a. Replication Study: The CD47-signal regulatory protein alpha (SIRPa) interaction is a therapeutic target for human solid tumors. eLife 6:e18173.
6) Horrigan SK, Courville P, Sampey D, Zhou F, Cai S, Reproducibility Project: Cancer Biology. 2017b. Replication Study: Melanoma genome sequencing reveals frequent PREX2 mutations. eLife 6:e21634.
7) Kandela I, Aird F, Reproducibility Project: Cancer Biology. 2017. Replication Study: Discovery and preclinical validation of drug indications using compendia of public gene expression data. eLife 6:e17044.
8) Mantis C, Kandela I, Aird F, Reproducibility Project: Cancer Biology. 2017. Replication Study: Coadministration of a tumor-penetrating peptide enhances the efficacy of cancer drugs. eLife 6:e17584
9) Baker M. 1,500 scientists lift the lid on reproducibility. Nature. 2016 May 26533(7604):452-4. doi: 10.1038/533452a.
10) Jonathan Sacks, ed Hava Tirosh-Samuelson and Aaron W. Hughes, 2013, “Universalizing Particularity.” p53. Brill.
12) Holton G. Werner Heisenberg and Albert Einstein, 2000, “Creating Copenhagen”. Symposium at the Graduate Center of the City University of New York.
13)Jacques Monod, 1970. “Chance and Necessity”
14) Frederick Grinell. Ambiguity in the Practice of Science. 1996. Science. 222(5260) 333.
15)Rita Levi-Montalcini, Transltion Luigi Attardi. “In Praise of Imperfection.” 1988. (Basic Books, New York), 1988
16)Francois Jacob. F. Philip, Transl. 1987. “ The Statue Within: An Autobiography”, Basic Books, New York.
17) Sir Peter Medawar, Is the scientific Paper a Fraud? The Listener (12 September 1963), p. 377
Medical and scientific advances are predicated on new knowledge that is robust and reliable and that serves as a solid foundation on which further advances can be built. In biomedical research, we are in the midst of a revolution with the generation of new data and scientific publications at a previously unprecedented rate. However, unfortunately, there is compelling evidence that the majority of these discoveries will not stand the test of time. To a large extent, this reproducibility crisis in basic and preclinical research may be as a result of failure to adhere to good scientific practice and the desperation to publish or perish. This is a multifaceted, multistakeholder problem. No single party is solely responsible, and no single solution will suffice. Here we review the reproducibility problems in basic and preclinical biomedical research, highlight some of the complexities, and discuss potential solutions that may help improve research quality and reproducibility.
As physicians and scientists, we want to make a contribution that alters the course of human health. We all want to make our mark by discovering something that really makes a difference. Yet most of the common diseases that we study are exceedingly complex. Our knowledge is fragmentary. Many, perhaps most of our models are naive and our constructs are often rough approximations of the truth. Eventually, we may see in our own research efforts and results what we want to see: promising findings and nice discoveries. This seeming success may not be reproducible. Yet robust, reproducible research is the foundation on which advances are built.
Against the reality of the diseases we want to cure, there is a real opportunity that we can more readily address. Although this would involve widespread changes and demand a critical re-evaluation of our processes, this is an opportunity that could have substantial benefit. The opportunity is to introduce, demand, and reward a level of rigor and robustness in designing, conducting, reporting, interpreting, validating, and disseminating research that is currently lacking from many areas of biomedical research.
Over the recent years, there has been an increasing recognition of the weaknesses that pervade our current system of basic and preclinical research. This has been highlighted empirically in preclinical research by the inability to replicate the majority of findings presented in high-profile journals. 1–3 The estimates for irreproducibility based on these empirical observations range from 75% to 90%. These estimates fit remarkably well with estimates of 85% for the proportion of biomedical research that is wasted at-large. 4–9 This irreproducibility is not unique to preclinical studies. It is seen across the spectrum of biomedical research. For example, similar concerns have been expressed for observational research where zero of 52 predictions from observational studies were confirmed in randomized clinical trials. 10–12 At the heart of this irreproducibility lie some common, fundamental flaws in the currently adopted research practices. Although disappointing, this experience should probably not be surprising, and it is what one would expect also theoretically for many biomedical research fields based on how research efforts are conducted. 13
Basic and preclinical research is particularly important because it forms the foundation on which future studies are built. It is preclinical research that provides the exciting, new ideas that will eventually find their way into clinical studies and new drugs that provide benefit to humankind. Yet this preclinical research is poorly predictive of ultimate success in the clinic. 14,15 And it is observational research that both attracts immediate public attention and often provides the hypothesis on which interventional studies are based.
Given the controversy that has surrounded this issue, it is important to note at the outset that these concerns about the irreproducibility of science do not invalidate the validity or legitimacy of the scientific method. Rather it is the rigorous, careful application of the scientific method that has translated into genuine improvements in human health and provided the substantial benefits we have enjoyed over recent decades: the investment in research has yielded dramatic improvement in outcomes in infectious diseases, cardiovascular disease, oncology, rheumatic diseases, and many other conditions.
At the outset, it is also important to acknowledge that not all biomedical research is assumed or expected to result in findings that have application for human health. Excellent ideas and excellent research can lead to a dead-end. That is simply the nature of research. Efficiency of 100% and waste of 0% is unlikely to be achievable. However, there is probably substantial room for improvement.
Finally, the fact that this self-critical debate is occurring openly within the scientific community reinforces the strength of our system.
Why Is This Important?
The key challenge is to ensure the most efficient, effective use of precious research funds. In the preclinical arena, it has become increasingly clear that the majority of preclinical research is unable to be reproduced, including by the original authors themselves.
This has long been known to be the case by large biotechnology and pharmaceutical companies that routinely seek to reproduce a scientific claim before commencing a research project. The issue is more problematic for the smaller companies and those who provide funding for these earlier stage companies, where, in proportion to their resources, the cost of attempting to reproduce a study can be substantial both in terms of money and time. 16 Early, preclinical studies are important because they potentially form the foundation on which later studies are built to evaluate specific drugs, and other interventions, and they can also provide the framework for biomarker analyses that can be used to focus treatments more precisely.
What is difficult to quantify is the opportunity cost associated with studies that fail to replicate. Investigators who pursue a provocative and exciting idea but that is not based on robust data will consume wasted hours for an idea that is ultimately discarded. For ideas that proceed into late stages of clinical evaluation or are even adopted in clinical practice only to be discarded subsequently, 17 the cost can be enormous.
Addressing this concern is central to ensuring ongoing confidence of and support from the public. It is key that the public, who directly and indirectly provide the money to fund our research efforts and who are our most important ambassadors in advocating for the importance of scientific investigation, are confident of the processes we have in place. They need to know there is real value in the data that ultimately emerges. They, after all, are the ultimate recipients reaping the benefits of the biomedical research enterprise.
What Constitutes Reproducibility?
There is no clear consensus at to what constitutes a reproducible study. The inherent variability in biological systems means there is no expectation that results will necessarily be precisely replicated. So it is not reasonable to expect that each component of a research report will be replicated in perfect detail. However, it seems completely reasonable that the one or two big ideas or major conclusions that emerge from a scientific report should be validated and with-stand close interrogation.
What has shaken many in the field is not that investigators are unable to precisely reproduce an experiment. That is to be expected. What is shocking is that in many cases, the big idea or major conclusion was not confirmed simply when experiments were performed by the same investigators when blinded to their test samples versus control samples. 2 The explanation for this was evident when the precise methodology of the experiments was reviewed. Investigators typically performed their experiments in a nonblinded fashion, so they were able to see what they were anticipating to see, and their research bias was thus able to be confirmed. 18 Observer bias has long been recognized to be a problem in preclinical studies and beyond, so this result should not be surprising. 19 Confirmation bias in scientific investigation unavoidably makes even the best scientists prone to try to find results or interpretations that fit their preconceived ideas and theories. 20,21
In addition, empirical assessments of preclinical studies showed an array of problems, including, but not limited to, the fact that there was a failure to repeat experiments, to use legitimate controls, to validate reagents, and use appropriate statistical tests. On top of that, investigators often selected the best experiment rather than referencing the entire data set. These practices conspired to ensure that not only could individual experiments not be replicated, but that the main conclusion of the article was not substantiated. 18
Furthermore, it was notable that several nonreproducible publications had been cited many hundreds of times. Clinical studies were initiated based on that work. 2 But the authors of the secondary publications had not sought to actually reproduce or falsify the findings of the original papers.
There is an acknowledged tension here. It is reasonable to expect that there will be some level of uncertainty and irreproducibility as investigators genuinely push the boundaries of current knowledge and extend into the unknown. But it is also reasonable to expect that standard research procedures, such as experimenter-blinding, will be used and that irreproducibility will be seen in the minority rather than the majority of cases.
Has Much Changed?
The inability to reproduce research findings is probably a long-standing problem. In the past, investigators would share privately the work that others could not reproduce. This was a key part of scientific exchange that typically took place in the corridors or bars at scientific meetings.
This is an issue across the breadth of biomedical and social sciences (for illustrative examples, see Tables 1 and 2). However, the burgeoning number of high-profile journals has given voice to many papers that, in the past, would not have received the same level of profile and exposure. Now the sheer magnitude of high-profile studies creates a challenge for any investigator to remain current. The number of publishing scientists has grown over the years, with over 15 million scientists publishing ≥1 article that was indexed in Scopus in the period 1996–2011. 49 Biomedical research is the most prolific scientific field in this regard. It is practically impossible for even the most knowledgeable expert to maintain direct knowledge of the work done by so many other scientists, even when it comes to his/her core discipline of interest.
Table 1. Examples of Some Reported Reproducibility Concerns in Preclinical Studies
ALS indicates amyotrophic lateral sclerosis MIAME, minimum information about a microarray experiment NGS, next generation sequencing and VPA, valproic acid (model of autism).
Table 2. Additional Basic Science Fields Where Concerns Regarding Reproducibility Have Been Raised
This issue is also amplified by the public appetite for new advances. Over recent years, an increasingly informed, vocal and vigilant patient–advocate community has emerged. These groups have played an important role in helping further raise the profile of medical research. They have effectively argued for ongoing research investment. At the same time, they have maintained a clear focus on research productivity and expect to see breakthrough discoveries effectively translated into improved medicines. And a hungry media is happy to trumpet these breakthroughs to a waiting public.
Why Is This Happening?
The vast majority of academic investigators, industry researchers, journal editors, investors, government, and certainly patients want and expect to see valuable research dollars translated into new therapies. The expectation is well founded: the evidence that biomedical research has had a profound effect on human health is undeniable. To be clear, the problem under discussion here is not one of scientific fraud. It is not a failure of the scientific method. It is a consequence of a system that is willing to overlook and ignore lack of scientific rigor and instead reward flashy results that generate scientific buzz or excitement. It is mostly an issue of how priorities are balanced and a failure to rigorously apply standard scientific methods in an increasingly competitive research environment when scientists are scrambling to get their share of a dwindling national research budget. 50 Although fraud is rare, use of questionable research practices seems to affect the majority of researchers. 51
In the preclinical arena, there seems to be a wide-spread conscious or unconscious belief that a rigorous research process, that follows what most would consider standard scientific methodology (blinding, repeating experiments, inclusion of positive and negative controls, use of validated reagents, etc.), may stifle the creative, innovative act of discovery. That is clearly not the case. An unexpected, unexplained observation that serves as the first hint of something new should be tested rigorously. It should be repeated and carefully confirmed before it is announced to the world. 52 Unfortunately, currently that is seldom the case. Instead, it would seem that many investigators feel the need to immediately rush into print with an unconfirmed, and unconfirmable, finding.
It is probably relevant to note the evidence that investigators who are assured of more consistent, stable laboratory funding are less likely to succumb to the pressures of fraud. 53,54 It seems reasonable to extrapolate and suggest that may also be the case for investigators who enjoy similar guaranteed research funding, they too may be less inclined to take shortcuts with their experiments and use questionable or substandard research practices. Perhaps somewhat paradoxically then, it is reasonable to expect that if research funding per capita (per investigator) decreases further, we should expect that this problem will be even more evident: the need to be first, with its concomitant rewards, will trump the need to be right.
What Can be Done?
Addressing this challenge may require a multipronged approach, there is no single solution, and there are several initiatives underway that may have a positive effect. What seems to emerge repeatedly is a failure for individual researchers, reviewers, and editors to comply with agreed, well-established guidelines for the conduct of experimental research. While there is no direct incentive for investigators to comply, it seems that generally they do not.
Agreed Recommendations for Pharmacology Studies Are Overlooked
There have been calls for a more rigorous approach to experimental design with some disciplines devoting considerable effort to addressing the inability to translate preclinical research into clinical success. For example, in the field of pharmacology in an effort to improve the quality of animal studies, investigators have argued for a focus on hypothesis-testing research, 40 a prospective, rigorous, research plan for preclinical studies, 38 training and more appropriate use of statistics, 41 and performance of requisite studies, such as pharmacokinetic analyses before any evaluation of potential efficacy. 39 These recommendations are widely endorsed and supported. Henderson et al 37 came to similar conclusions after their review of the literature. Among the most common recommendations included blinding of outcome assessment, randomized allocation of animals, power calculation to determine sample size, use of positive and negative controls, determination of dose-response, replication in different models, and independent replication.
Although these are key to the appropriate interpretation of animal studies, they are still widely neglected. One particularly troubling study 28 found that only 9% of studies used the correct experimental unit in their analyses and, further, that the behavioral variation between litters of mice was greater than the reported treatment effect. This failure to adopt well-recognized, widely supported guidelines begs the question: how does the community ensure that key recommendations are integrated into the research effort? How is this monitored?
Data Disclosure and Omics Research: Guidelines Are Ignored
Another area that has received attention is the analysis of large data sets and the computing tools that are required to analyze them. There are several common issues with omics research.
It was recognized over a decade ago that experiments involving array technology were potentially fraught with problems. To address this, a requirement for mandatory full data deposition was recommended in 2001 and quickly adopted by many journals. This minimum information about a microarray experiment standard has been widely accepted by investigators and journals. However, despite being accepted as the desired norm over a decade ago, compliance remains a problem. An analysis of data from high-profile microarray publications that appeared between 2005 and 2006 in Nature Genetics reproduced the results of only 2 of 18 reanalyzed papers. The principal reason for failure was lack of availability of original raw data. 22 A study of 127 articles on microarray studies published between July 2011 and April 2012 revealed that ≈75% were still not minimum information about a microarray experiment compliant. Furthermore, reanalysis of data often did not support the original conclusions. 30 Similar observations have been made by others. 23,25 An analysis of 500 papers in the 50 top journals across scientific fields (those with highest impact factors) revealed only 9% deposited full primary data online. 55
It is noteworthy that both the National Science Foundation and the National Institutes of Health have strongly worded statements requiring data set disclosure and software sharing, but neither is consistently enforced: compliance still remains at the discretion of the investigators.
In addition, the majority of journals surveyed (105 of 170 journals) in a recent evaluation failed to comply with the 2003 National Academies report requiring that journals clearly and prominently state (in the instructions for authors and on their Web sites) their policies for distribution of publication-related materials, data, and other information. 33 Similar calls have come from many directions and with a plea to support only those journals that support reproducible research. 56–62 These studies reinforce the importance of access to primary data. Moreover, the data suggests that a reluctance to provide such access is a marker of a lower quality study.
There are additional issues with omics research beyond access to the data and specific recommendations have been proposed to address these, in particular regarding the need to perform independent validations of all findings that emerge from these analyses. 63
Substantive issues still persist in studies involving large data sets. As a recent example, when the activity of 15 oncology drugs was directly compared in large pharmacogenomics studies from 2 independent data sets, one generated at the Dana-Faber Cancer Institute and the other at the Massachusetts General Hospital, there was little or no consistency, even though the same 471 cell lines were examined. 29,64 Although the lack of consistency might have been attributed to differences in the cell lines that were used, this seems unlikely. Rather, it probably represents a lack of standardization of the experimental assays and methods of analysis. 29
We Get What We Incentivize
It is interesting to observe what has happened in the psychology literature. Some of the most provocative psychological experiments were not able to be replicated. For example, the alleged effect of positive or negative priming was not confirmed when experiments were rigorously repeated. 65 The publication of this negative result is unusual. Instead, investigators have noted that the rate of positive results in psychological science (as in many biomedical fields) is approaching 90% to 100%. 66 Most hypotheses are confirmed, and skeptics have concluded that we are either “approaching omniscience or our journals are publishing an unrepresentative sample of completed research.” 48 It is likely that this bias toward studies that confirm a favorite hypothesis spans the scientific spectrum. 2,67–69 An analysis of 525 preclinical stroke studies revealed that only 2% reported a lack of effect on stroke, which led the authors to conclude that serious publication bias may exist. 24 In another analysis of over 4445 data sets involving animal studies of neurological diseases, it was similarly concluded that perhaps the majority of the data were either suppressed or recast in a way that truly negative studies would be published as positive results—there were just too many positive results published to be true. 27 Correspondingly, although the large majority of interventions for neurological diseases seemed effective in animal studies, few had favorable results when tested in humans.
Further, the recognition that in psychological research using several smaller, underpowered samples is more likely to provide a positive result through selective analysis and outcome reporting than using a larger appropriately powered sample 70 also permeates other biomedical fields. 71 Similarly, the powerful incentives that favor novelty over replication are certainly not unique to psychological studies. 72 In an attempt to proactively address these concerns, the reproducibility project has been created. This is a large-scale, cooperative effort to examine reproducibility in psychology studies 73 and could serve as a template for other disciplines.
Several thoughtful suggestions have been made by many investigators, some of which are listed in Table 3. Although a systematic review of the literature was not performed, there is substantial concordance in terms of these recommendations, with a commonality across diverse disciplines and research stage. The fundamental problem with most, if not all, of these proposals is the requirement for investigators, institutions, and journals to willingly comply: it is not at all clear how reasonable recommendations will be implemented or monitored while they remain voluntary. Conversely, were they to be mandatory, then one has to examine carefully how they would be enforced and by whom. The details of how to make these changes work can have a major effect on their efficiency.
Table 3. Some Proposals to Improve Experimental Rigor and Quality in Preclinical Research
There are several other initiatives underway to try and aid this process. We list here a few examples. The Neuroscience Information Framework was created to define resources available to the research community to further neuroscience research. 79 Investigators studying spinal cord injury have defined the minimal information about a spinal cord injury experiment in an attempt to improve the outcomes of spinal cord injury research. 80 They have proposed the inclusion of specific experimental details in addition to more general elements of study design, such as experimenter blinding, randomization of animals, cohorts of sufficient size, inclusion of controls. 18,81
The Global Biological Sciences Institute aims to establish a set of harmonized, consensus-based standards that can be applied to commonly used research tools. An agreed set of validated tools would represent a significant advance. But even with validated tools, having them accepted will be a challenge. This is illustrated by the MDA-MB-435 cell line that has hundreds of citations falsely identifying it as a breast cell. Despite definitive reports identifying this as a melanoma cell, 82,83 at the time of writing, since 2010, there have been over 170 reports continuing to falsely identify it as a breast cell and only 47 reports correctly identifying it as a melanoma cell (Begley unpublished). A curated data set that is publicly available with a list of readily searchable reagents could go a long way to helping address this type of problem.
Another recent project is the Reproducibility Initiative, which seeks to examine the replicability of the top 50 most impactful cancer biology studies published between 2010 and 2012. 84 This initiative is exploratory and may offer useful insights. It potentially offers the opportunity to debate and clarify within the scientific community what constitutes adequate and appropriate replication, who should perform these replication studies, and how they are best performed. 52
Academia.edu is taking a different approach, attempting to build an alternative publication system where papers are made available promptly and with the peer review process taking place postpublication. They also hope to make research freely available.
These efforts are all still at an embryonic stage, and although their value remains to be determined, it is heartening to see the broad-based approach that is being adopted by the scientific community.
The National Institute of Health has indicated that it too will take several steps, including additional training within the National Institute of Health intramural program, a more systematic review of grant applications, more transparent access to data, an online forum for discussion of published papers along with other measures under consideration. 50 Of these, perhaps the most important is the focus on education. This is a crucial initiative. It is essential that the next generation of researchers is better trained in terms of design, execution, and interpretation of experiments. A key component of this effort will need to be instruction in terms of appropriate selection and cleaning of data for presentation. This initiative will help address the concern that junior researchers may not receive adequate personalized training or mentoring if they are in laboratories where the principal investigator is unable to provide individual attention on a regular basis.
Realistic Prospect for Change?
Despite the clear challenges that we face, it is worth recalling that substantive changes have taken place over recent decades in clinical trials processes, and similar progress may also be achieved for preclinical research. Although it has taken a long time to reach wider consensus about the importance of proper randomization, blinding (when feasible and appropriate), and registration in clinical trials, these are now widely accepted as the norm, and failure to comply (eg, nonregistration) is penalized (eg, by inability to publish in most major journals). When these changes were being introduced, some physicians expressed concern, at least privately that these changes would stifle clinical research. In fact, this has introduced a new level of reliability and confidence in the clinical research effort. There are still multiple challenges of course. The public availability of clinical trial data remains suboptimal. 85,86 Research protocols and full study reports remain difficult to access. Transparent access to key data that underpins clinical conclusions remains to be broadly addressed. 9 Plus many clinical trials remain nonrandomized without any good reason, especially in phase I and II research. 87 However, despite these real concerns, the current status of planning, execution, and interpretation of clinical trials is different and probably substantially improved to what it was several decades ago and vastly superior to the situation in preclinical research.
At this time, the level of control, rigor, and accountability seen in clinical trials (even with their residual deficiencies) is currently unimaginable in the preclinical arena 88 where much still remains at the discretion of the individual investigator. Although it seems unlikely that this pattern can continue, the notion of increased oversight of preclinical studies has long been proposed. 89,90 The current model of investigator self-regulation and self-censoring does not seem to be serving the scientific community well enough. Although clinical and preclinical research do have major differences, many of the lessons learnt from clinical research may also be considered for more systematic application in preclinical research.
In addition to the changes that have taken place in clinical practice, it is important to recall the changes that have been successfully introduced into basic and preclinical research. Many of these are now integrated into standard research practice and are accepted as routine procedures. As a result, there are now appropriate limits on handling of radioactive materials, recommendations regarding DNA cloning, committees to ensure appropriate research access to human tissues, safeguards to protect investigators when using human samples, guidance regarding use of retroviruses, regulations on experiments with embryonic stem cells, and committees to ensure appropriate treatment of animals in research. The introduction of each of these was associated with some increase in bureaucracy and significant reluctance on the part of some investigators: the introduction of regulations and controls for DNA cloning was hotly debated in the early 1970s. At that time, Watson strongly opposed regulatory protocols, 91 whereas other scientists, including Singer and Berg, took the lead on promoting a responsible approach to the young field. 92,93 There are obvious parallels in the current debate regarding the extent and importance of reproducible science. 64
Once guidelines and regulations are introduced, these measures have typically remained in place. However, at least for some (eg, controls on DNA cloning and embryonic stem cells), controls have become more relaxed in the light of new scientific information and a changing social and biomedical environment. 92–95
Although it is not possible to demonstrate a direct improvement in the research enterprise as a result of these changes, for those who recall the time before their introduction, it seems self-evident that there is improved investigator safety as a result of training in use of radioactive materials and safeguards regarding potential infectious agents. It seems intuitively obvious that these changes have resulted in improved animal care and better protection for patients’ rights. But there is no doubt that this came at a cost in terms of increased bureaucracy, increased demands on investigators’ time, and increased institutional responsibility.
Role of the Editors, Reviewers, and Journals
It has been argued that editors, reviewers, and the journals must take substantial responsibility for the current situation. 96 Many of the perverse incentives that drive scientific publications, and thus build and sustain careers, could be more closely regulated at this level. 75
Several journals have introduced new Guidelines for Authors and made specific proposals that attempt to address the problems in preclinical scientific methodology identified above. 62,97,98 These Guidelines attempt to increase the focus on replication, reagent validation, statistical methods, and so on. This should make it more difficult for papers to be accepted as the journals will require additional controls and validation. Complying with these requirements will increase the effort of investigators, but the literature may become more reliable. New standards should hopefully improve the quality of scientific publications. However, it is again worth noting the difference between having standards in place and ensuring those standards are met: this continues to prove to be a problem. 34,35
Monitoring the effect of policy changes is warranted: there is no value in simply increasing the burden of bureaucracy. It is possible that some new policies may have the best intentions but may result in inadvertent adverse consequences. There is also the potential of normative responses where investigators focus their attention on satisfying whatever new checklist item is asked of them, but fail to improve the overall agenda of their research.
Given that most initiatives for making changes are done within single journals or scientific subfields, there is fragmentation and lack of consistency in the messages given and in the policies that are adopted. For example, despite the improvements in Guidelines to Authors, and despite the decades-long recognition of the need for investigator blinding, 19 there seems to still be a general reluctance to demand blinding of investigators even when subjective end points are assessed. As with this journal, even the journals that have reworked their Guidelines to Authors still do not demand blinding of investigators in the evaluation of subjective preclinical data. In contrast to clinical trials where blinding is not always feasible or practical, in most preclinical investigations, blinding should be straightforward to adopt (with few exceptions). This oversight is difficult to comprehend, but presumably it reflects a reluctance and resistance from the investigator community.
There have also been specific editorial proposals to focus on rewarding investigators for scientific quality rather than quantity, reward confirmatory experiments, and publishing negative data, perhaps even preferentially so. 99 The proposal to reconsider the need to find clinical relevance in early stage research is particularly intriguing. 75 Although these recommendations make intuitive sense, a major challenge is how they would be introduced and how their effect would be evaluated. Another suggestion is that editors could openly solicit replication studies for findings of particular relevance, with a bidding-process and requirement for rigorous experimental approach. 74 However, the laboratories that are likely to respond to such a request may not be the ideal laboratories to perform such studies. 52
Although it is easy to point to the journal editors and reviewers as the principal area for improvement, this ignores the fact that the publication of data is the final step in a lengthy process. Reviewers can have no knowledge as to what data investigators have chosen to exclude. Reviewers cannot know whether data was strung together post hoc simply to create the best story. Of necessity, they take the work at face value. Further, there is little recognition for reviewers who diligently undertake their work: they have many demands on their time, and reviewing an article may be the lowest priority. In fact, the reward for conscientious reviewers may be a disproportionate increase in reviewing requests!
Finally, particularly in the top-tier journals, there is a clear emphasis on the exploratory investigations that are at the heart of this problem. 97 These are forgiven the same level of rigor that is demanded of hypothesis-testing studies, 97,98 yet in terms of their prognostic value, they are most directly comparable to Phase 1 clinical studies. Although the focus on exploratory investigations is understandable, after all they generate scientific buzz, they should be labeled for what they are and viewed with the high level of skepticism they deserve. 98
Back to the Beginning: Improving the Grant Review Process
Although a major part of the problem relates to our current system of scientific publishing, this is just the final stage of the research process. What we really need perhaps is a much more stringent focus on the grant approval process. Research funding agencies have primary responsibility for reductions in waste that emanates from how precious research funds are invested. 5
As with those who review scientific papers for publication, Grant Committee Members are busy people who are seldom recognized or rewarded for a detailed, careful review. There is evidence that the grants that are funded have more similarity in their themes to the interests of the study section members than unfunded grants. 100 Grant reviewers may be more likely to suffer from allegiance-bias favoring grants that align with their own beliefs and interests. There is also some evidence that current grant review is inefficient and is not receptive of out-of-the box innovation. Different approaches to allocation of research funds have been proposed. 101 Many of those would focus more on judging the merit of people rather than projects.
In judging the merit of people, it is important to identify rigorous metrics that reward quality and reproducibility rather than volume and quantity. A PQRST index has been proposed for evaluating investigators 102 from the initials of productivity, quality, reproducibility, sharing, and translational potential. Consideration of these dimensions may improve the current situation. However, assessment of quality may be one of the most difficult aspects of judging the work and track-record of an investigator. Such assessment may require more transparent availability of information about the study design and conduct of previous studies by an investigator. It also requires some consensus on what are the key quality features that are more important in each field (eg, whether experiments were performed blinded, with appropriate reagents, experiments repeated, whether data sets are publicly available, etc). 18 In this regard, it is again worth noting the relative concordance of ideas across fields as to what constitutes quality research (Table 3). Quality indicators that acquire power to inform funding decisions are expected to improve: the adoption of a quality score would provide a powerful incentive for scientists to improve their performance specifically in the dimensions captured by that score. Optimistically, if these indicators are selected wisely and they represent the quality of the work well enough, there is a real possibility that the quality of proposed and conducted research will truly improve. More pessimistically, investigators may just see this as another checklist they have to satisfy that has no real effect on improving their research. 6 Again, a monitoring component would be key to include alongside a change in process that attempts to recognize quality. The general issue is currently under consideration by the National Institute of Health leadership 50 and provides some optimism that quality indicators will factor in to the grant review process going forward.
One simple proposal is that Committees should expect that all subjective analyses will be performed by blinded investigators. There should also be a requirement that reagents are validated, that a minimum number of replicates will be evaluated, and that appropriate controls will be included. 18,37 These requirements should be explicitly stated and include a rigorous defined research plan. 38 Unfortunately, these aspects of a research proposal are frequently regarded as trivial and are taken for granted: it is assumed that if the idea is a good one, that rigorous and careful scientific method will be used.
The principal responsibility for research findings rests with the investigator and their host institution. Currently, science operates under the trust me model that is no longer considered appropriate in corporate life nor in government. 72 It has been argued that there is also a need for institutions to take a greater leadership role. 9
One proposal is that for Institutions that receive federal research funding, there should be an expectation that they will comply with Good Institutional Practice (analogous to, eg, Good Manufacturing Practice or Good Laboratory Practice). This would ensure that Institutions commit to and are recognized and rewarded for ensuring a minimum research standard among their employees. A disadvantage is that this will necessarily increase the burden on investigators and Institutions. It is essential to pick the right targets or what items to include under Good Institutional Practice and to avoid simply worsening research bureaucracy. In addition to improving the overall quality of our research endeavor, many of these recommendations would/should have immediate benefit to the Institution itself both in terms of the quality of personnel that they attract and the ability to commercialize their research (see Table 4 for illustrative examples).
Table 4. Issues That Could Be Addressed by a Policy of Good Institutional Practice for Basic Research
Institutions could have a stated policy regarding their expectation for research quality conducted within their laboratories. For example, this may include a requirement that subjective end points are only evaluated by investigators blinded to the experimental arms. When the issue of blinding of experiments is discussed with principal investigators, a common concern is that multiple investigators are required for blinding to be effective. This of course is the point. Within an individual laboratory, this may require that students and postdoctoral fellows work together on a project. Although sharing a project and sharing data are contrary to the prevailing ethos in many laboratories, working together on the same experiment provides a level of cross-checking, cross-validation, and objectivity that would be beneficial. 9
The institution has a responsibility to staff, students, and postdoctoral fellows. Compulsory annual refresher training for all principal investigators in experimental design, use of controls, use of validated reagents, data selection, statistical tests, and so on should be the norm. There should be similar training in experimental methods for junior researchers: this is an area where there is room for improvement even among highly regarded institutions. 103
Institutions may also contribute toward improving transparency standards, for example, by adopting policies, indicating that raw data should be made available on request.
Institutions may also serve the goal of improving research credibility and efficiency if they adopt appointment and promotion standards that, instead of relying on publication in top tier journals as a surrogate for quality, recognizes the importance of reproducible research findings rather than flashy, unsubstantiated reports. 6,7
Although many investigators are clearly able to self-regulate, self-monitor, self-censor, many others are not. For those investigators, we suggest their institution could have processes to ensure that there is an additional level of control that can act as a safety net to ensure the most appropriate scientific behaviors.
It is impossible to endorse an approach that suggests that we proceed with an ongoing research investment that is producing results the majority of which cannot be substantiated and will not stand the test of time.
It is time to rethink methods and standardization of research practices. These subjects are not nearly as superficially exciting as the exploratory studies that often seize wider interest, but their effect can be far more pervasive across the research enterprise.
Clearly not all investigators agree. Many agree with the view that we cannot afford to put programs of discovery on hold while we do a rethink on methods and standardization. 64 We offer a different perspective. If this means (and it probably does) putting some alleged advances on hold, that is completely appropriate if these touted advances are not really reproducible and truly useful. If this means (and it probably does) increasing the investment in research training and research infrastructure at the short-term expense of more bench research, that is also appropriate if the investment will eventually improve the yield of whatever bench research is done.
Corrective actions may need to involve several of the stakeholders discussed above in combination (see Table 5 for some specific recommendations). For example, if funding agencies make some right steps but these are not adopted by institutions or journals, the benefit may not be reaped. We also need to recognize that although there are many good ideas on how to improve research practices, their exact implementation can be a challenge. Evidence, ideally from experimental studies, on the effect of changes in research practices would be important to obtain and the effect of proposed changes should be properly monitored.
- Bring standards more inline with clinical trial registration
- Responsibility to protect the animal subjects from undue harm
- Protects both institutions and individuals downstream who rely on preclinical data
- It jeopardizes the strategic advantage of drug developers by enabling free-riding on investments in clinical development.
- If the logic supporting good disclosure practice extends to preclinical studies, does it extend to basic science as well?
- Cost to investigators
Most Preclinical Life Science Research Is Irreproducible Bunk
Biomedical science is broken, according to a new study. In their article, "The Economics of Reproducibility in Preclinical Research," published in the journal PLoS Biology, a team of researhers led by Leonard Freedman of the Global Biological Standards Institute reports that more than half of preclinical research cannot be replicated by other researchers. From the abstract:
Low reproducibility rates within life science research undermine cumulative knowledge production and contribute to both delays and costs of therapeutic drug development. An analysis of past studies indicates that the cumulative (total) prevalence of irreproducible preclinical research exceeds 50%, resulting in approximately US$28,000,000,000 (US$28B)/year spent on preclinical research that is not reproducible—in the United States alone.
Back in 2012 I reported on other studies that also found that the results in about 9 out of 10 landmark biomedical papers could not be reproduced:
The scientific journal Nature published a disturbing commentary claiming that in the area of preclinical research—which involves experiments done on rodents or cells in petri dishes with the goal of identifying possible targets for new treatments in people—independent researchers doing the same experiment cannot get the same result as reported in the scientific literature.
The commentary was written by former vice president for oncology research at the pharmaceutical company Amgen Glenn Begley and M.D. Anderson Cancer Center researcher Lee Ellis. They explain that researchers at Amgen tried to confirm academic research findings from published scientific studies in search of new targets for cancer therapeutics. Over 10 years, Amgen researchers could reproduce the results from only six out of 53 landmark papers. Begley and Ellis call this a "shocking result." It is.
And ten years ago in his groundbreaking article, "Why Most Published Research Findings Are False," Stanford Uninversity statisitician John Ioannidis found:
There is increasing concern that most current published research findings are false. The probability that a research claim is true may depend on study power and bias, the number of other studies on the same question, and, importantly, the ratio of true to no relationships among the relationships probed in each scientific field. In this framework, a research finding is less likely to be true when the studies conducted in a field are smaller when effect sizes are smaller when there is a greater number and lesser preselection of tested relationships where there is greater flexibility in designs, definitions, outcomes, and analytical modes when there is greater financial and other interest and prejudice and when more teams are involved in a scientific field in chase of statistical significance. Simulations show that for most study designs and settings, it is more likely for a research claim to be false than true. Moreover, for many current scientific fields, claimed research findings may often be simply accurate measures of the prevailing bias.
Science is supposed to build and organize knowledge in the form of testable explanations and predictions. If reported research results cannot be reliably replicated, they are not science.
The reproducibility “crisis”
The debate over a reproducibility crisis has been simmering for years now, amplified by growing concerns over a number of reproducibility studies that have failed to replicate previous positive results. Additional evidence from larger meta-analysis of past papers also points to a lack of reproducibility in biomedical research with potentially dire consequences for drug development and investment into research. One of the largest meta-analyses concluded that low levels of reproducibility, at best around 50% of all preclinical biomedical research, were delaying lifesaving therapies, increasing pressure on research budgets and raising costs of drug development 1 . The paper claimed that about US$28 billion a year was spent largely fruitlessly on preclinical research in the USA alone.
A problem of statistics
However, the assertion that a 50% level of reproducibility equates to a crisis, or that many of the original studies were really fruitless, has been disputed by some specialists in replication. “A 50% level of reproducibility is generally reported as being bad, but that is a complete misconstrual of what to expect”, commented Jeffrey Mogil, who holds the Canada Research Chair in Genetics of Pain at McGill University in Montreal. “There is no way you could expect 100% reproducibility, and if you did, then the studies could not have been very good in the first place. If people could replicate published studies all the time then they could not have been cutting edge and pushing the boundaries”.
One reason not to expect 100% reproducibility in preclinical studies is that cutting edge or exploratory research deals with a lot of uncertainty and competing hypotheses of which only a few can be correct. After all, there would be no need to conduct experiments at all if the outcome were completely predictable. For that reason, initial preclinical study cannot be absolutely false or true, but must rely on weight of the evidence, usually using the P-test as a tiebreaker. The interpretation of experiments often relies on probabilities (P-values) of < 0.05 as the gold standard test for statistical significance, which creates a sharp but somewhat arbitrary cut-off. It means that if a study's results fall just on the side of statistical significance, a replication has a high probability of refuting them, explained Malcolm Macleod, who specializes in meta-analysis of animal studies of neurological diseases at Edinburgh University in the UK. “A replication of a study that was significant just below P 0.05, all other things being equal and the null hypothesis being indeed false, has only a 50% chance to again end up with a ‘significant’ P-value on replication”, said Macleod. “So many of the so called ‘replication studies’ may have been false negatives”.
… the assertion that a 50% level of reproducibility equates to a crisis, or that many of the original studies were really fruitless, has been disputed by some specialists in replication
For this reason, replication studies need even greater statistical power than the original, Macleod argued, given that the reason for doing them is to confirm or refute previous results. They need to have “higher n's” than the original studies, otherwise the replication study is no more likely to be correct than the original.
This leads to a fundamental problem for the life sciences, especially preclinical research: a huge vested interest in positive results has mitigated against replication. Authors have grants and careers at stake, journals need strong stories to generate headlines, and pharmaceutical companies have invested large amounts of money in positive results and patients yearn for new therapies. There is also a divergence in interest between different parties in the overall research and development pipeline. Preclinical researchers need freedom to explore the borders of knowledge, while clinical researchers rely on replication to weed out false positives.
To address this dichotomy, Mogil and Macleod have proposed a new strategy for conducting health-relevant studies. “Malcolm is studying clinical trials and replicability itself, while I'm at the preclinical end of the spectrum, so our needs and takes on the problem are not the same but our analysis of the potential solution is very similar”, commented Mogil. They suggest a three-stage process to publication whereby the first stage allows for exploratory studies that generate or support hypotheses away from the yoke of statistical rigour, followed by a second confirmatory study, performed with the highest levels of rigour by an independent laboratory. A paper would then only be published after successful completion of both stages. A third stage, involving multiple centres, could then create the foundation for human clinical trials to test new drug candidates or therapies.
“The idea of this compromise is that I get left alone to fool around and not get every single preliminary study passed to statistical significance, with a lot of waste in money and time”, Mogil explained. “But then at some point I have to say ‘I've fooled around enough time that I'm so convinced by my hypothesis that I'm willing to let someone else take over’”. Mogil is aware that this would require the establishment and funding of a network of laboratories to perform confirmatory or replication studies. “I think this a perfect thing for funding agencies so I'm trying to get the NIH (National Institutes of Health) to give a consortium of pain labs a contract”, he added.
There have been various projects to reproduce results, but these merely helped to define the scale of the problem rather than provide solutions, according to Mogil. The two most prominent reproducibility projects, one for psychology and one for cancer research, were set up by the Centre for Open Science, a non-profit organization founded in 2013 to “increase the openness, integrity, and reproducibility of scientific research”.
The cancer study reported its results in January 2017 2 , but it raised as many questions as it answered. Two out of five studies “substantially reproduced” the original findings, although not all experiments met the threshold of statistical significance. Two others yielded “uninterpretable results”, and one failed to replicate a paper by Erkki Ruoslahti, from the cancer research centre at Sanford Burnham Prebys Medical Discovery Institute in La Jolla, California. His original study had identified a peptide that appears to help anti-cancer drugs penetrate tumours 3 .
… replication studies need even greater statistical power than the original, Macleod argued, given that the reason for doing them is to confirm or refute previous results.
Ruoslahti has been hotly disputing the results of that replication, arguing that it was a limited study comprising just a single experiment and that the associated meta-analysis ignored previous reproduction of his results by three generations of post docs. “I do disagree with the idea of reproducibility studies”, he said. “If only one experiment is done without any troubleshooting, the result is a tossup. Anything more extensive would be prohibitively costly. So many things can go wrong in an experiment done by someone for the first time. Instead I think we should let the scientific process run its course. Findings that are not correct will disappear because others can't reproduce them or publish divergent results, after an adequate try and hopefully also explaining why the results are different”. Ruoslahti has received support from Tim Errington, manager of the Centre for Open Science's cancer reproducibility project, who agreed that a single failure to replicate should not invalidate a paper.
Method descriptions and biology
Nonetheless, the cancer reproducibility project highlighted a wider problem: that experimental methods or environmental conditions are often not reported in sufficient detail to recreate the original set up accurately. Indeed, the most obvious conclusion was that many papers provide too little detail about their methods, according to Errington. As a result, replication teams have to devote many hours to chase down protocols and reagents, which often had been developed by students or post docs no longer with the team.
The exposure of such discrepancies is itself a positive result from the replication study, Errington asserted, and it has sparked efforts to make experiments more repeatable. “The original authors, just like all scientists, are focused on conducting their current research projects, writing up their results for publication, and writing grants and job applications for advancement”, Errington noted. “Digging through old lab notebooks to find what was previously published is not as high a priority. This points to a gap that can be filled by making this information readily available to complement the publication at the time of publication. We demonstrate a way to do this with each Replication Study, where the underlying methods/data/analysis scripts are made available using https://osf.io. And unique materials that are not available can be made available for the research community to reuse, for replication or new investigations”.
Another major factor that can cause replication to fail is the biology itself. By way of example, the effect of a drug might depend on the particular metabolic or immunological state of an animal, asserted Hanno Würbel from the Division of Animal Welfare at the University of Bern in Switzerland, who has a longstanding interest in reproductivity in research. “If a treatment effect, for example a drug effect, is conditional on some phenotypic characteristics, such as the drug only working under conditions of stress, then it seems inappropriate to speak of a ‘failed’ replication. In that case both study outcomes would be ‘true’ within a certain range of conditions”, he explained. Nonetheless, discrepancies between original and replication studies could indeed enrich research. “Provided all studies were done well, different outcomes of replicate studies would be informative in telling us that conditions matter, and that we need to search further to establish the range of conditions under which a given treatment works”, Würbel said.
Preclinical researchers need freedom to explore the borders of knowledge, while clinical researchers rely on replication to weed out false positives.
Another related issue is the high level of standardization to make results as generally valid and reproducible as possible. But, as Würbel emphasized, this can actually have the opposite effect. “The standard approach to evidence generation in preclinical animal research are single-laboratory studies conducted under highly standardized conditions, often for both the genotype of the animals and the conditions under which they are reared, housed and tested”, he said. “Because of this, you can never know for sure whether a study outcome has or hasn't got external validity. If you think about it, this means that replication studies are inherently required by the very nature of the standard approach to preclinical animal research. Yet results of single-laboratory studies conducted under highly standardized conditions are still being sold, that is still getting published, as if the results were externally valid and reproducible, but without any proof. And then people are surprised when replication studies ‘fail’”.
Rodent animal models in particular have been highly standardized as inbred strains with the aim of making results more repeatable by eliminating genetic differences. But it also means that results cannot readily be generalized and different strains can yield different results. This has long been appreciated in some fields such as ageing research where genetic differences can have a huge impact. Steve Austad at the University of Michigan, USA, realized as early as 1999 that relying on genetically homogenous rodents for ageing research often led to conclusions that tend to reflect strain-specific idiosyncrasies. He therefore advocated the development of pathogen-free stocks from wild-trapped progenitors for study of ageing and late-life pathophysiology 4 . It has now inspired calls for greater genetic diversity among laboratory rodents for preclinical research in drug development and ageing.
Publishing negative and confirmation studies
However, perhaps the biggest elephant in the room is related to publication bias towards positive results and away from the null hypothesis. This affects replication in so far that not just negative but also confirmation studies tend not to get reported. This distortion can also make failure to replicate more likely by encouraging false positives in the first place. Macleod therefore urges the whole research community to adopt a more upbeat approach to null results as these can be just as valuable as positive ones. “Maybe we should think of studies as means to provide information. If a study provides information, non-replication is on par with initial reports”, he remarked. “In any case, replication, if well done, and regardless of whether results are in agreement or at variance with the original study, adds information that can be aggregated with the initial study and furthers our evidence”.
… experimental methods or environmental conditions are often not reported in sufficient detail to recreate the original set up accurately
There has been some recognition of the need to promote null results, notably through the Journal of Negative Results in Biomedicine (JNRBM). Surprisingly, it is scheduled to cease publication by BioMed Central in September, on the grounds that its mission has been accomplished. The publisher argues that results which would previously have remained unpublished were now appearing in other journals. Many though would contend that null results are still greatly underrepresented in the literature and that there is a shortage of both resources and motivation for replication studies in general.
On that front, there are some promising initiatives though, such as StudySwap, a platform hosted by the Centre for Open Science, to help biologists find suitable collaborators for replication studies. “StudySwap allows scholars from all over the world to replicate effects before they ever get published”, explained Martin Schweinsberg, assistant professor of organizational behaviour at the European School of Management and Technology, Berlin. “Such pre-publication independent replications (PPIRs) are important because they allow scientists to better understand the phenomenon they're studying before they submit a paper for publication”. According to Christopher Chartier from the Department of Psychology at Ashland University, Ohio, USA, a common use of StudySwap will be for two or more researchers concurrently collecting data at several different sites to combine the samples for analysis. “These types of study would result in larger and more diverse samples than any of the individuals”, he remarked. However, StudySwap is currently run by volunteers and is not yet geared up for large-scale replication work. This will require active support from major funding agencies, and there are welcome signs of this happening, according to Brian Nosek, executive director of the Center for Open Science. “For example, the NWO (Netherlands Organisation for Scientific Research) has a 3 million Euro funding line for replications of important discoveries”, he said.
The role of journals and funders
Journals also have an important role to play, Nosek added, by providing incentives for replications. He highlighted the TOP Guidelines (http://cos.io/top/) from the Centre for Open Science, which specifies replication as one of its key eight standards for advancing transparency in research and publication. “TOP is gaining traction across research communities with about 3,000 journal signatories so far”, said Nosek. “COS also offers free training services for doing reproducible research, and fosters adoption of incentives that can make research more open and reproducible, such as badges to acknowledge open practices”, Another COS initiative called Registered Reports (http://cos.io/rr/) promotes a publishing model where peer review is conducted prior to the outcomes of the research being known.
Many […] would contend that null results are still greatly underrepresented in the literature, and that there is a shortage of both resources and motivation for replication studies in general
There is no shortage of projects focusing on replication, and there are signs of funding bodies devoting resources, as in the Netherlands. “This is changing, and research assessment structures are I think open to measures beyond the grants in/papers out approach”, Macleod said. “The real challenge is with individual institutions and their policy and practices around academic promotion, tenure, and so on, which are still largely wedded to outdated measures such as Impact Factor and author position”.
Journals in particular have a responsibility and could help by changing outdated methods of reward and insisting on more detailed method descriptions, according to Matt Hodgkinson, Head of Research Integrity at Hindawi, one of the largest open-access journal publishers. “Incentive structures now reward publication volume and being first”, he said. “Citations are currently counted and more is considered better for the authors and journal, which can perversely reward controversial findings that fail to replicate. Instead funders and institutions should reward quality of reporting, replicability, and collaborations”. Hodgkinson added that journals should also abandon space constraints on the methods sections to allow authors to describe the experimental procedures and conditions in much more detail. In fact, a growing number of journals and publishers encourage authors to provide more details on experiments and to format their methods section so as to make it easier for other researchers to reproduce their results.
Whatever the measures to improve reproducibility of biomedical research, it will have to involve all actors from researchers to funding agencies to journals and eventually commercial players as the main customers of academic research. In addition, it may also require new ways to conduct research and validate experimental data.