Page 1

Teaching reproducible research in bioinformatics Robert Castelo robert.castelo@upf.edu @robertclab Dept. of Experimental and Health Sciences Universitat Pompeu Fabra

1st. HEIRRI Conference - Barcelona - March 18th, 2016

Robert Castelo - robert.castelo@upf.edu - @robertclab

Teaching reproducible research in bioinformatics

1 / 11


Bioinformatics

Robert Castelo - robert.castelo@upf.edu - @robertclab

Teaching reproducible research in bioinformatics

2 / 11


Funding and scientific output Funding in genomics research has reached $2.9 billion/year worldwide (Pohlhaus & Cook-Deegan, BMC Genomics, 9:472, 2008) The number of articles published every year in the biomedical literature (as indexed in PubMed) has doubled in the last 15 years.

Nr. of articles in PubMed per year

1200000

1000000

800000

600000

400000

200000

0 1950

2001

2002

2003

2004

2005

2006

2007

2008

2009

2010

2011

2012

2013

2014

2015

Year of publication

Source: http://www.ncbi.nlm.nih.gov/pubmed Robert Castelo - robert.castelo@upf.edu - @robertclab

Teaching reproducible research in bioinformatics

3 / 11


Problems: drug development stagnation The number of new drugs approved per billion US dollars spent on R&D has halved every 9 years since 1950, falling around 80-fold in inflation-adjusted terms (Scannell et al., Nat Rev Drug Discov, 11:191-200, 2012).

Source: Fig. 1. Scannell et al. Diagnosing the decline in pharmaceutical R&D efficiency. Nat Rev Drug Discov, 11:191-200, 2012.

Robert Castelo - robert.castelo@upf.edu - @robertclab

Teaching reproducible research in bioinformatics

4 / 11


Problems: lack of reproducibility

CORRESPONDENCE

A N A LY S I S

Believe it or not: how much can we rely on published data on potential drug targets? Florian Prinz, Thomas Schlange and Khusru Asadullah

Repeatability of published microarray gene expression analyses

© 2009 Nature America, Inc. All rights reserved.

John P A Ioannidis1–3, David B Allison4, Catherine A Ball5, Issa Coulibaly4, Xiangqin Cui4, Aedín C Culhane6,7, Mario Falchi8,9, Cesare Furlanello10, Laurence Game11, Giuseppe Jurman10, Jon Mangion11, Tapan Mehta4, Michael Nitzberg5, Grier P Page4,12, Enrico Petretto11,13 & Vera van Noort14 Given the complexity of microarray-based gene expression studies, guidelines encourage transparent design and public data availability. Several journals require public data deposition and several public databases exist. However, not all data are publicly available, and even when available, it is unknown whether the published results are reproducible by independent scientists. Here we evaluated the replication of data analyses in 18 articles on microarray-based gene expression profiling published in Nature Genetics in 2005–2006. One table or figure from each article was independently evaluated by two teams of analysts. We reproduced two analyses in principle and six partially or with some discrepancies; ten could not be reproduced. The main reason for failure to reproduce was data unavailability, and discrepancies were mostly due to incomplete data annotation or specification of data processing and analysis. Repeatability of published microarray studies is apparently limited. More strict publication rules enforcing public data availability and explicit description of data processing and analysis should be considered.

research, the Uniform Guidelines of the International Committee of Medical Journal Editors state that authors should “identify the methods, apparatus and procedures in sufficient detail to allow other workers to reproduce the results”12. Making primary data publicly available has many challenges but also many benefits13. Public data availability allows other investigators to confirm the results of the original authors, exactly replicate these results in other studies and try alternative analyses to see whether results are robust and to learn new things. Journals such as Nature Genetics require public data deposition as a prerequisite for publication for microarray-based research. Yet, the extent to which data are indeed made fully and accurately publicly available and permit confirmation of originally reported findings in many areas, including gene expression microarray research, is unknown. In this project, we aimed to evaluate the repeatability of published microarrays studies. We focused specifically on the ability to repeat the published analyses and get the same results. This is one important component in the wider family of replication and reproducibility issues. We evaluated 18 articles published in Nature Genetics in 2005 or 2006 that presented data from comparative analyses of microarrays experiments that had not been previously published elsewhere. Detailed Microarray-based research is a prolific scientific field1 where extensive eligibility criteria and search strategies are presented in the Methods data are generated and published. The field has been sensitized to the section. Of 20 initially selected articles14–33, 2 were excluded21,26 when 2–5 need for transparent design and public data deposition and public they were found to use previously published data. The 18 evaluated 6–8. Issues surrounding the 14–20,22–25,27–33 and the selected tables or figures we attempted articles databases have been designed for this purpose Florian Prinz, Thomas Schlange and Khusru Asadullah ability to reproduce published results with publicly available data have to reproduce are shown in Table 1. They cover a wide variety of methdrawn attention in microarray-related research9–11 and beyond. The ods and applications, as expected from a multidisciplinary genetics reproducibility of scientific results has been a concern of the scientific journal. Of the 18 articles, 16 declare in either the primary article or A recent report by Arrowsmith noted that the to ‘feasible/marketable’ , and profiling the financial costs experimental data community for decades and in every scientific discipline. In biomedical its supplements that the gene expression

CORRESPONDENCE

Believe it or not: how much can we rely on published data on potential drug targets?

success rates for new development projects in Phase II trials have fallen from 28% to 18% in recent years, with insufficient efficacy being the most frequent reason for failure (Phase II failures: 2008–2010. Nature Rev. Drug Discov. 10, 328–329 (2011)) . This indicates the limitations of the predictivity of disease models and also that the validity of the targets being investigated is frequently questionable, which is a crucial issue to address if success rates in clinical trials are to be improved. NATURE GENETICS | VOLUME 41 | NUMBER 2 | FEBRUARY 2009 Candidate drug targets in industry are derived from various sources, including inhouse target identification campaigns, inlicensing and public sourcing, in particular on reports published in the literature and Robertbased Castelo - robert.castelo@upf.edu

of pursuing a full-blown drug discovery and development programme for a particular target could ultimately be hundreds of millions of Euros. Even in the earlier stages, investments in activities such as high-throughput screening programmes are substantial, and thus the validity of published data on potential targets is crucial for companies when deciding to start novel projects. To mitigate some of the risks of such investments ultimately being wasted, most phar149 maceutical companies run in-house target validation programmes. However, validation projects that were started in our company based on exciting published data have often resulted in disillusionment when key data - @robertclab

1Clinical and Molecular Epidemiology Unit, Department of Hygiene and Epidemiology, University of Ioannina School of Medicine, Ioannina 45110, Greece. 2Biomedical Research Institute, Foundation for Research and Technology–Hellas, Ioannina 45110, Greece. 3Center for Genetic Epidemiology and Modeling, Tufts Medical Center and Department of Medicine, Tufts University School of Medicine, Boston, Massachusetts 02111, USA. 4Department of Biostatistics, Section on Statistical Genetics, University of Alabama at Birmingham, Birmingham, Alabama 35294, USA. 5Department of Biochemistry, Stanford University School of Medicine, Stanford, California 94305, USA. 6Department of Biostatistics and Computational Biology, Dana-Farber Cancer Institute, Boston, Massachusetts 02115, USA. 7Department of Biostatistics, Harvard School of Public Health, Boston, Massachusetts 02115, USA. 8Genomic Medicine, Faculty of Medicine, Imperial College London, Hammersmith Hospital, Du Cane Road, London W12 0NN, UK. 9Department 1 of Twin Research & Genetic Epidemiology, St. Thomas’ Campus, King’s College London, Lambeth Palace Road, London SE1 7EH, UK. 10Fondazione Bruno Kessler, via Sommarive 18, 38100 Povo-Trento, Italy. 11Medical Research Council Clinical Sciences Centre Microarray Centre, Hammersmith Hospital, Du Cane Road, London W12 0NN, UK. 12Statistics and Epidemiology Unit, RTI International, Atlanta, Georgia 30341, USA. 13Department of Epidemiology, Public Health and Primary Care, Faculty of Medicine, Imperial College, Praed Street, London W2 1PG, UK. 14European Molecular Biology Laboratory Heidelberg, Meyerhofstraße 1, 69117 Heidelberg, Germany. Correspondence should be addressed to J.P.A.I. (jioannid@cc.uoi.gr). Received 15 September 2008; accepted 4 November 2008; published online 28 January 2009; doi:10.1038/ng.295

A recent report by Arrowsmith noted that the to ‘feasible/marketable’, and the financial costs success rates for new development projects in of pursuing a full-blown drug discovery and Phase II trials have fallen from 28% to 18% in development programme for a particular tarrecent years, with insufficient efficacy being get could ultimately be hundreds of millions of the most frequent reason for failure (Phase II Euros. Even in the earlier stages, investments failures: 2008–2010. Nature Rev. Drug Discov. in activities such as high-throughput screen10, 328–329 (2011))1. This indicates the limi- ing programmes are substantial, and thus the tations of the predictivity of disease models validity of published data on potential targets and also that the validity of the targets being is crucial for companies when deciding to start investigated is frequently questionable, which novel projects. is a crucial issue to address if success rates in To mitigate some of the risks of such investclinical trials are to be improved. ments ultimately being wasted, most pharCandidate drug targets in industry are maceutical companies run in-house target derived from various sources, including in- validation programmes. However, validation house target identification campaigns, in- projects that were started in our company licensing and public sourcing, in particular based on exciting published data have often based on reports published in the literature and resulted in disillusionment when key data L I N K T O O R I G I N A L A RT I C L E presented at conferences. During the transfer could not be reproduced. Talking to scienof projects from an academic to a company tists, both in academia and in industry, there setting, the focus changes from ‘interesting’ seems to be a general impression that many

L I N K T O O R I G I N A L A RT I C L E

results that are published are hard to reproduce. However, there is an imbalance between this apparently widespread impression and its public recognition (for example, see REFS 2,3), and the surprisingly few scientific publications dealing with this topic. Indeed, to our knowledge, so far there has been no published in-depth, systematic analysis that compares reproduced results with published results for wet-lab experiments related to target identification and validation. Early research in the pharmaceutical industry, with a dedicated budget and scientists who mainly work on target validation to increase the confidence in a project, provides a unique opportunity to generate a broad data set on the reproducibility of published data. To substantiate our incidental observations that published reports are frequently not reproducible with quantitative data, we performed an analysis of our early (target identification and validation) in-house projects in our strategic research fields of oncology, women’s health and cardiovascular diseases that were performed over the past 4 years (FIG. 1a). We distributed a questionnaire to all involved scientists from target discovery, and queried names, main relevant published data (including citations), in-house data obtained and their relationship to the published data, the impact of the results obtained for the outcome of the projects, and the models

results that are published are hard to reproduce. However, there is an imbalance between this apparently widespread impression and its public recognition (for example, see REFS 2,3), and the surprisingly few scientific publications dealing with this topic. Indeed, to our knowledge, so far there has been no published in-depth, systematic analysis that compares reproduced results with published results for wet-lab experiments related to target identification and validation. Early research in the pharmaceutical industry, with a dedicated budget and scientists who mainly work on target validationoftopublished increase Figure 1 | Analysis of the reproducibility data in 67 in- of each of the following outcomes is shown: data were completely in line house projects. a | This figure illustrates the distribution of projects within with published data; the main set was reproducible; some results (including the confidence in a project, provides a unique the oncology, women’s health and cardiovascular indications that were ana- the most relevant hypothesis) were reproducible; or the data showed inconlysed in thisto study. b | Severala approaches were used reproduce opportunity generate broad data settoon the the pub- sistencies that led to project termination. ‘Not applicable’ refers to projects lished data. Models were either exactly copied, adapted to internal needs that were almost exclusively based on in-house data, such as gene expresreproducibility of published data. substanti(for example, using other cell lines than thoseTo published, other assays and so sion analysis. The number of projects and the percentage of projects within or the published data was transferred to models for another indication. this study (a– c) are indicated. d | A comparison of model usage in the reproate ouron) incidental observations that published ‘Not applicable’ refers to projects in which general hypotheses could not be ducible and irreproducible projects is shown. The respective numbers of verified. | Relationship of published data to in-house data. The proportion projects and the percentages of the groups are indicated. reports arecfrequently not reproducible with quantitative data, we performed an analysis NATURE REVIEWS | DRUG DISCOVERY www.nature.com/reviews/drugdisc of our early (target identification and ©valida2011 Macmillan Publishers Limited. All rights reserved tion) in-house projects in our strategic research fields of oncology, women’s health and cardiovascular diseases that were performed over the past 4 years (FIG. 1a). We distributed a questionnaire to all involved scientists from target discovery, and queried names, main relevant published data (including citations),research in-house in bioinformatics Teaching reproducible 5 / 11


Brembs et al.

Consequences of journal rank

Problems: growth of retractions Consequences of journal rank

Misconduct accounts for the majority of retracted scientific publications Ferric C. Fanga,b,1, R. Grant Steenc,1, and Arturo Casadevalld,1,2 Departments of aLaboratory Medicine and bMicrobiology, University of Washington School of Medicine, Seattle, WA 98195; cMediCC! Medical Communications Consultants, Chapel Hill, NC 27517; and dDepartment of Microbiology and Immunology, Albert Einstein College of Medicine, Bronx, NY 10461 Edited by Thomas Shenk, Princeton University, Princeton, NJ, and approved September 6, 2012 (received for review July 18, 2012)

A detailed review of all 2,047 biomedical and life-science research articles indexed by PubMed as retracted on May 3, 2012 revealed that only 21.3% of retractions were attributable to error. In contrast, 67.4% of retractions were attributable to misconduct, including fraud or suspected fraud (43.4%), duplicate publication (14.2%), and plagiarism (9.8%). Incomplete, uninformative or misleading retraction announcements have led to a previous underestimation of the role of fraud in the ongoing retraction epidemic. The percentage of scientific articles retracted because of fraud has increased ∼10-fold since 1975. Retractions exhibit distinctive temporal and geographic patterns that may reveal underlying causes. bibliometric analysis

| biomedical publishing | ethics | research misconduct

T

he number and frequency of retracted publications are important indicators of the health of the scientific enterprise, because retracted articles represent unequivocal evidence of project failure, irrespective of the cause. Hence, retractions are worthy of rigorous and systematic study. The retraction of flawed publications corrects the scientific literature and also provides insights into the scientific process. However, the rising frequency of retractions has recently elicited concern (1, 2). Studies of selected retracted articles have suggested that error is more common than fraud as a cause of retraction (3–5) and that rates of retraction correlate with journal-impact factor (6). We undertook a comprehensive analysis of all retracted articles indexed by PubMed to ascertain the validity of the earlier findings. Retracted articles were classified according to whether the cause of retraction was documented fraud (data falsification or fabrication), suspected fraud, plagiarism, duplicate publication, error, unknown, or other reasons (e.g., journal error, authorship dispute). Results

published by the authors of a manuscript in the Journal of Cell Biology stated that “In follow-up experiments . . . we have shown that the lack of FOXO1a expression reported in figure 1 is not correct” (11). A subsequent report from the Office of Research Integrity states that the first author committed “research misconduct by knowingly and intentionally falsely reporting . . . that FOXO1a was not expressed . . . by selecting a specific FOXO1a immunoblot to show the desired result” (12). In contrast to earlier studies, we found that the majority of retracted articles were retracted because of some form of misconduct, with only 21.3% retracted because of error. The most common reason for retraction was fraud or suspected fraud (43.4%), with additional articles retracted because of duplicate publication (14.2%) or plagiarism (9.8%). Miscellaneous reasons or unknown causes accounted for the remainder. Thus, for articles in which the reason for retraction is known, three-quarters were retracted because of misconduct or suspected misconduct, and only onequarter was retracted for error. Temporal Trends. A marked recent rise in the frequency of retraction was confirmed (2, 13), but was not uniform among the various causes of retraction (Fig. 1A). A discernible rise in retractions because of fraud or error was first evident in the 1990s, with a subsequent dramatic rise in retractions attributable to fraud occurring during the last decade. A more modest increase in retractions because of error was observed, and increasing retractions because of plagiarism and duplicate publication are a recent phenomenon, seen only since 2005. The recent increase in retractions for fraud cannot be attributed solely to an increase in the number of research publications: retractions for fraud or suspected fraud as a percentage of total articles have increased nearly 10-fold since 1975 (Fig. 1B).

The rate of retractions grows and is higher in HIF journals.

Geographic Origin and Impact Factor. Retracted articles were authored in 56 countries, and geographic origin was found to vary articles relating primarily to biomedical research published since according to the cause for retraction (Fig. 2). The United States, the 1940s. A comprehensive search of the PubMed database in Germany, Japan, and China accounted for three-quarters of May 2012 identified 2,047 retracted articles, with the earliest retractions because of fraud or suspected fraud. China and India retracted article published in 1973 and retracted in 1977. Hence, collectively accounted for more cases of plagiarism than the retraction is a relatively recent development in the biomedical United States, and duplicate publication exhibited a pattern simscientific literature, although retractable offenses are not necesilar to that of plagiarism. The relationship between journal impact sarily new. To understand the reasons for retraction, we consulted factor and retraction rate was also found to vary with the cause of reports from the Office of Research Integrity and other published retraction. Journal-impact factor showed a highly significant corresources (7, 8), in addition to the retraction announcements in relation with retractions because of fraud or error but not with scientific journals. Use of these additional sources of information those because of plagiarism or duplicate publication (Fig. 3 A–C). resulted in the reclassification of 118 of 742 (15.9%) retractions in Moreover, the mean impact factors of journals retracting articles an earlier study (4) from error to fraud. A list of 158 articles for which the cause of retraction was reclassified because of consultation of secondary sources is provided in Table S1. For example, Author contributions: F.C.F., R.G.S., and A.C. designed research, performed research, ana retraction announcement in Biochemical and Biophysical Realyzed data, and wrote the paper. search Communications reported that “results were derived from The authors declare no conflict of interest. experiments that were found to have flaws in methodological This article is a PNAS Direct Submission. execution and data analysis,” giving the impression of error (9). 1F.C.F., R.G.S., and A.C. contributed equally to this work. However, an investigation of this article conducted by Harvard 2To whom correspondence should be addressed. E-mail: arturo.casadevall@einstein.yu. fromand Munafò ettheal.Office (2009), andIntegrity represent candidate gene studies of a number University reported to of Research inedu. dicated that “many instances of data fabrication and falsifica- This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10. gene-phenotype associations psychiatric phenotypes. The bias score 1073/pnas.1212247109/-/DCSupplemental. tionof were found” (10). In another example, a retractionof notice Causes of Retraction. PubMed references more than 25 million

ence. (A) Exponential fit from Munafò et al. (2009), and represent candidate gene studies of a number .heroku.com). (B) of gene-phenotype associations of psychiatric phenotypes. The bias score dual study effect size. (y-axis) represents the effect size of the individual study divided by the pooled sent candidate gene effect size estimated indicated by meta-analysis, on a log-scale. Therefore, a e and alcoholism. The value greater than zero indicates that the study provided an over-estimate of ffect size (odds ratio; OR), the likely true effect size. This is plotted against the IF of the journal the study blication of the study was published in (x-axis), on a log-scale. The size of the circle is proportional 1 |toCurrent inindividual the reliability of science. (A) positively Exponential F of the journal FIGURE the the sampletrends size of the study. Bias score is significantly ificantly negatively with IF , sample size significantly negatively. (D) Linear regression for PubMedcorrelated retraction notices (data from pmretract.heroku.com). (B) between IF and extent to with confidence intervals between IF and Fang and Casadevall’s Retraction Relationship between year of publication and individual study effect size. ue effect. Data are taken Index (data provided by Fang and Casadevall, 2011).

Source: Fig. 1. Brembs et al. Deep impact:

unintended consequences of journal rank. Front Hum

Neurosci, 7:29, 2013.

Data are taken from Munafò et al. (2007), and represent candidate gene Robert Castelo - robert.castelo@upf.edu - @robertclab

fit

“Too good to be true” results, reviewers COI and poor methodology, seem to be potenial causes (Brembs et al., 2013).

(y-axis) represents the effect size of the individual study divided by the pooled

17028–17033 | PNAS | October 16, 2012 | vol. 109 | no. 42

www.pnas.org/cgi/doi/10.1073/pnas.1212247109

effect size estimated research indicated in by bioinformatics meta-analysis, on a log-scale. Teaching reproducible 6 / 11Therefore, a


Society, journals and funding agencies start reacting

NIH plans to enhance reproducibility Francis S. Collins and Lawrence A. Tabak discuss initiatives that the US National Institutes of Health is exploring to restore the self-correcting nature of preclinical research.

A

growing chorus of is concern, from shorter term, however, and “poor training probably responsible fortheatchecks least scientists and laypeople, contends balances that once ensured scientific fidelity

that the complex system for ensuring have been hobbled. This has compromised the reproducibility of biomedical research the ability of today’s researchers to reproduce is failing and is in need of restructuring1,2. others’ findings. As leaders of the US National Institutes of Let’s be clear: with rare exceptions, we Health (NIH), we share this concern and have no evidence to suggest that irreproduchere explore some of the significant interibility is caused by scientific misconduct. In Teaching reproducible research in bioinformatics / 11 Integrity of the ventions that we are planning. 2011, the Office of 7 Research

some of the challenges” (of reproducibility)

Robert Castelo - robert.castelo@upf.edu - @robertclab

rep ing inc stat det bas Cru are ing cal An sau and des pet scie to f E and cen ing the hig tres in s ten rew T no res pap vio ing unp ing tha

PRE

Rep scie cal are tha ent blin of o lic ove dat the imp rep P use tha duc sim ani


Reproducibility vs Replicability

Reproducibility: start from the same samples/data, use the same methods, get the same results.

Replicability: conduct again the experiment with independent samples and/or methods to get confirmatory results. Replicability = Reproducilibity + Conduct experiment again

Replicability and reproducibility might be challenging in epidemiology (recruit again a cohort) or molecular biology (complex cell manipulation).

Robert Castelo - robert.castelo@upf.edu - @robertclab

Teaching reproducible research in bioinformatics

8 / 11


code that combines the different results together is not saved (10). Addressing this problem will he rise of computational science has led to require long follow-up times. Such studies are require either changing the behavior of the softexciting and fast-moving developments in difficult to replicate because of time and expense, ware systems themselves or getting researchers many scientific areas. New technologies, especially in the time frame of policy decisions to use other software systems that are more ameincreased computing power, and methodological that need to be made regarding regulation (2). nable to reproducibility. Neither is likely to hapResearchers across a range of computational pen quickly; old habits die hard, and many will advances have dramatically improved our ability to collect complex high-dimensional data (1, 2). science disciplines have been calling for repro- be unwilling to discard the hours spent learning Large data sets have led to scientists doing more ducibility, or reproducible research, as an attain- existing systems. Non–open source software can computation, as well as researchers in computa- able minimum standard for assessing the value of only be changed by their owners, who may not tionally oriented fields directly engaging in more scientific claims, particularly when full independent perceive reproducibility as a high priority. PERSPECTIVE replication of a study is not feasible (4–8). The science. The availability of large public databases In order to advance reproducibility in comhas allowed for researchers to make meaningful standard of reproducibility calls for the data and putational science, contributions will need to scientific contributions without using the tradi- the computer code used to analyze the data be made come from multiple directions. Journals can play tional tools of a given field. As an example of available to others. This standard falls short of full a role here as part of a joint effort by the scientific this overall trend, the Sloan Digital Sky Survey, replication because the same data are analyzed community. The journal Biostatistics, for which a large publicly available astronomical survey again, rather than analyzing independently col- I am an associate editor, has implemented a polof the Northern Hemisphere, was ranked the most lected data. However, under this standard, limited icy for encouraging authors of accepted papers exploration of the data and the analysis code is to make their work reproducible by others (11). cited observatory (3), allowing astronomers with-D. Peng Roger out telescopes to make discoveries using data col- possible and may be sufficient to verify the quality Authors can submit their code or data to the journalbut for posting as supporting onlinehas material scientifichas claims. Oneexciting aim of the reproduclected by others. Similar developmentsComputational can be of thescience led to new developments, the nature of the work ibility standard is to fill the in the scientific found in fields such as biology and epidemiology. and can additionally requesthas a “reproducibility exposed limitations in our ability to gap evaluate published findings. Reproducibility the potential review,” which the associatereplication editor for reproprocess between scientific full repli-claims Replication is the ultimate standard by to which serve as evidence-generating a minimum standard for judging wheninfull independent of a scientific claims are judged. With replication, cation of a study and no replication. Between ducibility runs the submitted code on the data study is not possible. independent investigators address a scientific hypothesis and build up require long follow-up times. Such studies are he rise of computational science has led to Reproducibility Spectrum evidence for or against it. The scienstudy is not possible.

Reproducible research T in bioinformatics

Data Replication & Reproducibility

In computationally oriented research (e.g., bioinformatics) reproducibility is always feasible (Claerbout and Karrenbach. Electronic documents give reproducible research a new meaning, 1992).

Reproducible Research in Computational Science

T

exciting and fast-moving developments in

difficult to replicate because of time and expense,

tific community’s “culture of replicaPublication many scientific areas. +New technologies, especially in the time frame of policy decisions tion” has served to quickly weed out computing power, and methodological that need to be made regarding regulation Publication Full (2). spurious claims and enforce on the increased Linked and onlyhave dramatically improved our abilityCode Researchers across a range of computational replication community a disciplined approach to advances Code executable and data science disciplines been calling for reproscientific discovery. However, with to collect complex high-dimensional data (1, 2). code have and data computational science and the corre- Large data sets have led to scientists doing more ducibility, or reproducible research, as an attainsponding collection of large and com- computation, as well as researchers in computa- able minimum standard for assessing the value of plex data sets the notion of replication tionally oriented fields directly engaging in more scientific claims, particularly when full Not reproducible Goldindependent standard can be murkier. It would require tre- science. The availability of large public databases replication of a study is not feasible (4–8). The mendous resources to independently has allowed for researchers to make meaningful standard of reproducibility calls for the data and replicate the Sloan Digital Sky Sur- Fig. 1. The spectrum of reproducibility. vey. Many studies—for example, in scientific contributions without using the tradi- the computer code used to analyze the data be made available others.that This falls short full tionalthat toolsthese of atwo given field. an example extreme endAs points, there is a of spectrum climate science—require computing power and to verifies thestandard code produces theof results replication because sameArticles data arewith analyzed overall the Sloan Survey, possibilities, and Digital a study Sky may be more or less may not be available to all researchers.this Even if oftrend, published in thethearticle. data or again,code rather than aanalyzing independently collarge publicly available astronomical survey reproducible than another depending on what data receive “D” or “C” kite-mark, respeccomputing and data size are not limitinga factors, and code are madewas available A recent replication can be difficult for other reasons. printed onunder the first of thelimited article lectedtively, data. However, thispage standard, of the In Northern Hemisphere, ranked(Fig. the 1). most review (3), of microarray gene expression analyses itself. Articles that and havethe passed the reproducenvironmental epidemiology, large cohort studies exploration of the data analysis code is cited observatory allowing astronomers withthat studies were either designed to examine subtle health effects en- found ibility receive an “R.” The policy was imandreview may be sufficient to verify the quality outoftelescopes to make discoveries usingnot datareproducible, col- possible partially reproducible with some discrepancies, or plemented in July 2009, and as of July 2011, vironmental pollutants can be very expensive and lected by others. Similar developments can be of the scientific claims. One aim of the reproducreproducible. This range was largely explained by 21 of 125 articles have been published with a standard isincluding to fill the gap in the scientific found in fields such as biology and epidemiology. ibilitykite-mark, the availability of data and metadata (9). five articles with an “R.” Department of Biostatistics, Johns Hopkins Bloomberg evidence-generating process between ReplicationThe is the ultimate standard reproducibility standardbyis which based on the The articles have reflected a rangefull of replitopics School of Public Health, Baltimore MD 21205, USA. of abiostatistical study and methods, no replication. Between scientific claims judged. With replication, fact thatare every computational experiment has,cation in from epidemiology, and To whom correspondence should be addressed. E-mail: independenttheory, investigators address a action taken by the genomics. In this admittedly small sample, we a detailed log of every rpeng@jhsph.edu

The software toolkit is already there (literate programming, version control systems, unit testing, data and code repositories, etc). 1226

case able exp trac if re cod cod is n requ war to u nab pen be u exis only perc

puta com a ro com I am icy to m Aut jour and revi duc

“An article about computational science in a scientific publication is not the scholarship itself, it is merely advertising of the scholarship. The actual scholarship is the complete software development environment and the complete setup of instructions which scientific hypothesis and build Reproducibility Spectr evidence forgenerated or against it. The scien- figures” 2 DECEMBER 2011 VOL 334 the SCIENCE www.sciencemag.org tific community’s “culture of replicaPublication + tion” has served to quickly weed out Publication spurious claims and enforce on the only community a disciplined approach to Code scientific discovery. However, with computational science and the correTeaching reproducible research in bioinformatics 9 / 11 sponding collection of large and com-

–Jon Claerbout, Stanford University

Robert Castelo - robert.castelo@upf.edu - @robertclab

com able the com in jo

Code and data


Reproducible research in bioinformatics Short project at the “Information Extraction from Omics Technologies� (IEO) subject from the UPF MSc Program in Bioinformatics for Health. Students pick a publicly available high-throughput gene expression profiling data set. Try to reproduce one of the figures of the accompanying paper or address a simple question on differential expression with a subset of the data. Data analyses must be reproducible: coded in R (statistical language/software), within a dynamic markdown document (literate programming) that can automatically generate results and figures. Results must be described in a short 4-6 page report following a given template, where the coded analyses constitute its supplementary material. Evaluation using a rubric the students know: justified analysis decisions, paper sectioning, text flow, figure labeling, self-contained captions, etc. Robert Castelo - robert.castelo@upf.edu - @robertclab

Teaching reproducible research in bioinformatics

10 / 11


Concluding remarks Replicability is at the core of the scientific method, but might be challenging to achieve. Reproducibility is a minimum standard and is always feasible in bioinformatics given that code and data are available. The software toolkit for reproducible research is available and growing (see, e.g., Roger Peng Coursera materials at http://github.com/rdpeng/courses).

“The most important tool is the mindset, when starting, that the end product will be reproducible” –Keith Baggerly, MD Anderson (shared via Twitter by @kwbroman)

Robert Castelo - robert.castelo@upf.edu - @robertclab

Teaching reproducible research in bioinformatics

11 / 11

Teaching reproducible research in bioinformatics  
Read more
Read more
Similar to
Popular now
Just for you