A Decade of Replications: Lessons from the Quarterly Journal of Political Science

Editor’s note: this piece is contributed by Nicholas Eubank, a PhD Candidate in Political Economy at the Stanford University Graduate School of Business.

The success of science depends critically on the ability of peers to interrogate published research in an effort not only to confirm its validity but also to extend its scope and probe its limitations. Yet as social science has become increasingly dependent on computational analyses, traditional means of ensuring the accessibility of research — like peer review of written academic publications — are no longer sufficient. To truly ensure the integrity of academic research moving forward, it is necessary that published papers be accompanied by the code used to generate results. This will allow other researchers to investigate not just whether a paper’s methods are theoretically sound, but also whether they have been properly implemented and are robust to alternative specifications.

Since its inception in 2005, the Quarterly Journal of Political Science (QJPS) has sought to encourage this type of transparency by requiring all submissions to be accompanied by a replication package, consisting of data and code for generating paper results. These packages are then made available with the paper on the QJPS website. In addition, all replication packages are subject to internal review by the QJPS prior to publication. This internal review includes ensuring the code executes smoothly, results from the paper can be easily located, and results generated by the replication package match those in the paper.

This policy is motivated by the belief that publication of replication materials serves at least three important academic purposes. First, it helps directly ensure the integrity of results published in the QJPS. Although the in-house screening process constitutes a minimum bar for replication, it has nevertheless identified a remarkable number of problems in papers. In the last two years, for example, 13 of the 24 empirical papers subject to in-house review were found to have discrepancies between the results generated by authors’ own code and the results in their written manuscripts.

Second, by emphasizing the need for transparent and easy-to-interpret code, the QJPS hopes to lower the costs associated with other scholars interrogating the results of existing papers. This increases the probability other scholars will examine the code for published papers, potentially identifying errors or issues of robustness if they exist. In addition, while not all code is likely to be examined in detail, it is the hope of the QJPS that this transparency will motivate submitting authors to be especially cautious in their coding and robustness checks, preventing errors before they occur.

Third and finally, publication of transparent replication packages helps facilitate research that builds on past work. Many papers published in the QJPS represent methodological innovations, and by making the code underlying those innovations publicly accessible, we hope to lower the cost to future researchers of building on existing work.

(1) In-House Replication

The experience of the QJPS in its first decade underscores the importance of its policy of in-house review. Prior to publication, all replication packages are tested to ensure code runs cleanly, is interpretable, and generates the results in the paper.

This level of review represents a sensible compromise between the two extremes of review. On the one hand, most people would agree that an ideal replication would consist of a talented researcher re-creating a paper from scratch based solely on the paper’s written methodology section. However, undertaking such replications for every submitted paper would be cost-prohibitive in time and labor, as would having someone check an author’s code for errors line-by-line. On the other hand, direct publication of replication packages without review is also potentially problematic. Experience has shown that many authors submit replication packages that are extremely difficult to interpret or may not even run, defeating the purpose of a replication policy.

Given that the QJPS review is relatively basic, however, one might ask whether it is even worth the considerable time the QJPS invests. Experience has shown the answer is an unambiguous “yes.” Of the 24 empirical papers subject to in-house replication review since September 2012, [1] only 4 packages required no modifications. Of the remaining 20 papers, 13 had code that would not execute without errors, 8 failed to include code for results that appeared in the paper, [2] and 7 failed to include installation directions for software dependencies. Most troubling, however, 13 (54 percent) 14 (58 percent) had results in the paper that differed from those generated by the author’s own code. Some of these issues were relatively small — likely arising from rounding errors during transcription — but in other cases they involved incorrectly signed or mis-labeled regression coefficients, large errors in observation counts, and incorrect summary statistics. Frequently, these discrepancies required changes to full columns or tables of results. Moreover, Zachary Peskowitz, who served as the QJPS replication assistant from 2010 to 2012, reports similar levels of replication errors during his tenure as well. The extent of the issues — which occurred despite authors having been informed their packages would be subject to review — points to the necessity of this type of in-house interrogation of code prior to paper publication.

(2) Additional Considerations for a Replication Policy

This section presents an overview of some of the most pressing and concrete considerations the QJPS has come to view as central to a successful replication policy. These considerations — and the specific policies adopted to address them — are the result of hard-learned lessons from a decade of replication experience.

2.1 Ease of Replication
The primary goal of QJPS policies is ensuring replication materials can be used and interpreted with the greatest of ease. To the QJPS, ease of replication means anyone who is interested in replicating a published article (hereafter, a “replicator”) should be able to do so as follows:

  1. Open a README.txt file in the root replication folder, and find a summary of all replication materials in that folder, including subfolders if any.
  2. After installing any required software (see Section 2.4 on Software Dependencies) and setting a working directory according to directions provided in the README.txt file, the replicator should be able simply to open and run the relevant files to generate every result and figure in the publication. This includes all results in print and/or online appendices.
  3. Once the code has finished running, the replicator should be able easily to locate the output and to see where that output is reported in the paper’s text, footnotes, figures, tables, or appendices.

2.2 README.txt File

To facilitate ease of replication, all replication packages should include a README.txt file that includes, at a minimum:

  1. Table of Contents: a brief description of every file in the replication folder.
  2. Notes for Each Table and Figure: a short list of where replicators will find the code needed to replicate all parts of the publication.
  3. Base Software Dependencies: a list of all software required for replication, including the version of software used by the author (e.g. Stata 11.1, R 2.15.3, 32bit Windows 7, OSX 10.9.4).
  4. Additional Dependencies: a list of all libraries or added functions required for replication, as well as the versions of the libraries and functions that were used and the location from which those libraries and functions were obtained.
    1. R: the current R versions can be found by typing R.Version() and information on loaded libraries can be found by typing sessionInfo().
    2. Stata: Stata does not specifically “load” extra-functions in each session, but a list of all add-ons installed on a system can be found by typing ado.dir.
  5. Seed locations: Authors are required to set seeds in their code for any analyses that employ randomness (e.g., simulations or bootstrapped standard errors. For further discussion, see Section 2.5). The README.txt file should include a list of locations where seeds are set in the analyses so that replicators can find and change the seeds to check the robustness of the results.

2.3 Depth of Replication

The QJPS requires that every replication package include the code that computes the primary results of the paper. In other words, it is not sufficient to provide a file of pre-computed results along with the code that formats the results for LaTeX. Rather, the replication package must include everything that is needed to execute the statistical analyses or simulations that constitute the primary contribution of the paper. For example, if a paper’s primary contribution is a set of regressions, then the data and code needed to produce those regressions must be included. If a paper’s primary contribution is a simulation, then code for that simulation must be provided—not just a dataset of the simulation results. If a paper’s primary contribution is a novel estimator, then code for the estimator must be provided. And, if a paper’s primary contribution is theoretical and numeric simulation or approximation methods were used to provide the equilibrium characterization, then that code must be included.

Although the QJPS does not necessarily require the submitted code to access the data if the data are publicly available (e.g., data from the National Election Studies, or some other data repository), it does require that the dataset containing all of the original variables used in the analysis be included in the replication package. For the sake of transparency, the variables should be in their original, untransformed and unrecoded form, with code included that performs the transformations and recodings in the reported analyses. This allows replicators to assess the impact of transformations and recodings on the results.

2.3.1 Proprietary and Non-Public Data
If an analysis relies on proprietary or non-public data, authors are required to contact the QJPS Editors before or during initial submission. Even when data cannot be released publicly, authors are often required to provide QJPS staff access to data for replication prior to publication. Although this sometimes requires additional arrangements — in the past, it has been necessary for QJPS staff to be written in IRB authorizations — in-house review is especially important in these contexts, as papers based on non-public data are difficult if not impossible for other scholars to interrogate post-publication.

2.4 Software Dependencies
Online software repositories — like CRAN or SSC — provide authors with easy access to the latest versions of powerful add-ons to standard programs like R and Stata. Yet the strength of these repositories — their ability to ensure authors are always working with the latest version of add-ons — is also a liability for replication.

Because online repositories always provide the most recent version of add-ons to users, the software provided in response to a given query actually changes over time. Experience has shown this can cause problems when authors use calls to these repositories to install add-ons (through commands like install_packages(“PACKAGE”) in R or ssc install PACKAGE in Stata. As scholars may attempt to replicate papers months or years after a paper has been published, changes in the software provided in response to these queries may lead to replication failures. Indeed, the QJPS has experienced replication failures due to changes in the software hosted on the CRAN server that occurred between when a paper was submitted to the QJPS and when it was reviewed.

With that in mind, the QJPS now requires authors to include copies of all software (including both base software and add-on functions and libraries) used in the replication in their replication package, as well as code that installs these packages on a replicator’s computer. The only exceptions are extremely commonly tools, like R, Stata, Matlab, Java, Python, or ArcMap (although Java- and Python-based applications must be included). [3]

2.5 Randomizations and Simulations

A large number of modern algorithms employ randomness in generating their results (e.g., the bootstrap). In these cases, replication requires both (a) ensuring that the exact results in the paper can be re-created, and (b) ensuring that the results in the paper are typical rather than cherry-picked outliers. To facilitate this type of analysis, authors should:

  1. Set a random number generator seed in their code so it consistently generates the exact results in the paper;
  2. Provide a note in the README.txt file indicating the location of all such commands, so replicators can remove them and test the representativeness of result.

In spite of these precautions, painstaking experience has shown that setting a seed is not always sufficient to ensure exact replication. For example, some libraries generate slightly different results on different operating systems (e.g. Windows versus OSX) and on different hardware architectures (e.g. 32-bit Windows 7 versus 64-bit Windows 7). To protect authors from such surprises, we encourage authors to test their code on multiple platforms, and document any resulting exceptions or complications in their README.txt file.

2.6 ArcGIS
Although we encourage authors to write replication code for their ArcGIS-based analyses using the ArcPy scripting utility, we recognize that most authors have yet to adopt this tool. For the time being, the QJPS accepts detailed, step-by-step instructions for replicating results via the ArcGIS Graphical User Interface (GUI). However, as with the inclusion and installation of add-on functions, the QJPS has made available a tutorial on using ArcPy available to authors which we hope will accelerate the transition towards use of this tool. [4]

(3) Advice to Authors
In addition to the preceding requirements, the QJPS also provides authors with some simple guidelines to help prevent common errors. These suggestions are not mandatory, but they are highly recommended.

  1. Test files on a different computer, preferably with a different operating system: Once replication code has been prepared, the QJPS suggests authors email it to a different computer, unzip it, and run it. Code often contains small dependencies—things like unnoticed software requirements or specific file locations—that go unnoticed until replication. Running code on a different computer often exposes these issues in a way that running the code on one’s own does not.
  2. Check every code-generated result against your final manuscript PDF: The vast majority of replication problems emerge because authors either modified their code but failed to update their manuscript, or made an error while transcribing their results into their paper. With that in mind, authors are strongly encouraged to print out a copy of their manuscript and check each result before submitting your final version of the manuscript and replication package.

(4) Conclusion

As the nature of academic research changes, becoming ever more computationally intense, so too must the peer review process. This paper provides an overview of many of the lessons learned by the QJPS‘s attempt to address this need. Most importantly, however, it documents not only the importance of requiring the transparent publication of replication materials but also the strong need for in-house review of these materials prior to publication.

[1] September 2012 is when the author took over responsibility for all in-house interrogations of replication packages at the QJPS.

[2] This does not include code which failed to execute, which might also be thought of as failing to replicate results from the paper.

[3] To aid researchers in meeting this requirement, detailed instructions on how to include CRAN or SSC packages in replication packages are provided through the QJPS.

[4] ArcPy is a Python-based tool for scripting in ArcGIS.

Editor’s Note [6/22/2015]: the number of packages with discrepancies was corrected from 13 (54%) to 14 (58%) at the author’s request.

About nicholaseubank

Post-Doctoral Fellow in Political Science at Stanford University.
This entry was posted in Uncategorized. Bookmark the permalink.

12 Responses to A Decade of Replications: Lessons from the Quarterly Journal of Political Science

  1. CDZ Neouts says:

    Support the replication movement but have some awareness, man. Using words like interrogate or probe the results are not going to win anyone over. This type of word choice is probably more likely to turn people against replication. There does not need to be an air or superiority or “I know better than you” among the replication movement.

  2. Pingback: Links diversos: Discriminação de preços em passagens, vídeo Piketty na USP e replicação dos códigos dos artigos. | Análise Real

  3. Pingback: is economics partisan? is all of social science just wrong? | orgtheory.net

  4. I can understand how all this is frustrating from the perspective of a replicator, but many of these issues (eg, a missing “install_packages()” function) give clear error messages that are fairly easy to fix and others (eg, quirks of PRNG) may be tricky to debug for exact replication, but usually won’t make a difference (unless the author did something stupid like calculate b.s.e. based on 50 resamples, and I would argue that in such a circumstance not specifying the random seed is not a bug but a feature as it makes the inadequate resamples more obvious).
    From the perspective of a reader though the main thing that matters is the number of papers that are wrong. You said 13/24 “had results in the paper that differed from those generated by the author’s own code,” and this is the sort of error the average reader would care about, but it’s also a very heterogenous category that you said ranges from minor rounding or transcription error to coefficients w reverse sign. Can you give a better sense of how many meaningful errors in results there were? That is, how many of the 13/24 were not just that the third decimal place of a control variable’s error term was off due to rounding or resampling variation, but errors that would make a difference to testing of enumerated hypotheses? Here’s another way to put it, how many of these errors were the sort of thing that might justify a subsequent comment and reply if you had not caught them before publication?

    • nicholaseubank says:

      Good points.

      With regards to the first, while some problems do give rise to error messages (like a missing package), the solutions are not always straightforward. If the package only requires the current CRAN package, then it is easy to fix, but often times it requires either an already-out-of-date version of a package, or something not on CRAN. In either case, these are non-trivial impediments to future replication.

      Regarding the nature of errors, it is true that it would be good to know the number of errors that would necessitate subsequent comment. The problem is that this is a much more difficult question to answer — I think to do it right, we’d need to survey referees about what changes influence their views on a paper. We do have many errors that are not third decimal place rounding errors, but whether they would influence a referee’s view is another question.

      I also want to emphasize that the argument being made here is not that most unreviewed work is wrong. Rather, the argument is primarily that simply requiring replication packages be released with published papers may not accomplish the aims of journal editors if those packages are not reviewed.

      • Thanks. Very true about how library updates can change things (and you’re very clear about this in the post). Of course you also say hardware and OS versions can change things too and I shudder at trying to replicate something from ten years ago and trying to get my hands on a working PPC G4 running OS X 10.3.6, even if I have archived copies of the appropriate versions of the libraries. Ultimately my reading of your post is that even if you archive your data and script, results are not necessarily really replicable since that script assumes particular hardware, OS, software, libraries, PRNG seeds, etc. and so even though we think of code as replicable, this is only approximately true. To paraphrase Heraclitus, you can’t run the same regression twice.
        And I definitely agree that distinguishing meaningful from trivial errors is inherently subjective and so hard to calculate, but it’s also the question that many people ultimately care about. I appreciate that your point is the great difficulty of ensuring replicability, and that you have laid this out in admirable detail, but many of the links I saw to this post were perhaps misreading it as whether we can trust results. I don’t blame you for misreadings, but I was curious if you could illuminate the broader questions implied by them.

  5. nicholaseubank says:

    Yes, the “legacy” problem is definitely a large one.

    Regarding the magnitude of errors, what I can say is that many of these errors are not rounding errors in the last decimal place. As I mentioned in the article, we’ve seen sign errors, mis-labeled regression coefficients (i.e. coefficient A labeled as being for variable B and vice versa) and numeric errors that were not in the trailing digits. I think by any standard, those are things we do not expect to see in top-tier papers.

    To speculate for a moment, I think one of the reasons others find that particular paragraph troubling is that it suggests that the social sciences have yet to approach our coding with the same level of care as our writing and theory. In a sense, generating the results in one’s paper is more or less the lowest thing we can expect from replication files, and if we are regularly falling short of that, I think it leads many to worry about what else may be slipping through the cracks. To be clear, this is speculation — the QJPS does not have the resources to do the type of full analyses of these files to test this, so I don’t have systematic data on this question — but I can say anecdotally that when I have attempted to use replication files from top authors in my won research (for example, papers in the AER), I have found substantive problems with the code that go beyond this type of error, and others I have spoken with have as well.

    I spend a reasonable amount of time in the computer science department, and and one thing that experience has taught me is that social scientists have a somewhat naive approach to programming. Essentially, I think our philosophy is “just be careful and don’t make mistakes”. In the computer sciences, by contrast, it is ASSUMED that errors will be made when programming, and so style guidelines have developed to both minimize the opportunities for errors to sneak in and also to ensure they get caught.

  6. Pingback: Is There a Quantitative Turn in the History of Economics (and how not to screw it up)? | The Undercover Historian

  7. Pingback: In God We Trust, All Others Show Me Your Code | Dr. Ghulam Mohey-ud-din, PhD

  8. Pingback: 2015 TPM Annual Most Viewed Post Award Winner: Nicholas Eubank | The Political Methodologist

  9. Pingback: Embrace Your Fallibility: Thoughts on Code Integrity | The Political Methodologist

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s