To facilitate further discussion of these proposals—and perhaps to begin to develop an actionable plan for reform—the International Methods Colloquium (IMC) hosted a panel discussion on “reproducibility and a stricter threshold for statistical significance” on October 27, 2017. The one-hour discussion included six panelists and over 240 attendees, with each panelist giving a brief initial statement concerning the proposal to “redefine statistical significance” and the remainder of the time being devoted to questions and answers from the audience. The event was recorded and can be viewed online for free at the International Methods Colloquium website.

Unfortunately, the IMC’s time limit of one hour prevented many audience members from asking their questions and having a chance to hear our panelists respond. Panelists and audience members alike agreed that the time limit was not adequate to fully explore all the issues raised by Benjamin et al. (2017). Consequently, questions that were not answered during the presentation were forwarded to all panelists, who were given a chance to respond.

The questions and answers, both minimally edited for clarity, are presented in this article. The full series of questions and answers (and this introduction) are embedded in the PDF below.

]]>

You can find a direct link to a downloadable version of the print edition here [update: a version with a minor correction has been added as of 5:23 PM on 9/26/2017]:

]]>We are very excited to announce a new Minnesota Political Methodology Colloquium (MPMC) initiative: the Minnesota Political Methodology Graduate Student Conference. The conference is scheduled for May 4 & May 5, 2018.

The Minnesota Political Methodology Graduate Student Conference is designed to provide doctoral students with feedback on their research from peers and faculty. Research papers may focus on any substantive topic, employ any research methodology, and/or be purely methodological. We are particularly interested in novel applied work to interesting and important questions in political science, sociology, psychology, and related fields.

The conference represents a unique opportunity for graduate students in different programs, across different disciplines, and with different substantive interests to network and receive feedback on their work. Papers will receive feedback from a faculty discussant, written feedback from other panelists, and comments/suggestions from audience members.

The conference will occur over two days (May 4 and May 5, 2018) and feature at least 24 presentations in 6 panels. Proposals are due December 1, 2017.

Our keynote speaker for the event is Sara Mitchell, F. Wendell Miller Professor of Political Science at the University of Iowa.

Details about the conference may be found here.

Questions should be addressed to mpmc@umn.edu

]]>MacKinnon and Webb offer a useful analysis of how the uncertainty of causal effects can be underestimated when observations are clustered and the treatment is applied to a very large or vary small share of the clusters. Their mathematical exposition, simulation exercises, and replication analysis provide a helpful guide for how to proceed when data are poorly behaved in this way. These are valuable lessons for researchers studying impacts of policy in observational data where policies tend to be sluggish and thus do not generate much variability in the key explanatory variables.

**Correction of Two Errors**

MacKinnon and Webb find two errors in our analysis, while nonetheless concluding “we do not regard these findings as challenging the conclusions of Burden et al. (2017).” Although we are embarrassed by the mistakes, we are also grateful for their discovery.^{1} Our commitment to transparency is reflected by the fact the data was been made public for replication purposes since well before the article was published. We have posted corrected versions of the replication files and published a corrigendum with the journal where the article was original published.

Fortunately, none of the other analyses in our article were affected. It is only Table 7 where errors affect the analysis. Tables 2 through 6 remain intact.

We concede that when corrections are made the effect of early voting drops from statistical significance in the model of the difference in the Democratic vote between 2008 and 2012. All of the various standard errors they report are far too large to reject the null hypothesis.

**The Problem of Limited Variation**

The episode highlights the tradeoffs that researchers face between applying what appears to be a theoretically superior estimation technique (i.e., difference-in-difference) and the practical constraints of a particular application (i.e., limited variation in treatment variables) that make its use intractable. In the case of our analysis, election laws do not change rapidly, and the conclusions of our analysis were largely based on cross-sectional analyses (Tables 2-6), with the difference-in-difference largely offered as a supplemental analysis.

We are in agreement with MacKinnon and Webb that models designed to estimate causal effects (or even simple relationships) may be quite tenuous when the number of clusters is small and the clusters are treated in a highly unbalanced fashion. In fact, we explained our reluctance to apply the difference-in-difference model to our data because of the limited leverage available. We were explicit about our reservations in this regard. As our article stated:

“A limitation of the difference-in-difference approach in our application is that few states actually changed their election laws between elections. As Table A1 (see Supplemental Material) shows, for some combinations of laws there are no changes at all. For others, the number of states changing is as low as one or two. As result, we cannot include some of the variables in the model because they do not change. For some other variables, the interpretation of the coefficients would be ambiguous given the small number of states involved; the dummy variables essentially become fixed effects for one or two states” (p. 572).

This is unfortunate in our application because the difference-in-difference models are likely to be viewed as more convincing than the cross-sectional models. This is why we offered theory suggesting that the more robust cross-sectional results were not likely to suffer from endogeneity.

The null result in the difference-in-difference models is not especially surprising given our warning above about the limited leverage provided by the dataset. Indeed, the same variable was insignificant in our model of the Democratic vote between 2004 and 2008 that we also reported in Table 7. We are left to conclude that the data are not amenable to detecting effects using difference-in-difference models. Perhaps researchers will collect data from more elections to provide more variation in the key variable and estimate parameters more efficiently.

In addition to simply replicating our analysis, MacKinnon and Webb also conduct an extension to explore asymmetric effects. They separate the treated states into those where early voting was adopted and where early voting was repealed. We agree that researchers ought to investigate such asymmetries. We recommended as much in our article: “As early voting is being rolled back in some states, future research should explore the potential asymmetry between the expansion and contraction of election practices” (p. 573). However, we think this is not feasible with existing data. As MacKinnon and Webb note, only two states adopted early voting and only one state repealed early voting. As a result, analyzing these cases separately as they do essentially renders the treatment variables to be little more than fixed effects for one or two states, as we warned in our article. The coefficients might be statistically significant using various standard error calculations, but it is not clear that MacKinnon and Webb are actually estimating the treatment effects rather than something idiosyncratic about one or two states.

**Conclusion**

While the errors made in our difference-in-difference analysis were regrettable, we think the greater lesson from the skilled analysis of MacKinnon and Webb is to raise further doubt about whether this tool is simply unsuitable in such a policy setting. While all else is equal, it may offer a superior mode of analysis; but all else is not equal. Researchers need to find the best mode of analysis to fit with the limitations of the data.

**Footnotes**

Matthew D. Webb, Department of Economics, Carleton University

**Extended Abstract**

There is a large and rapidly growing literature on inference with clustered data, that is, data where the disturbances (error terms) are correlated within clusters. This type of correlation is commonly observed whenever multiple observations are associated with the same political jurisdictions. Observations might also be clustered by time periods, industries, or institutions such as hospitals or schools.

When estimating regression models with clustered data, it is very common to use a “cluster-robust variance estimator” or CRVE. However, inference for estimates of treatment effects with clustered data requires great care when treatment is assigned at the group level. This is true for both pure treatment models and difference-in-differences regressions, where the data have both a time dimension and a cross-section dimension and it is common to cluster at the cross-section level.

Even when the number of clusters is quite large, cluster-robust standard errors can be much too small if the number of treated (or control) clusters is small. Standard errors also tend to be too small when cluster sizes vary a lot, resulting in too many false positives. Bootstrap methods based on the wild bootstrap generally perform better than t-tests, but they can also yield very misleading inferences in some cases. In particular, what would otherwise be the best variant of the wild bootstrap can underreject extremely severely when the number of treated clusters is very small. Other bootstrap methods can overreject extremely severely in that case.

In Section 2, we briefly review the key ideas of cluster-robust covariance matrices and standard errors. In Section 3, we then explain why inference based on these standard errors can fail when there are few treated clusters. In Section 4, we discuss bootstrap methods for cluster-robust inference. In Section 5, we report (graphically) the results of several simulation experiments which illustrate just how severely both conventional and bootstrap methods can overreject or underreject when there are few treated clusters. In Section 6, the implications of these results are illustrated using an empirical example from Burden, Canon, Mayer, and Moynihan (2017). The final section concludes and provides some recommendations for empirical work.

**Full Article**

**Replication File**

Replication files for the Monte Carlo simulations and the empirical example can be found at: doi:10.7910/DVN/GBEKTO .

- We are grateful to Justin Esarey for several very helpful suggestions and to Joshua Roxborough for valuable research assistance. This research was supported, in part, by a grant from the Social Sciences and Humanities Research Council of Canada. Some of the computations were performed at the Centre for Advanced Computing at Queen’s University.

Arthur Spirling (New York University) | October 20th |

Roundtable on Reproducibility and a Stricter Threshold for Statistical Significance: Dan Benjamin (University of Southern California), Daniel Lakens (Eindhoven University of Technology), and E.J. Wagenmakers(University of Amsterdam) | October 27th |

Will Hobbs (Northeastern University) | November 3rd |

Adeline Lo (Princeton University) | November 10th |

Olga Chyzh (Iowa State University) | November 17th |

Pamela Ban (Harvard University) | December 1st |

Teppei Yamamoto (Massachussetts Institute of Technology) | February 2nd |

Clay Webb (University of Kansas) | February 16th |

Mark Pickup (Simon Fraser University) | February 23rd |

Erik Peterson (Dartmouth College) | March 2nd |

Casey Crisman-Cox (Washington University in St. Louis) | March 9th |

Erin Hartman (University of California, Los Angeles) | March 23rd |

Note that all presentations will begin at noon Eastern time and last for one hour. Additional information for each presentation (including a title and link to relevant paper) will be released closer to its date. You can preregister to attend a presentation by clicking on the link in the Google Calendar entry corresponding to the talk; the IMC’s Google Calendar is available at https://www.methods-colloquium.com/schedule. (Anyone can show up the day of the presentation without pre-registering if they wish as long as room remains; there are 500 seats available in each webinar.)

The International Methods Colloquium (IMC) is a weekly seminar series of methodology-related talks and roundtable discussions focusing on political methodology; the series is supported by Rice University and a grant from the National Science Foundation. The IMC is free to attend from anywhere around the world using a PC or Mac, a broadband internet connection, and our free software. You can find out more about the IMC at our website, http://www.methods-colloquium.com/, where you can register for any of these talks and/or join a talk in progress using the “Watch Now!” link. You can also watch archived talks from previous IMC seasons at this site.

]]>A large and interdisciplinary group of researchers recently proposed redefining the conventional threshold of statistical significance from *p* < 0.05 to *p* < 0.005, both two-tailed (Benjamin et al. 2017). The purpose of the reform is to “immediately improve the reproducibility of scientific research in many fields” (p. 5); this comes in the context of recent large-scale replication efforts that have uncovered startlingly low rates of replicability among published results (e.g., Klein et al. 2014; Open Science Collaboration 2015). Recent work suggests that results that meet a more stringent standard for statistical significance will indeed be more reproducible (V. E. Johnson 2013; Esarey and Wu 2016; V. E. Johnson et al. 2017). Reproducibility should be improved by this reform because the *false discovery rate*, or the proportion of statistically significant findings that are null relationships (Benjamini and Hochberg 1995), is reduced by its implementation. Benjamin et al. (2017) explicitly disavow a requirement that results meet the stricter standard in order to be publishable, but in the past statistical significance has been used as a necessary condition for publication (T. D. Sterling 1959; T. Sterling, Rosenbaum, and Winkam 1995) and we must therefore anticipate that a redefinition of the threshold for significance may lead to a redefinition of standards for publishability.

As a method of screening empirical results for publishability, the conventional *p* < 0.05 null hypothesis significance test (NHST) procedure lies on a kind of Pareto frontier: it is difficult to improve its qualities on one dimension (i.e., increasing the replicability of published research) without degrading its qualities on some other important dimension. This makes it hard for any proposal to displace the conventional NHST: such a proposal will almost certainly be worse in some ways, even if it is better in others. For moving the threshold for statistical significance from 0.05 to 0.005, the most obvious tradeoff is that the *power* of the test to detect non-null relationships is harmed at the same time that the *size* of the test (i.e., its propensity to mistakenly reject a true null hypothesis) is reduced. The usual way of increasing the power of a study is to increase the sample size, *N*; given the fixed nature of historical time-series cross-sectional data sets and the limited budgets of those who use experimental methods, this means that many researchers may feel forced to accept less powerful studies and therefore have fewer opportunities to publish. Additionally, reducing the threshold for statistical significance can exacerbate *publication bias*, i.e., the extent to which the published literature exaggerates the magnitude of an relationship by publishing only the largest findings from the sampling distribution of a relationship (T. Sterling, Rosenbaum, and Winkam 1995; Scargle 2000; Schooler 2011; Esarey and Wu 2016).

Simply accepting lower power in exchange for a lower false discovery rate would increase the burden on the most vulnerable members of our community: assistant professors and graduate students, who must publish enough work in a short time frame to stay in the field. However, there is a way of maintaining adequate power using *p* < 0.005 significance tests without dramatically increasing sample sizes: design studies that conduct conjoint tests of multiple predictions from a single theory (instead of individual tests of single predictions). When *K*-many statistically independent tests are performed on pre-specified hypotheses that must be jointly confirmed in order to support a theory, the chance of simultaneously rejecting them all by chance is *α*^{K} where *p* < *α* is the critical condition for statistical significance in an individual test. As *K* increases, the *α* value for each individual study can fall and the overall power of the study often (though not always) increases. It is important that hypotheses be specified *before* data analysis and that failed predictions are reported; simply conducting many tests and reporting the statistically significant results creates a multiple comparison problem that inflates the false positive rate (Sidak 1967; Abdi 2007). Because it is usually more feasible to collect a greater depth of information about a fixed-size sample rather than to greatly expand the sample size, this version of the reform imposes fewer hardships on scientists at the beginning of their careers.

I do not think that an NHST with a lowered *α* is the best choice for adjudicating whether a result is statistically meaningful; approaches rooted in statistical decision theory and with explicit assessment of out-of-sample prediction have advantages that I consider decisive. However, if reducing *α* prompts us to design research stressing the simultaneous testing of multiple theoretical hypotheses, I believe this change would improve upon the status quo—even if passing this more selective significance test becomes a requirement for publication.

Based on the extant literature, I believe that requiring published results to pass a two-tailed NHST with *α* = 0.05 is at least partly responsible for the “replication crisis” now underway in the social and medical sciences (Wasserstein and Lazar 2016). I also anticipate that studies that can meet a lowered threshold for statistical significance will be more reproducible. In fact, I argued these two points in a recent paper (Esarey and Wu 2016). Figure 1 reproduces a relevant figure from that article; the *y*-axis in that figure is the *false discovery rate* (Benjamini and Hochberg 1995), or FDR, of a conventional NHST procedure with an *α* given by the value on the *x*-axis. Written very generically, the false discovery rate is:

or, the proportion of statistically significant results (“discoveries”) that correspond to true null hypotheses. *A* is a function of the *power* of the test, Pr(stat. sig.|null hypothesis is true). *B* is a function of the *size* of the test, Pr(stat. sig.|null hypothesis is false). Both *A* and *B* are a function of the underlying proportion of null hypotheses being proposed by researchers, Pr(null hypothesis is true).

As Figure 1 shows, when Pr(null hypothesis is true) is large, there is a very disappointing FDR among results that pass a two-tailed NHST with *α* = 0.05. For example, when 90% of the hypotheses tested by researchers are false leads (i.e., Pr(null hypothesis is true)=0.9), we may expect nearly ≈30% of discoveries to be false when all studies have perfect power (i.e., Pr(stat. sig.|null hypothesis is false)=1). Two different studies applying disparate methods to data from the Open Science Collaboration’s replication project data (2015) have determined that approximately 90% of researcher hypotheses are false (V. E. Johnson et al. 2017; Esarey and Liu 2017); consequently, an FDR in the published literature of at least 30% is attributable to this mechanism. Even higher FDRs will result if studies have less than perfect power.

By contrast, Figure 1 *also* shows that setting *α* ≈ 0.005 would greatly reduce the false discovery rate, even when the proportion of null hypotheses posed by researchers is extremely high. For example, if 90% of hypotheses proposed by researchers are false, the lower bound FDR in the published literature among studies meeting the higher standard would be about 5%. Although non-null results are not always perfectly replicable in underpowered studies (and null results have a small chance of being successfully replicated), a reduction in the FDR from ≈30% to ≈5% would almost certainty and drastically improve the replicability of published results.

Despite this and other known weaknesses of the current NHST, it lies on or near a Pareto frontier of possible ways to classify results as “statistically meaningful” (one aspect of being suitable for publication). That is, it is challenging to propose a revision to the NHST that will bring improvements without also bringing new or worsened problems, some of which may disproportionately impact certain segments of the discipline. Consider some dimensions on which we might rate a statistical test procedure, the first few of which I have already discussed above:

- the
*false discovery rate*created by the procedure, itself a function of:- the
*power*of the procedure to detect non-zero relationships, and - the
*size*of the procedure, its probability of rejecting true null hypotheses;

- the
- the degree of
*publication bias*created by the procedure, i.e., the extent to which the published literature exaggerates the size of an effect by publishing only the largest findings from the sampling distribution of a relationship (T. Sterling, Rosenbaum, and Winkam 1995; Scargle 2000; Schooler 2011; Esarey and Wu 2016); - the
*number of researcher assumptions*needed to execute the procedure, a criterion related to the*consistency*of the standards implied by use of the test; and - the
*complexity*, or ease of use and interpretability, of the procedure.

It is hard to improve on the NHST in one of these dimensions without hurting performance in another area. In particular, lowering *α* from 0.05 to 0.005 would certainly lower the power of most researchers’ studies and might increase the publication bias in the literature.

Changing the *α* of the NHST from 0.05 to 0.005 is a textbook example of moving along a Pareto frontier because the size and the power of the test are in direct tension with one another. Because powerful research designs are more likely to produce publishable results when the null is false, maintaining adequate study power is especially critical to junior researchers who need publications in order to stay in the field and advance their careers. The size/power tradeoff is depicted in Figure 2.

Figure 2 depicts two size and power analyses for a coefficient of interest *β*. I assume that , the standard deviation of the sampling distribution of *β*, is equal to 1; I also assume 100 degrees of freedom (typically corresponding to a sample size slightly larger than 100. In the left panel (Figure 2a), the conventional NHST with *α* = 0.05 is depicted. In the right panel (Figure 2b), the Benjamin et al. (2017) proposal to decrease *α* to 0.005 is shown. The power analyses assume that the true *β* = 3, while the size analyses assume that *β* = 0.

The most darkly shaded area under the right hand tail of the *t*-distribution under the null is the probability of incorrectly rejecting a true null hypothesis, *α*. As the figure shows, it is impossible to shrink *α* without simultaneously shrinking the lighter shaded area, where the lighter shading depicts the power of a hypothesis test to correctly reject a false null hypothesis. This tradeoff can be more or less severe (depending on *β* and ), but always exists. The false discovery rate will still (almost always) improve when *α* falls from 0.05 to 0.005, despite the loss in power, but many more true relationships will not be discovered under the latter standard.

The size/power tradeoff is not the only compromise associated with lowering *α*: in some situations, decreased *α* can increase the publication bias associated with using statistical significance tests as a filter for publication. Of course, (Benjamin et al. 2017) explicitly disavow using an NHST with lowered *α* as a necessary condition for publication; they say that “results that would currently be called ‘significant’ but do not meet the new threshold should instead be called ‘suggestive’” (p. 5). However, given the fifty year history of of requiring results to be statistically significant to be publishable (T. D. Sterling 1959; T. Sterling, Rosenbaum, and Winkam 1995), we must anticipate that this pattern could continue into the future.

Consider Figure 3, which shows two perspectives on how decreased *α* might impact publication bias. The left panel (Figure 3a) shows the percentage difference in magnitude between a true coefficient *β* = 3 and the average statistically significant coefficient using an NHST with various values of *α*.^{2} The figure shows that decreased values of *α* increase the degree of publication bias; this occurs because stricter significance tests tend to reject only the largest estimates from a sampling distribution, as also shown in Figure 2. On the other hand, the right panel (Figure 3b) shows the percentage difference in magnitude from a true coefficient drawn from a spike-and-normal prior density of coefficients:

where there is a substantial (50%) chance of a null relationship being studied by a researcher; is the indicator function for a null relationship.^{3} In this case, increasingly strict significance tests tend to *decrease* publication bias, because the effect of screening out null relationships (which are greatly overestimated by any false positive result) is stronger than the effect of estimating only a portion of the sampling distribution of non-null relationships (which drives the result in the left panel).

If the proposal of (Benjamin et al. 2017) is simply to accept lowered power in exchange for lower false discovery rates, it is a difficult proposal to accept. First and foremost, assistant professors and graduate students must publish a lot of high-quality research in a short time frame and may be forced to leave the discipline if they cannot; higher standards that are an inconvenience to tenured faculty may be harmful to them unless standards for hiring and tenure adapt accordingly. In addition, even within a particular level of seniority, this reform may unlevel the playing field. Political scientists in areas that often observational data with essentially fixed *N*, such as International Relations, would be disproportionately affected by such a change: more historical data cannot be created in order to raise the power of a study. Among experimenters and survey researchers, those with smaller budgets would also be disproportionately affected: they cannot afford to simply buy larger samples to achieve the necessary power. Needless to say, the effect of this reform on the scientific ecosystem would be difficult to predict and not necessarily beneficial; at first glance, such a reform seems to benefit the most senior scholars and people at the wealthiest institutions.

However, I believe that acquiesence to lower power is *not* the only option. It may be difficult or impossible for researchers to collect larger *N*, but it is considerably easier for them to measure a larger number of variables of interest *K* from their extant samples. This creates the possibility that researchers can spend more time developing and enriching their theories so that these theories make multiple predictions. Testing these predictions jointly typically allows for much greater power than testing any single prediction alone, as long as any prediction is clearly laid out *prior to analysis* and *all failed predictions are reported* in order to avoid a multiple comparison problem (Sidak 1967; Abdi 2007); predictions must be specified in advance and failed predictions must be reported because simply testing numerous hypotheses and reporting any that were confirmed tends to generate an excess of false positive results.

When *K*-many statistically independent hypothesis tests are performed using a significance threshold of *α* and *all* must be passed in order to confirm a theory,^{4} the chance of simultaneously rejecting them all by chance is . Fixing , it is clear that increasing *K* allows the individual test’s *α* to be higher.^{5} Specifically, *α* must be equal to in order to achieve the desired size. That means that two statistically independent hypothesis tests can have their individual *α* ≈ 0.07 in order to achieve a joint size of 0.005; this is a *lower* standard on the individual level than the current 0.05 convention. When *k* = 3, the individual *α* ≈ 0.17.

When will conducting a joint test of multiple hypotheses yield greater power than conducting a single test? If two hypothesis tests are statistically independent^{6} and conducted as part of a joint study, this will occur when:

[1]

[2]

Here, *τ*_{k} is the non-central cumulative *t*-density corresponding to the sampling distribution of *β*_{k}, *k* ∈ {1, 2}; I presume that *k* is sorted so that tests are in descending order of power. *t*^{⋆}(*a*) is the positive critical *t*−statistic needed to create a single two-tailed hypothesis test of size *a*. The left hand side of equation (1) is the power of the single hypothesis test with the greatest power; the right hand side of equation (1) is the power of the joint test of two hypotheses. As equation (2) shows, the power of the joint test is larger when the proportional change in power for the test of *β*_{1} is less than the power for the test of *β*_{2}. Whether this condition is met depends on many factors, including the magnitude and variability of *β*_{1} and *β*_{2}.

To illustrate the potential for power gains, I numerically calculated the power of joint tests with size for the case of one, two, and three statistically independent individual tests. For this calculation, I assumed three relationships *β* = *β*_{k} with equal magnitude and sign, *k* ∈ {1, 2, 3} with ; thus, each one of the three tests has identical power when conducted individually. I then calculated the power of each joint test for each value of *k* and for varying values of *β*, where *τ* is the non-central cumulative *t*-density with non-centrality parameter equal to *β*. Note that, as before, I define *t*^{⋆}(*a*) as the critical *t*-statistic for a single two-tailed test with size *a*. The result is illustrated in Figure 4.

Figure 4 shows that joint hypothesis testing creates substantial gains in power over single hypothesis testing for most values of *β*. There is a small amount of power loss near the tails (where *β* ≈ 0 and *β* ≈ 6), but this is negligible.

There are reasons to be skeptical of the Benjamin et al. (2017) proposal to move the NHST threshold for statistical significance from *α* = 0.05 to *α* = 0.005. First, there is substantial potential for adverse effects on the scientific ecosystem: the proposal seems to advantage senior scholars at the most prominent institutions in fields that do not rely on fixed-*N* observational data. Second, the NHST with a reduced *α* is not my ideal approach to adjudicating which results are statistically meaningful; I believe it is more advantageous for political scientists to adopt a statistical decision theory-oriented approach to inference^{7} and give greater emphasis to cross-validation and out-of-sample prediction.

However, based on the evidence presented here and in related work, I believe that moving the threshold for statistical significance from *α* = 0.05 to *α* = 0.005 would benefit political science *if* we adapt to this reform by developing richer, more robust theories that admit multiple predictions. Such a reform would reduce the false discovery rate without reducing power or unduly disadvantaging underfunded scholars or subfields that rely on historical observational data, even if meeting the stricter standard for significance became a necessary condition for publication. It would also force us to focus on improving our body of theory; our extant theories lead us to propose hypotheses that are wrong as much as 90% of the time (V. E. Johnson et al. 2017; Esarey and Liu 2017). Software that automates the calculation of appropriate critical *t*-statistics for correlated joint hypothesis tests would make it easier for substantive researchers to make this change, and ought to be developed in future work. This software has already been created for joint tests involving interaction terms in generalized linear models by Esarey and Sumner (2017), but the procedure needs to be adapted for the more general case of any type of joint hypothesis test.

Abdi, Herve. 2007. “The Bonferonni and Sidak Corrections for Multiple Comparisons.” In *Encyclopedia of Measurement and Statistics*, edited by Neil Salkind. Thousand Oaks, CA: Sage. URL: https://goo.gl/EgNhQQ accessed 8/5/2017.

Benjamin, Daniel J., James O. Berger, Magnus Johannesson, Brian A. Nosek, E. J. Wagenmakers, Richard Berk, Kenneth A. Bollen, et al. 2017. “Redefine Statistical Significance.” *Nature Human Behavior* Forthcoming: 1–18. URL: https://osf.io/preprints/psyarxiv/mky9j/ accessed 7/31/2017.

Benjamini, Y., and Y. Hochberg. 1995. “Controlling the false discovery rate: a practical and powerful approach to multiple testing.” *Journal of the Royal Statistical Society. Series B (Methodological)* 57 (1). JSTOR: 289–300. URL: http://www.jstor.org/stable/10.2307/2346101.

Esarey, Justin, and Nathan Danneman. 2015. “A Quantitative Method for Substantive Robustness Assessment.” *Political Science Research and Methods* 3 (1). Cambridge University Press: 95–111.

Esarey, Justin, and Vera Liu. 2017. “A Prospective Test for Replicability and a Retrospective Analysis of Theoretical Prediction Strength in the Social Sciences.” Poster presented at the 2017 Texas Methods Meeting at the University of Houston. URL: http://jee3.web.rice.edu/replicability-package-poster.pdf accessed 8/1/2017.

Esarey, Justin, and Jane Lawrence Sumner. 2017. “Marginal Effects in Interaction Models: Determining and Controlling the False Positive Rate.” *Comparative Political Studies* forthcoming: 1–39. URL: http://jee3.web.rice.edu/interaction-overconfidence.pdf accessed 8/5/2017.

Esarey, Justin, and Ahra Wu. 2016. “Measuring the Effects of Publication Bias in Political Science.” *Research & Politics* 3 (3). SAGE Publications Sage UK: London, England: 1–9. URL: https://doi.org/10.1177/2053168016665856 accessed 8/1/2017.

Johnson, Valen E. 2013. “Revised Standards for Statistical Evidence.” *Proceedings of the National Academy of Sciences* 110 (48). National Acad Sciences: 19313–7.

Johnson, Valen E, Richard D. Payne, Tianying Wang, Alex Asher, and Soutrik Mandal. 2017. “On the Reproducibility of Psychological Science.” *Journal of the American Statistical Association* 112 (517). Taylor & Francis: 1–10.

Klein, Richard A., Kate A. Ratliff, Michelangelo Vianello, Reginald B. Adams, Stepan Bahnik, Michael J. Bernstein, Konrad Bocian, et al. 2014. “Investigating Variation in Replicability.” *Social Psychology* 45 (3): 142–52. doi:10.1027/1864-9335/a000178.

Open Science Collaboration. 2015. “Estimating the Reproducibility of Psychological Science.” *Science* 349 (6251): aac4716. doi:10.1126/science.aac4716.

Scargle, Jeffrey D. 2000. “Publication Bias: The ‘File-Drawer’ Problem in Scientific Inference.” *Journal of Scientific Exploration* 14: 91–106.

Schooler, Jonathan. 2011. “Unpublished Results Hide the Decline Effect.” *Nature* 470: 437.

Sidak, Zbynek. 1967. “Rectangular confidence regions for the means of multivariate normal distributions.” *Journal of the American Statistical Association* 62 (318): 626–33.

Sterling, T.D., W. L. Rosenbaum, and J. J. Winkam. 1995. “Publication Decisions Revisited: The Effect of the Outcome of Statistical Tests on the Decision to Publish and Vice Versa.” *The American Statistician* 49: 108–12.

Sterling, Theodore D. 1959. “Publication Decisions and Their Possible Effects on Inferences Drawn from Tests of Significance—or Vice Versa.” *Journal of the American Statistical Association* 54 (285): 30–34. doi:10.1080/01621459.1959.10501497.

Wasserstein, Ronald L., and Nicole A. Lazar. 2016. “The Asa’s Statement on P-Values: Context, Process, and Purpose.” *The American Statistician* 70 (2): 129–33. URL: http://dx.doi.org/10.1080/00031305.2016.1154108 accessed 8/5/2017.

The code to replicate Figures 2-4 is available at http://dx.doi.org/10.7910/DVN/C6QTF2. Figure 1 is a reprint of a figure originally published in Esarey and Wu; the replication file for that publication is available at http://dx.doi.org/10.7910/DVN/2BF2HB.

- I thank Jeff Grim, Martin Kavka, Tim Salmon, and Mike Ward for helpful comments on a previous draft of this paper, particularly in regard to the effect of lowered
*α*on junior scholars and the question of whether statistical significance should be required for publication. - Specifically, I measure , where is a statistically significant estimate.
- Here, I measure , where
*μ*_{β}is the mean of*f*(*β*). - The points in this paragraph are similar to those made by Esarey and Sumner (2017, pp. 15–19).
- For statistically correlated tests, the degree to which is smaller; at the limit where all the hypothesis tests are perfectly correlated, .
- Correlated significance tests require the creation of a joint distribution
*τ*on the right hand side of equation ([eq:power-gain-line-one]) and the determination of a critical value*t*^{⋆}such that ; while practically important, this analysis is not as demonstratively illuminating as the case of statistically independent tests. - In a paper with Nathan Danneman (2015), I show that a simple, standardized approach could reduce the rate of false positives without harming our power to detect true positives (see Figure 4 in that paper). Failing this, I would prefer a statistical significance decision explicitly tied to expected replicability, which requires information about researchers’ propensity to test null hypotheses as well as their bias toward positive findings (Esarey and Liu 2017). These changes would increase the
*complexity*and the*number of researcher assumptions*of a statistical assessment procedure relative to the NHST, but not (in my opinion) to a substantial degree.

**Abstract:** *Most contributors to a recent *Political Analysis* symposium on time series analysis suggest that in order to maintain equation balance, one cannot combine stationary, integrated, and/or fractionally integrated variables with general error correction models (GECMs) and the equivalent autoregressive distributed lag (ADL) models. This definition of equation balance implicates most previous uses of these models in political science and circumscribes their use moving forward. The claim thus is of real consequence and worthy of empirical substantiation, which the contributors did not provide. Here we address the issue. First, we highlight the difference between estimating unbalanced equations and mixing orders of integration, the former of which clearly is a problem and the latter of which is not, at least not necessarily. Second, we assess some of the consequences of mixing orders of integration by conducting simulations using stationary and integrated time series. Our simulations show that with an appropriately speciﬁed model, regressing a stationary variable on an integrated one, or the reverse, does not increase the risk of spurious results. We then illustrate the potential importance of these conclusions with an applied example—income inequality in the United States.*^{[1]}

*Political Analysis* (PA) recently hosted a symposium on time series analysis that built upon De Boef and Keele’s (2008) inﬂuential time series article in the *American Journal of Political Science*. Equation balance was an important point of emphasis throughout the symposium. In their classic work on the subject, Banerjee, Dolado, Galbraith and Hendry (1993, 164) explain that an unbalanced equation is a regression, “in which the regressand is not the same order of integration as the regressors, or any linear combination of the regressors.” The contributors to this symposium were right to emphasize the importance of equation balance, as unbalanced equations can produce serially correlated residuals (e.g., Pagan and Wickens 1989) and spurious relationships (e.g., Banerjee et al. 1993, 79).

Throughout the *PA* symposium, however, equation balance is deﬁned and applied in different ways. Grant and Lebo (2016, 7) follow Banerjee, et al’s deﬁnition when they explain that a general error correction model (GECM)—or autoregressive distributed lag (ADL)—is balanced if co-integration is present.[2] Keele, Linn and Webb (2016a, 83) implicitly make this same point in their second contribution to the symposium when they cite Bannerjee et al. (1993) in their discussion of equation balance. Yet, other parts of the symposium seem to apply a stricter standard of equation balance, stating that when estimating a GECM/ADL all time series must be the same order of integration. As Grant and Lebo write in the abstract of their ﬁrst article, “Time series of various orders of integration—stationary, non-stationary, explosive, near-and fractionally integrated—should not be analyzed together… That is, without equation balance the model is misspecified and hypothesis tests and longrun-multipliers are unreliable.” Keele, Linn and Webb (2016b, 34) similarly write, “no regression model is appropriate when the orders of integration are mixed because no long-run relationship can exist when the equation is unbalanced.” Box-Steffensmeier and Helgason (2016, 2) make the point by stating, “when studying the relationship between two (or more) series, the analyst must ensure that they are of the same level of integration; that is, they have to be balanced.” Although Freeman (2016) offers a more nuanced perspective on equation balance, many of the symposium contributors could be interpreted as recommending that scholars never mix orders of integration.[3] Indeed, in their concluding article, Lebo and Grant write, “One point of agreement among the papers here is that equation balance is an important and neglected topic. One cannot mix together stationary, unit-root, and fractionally integrated variables in either the GECM or the ADL” (p.79).

It is possible that these authors did not mean for these quotes to be taken literally. However, we both have recently been asked to review articles that have used these quotes to justify analytic decisions with time series data.[4] Thus, we think the claims should be reviewed carefully. This is especially the case because Grant and Lebo could be interpreted as applying these strict standards in some of their empirical applications. For example, in their discussion of Sánchez Urribarrí, Schorpp, Randazzo and Songer (2011), Grant and Lebo write, “both the UK and US models are unbalanced—each DV is stationary, and the inclusion of unit-root IVs has compromised the results” (Supplementary Materials, p.36). Researchers might take this statement to imply that including stationary and unit root variables automatically produces an unbalanced equation.

In addition to holding implications for practitioners, the strict interpretation of equation balance holds implications for the vast number of existing time series articles that employ GECM/ADL models without pre–whitening the data to ensure equal orders of integration across all series. Lebo and Grant (2016, 79) point out, for example, “FI [fractional integration] methods allow us to create a balanced equation from dissimilar data. By ﬁltering each series by its own (*p*, *d*, *q*) noise model, the residuals of each can be rendered (0, 0, 0) so that you can investigate how *X’s* deviations from its own time-dependent patterns affect *Y’s* deviations from its own time-dependent patterns.” Fortunately, existing time series analysis that does not pre-whiten the data need not be automatically dismissed. The strict interpretation of equation balance—i.e., that mixing orders of integration is always problematic with the GECM/ADL—is not accurate. As noted above, the contributors to the symposium may indeed understand this point. But based on the quotes above, we feel that it is important to clarify for practitioners that an unbalanced equation is not synonymous with mixing orders of integration. While related, they are not the same, and while the former is always a problem the latter is not.

We begin by showing that equation balance does not necessarily require that all series have the same order of integration with the GECM/ADL. This is important because the classic examples in the literature of unbalanced equations include series of different orders of integration (see, for example, Banerjee et al. (1993, 79) and Maddala and Kim (1998, 252)). But our results are not at odds with these scholars, as their examples all assume a relationship with no dynamics. When using a GECM/ADL to model dynamic processes, even mixed orders of integration can produce balanced equations. This conclusion is consistent with Banerjee et al. (1993), who write, “The moral of the econometricians’ story is the need to keep track of the orders of integration on both sides of the regression equation, *which usually means incorporating dynamics*; models that have restrictive dynamic structures are relatively likely to give misleading inferences simply for reasons of inconsistency of orders of integration” (p.192, italics ours).

We believe the *PA* symposium was not suﬃciently clear that adding dynamics can solve the equation balance problem with mixed orders of integration. Thus, a key contribution of our article is to show how appropriate model speciﬁcation can be used to produce equation balance and avoid inﬂating the rate of spurious regression—even when the model includes series with different orders of integration. Our particular focus is analysis that mixes stationary *I*(0) and integrated *I*(1) time series. In practice, researchers might encounter other types of time series, such as fractionally integrated, near-integrated, or explosive series. Evaluating every type of time series and the vast number of ways different orders of integration could appear in a regression model is beyond the scope of this paper. Our goal is more basic, but still important. We aim to demonstrate that there are exceptions to the claim that, “The order of integration needs to be consistent across all series in a model” (Grant and Lebo 2016, 4) and that these exceptions can hold important implications for social science research.

More speciﬁcally, we show that when data are either stationary or (ﬁrst order) integrated, scenarios exist when a GECM/ADL that includes both types of series can be estimated without problem. Our simulations show that regressing an integrated variable on a stationary one (or the reverse) does not increase the risk of spurious results when modeled correctly. While this may be a simple point, we think it is a crucial one. As mentioned above, if readers interpreted the previous quotes from the *PA* symposium as deﬁning equation balance to mean that different orders of integration cannot be mixed, most existing research that employs the ADL/GECM model would be called into question. Given the fact that P*olitical Analysis* is one of the most cited journals in political science and the symposium included some of the top time series practitioners in the discipline, we believe it is valuable to clarify that mixing orders of integration is not always a problem and that existing time series research is not inherently ﬂawed. Furthermore, the one article that has responded to particular claims made in the symposium contribution did not address the symposium’s deﬁnition of equation balance (Enns et al. 2016).[5] We hope our article helps clarify the concept of equation balance for those who use time series analysis.

We also illustrate the importance of our ﬁndings with an applied example—income inequality in the United States. The example illustrates how the use of pre-whitening to force variables to be of equal orders of integration (when the equation is already balanced) can be quite costly, leading researchers to fail to detect relationships.[6]

The contributors to the *PA* symposium were all correct to emphasize equation balance. Time series analysis requires a balanced equation. An unbalanced equation is mis-speciﬁed by deﬁnition, typically resulting in serially correlated residuals and an increased probability of Type I errors.[7] As noted above, Banerjee et al. (1993, 164, italics ours) explain that an unbalanced equation is a regression, “in which the regressand is not the same order of integration as the regressors, *or any linear combination of the regressors*.” Our primary concern is that much of the discussion in the *PA* symposium seems to focus on the order of integration of each variable in the equation without acknowledging that a “linear combination of the regressors” can also produce equation balance. We worry that researchers might interpret this focus to mean that equation balance requires each series in the model to be the same order of integration.[8] Such a conclusion would be wrong. As the previous quote from Banerjee et al. (1993) indicates (also see, Maddala and Kim (1998, 251), if the regressand and the regressors are not the same order of integration, the equation will still be balanced if a linear combination of the variables is the same order of integration.

As Grant and Lebo (2016, 7) and Keele, Linn and Webb (2016a, 83) acknowledge, cointegration offers a useful illustration of how an equation can be balanced even when the regressand and regressors are not the same order of integration.[9] Consider two integrated *I*(1) variables, *Y* and *X*, in a standard GECM model:

(1)

Clearly, the equation mixes orders of integration. We have a stationary regressand () and a combination of integrated (, $X_{t-1}$) and stationary () regressors. However, if *X* and *Y* are cointegrated, the equation is still balanced. To see why, we can rewrite Equation 1 as:

(2)

*X* and *Y* are cointegrated when *X* and *Y* are both integrated (of the same order) and and and are non-zero (and < 0). Because cointegration ensures that *Y* and *X* maintain an equilibrium relationship, a linear combination of these variables exists that is stationary (that is, if we regress *Y* on *X*, in levels, the residuals would be stationary).[10] As noted above, this (stationary) linear combination is captured by . Additionally, since *Y* and *X* are both integrated of order one, and will be stationary. Thus, cointegration ensures that the equation is balanced: the regressand () and either the regressors () or a linear combination of the regressors are all stationary. Importantly, if we added a stationary regressor to the model, e.g., if we thought innovations in *Y* were also inﬂuenced by a stationary variable, the equation would still be balanced.

The fact that the GECM—which mixes stationary and integrated regressand and regressors—is appropriate when cointegration is present demonstrates that equation balance does not require the series to be the same order of integration. As we have mentioned, Grant and Lebo (2016, 7) acknowledge that a GECM is balanced if co-integration is present and Keele, Linn and Webb (2016a, 83) make this point in their second contribution to the symposium citing Bannerjee et al. (1993) in their discussion of equation balance. However, as noted above, we have begun to encounter research that interprets other statements in the symposium to mean that analysts can never mix orders of integration. For example, in their discussion of Volscho and Kelly (2012), Grant and Lebo write that the “data is a mix of data types (stationary and integrated), so any hypothesis tests will be based on unbalanced equations” (supplementary appendix, p. 48). But is this this really the case? The above example shows that when cointegration is present, equation balance can exist even when the orders of integration are mixed.

Below, we use simulations to illustrate two seemingly less well-known scenarios when equation balance exists despite different orders of integration. Again, our goal is not to identify all cases where different orders of integration can result in equation balance. Rather, we want to show that researchers should not automatically equate different orders of integration with an unbalanced equation. Situations exist where it is completely appropriate to estimate models with different orders of integration.

We begin with an integrated *Y* and a stationary *X*. At ﬁrst glance, estimating a relationship between these variables, which requires mixing an *I*(1) and *I*(0) series, might seem problematic. Grant and Lebo (2016, 4) explain, “Mixing together series of various orders of integration will mean a model is misspecified” and in econometric texts, mixing *I*(1) and *I*(0) series offers a classic example of an unbalanced equation (Banerjee et al. 1993, 79, Maddala and Kim 1998, 252).[11]

It is still possible to estimate the relationship between an integrated *Y* and a stationary *X* in a correctly speciﬁed and balanced equation. First, we must recognize that when Banerjee et al. (1993) (see also Mankiw and Shapiro (1986) and Maddala and Kim (1998)) state that an *I*(1) and *I*(0) series represent an unbalanced equation, they are modeling the equation:

(3)

Equation 3 is indeed unbalanced (and thus misspecified) as the regressand is integrated and the regressor is stationary. This result does not, however, mean that we cannot consider these two series. A stationary series, *X*, might be related to innovations in an integrated series, *Y*. If so, we could model this process with an autoregressive distributed lag model:

(4)

Much as before, this might appear to still be an unbalanced equation. We continue to mix *I*(1) and *I*(0) series, which seemingly violates Lebo and Grant’s (2016, 71) conclusion that, “One cannot mix together stationary, unit–root, and fractionally integrated variables in either the GECM or the ADL.”[12] However, since *Y* is *I*(1), = 1, which means . Thus, we can rewrite the equation as,

(5)

Because *Y* is an integrated, *I*(1), series, must be stationary. Thus, the regressand and regressors are all *I*(0) series. As Banerjee et al. (1993, 169) explain, “regressions that are linear transformations of each other have identical statistical properties. What is important, therefore, is the possibility of transforming in such a way that the regressors are integrated of the same order as the regressand.”[13] Thus, Equation 5 shows that the ADL in Equation 4 is indeed balanced. (Because the GECM is algebraically equivalent to the ADL, the GECM would—by denition—also be balanced in this example.)

The above discussion suggests that we can use an ADL to estimate the relationship between an integrated *Y* and stationary *X*. To test these expectations, we conduct a series of Monte Carlo experiments. We generate an integrated *Y* with the following DGP:

(6)

We generate the stationary time series *X*, with the following DGP, where equals 0.0 or 0.5:

(7)

Notice that *X* and *Y* are independent series. Particularly with dependent series that contain a unit root (as is the case here), the dominant concern in time series literature is the potential for estimating spurious relationships (e.g., Granger and Newbold 1974, Grant and Lebo 2016, Yule 1926). Thus, our ﬁrst simulations seek to identify the percentage of analyses that would incorrectly reject the null hypothesis of no relationship between a stationary *X* and integrated *Y* with an ADL. As noted above, in light of the recommendations in the *PA* symposium to never mix orders of integration, this approach seems highly problematic. However, if the equation is balanced as we suggest, the false rejection rate in our simulations should only be about 5 percent.

In the following simulations, *T* is set to 50 and then 1,000. These values allow us to evaluate both a short time series that political scientists often encounter and a long time series that will approximate the asymptotic properties of the series. We use the DGP from Equations 6 and 7, above, to generate 1,000 simulated data sets. Recall that in our stationary series, equals 0.5 or 0.0 and *Y* and *X* are never related. To evaluate the relationship between *X* and *Y*, we estimate an ADL model in Equation 4.[14]

Table 1 reports the average estimated relationship across all simulations between *X* and *Y* ( and in Equation 4) and the percent of simulations in which these relationships were statistically signiﬁcant. The mean estimated relationship is close to zero and the Type I error rate is close to 5 percent. With this ADL speciﬁcation, when *Y* is integrated and *X* is stationary, mixing integrated and stationary time series does not increase the risk of spurious regression.[15]

Results in Table 2 show that the same pattern of results emerges when *X* is integrated and *Y* is stationary.[16] Most time series analysis in the political and social sciences could be accused of mixing orders of integration. Thus, the recommendations of the *PA* symposium could be interpreted as calling this research into question. We have shown, however, that mixing orders of integration does not automatically imply an unbalanced equation. It also does not automatically lead to spurious results.

We think the foregoing discussion and analyses offer compelling evidence that, despite the range of statements about equation balance in the *PA* symposium, mixing orders of integration when using a GECM/ADL does not automatically pose a problem to researchers. Of course, to a large degree the previous sections reiterate and unpack what econometricians have shown mathematically (e.g., Sims, Stock and Watson 1990), and so may come as little surprise to some readers (especially those who have not read the *PA* Symposium). Here, we use an applied example to illustrate the importance of correctly understanding equation balance. We turn to a recent article by Volscho and Kelly (2012) that analyzes the rapid income growth among the super-rich in the United States (US). They estimate a GECM of pre-tax income growth among the top 1% and ﬁnd evidence that political, policy, and economic variables inﬂuence the proportion of income going those at the top. Critically for our purposes, they include stationary and integrated variables on the right-hand side, which Grant and Lebo (2016, 26) actually single out as a case where the “GECM model [is] inappropriate with mixed orders of integration.” Grant and Lebo go on to assert that Volscho and Kelly’s “data is a mix of data types (stationary and integrated), so any hypothesis tests will be based on unbalanced equations” (supplementary appendix, p. 48). Based on the conclusion that mixing orders of integration produces an unbalanced equation, Grant and Lebo employ fractional error correction technology and find that none of the political or and policy variables (and only some economic variables) matter for incomes among the top 1%.

These are very different ﬁndings, ones with potential policy consequences, and so it is important to reconsider what Volscho and Kelly did—and whether the mixed orders of integration pose a problem for their analysis. To begin our analysis, we present the dependent variable from Voschlo and Kelly, the total pre-tax income share of the top 1% for the period between 1913 and 2008.[17] In Figure 1 we can see that income shares start off quite high and then drop and then return to inter-war levels toward the end of the series. The variable thus exhibits none of the trademarks of a stationary series, i.e., it is not mean-reverting, and looks to contain a unit root instead. Notice that the same is true for the shorter period encompassed by Volscho and Kelly’s analysis, 1949-2008. Augmented Dickey-Fuller (ADF) and Phillips–Perron unit root tests conﬁrm these suspicions, and are summarized in the ﬁrst row of Table 3, below.[18]

**Figure 1: The Top 1 Percent’s Share of Pre-tax Income in the United States, 1913 to 2008 **

What about the independent variables? Here, we ﬁnd a mix (see Table 3). Some variables clearly are nonstationary and also appear to contain unit roots: the capital gains tax rate, union membership, the Treasury Bill rate, Gross Domestic Product (logged), and the Standard and Poor 500 composite index. The top marginal tax rate also is clearly nonstationary and we cannot reject a unit root even when taking into account the secular (trending) decline over time. The results for the Shiller Home Price Index are mixed and trade openness is on the statistical cusp, and there is reason—based on the size of the autoregressive parameter (-0.29) and the fact that we reject the unit root over a longer stretch of time—to assume that the variable is stationary. For the other variables included in the analysis, we reject the null hypothesis of a unit root: Democratic president, and the Percentage of Democrats in Congress. These ﬁndings seem to comport with what Volscho and Kelly found (see their supplementary materials).[19]

Volscho and Kelly proceed to estimate a GECM of the top 1% income share including current ﬁrst differences and lagged levels of the stationary and integrated variables. So far, the diagnostics support their decision (integrated DV, some IVs are integrated, and we ﬁnd evidence of cointegration).[20] The fact that stationary variables are also included in the model should not affect equation balance. However, in order to evaluate the robustness of Volscho and Kelly’s results, we re-consider their data with Pesaran and Shin’s ARDL (Autoregressive Distributed Lag) critical bounds testing approach (Pesaran, Shin and Smith 2001). Although political scientists typically refer to the autoregressive distributed lag model as an ADL, Pesaran, Shin and Smith (2001) prefer ARDL. For their bounds test of cointegration, they estimate the model as a GECM.[21]

The ARDL approach is one of the approaches recommended by Grant and Lebo and is especially advantageous in the current context because two critical values are provided, one which assumes all stationary regressors and one which assumes all integrated regressors. Values in between these “bounds” correspond to a mix of integrated and stationary regressors, meaning the bounds approach is especially appropriate when the analysis includes both types of regressors. Grant and Lebo (2016, 19) correctly acknowledge that “With the bounds testing approach, the regressors can be of mixed orders of integration—stationary, non-stationary, or fractionally integrated—and the use of bounds allow the researcher to make inferences even when the integration of the regressors is unknown or uncertain.”[22] Since Table 3 indicates we have a mix of stationary and integrated regressors, if our critical value exceeds the highest bound, we will have evidence of cointegration.

The ARDL approach proceeds in several steps.[23] First, if the dependent variable is integrated, the ARDL model (which is equivalent to the GECM) is estimated. Next, if the residuals from this model are stationary, an F-test is conducted to evaluate the null hypothesis that the combined effect of all lagged variables in the model equals zero. This F statistic is compared to the appropriate critical values (Pesaran, Shin and Smith 2001). We rely on the small-sample critical values from Narayan (2005). If there is evidence of cointegration, both long and short-run relationships from the initial ARDL (i.e., ADL/GECM) model can be evaluated.

Our analysis focuses on Column 5 from Volscho and Kelly’s Table 1, which is their preferred model. The ﬁrst column of our Table 4, below, shows that we successfully replicate their results. The ARDL analysis appears in Column 2.[24] The key difference between this specication and that of Volscho and Kelly’s is that they (based on a Breusch-Godfrey test) employed the Prais-Winsten estimator to correct for serially correlated errors and we do not. Our decision reﬂects the fact that other tests do not reject the null of white noise, e.g., the Portmanteau (Q) test produces a p-value of 0.12, and it allows us to compare the results with and without the correction. Also note that an expanded model including lagged differenced dependent and independent variables (see Appendix Table A-1) produces very similar estimates to those shown in column 2 of Table 4, and a Breusch-Godfrey test indicates that the resulting residuals are uncorrelated.

To begin with, we need to test for cointegration. For this, we compare the F-statistic from the lagged variables (6.54) with the Narayan (2005) upper (*I*(1)) critical value (3.82), which provides evidence of cointegration.[25] The Bounds t-test also supports this inference, as the t-statistic (-7.75) for the parameter is greater (in absolute terms) than the *I*(1) bound tabulated by Pesaran, Shin, and Smith (2001, 303). Returning to the results in Column 2, we see that the ARDL approach produces similar conclusions to Column 1. (Philips (2016) uses the ARDL approach to re-consider the ﬁrst model in Volscho and Kelly’s (2012) Table 1 and also obtains similar results.) The coefficients for all but two of the independent variables have similar effects, i.e., the same sign and statistical signiﬁcance.[26] The exceptions are Divided Government_{(t-1)} and Trade Openness, for which the coeﬃcients using the two approaches are similar but the standard errors differ substantially. Consistent with the existing research on the subject, we ﬁnd evidence that economics, politics, and policy matter for the share of income going to the top 1 percent.

Although Grant and Lebo (2016, 18) recommend both the ARDL approach and a three-step fractional error correction model (FECM) approach, they only report the results for the latter in their re-analysis of Volscho and Kelly.[27] It turns out that the two approaches produce very different results. This can be seen in column 3 of Table 4, which reports Grant and Lebo’s FECM reanalysis of Volscho and Kelly Model 5 (from Grant and Lebo’s supplementary appendix, p. 50). With their approach, only the change in stock prices (Real S&P 500 Index) and Trade Openness are statistically signiﬁcant (p<.05) predictors of income shares, though levels of stock prices and trade openness also matter via the FECM component, which captures disequilibria between those variables and lagged income shares. Despite theoretical and empirical evidence suggesting that the marginal tax rate (Mertens 2015, Piketty, Saez and Stantcheva 2014), union strength (Jacobs and Myers 2014, Pontusson 2013, Western and Rosenfeld 2011), and the partisan composition of government (Bartels 2008, Hibbs 1977, Kelly 2009) can inﬂuence the pre-tax income of the upper 1 percent, we would conclude that only trade openness and stock prices influence the pre-tax income share of richest Americans. Of course, analysts might reasonably prefer alternative models to the ones Volscho and Kelly estimate, perhaps opting for a more parsimonious speciﬁcation, allowing endogenous relationships, and/or including alternate lag speciﬁcations. The key point is that, given the particular model, the ARDL and three–step FECM produce very different estimates.

In his contribution to the *PA* symposium, John Freeman wrote, “It now is clear that equation balance is not understood by political scientists” (Freeman 2016, 50). Our goal has been to help clarify misconceptions about equation balance. In particular, we have shown that mixing orders of integration in a GECM/ADL model does not automatically lead to an unbalanced equation. As the title of Lebo and Grant’s second contribution to the symposium (“Equation Balance and Dynamic Political Modeling”) illustrates, equation balance was a central theme of the symposium. Although others have responded to particular criticisms within the *PA* symposium (e.g., Enns et al. 2016), this article is the ﬁrst to address the symposium’s discussion and recommendations related to equation balance.

Because they are related, it is easy to (erroneously) conclude that mixing orders of integration is synonymous with an unbalanced equation. It would be wrong, however, to reach this conclusion. We have focused on two types of time series: stationary and unit–root series and we have found that situations exist when it is unproblematic—and inconsequential—to mix these types of series (because the equation is balanced).[28]

These results help clarify existing time series research (e.g., Banerjee et al. 1993, Sims, Stock and Watson 1990) by showing that when we use a GECM/ADL to model dynamic processes, even mixed orders of integration can produce balanced equations. The ﬁndings also lead to three recommendations for researchers. First, scholars should not automatically dismiss existing time series research that mixes orders of integration. Even when series are of different orders of integration or when the equation transforms variables in a way that leads to different orders of integration, the equation may still be balanced and the model correctly speciﬁed. In fact, we identified, and our simulations conﬁrmed, speciﬁc scenarios when integrated and stationary time series can be analyzed together. Second, as we showed with our simulations and with our applied example, researchers must evaluate whether they have equation balance based on both the univariate properties of their variables and the model they specify. Third and ﬁnally, our results show that researchers do not always need to pre-whiten their data to ensure equation balance. Although pre-whitening time series will sometimes be appropriate, we have shown that this step is not a necessary condition for equation balance. This is important because such data transformations are potentially quite costly, speciﬁcally, in the presence of equilibrium relationships. As we saw above, Grant and Lebo’s decision to pre-whiten Volscho and Kelly’s data with their three-step FECM may be one such example.

Banerjee, Anindya, Juan Dolado, John W. Galbraith and David F. Hendry. 1993. *Co-Integration, Error Correction, and the Econometric Analysis of Non-Stationary Data*. Oxford: Oxford University Press.

Bartels, Larry M. 2008. *Unequal Democracy*. Princeton: Princeton University Press.

Box-Steffensmeier, Janet and Agnar Freyr Helgason. 2016. “Introduction to Symposium on Time Series Error Correction Methods in Political Science.” *Political Analysis* 24(1):1–2.

De Boef, Suzanna and Luke Keele. 2008. “Taking Time Seriously.” *American Journal of Political Science* 52(1):184–200.

Enns, Peter K., Nathan J. Kelly, Takaaki Masaki and Patrick C. Wohlfarth. 2016. “Don’t Jettison the General Error Correction Model Just Yet: A Practical Guide to Avoiding Spurious Regression with the GECM.” *Research and Politics* 3(2):1–13.

Ericsson, Neil R. and James G. MacKinnon. 2002. “Distributions of Error Correction Tests for Cointegration.” *Econometrics Journal* 5(2):285–318.

Esarey, Justin. 2016. “Fractionally Integrated Data and the Autodistributed Lag Model: Results from a Simulation Study.” Political Analysis 24(1):42–49.

Freeman, John R. 2016. “Progress in the Study of Nonstationary Political Time Series: A Comment.” *Political Analysi*s 24(1):50–58.

Granger, Clive W.J., Namwon Hyung and Yongil Jeon. 2001. “Spurious Regressions with Stationary Series.” *Applied Economics* 33:899–904.

Granger, Clive W.J. and Paul Newbold. 1974. “Spurious Regressions in Econometrics.” *Journal of Econometrics* 26:1045–1066.

Grant, Taylor and Matthew J. Lebo. 2016. “Error Correction Methods with Political Time Series.” *Political Analysis* 24(1):3–30.

Hibbs, Jr., Douglas A. 1977. “Political Parties and Macroeconomic Policy.” *American Political Science Review* 71(4):1467–1487.

Jacobs, David and Lindsey Myers. 2014. “Union Strength, Neoliberalism, and Inequality.” *American Sociological Review* 79(4):752–774.

Keele, Luke, Suzanna Linn and Clayton McLaughlinWebb. 2016a. “Concluding Comments.” *Political Analysis* 24(1):83–86.

Keele, Luke, Suzanna Linn and Clayton McLaughlin Webb. 2016b. “Treating Time with All Due Seriousness.” *Political Analysis* 24(1):31–41.

Kelly, Nathan J. 2009. *The Politics of Income Inequality in the United States*. New York: Cambridge University Press.

Lebo, Matthew J. and Taylor Grant. 2016. “Equation Balance and Dynamic Political Modeling.” *Political Analysis* 24(1):69–82.

Maddala, G.S. and In-Moo Kim. 1998. *Unit Roots, Cointegration, and Structural Change*. ed. New York: Cambridge University Press.

Mankiw, N. Gregory and Matthew D. Shapiro. 1986. “Do We Reject Too Often? Small Sample Properties of Tests of Rational Expectations Models.” *Economics Letters* 20(2):139–

Mertens, Karel. 2015. “Marginal Tax Rates and Income: New Time Series Evidence.” https://mertens.economics.cornell.edu/papers/MTRI_september2015.pdf.

Murray, Michael P. 1994. “A Drunk and Her Dog: An Illustration of Cointegration and Error Correction.” *The American Statistician* 48(1):37–39.

Narayan, Paresh Kumar. 2005. “The Saving and Investment Nexus for China: Evidence from Cointegration Tests.” *Applied Economics* 37(17):1979–1990.

Pagan, A.R. and M.R. Wickens. 1989. “A Survey of Some Recent Econometric Methods.” *The Economic Journal* 99(398):962–1025.

Pesaran, Hashem M., Yongcheol Shin and Richard J. Smith. 2001. “Bounds Testing Approaches to the Analysis of Level Relationships.” J*ournal of Applied Econometrics* 16(3):289–326.

Philips, Andrew Q. 2016. “Have Your Cake and Eat it Too? Cointegration and Dynamic Inference from Autoregressive Distributed Lag Models.” Working Paper .

Piketty, Thomas and Emmanuel Saez. 2003. “Income Inequality in the United States, 19131998.” *Quarterly Journal of Economics* 118(1):1–39.

Piketty, Thomas, Emmanuel Saez and Stefanie Stantcheva. 2014. “Optimal Taxation of Top Labor Incomes: A Tale of Three Elasticities.” *American Economic Journal: Economic Policy* 6(1):230–271.

Pontusson, Jonas. 2013. “Unionization, Inequality and Redistribution.” *British Journal of Industrial Relations* 51(4):797–825.

Sánchez Urribarrí, Raúl A., Susanne Schorpp, Kirk A. Randazzo and Donald R. Songer. 2011. “Explaining Changes to Rights Litigation: Testing a Multivariate Model in a Comparative Framework.” *Journal of Politics* 73(2):391–405.

Sims, Christopher A., James H. Stock and Mark W. Watson. 1990. “Inference in Linear Time Series Models with Some Unit Roots.” *Econometrica* 58(1):113–144.

Volscho, Thomas W. and Nathan J. Kelly. 2012. “The Rise of the Super-Rich: Power Resources, Taxes, Financial Markets, and the Dynamics of the Top 1 Percent, 1949 to 2008.” *American Sociological Review* 77(5):679–699.

Western, Bruce and Jake Rosenfeld. 2011. “Unions, Norms, and the Rise of U.S. Wage Inequality.” *American Sociological Review* 76(4):513537.

Wlezien, Christopher. 2000. “An Essay on ‘Combined’ Time Series Processes.” *Electoral Studies* 19(1):77–93.

Yule, G. Udny. 1926. “Why do we Sometimes get Nonsense-Correlations between TimeSeries?–A Study in Sampling and the Nature of Time-Series.” *Journal of the Royal Statistical Society* 89:1–63.

[1] A previous version of this paper was presented at the Texas Methods Conference, 2017. We would like to thank Neal Beck, Patrick Brandt, Harold Clarke, Justin Esarey, John Freeman, Nate Kelly, Jamie Monogan, Mark Pickup, Pablo Pinto, Randy Stevenson, Thomas Volscho, and two anonymous reviewers for helpful comments and suggestions. All replication materials are available on *The Political Methodologist* Dataverse site (https://dataverse.harvard.edu/dataverse/tpmnewsletter).

[2] The GECM and ADL are the same model (e.g., Banerjee et al. 1993, De Boef and Keele 2008, Esarey 2016). However, since the two models estimate different quantities of interest (Enns, Kelly, Masaki and Wohlfarth 2016), they are often discussed as two separate models.

[3] Speciﬁcally, Freeman (2016, 50) explains, “KLWs [Keele, Linn, and Webb] claim that unbalanced equations are ‘nonsensical’ (16, fn. 4) and GLs [Grant and Lebo] recommendation to ‘set aside’ unbalanced equations (7) are a bit overdrawn. Banerjee et al. (1993) and others discuss the estimation of unbalanced equations. They simply stress the need to use particular nonstandard distributions in these cases.”

[4] Given the prominence of the authors as well as the *Political Analysis* journal, it is perhaps not surprising that practitioners have begun to adopt these recommendations.

[5] Enns et al. (2016) focused on how to correctly implement and interpret the GECM.

[6] Of course, equation balance is not the only relevant consideration. Researchers must check that their model satisﬁes other assumptions, such as no autocorrelation in the residuals and no omitted variables.

[7] See, e.g., Banerjee et al. (1993, 164-168), Maddala and Kim (1998, 251-252), and Pagan and Wickens (1989, 1002).

[8] For example, Grant and Lebo (p.7-8) write, “Additionally, any loss of equation balance makes a cointegration test dubious so, again, if the dependent variable is *I*(1), then the model should only include *I*(1) independent variables.”

[9] See Murray (1994) for a discussion of cointegration.

[10] This, in fact, is the ﬁrst step of the Engle-Granger two-step method of testing for cointegration.

[11] Interestingly, existing simulations show that despite being unbalanced regressions, we will not ﬁnd evidence that unrelated *I*(0) and *I*(1) series are (spuriously) related in a simple bivariate regression if the *I*(0) variable is AR(0) (see, e.g., Banerjee et al. (1993, 79), Granger, Hyung and Jeon (2001, 901), and Maddala and Kim (1998, 252)). Banerjee et al. explain that the only way in which OLS can make the regression consistent and minimize the sum of squares is to drive the coefficient to zero (p.80). Our own simulations conﬁrm that when estimating unbalanced regressions with AR(1) and *I*(1) series, both serial correlation and inﬂated Type I error rates emerge.

[12] Although fractionally integrated variables may also be of interest to researchers, this example focuses on stationary and integrated processes, which offer a clear illustration of the consequences of mixing orders of integration.

[13] Banerjee et al. (1993) wrote this in the context of a discussion of equation balance among cointegrated variables, but the point applies equally well in this context.

[14] The ADL is mathematically equivalent to the general error correction model (GECM), so the GECM would produce the same results, as long as the parameters are interpreted correctly (see Enns et al. 2016).

[15] The simulations reported in Table 1 also indicate that the ADL speciﬁcation addresses the issue of serially correlated residuals, which would not be the case with an unbalanced regression. When $\latex \theta_x=0$ and *T*=50, a Breusch-Godfrey test rejects the null of no serial correlation just 6.5% of the time. When *T*=1,000, we ﬁnd evidence of serially correlated residuals in just 4.6% of the simulations. When $\latex \theta_x=0.5$, the corresponding rates are 6.4% (*T*=50) and 4.6% (*T*=1,000).

[16] The fact that we do not observe evidence of an increased rate of spurious regression in Table 2, particularly when *Y* is AR(1), implies that we do not have an equation balance problem. We also ﬁnd that the simulations in Table 2 tend not to produce serially correlated residuals (we only reject the null of no serial correlation in 6.3% and 5.2% of simulations when *T*=50 and 4.9% and 3.3% of simulations when *T*=1,000).

[17] These data, which come from Voschlo and Kelly, were originally compiled by Piketty and Saez (2003).

[18] These results are consistent with the unit root tests Volscho and Kelly report in the supplementary materials to their article. Grant and Lebo’s analysis also supports this conclusion. In their supplementary appendix, Grant and Lebo estimate the order of integration *d*=0.93 with a standard error of (0.10), indicating they cannot reject the null hypothesis that *d*=1.0.

[19] Although the dependent variable is pre-tax income, Volscho and Kelly identify several mechanisms that could lead tax rates to inﬂuence pre-tax income share (also see, Mertens 2015, Piketty, Saez and Stantcheva 2014). Based on existing research, it also would not be surprising if we observed evidence of a relationship between the top 1 percent’s income share and union strength (for recent examples, see Jacobs and Myers 2014, Pontusson 2013, Western and Rosenfeld 2011) and the partisan composition of government (Bartels 2008, Hibbs 1977, Kelly 2009).

[20] When using the error correction parameter in the GECM to evaluate cointegration, the correct Ericsson and MacKinnon (2002) critical values must be used. When doing so, we ﬁnd evidence of cointegration for Volscho and Kelly’s (2012) preferred speciﬁcation (Model 5).

[21] Recall that the ADL, ARDL, and GECM all refer to equivalent models.

[22] It is not clear why Grant and Lebo seemingly contradict their statement that “Mixing together series of various orders of integration will mean a model is misspeciﬁed” (p.4) in this context, especially since the ARDL is equivalent to the GECM, but they are correct to do so.

[23] For a concise overview of the ARDL approach, see http://davegiles.blogspot.ca/2013/06/ ardl-models-part-ii-bounds-tests.html.

[24] We exactly follow their lag structure and the assumption of a single endogenous variable, which seemingly is incorrect but possibly intractable.

[25] The 5 percent critical value when *T*=60 with an unrestricted intercept and no trend is 3.823. Narayan (2005) only reports critical values for up to 7 regressors. However, the size of the critical value decreases as the number of regressors increases (Narayan 2005, Pesaran, Shin and Smith 2001), so our reliance on the the critical value based on 7 regressors is actually a conservative test of cointegration. We also tested for integration allowing for short-run effects of all integrated variables and we again ﬁnd evidence of cointegration (F= 4.32).

[26] This reveals that explicitly taking into account serial correlation, which Volscho and Kelly did, has modest consequences.

[27] As Grant and Lebo (2016, 18) explain, the three–Step FECM proceeds as follows. First, *Y* is regressed on *X* and the residuals are obtained. The fractional difference parameter, *d*, is then estimated for each of the three series (*Y*, *X*, and the residuals). Grant and Lebo explain that if *d* for the residuals is less than d for both *X* and *Y*, then error correction is occurring. If this is the case, the researcher then fractionally differences *Y*, *X*, and the residual by each ones own *d* value. Finally, the researcher regresses the fractionally differenced *Y* and the fractionally differenced *X*, and the lag of the fractionally differenced residual (Grant and Lebo 2016, 18). This regression produces the results reported in Column 3 of Table 4.

[28] Of course, other statistical assumptions must also be satisﬁed. In other work, we have also considered combined time series that contain both stationary and unit–root properties (Wlezien 2000). We ﬁnd that when we analyze combined time series with mixed orders of integration, we are able to detect true relationships in the data. These results further highlight the fact that mixed orders of integration do not automatically imply an unbalanced regression.

Over the past six months, *Political Analysis* has made two important transitions: the move to Cambridge University Press and to Cambridge’s new publishing platform, Core. We are excited about these changes. Working with The Press as a publishing partner expands our journal’s reach, while providing benefits to members of the Society for Political Methodology (for example, SPM’s new website, https://www.cambridge.org/core/membership/spm).

Yet it’s the transition to Cambridge Core that is most exciting. This new publishing platform provides innovative synergies between our journal and other relevant journals, as well as books and Elements–Cambridge’s new digital enterprise combining the best of journals and books; all of which will benefit our readers, and more generally students and scholars in social science methodology. Some changes/features will be immediately recognizable to readers; others will appear later, as new features and content on Cambridge Core emerges and the Press adds new functionality; but also as researchers learn how to take advantage of the platform’s capabilities.

For example, in the past, most journals published a manuscript in one location, while ancillary materials—most importantly, supplementary and replication materials—were archived independently. Now, all content—the primary text, supplementary material, code, data—will be in one place, or easily accessible via links. Readers of *Political Analysis *will be able to toggle back and forth between the online published version of a manuscript and the ancillary materials. This accessibility continues to evolve while the Cambridge team and us work to build direct connections among the manuscript, data, and code.

As mentioned, emerging synergies will enable manuscripts published in *Political Analysis* to be connected to related papers published on Cambridge Core, in political science, and across the social and data sciences. Readers will be able to build their own content within the Cambridge universe by connecting manuscripts topically, methodologically, or however they find useful. As editors of *Political Analysis,* we’ll be able to make these connections as well, for example “virtual issues,” which could include curated content from the *American Political Science Review* or *Political Science Research and Methods*, among other journals.

And there are of course the tens of thousands of books on Cambridge Core, providing content for a new type of virtual issue for our readers, where we can combine journal and book content. For example, we will be able to publish virtual issues on topics that will include manuscripts from *Political Analysis*, chapters from book series like Analytical Methods for Social Research, and material from Elements.

I think this is exciting stuff. I hope everyone agrees. The future of academic publishing is now, and we are excited to play a part in its creation.

]]>Will is probably best known as an expert on human rights, terrorism, and civil conflict. These topics are now in vogue in political science, but weren’t when Will started his career. It is in part because of his work that they are now recognized as important topics in the discipline.

But readers of *The Political Methodologist *may also know that Will made several contributions to the field of political methodology, although I’m not sure he considered himself “a methodologist.” I think his achievement is particularly notable because his interest in the field developed relatively late in his life. I think most of us struggle just to stay abreast of new developments and avoid becoming too out of date in our own narrow fields after we leave graduate school. Will grew well beyond his intial methods training, eventually co-authoring several papers that introduced new statistical models of special application to substantive problems in International Relations. He also co-authored a book, *A Mathematics Course for Political and Social Research*, that many graduate programs (including the one at Rice University, my current employer) use as a part of their methods training.

Will was a professor at Florida State University when I was a graduate student there. It was partly due to his mentorship that I decided to focus on political methodology. This was somewhat of a risky decision because FSU was not really known for producing methodologists at the time. Nevertheless, he and Bumba Mukherjee encouraged me to take a paper that I’d written for our MLE class and turn it into what eventually became my first publication in *Political Analysis. *Everything good in my life, at least as it pertains to my work, stems from that decision. I owe them both a lot.

It’s really sad to me that Will felt like a misfit because I know so many people that loved and respected him, including me. Will was one of the first people who suggested to me that I might be on the autism spectrum. I don’t know whether I am on that spectrum or not, but I do know how hard it is to say and do things that upset people without truly understanding why. Will did that a lot. I thought of him as someone who showed that you could succeed professionally and personally—you could make a real positive difference to science and in people’s lives—despite that. That was something I really needed to know at the time.

I remember that Will used to call FSU “the island of misfit toys” because (he felt) many of us landed there because of some issue or thing in our background that had kept us out of what he considered to be a more prestigious venue. I guess I’ve always thought the misfit toys were the only ones really worth knowing.

I’ll miss you, Will.

]]>