On January 11 and 12, 2018, the fifth Asian Political Methodology Meeting was held at Seoul National University, Republic of Korea. The meeting was co-sponsored by the Department of Political Science and International Relations at Seoul National University and Program for Quantitative and Analytical Political Science of Princeton University.

This year’s program, available at https://asiapolmeth.princeton.edu/online-program, had eight sessions including two keynote speeches and two poster sessions. In total, 14 papers and 30 posters were presented. In total, 95 registered participants (25 foreign and 70 local) attended the entire conference and many unregistered participants also joined the conference. Nationalities of participants were from Australia, China, Germany, Hong Kong, Ireland, Japan, Republic of Korea, Singapore, United Kingdom, and United States.

The invited keynote speaker was Prof. Michael D. Ward from Duke University. Prof. Michael D. Ward gave a talk about how to analyze relational (network) data using statistical methods. Another keynote speaker was from KAIST, Republic of Korea: Prof. Meeyoung Cha. Prof. Meeyoung Cha presented a method of detecting fake news in online social media using machine learning techniques. After the keynote speech by Prof. Ward, the conference moved to a session of “Big Data in Political Methodology,” “Experimental Methods,” and “Bayesian Analysis.” The first day of the conference ended with the first poster session, consisting of faculty and post-doc participants.

The second day of the conference started with a theme of “Political Methodology for Lobby,” and then moved to a session of “Statistical Methods for Representation.” After the second poster session by graduate students, Prof. Cha gave a keynote speech and the conference finalized the program with a session of “Analyzing Congress using Statistical Methods.”

To make this conference successful, six graduate students at Seoul National University voluntarily contributed their time and resources for two months. Their names are Soonhong Cho, Suji Kang, Doeun Kim, Sunyoung Park, Sooahn Shin, Hyein Yang in alphabetical order. On behalf of the program committee, we sincerely appreciated their help and contribution.

* *The program committee for this conference included Jong Hee Park (committee chair and local host: Seoul National University, Republic of Korea), Fang-Yi Chiou (Academia Sinica, Taiwan), Kentaro Fukumoto (Gakushuin University, Japan), Benjamin Goldsmith (University of Sydney, Australia), Kosuke Imai (Princeton University, USA), and Xun Pang (Tsinghua University, China).

The 2019 Annual Meeting will be held in Kyoto, Japan. We look forward to seeing you in Kyoto next year!

*Jong Hee Park is Professor,* *Department of Political Science and International Relations,** **Seoul National University, Republic of Korea.*

]]>

Political Science Research & Methods (PSRM) is the first Press journal adopting Code Ocean, an extension of the journal’s existing policy that requires authors to deposit data necessary to reproduce the results in their articles. A PSRM article with the Code Ocean widget embedded on Cambridge Core, the Press’s publishing platform, can be seen here. The widget enables readers to view and run the code without leaving Cambridge Core.

The release also indicates that similar adoptions might follow at other CUP journals.

The “information for contributors” instructions at *PSRM* have not been updated to reflect this change, but Butler, Karpowitz, and Pope (2017) linked to in the press release indicates how this policy might change the integration of replication code into articles.

The full press release is available at this link.

]]>

Since Brambor, Clark and Golder’s (2006) article in *Political Analysis* (hereafter BCG), our understanding of interaction models has improved significantly and most empirical scholars have now integrated the tools to execute and interpret interaction models properly. In particular, one of the main recommendations of BCG was to include all constitutive terms of the interaction in the model specification. However, BCG acknowledge (in the text surrounding equation 7 of their paper) that there is a mathematically equivalent model specification that allows researchers to exclude certain constitutive terms from an interaction model when one of the modifying variables is discrete. A recent review experience made me realize that this exception is not as widely recognized as BCG’s core advice to include all constitutive terms, suggesting therefore, that a brief note to the scholarly community might be important in publicizing this exception. In the next section, I show the equivalency of BCG standard specification and this alternative specification. I then provide a brief example of both approaches when applied in a substantive case — Adams et al. (2006) study “Are Niche Parties Fundamentally Different from Mainstream Parties?” — and show that we get the same results either using BCG’s approach or the alternative approach.

Overall, I show that while the two model specifications are equivalent, each has some advantages in terms of the interpretation of the regression results. On the one hand, the advantage of the standard specification is to present directly in the regression results whether the *difference in the marginal effects *of *X *on *Y *between the categories of the modifying variable *Z *is statistically significant. On the other hand, the main benefit of the alternative approach is to present directly in the regression results the *marginal effects *of *X *under each category of the modifying variable *Z*. Researchers may thus choose between the two equivalent specifications depending on the results they want to present and emphasize.

**1 Equivalency of the Standard and Alternative Specifications**

In order to show the equivalency of BCG standard specification and the alternative specification when one of the modifying variables is discrete, I take as an example a dependent variable *Y *which is a function of an interaction effect between a continuous variable *X *and a dummy variable *D*. BCG standard approach to interaction models indicates that we must multiply the variables *X *and *D *and include this interaction term as well as the constitutive terms *X *and *D*, respectively, in the regression model. Specifically, the standard specification is the following:

*Y *= *b*_{0} + *b*_{1}*D *+ *b*_{2}*X *+ *b*_{3}*XD *+ *ϵ *(1)

where X is continuous and D is a dummy variable (0,1). The marginal effect of *X *when *D *= 0 is given by *b*_{2} while the marginal effect of *X *when *D *= 1 is given by *b*_{2} + *b*_{3}.²

The alternative approach explained briefly in BCG (see equation 7) and Wright (1976) consists in treating the dummy variable *D *as two dummy variables: *D*, the original variable, which equals 0 and 1 and *D*_{0}, the inverse of *D*, which equals 1 when *D *= 0 and 0 when *D *= 1. For example, if *D *is a dummy variable where 1 represents democratic countries and 0 authoritarian countries *D*_{0} would simply be the inverse dummy variable where 1 represents authoritarian countries and 0 democratic countries. Consequently, *D *+ *D*_{0} = 1 *and D*_{0} = 1 *− D*. The alternative approach consists in multiplying *X *respectively with *D *and *D*_{0}, including all constitutive terms in the regression model except *X *and one of the dummy variables, *D *or *D*_{0}. The reason for including only *D *or *D*_{0} is that these variables are perfectly collinear. It is not possible to include *X *neither because of perfect multicollinearity with *XD *and *XD*_{0}.

The alternative specification is thus the following:

*Y *= *a*_{0} + *a*_{1}*D *+ *a*_{2}*XD *+ *a*_{3}*XD*_{0} + *ϵ *(2)

Equation 2 could be rewritten as

*Y *= *a*_{0} + *a*_{1}*D *+ *a*_{2}*XD *+ *a*_{3}*X*(1 *− D*) + *ϵ *(3)

Equation 3 highlights explicitly that we do not necessarily need to create *D*_{0} but only to multiply *X *by (1 *− D*). In equations 2 and 3, the marginal effect of *X *when *D *= 0 (i.e. when *D*_{0} = 1) is given by *a*_{3} while the marginal effect of *X *when *D *= 1 is given by *a*_{2}.³

The main advantage of this alternative specification is that for each category of the discrete modifying variable *D *(0 and 1 in this case) the marginal effect of *X *and its associated standard error are provided directly from the regression results. This is not the case in the standard approach where only one of these results is directly provided (i.e. *b*_{2}, the effect of *X *when *D*=0). Consequently, we need to add up *b*_{2} and *b*_{3} to obtain the effect of *X *when *D*=1. This is easy to do in Stata with the command *lincom *(*lincom **_b*[*coef *] + *_b*[*coef *]).

A disadvantage of the alternative specification is that the regression results do not indicate whether the difference between the marginal effects of *X *when *D*=0 and when *D*=1 is statistically significant. This is the advantage of the standard approach which provides this information in the regression results . If the coefficient *b*_{3} is statistically significant in equation 1, this indicates that the marginal effect of *X *when *D*=1 is statistically different than the marginal effect of *X *when *D*=0. To answer this question with the alternative approach, we must test the equality of *a*_{2} and *a*_{3}. This is also straightforward in Stata with the command *test *(*test _b*[*coef *] = *_b*[*coef *]) or *lincom *(*lincom _b*[*coef *] *− _b*[*coef *]).

The specification of the alternative approach could be easily generalized to discrete variables with multiple categories whether the discrete variable is nominal or ordinal. The procedure is the same. We need first to create a dummy variable for each category of the discrete modifying variable. We then multiply *X *with each of these dummy variables and include all constitutive terms in the equation (except *X *and one of the dummy variables). This specification will also allow researchers to evaluate directly the magnitude of the substantive effect of *X *across the different values of the discrete modifying variables without *including all constitutive terms of the interaction explicitly*.

**2 Replication of “Are Niche Parties Fundamentally Different From Mainstream Parties? ”**

In this section, I compare the results of the standard and alternative approaches to interaction models in replicating Adams et al. (2006) study “Are Niche Parties Fundamentally Different from Mainstream Parties?” published in the *American Journal of Political Science*. Two main research questions are examined in this article. First, the authors examined whether mainstream parties are more responsive than niche parties to shift in public opinion in adjusting their policy programs. Second, and building on this prediction, they examined whether niche parties are more penalized electorally than mainstream parties when they moderate their policy positions. Here, I only replicate their model associated with the first question.

Adams et al. (2006) tested these hypotheses in seven advanced democracies (Italy, Britain, Greece, Luxembourg, Denmark, Netherlands, and Spain) over the 1976-1998 period. They measure parties’ policy position on the left-right ideological scale with data from the Comparative Manifesto Project (CMP). Surveys from the Eurobarometer are used to locate respondents on the corresponding left-right ideological dimension. Public opinion is measured as the average of all respondents’ self-placement. Finally, the authors coded Communist, Green, and Nationalist parties as niche parties with a dummy variable.

In table 1, I examine party responsiveness to public opinion and present the results of the standard and alternative approaches. Adams et al. (2006) use the standard approach and interact the variable *public opinion shift *with the dummy variable *niche party*. The dependent variable is the change in a party’s left- right position. Adams et al. (2006) thus specified a dynamic model where they assess whether a change in public opinion influences a change in party positions between two elections. The models include fixed effects for countries and a number of control variables (see the original study for the justifications). The specification of the standard approach in column (1) is the following:

∆*party position *= *b*0 + *b*1∆*public opinion *+ *b*2*niche party *+ *b*3(∆*public opinionXniche party*) + *controls*

In column (1) of Table 1, I display the same results as those published in Table 1 of Adams et al. (2006) article. The results in column (1) support the authors’ argument that niche parties are less responsive than mainstream parties to change in public opinion. The coefficient of *public opinion shift *(0.97) is positive and statistically significant indicating that when public opinion is moving to the left (right) mainstream parties (niche party=0) adjust their policy positions accordingly to the left (right). The coefficient of *public opinion shift X niche party *indicates that niche parties are less responsive than mainstream parties to shift in public opinion by -1.52 points on the left-right scale and the difference is statistically significant (p<0.01).

In column (2) of Table 1, I display the results of the alternative approach. The specification is now the following:

∆*party position *= *b*0 +*b*1*niche party*+*b*2(∆*public opinionXniche party*)+*b*3(∆*public opinionXmainstream party*)+*controls*

* *where *mainstream party *equals (1 *− niche party*).

It is important to highlight that the results in columns (1) and (2) are mathematically equivalent. For example, the coefficients of the control variables are exactly the same in both columns. There are some differences, however, in terms of the interpretation of the interaction effect. In column (2), the coefficient of *public opinion shift – mainstream party *(0.97) equals the coefficient of *public opinion shift *in column (1). This is because *public opinion shift – mainstream party *in column (2) indicates the impact of a change in public opinion on the positions of mainstream parties as it is for *public opinion shift *in column (1). On the other hand, the coefficient of *public opinion shift – niche party *in column (2) equals -0.55 and is statistically significant at the 0.05 level. This indicates that when public opinion is moving to the left (right) niche parties adjust their policy positions in the opposite direction to the right (left). This result is not explicitly displayed in column (1) when using the standard approach. The coefficient of *public opinion shift – niche party *in column (2) equals actually the sum of the coefficients of *public opinion shift *and *public opinion shift X niche party *in column (1) — i.e. 0.97 + -1.52 = -0.55. In column (2), a Wald-test indicates that the difference of the effects of *public opinion shift – niche party *and *public opinion shift – mainstream party *is statistically significant at the 0.01 level, exactly as indicated by the coefficient of *public opinion shift X niche party *in column (1).

Overall, researchers may choose between two equivalent specifications when one of the modifying variables is discrete in an interaction model: BCG specification which includes all constitutive terms of the interaction and an alternative specification that does not include all constitutive terms of the interaction *explicitly*. Each specification has its advantages in terms of the interpretation of the interaction effect. The advantage of the alternative approach is to present directly the *marginal effects *of an independent variable X on Y for each category of the discrete modifying variable Z. On the other hand, the advantage of BCG approach is to present directly whether the *difference in the marginal effects *of X on Y between the categories of Z is statistically significant. In both specifications, researchers then need to perform an additional test to verify whether the difference in the marginal effects is statistically significant (in the alternative specification) or to calculate the substantive marginal effects under each category of the discrete modifying variable (in the standard specification).

**Notes**

¹Assistant Professor, School of Political Studies, University of Ottawa, 120 University, Ottawa, ON, K1N 6N5, Canada (bferland@uottawa.ca). I thank James Adams, Michael Clark, Lawrence Ezrow, and Garrett Glasgow for sharing their data. I also thank Justin Esarey for his helpful comments on the paper.

² The marginal effect of X in equation 1 is given by *b*_{2} + *b*_{3}*D*.

³ The marginal effect of X in equations 2 and 3 is calculated by *a*_{2}*D *+ *a*_{3}*D*_{0}.

Note also that equation 1 on the left-hand side equals either equation 2 or 3 on the right-hand side:

*b*_{0} + *b*_{1}*D *+ *b*_{2}*X *+ *b*_{3}*XD *+ *ϵ*=*a*_{0} + *a*_{1}*D *+ *a*_{2}*XD *+ *a*_{3}*X*(1 *− D*) + *ϵ *

*b*_{0} + *b*_{1}*D *+ *b*_{2}*X *+ *b*_{3}*XD *+ *ϵ*=*a*_{0} + *a*_{1}*D *+ *a*_{2}*XD *+ *a*_{3}*X − a*_{3}*XD *+ *ϵ*

It is possible then to isolate *XD *on the right-hand side:

*b*_{0} + *b*_{1}*D *+ *b*_{2}*X *+ *b*_{3}*XD *+ *ϵ*=*a*_{0} + *a*_{1}*D *+ *a*_{3}*X *+ (*a*_{2} *− a*_{3})*XD *+ *ϵ*

Assuming that the models on the left-hand side and right-hand side are estimated with the same data *b*_{0} would equal *a*_{0}, *b*_{1} would equal *a*_{1}, *b*_{2} would equal *a*_{3} (i.e. the estimated parameter of X(1-D)), and *b*_{3} would equal (*a*_{2} *− a*_{3}).

**References**

Adams, James, Michael Clark, Lawrence Ezrow and Garrett Glasgow. 2006. “Are Niche Parties Fundamentally Different from Mainstream Parties? The Causes and the Electoral Consequences of Western European Parties’ Policy Shifts, 1976-1998.” *American Journal of Political Science *50(3):513–529.

Brambor, Thomas, William Roberts Clark and Matt Golder. 2006. “Understanding Interaction Models: Improving Empirical Analyses.” *Political Analysis *14:63–82.

Wright, Gerald C. 1976. “Linear Models for Evaluating Conditional Relationships.” *American Journal of Political Science *2:349–373.

To facilitate further discussion of these proposals—and perhaps to begin to develop an actionable plan for reform—the International Methods Colloquium (IMC) hosted a panel discussion on “reproducibility and a stricter threshold for statistical significance” on October 27, 2017. The one-hour discussion included six panelists and over 240 attendees, with each panelist giving a brief initial statement concerning the proposal to “redefine statistical significance” and the remainder of the time being devoted to questions and answers from the audience. The event was recorded and can be viewed online for free at the International Methods Colloquium website.

Unfortunately, the IMC’s time limit of one hour prevented many audience members from asking their questions and having a chance to hear our panelists respond. Panelists and audience members alike agreed that the time limit was not adequate to fully explore all the issues raised by Benjamin et al. (2017). Consequently, questions that were not answered during the presentation were forwarded to all panelists, who were given a chance to respond.

The questions and answers, both minimally edited for clarity, are presented in this article. The full series of questions and answers (and this introduction) are embedded in the PDF below.

]]>

You can find a direct link to a downloadable version of the print edition here [update: a version with a minor correction has been added as of 5:23 PM on 9/26/2017]:

]]>We are very excited to announce a new Minnesota Political Methodology Colloquium (MPMC) initiative: the Minnesota Political Methodology Graduate Student Conference. The conference is scheduled for May 4 & May 5, 2018.

The Minnesota Political Methodology Graduate Student Conference is designed to provide doctoral students with feedback on their research from peers and faculty. Research papers may focus on any substantive topic, employ any research methodology, and/or be purely methodological. We are particularly interested in novel applied work to interesting and important questions in political science, sociology, psychology, and related fields.

The conference represents a unique opportunity for graduate students in different programs, across different disciplines, and with different substantive interests to network and receive feedback on their work. Papers will receive feedback from a faculty discussant, written feedback from other panelists, and comments/suggestions from audience members.

The conference will occur over two days (May 4 and May 5, 2018) and feature at least 24 presentations in 6 panels. Proposals are due December 1, 2017.

Our keynote speaker for the event is Sara Mitchell, F. Wendell Miller Professor of Political Science at the University of Iowa.

Details about the conference may be found here.

Questions should be addressed to mpmc@umn.edu

]]>MacKinnon and Webb offer a useful analysis of how the uncertainty of causal effects can be underestimated when observations are clustered and the treatment is applied to a very large or vary small share of the clusters. Their mathematical exposition, simulation exercises, and replication analysis provide a helpful guide for how to proceed when data are poorly behaved in this way. These are valuable lessons for researchers studying impacts of policy in observational data where policies tend to be sluggish and thus do not generate much variability in the key explanatory variables.

**Correction of Two Errors**

MacKinnon and Webb find two errors in our analysis, while nonetheless concluding “we do not regard these findings as challenging the conclusions of Burden et al. (2017).” Although we are embarrassed by the mistakes, we are also grateful for their discovery.^{1} Our commitment to transparency is reflected by the fact the data was been made public for replication purposes since well before the article was published. We have posted corrected versions of the replication files and published a corrigendum with the journal where the article was original published.

Fortunately, none of the other analyses in our article were affected. It is only Table 7 where errors affect the analysis. Tables 2 through 6 remain intact.

We concede that when corrections are made the effect of early voting drops from statistical significance in the model of the difference in the Democratic vote between 2008 and 2012. All of the various standard errors they report are far too large to reject the null hypothesis.

**The Problem of Limited Variation**

The episode highlights the tradeoffs that researchers face between applying what appears to be a theoretically superior estimation technique (i.e., difference-in-difference) and the practical constraints of a particular application (i.e., limited variation in treatment variables) that make its use intractable. In the case of our analysis, election laws do not change rapidly, and the conclusions of our analysis were largely based on cross-sectional analyses (Tables 2-6), with the difference-in-difference largely offered as a supplemental analysis.

We are in agreement with MacKinnon and Webb that models designed to estimate causal effects (or even simple relationships) may be quite tenuous when the number of clusters is small and the clusters are treated in a highly unbalanced fashion. In fact, we explained our reluctance to apply the difference-in-difference model to our data because of the limited leverage available. We were explicit about our reservations in this regard. As our article stated:

“A limitation of the difference-in-difference approach in our application is that few states actually changed their election laws between elections. As Table A1 (see Supplemental Material) shows, for some combinations of laws there are no changes at all. For others, the number of states changing is as low as one or two. As result, we cannot include some of the variables in the model because they do not change. For some other variables, the interpretation of the coefficients would be ambiguous given the small number of states involved; the dummy variables essentially become fixed effects for one or two states” (p. 572).

This is unfortunate in our application because the difference-in-difference models are likely to be viewed as more convincing than the cross-sectional models. This is why we offered theory suggesting that the more robust cross-sectional results were not likely to suffer from endogeneity.

The null result in the difference-in-difference models is not especially surprising given our warning above about the limited leverage provided by the dataset. Indeed, the same variable was insignificant in our model of the Democratic vote between 2004 and 2008 that we also reported in Table 7. We are left to conclude that the data are not amenable to detecting effects using difference-in-difference models. Perhaps researchers will collect data from more elections to provide more variation in the key variable and estimate parameters more efficiently.

In addition to simply replicating our analysis, MacKinnon and Webb also conduct an extension to explore asymmetric effects. They separate the treated states into those where early voting was adopted and where early voting was repealed. We agree that researchers ought to investigate such asymmetries. We recommended as much in our article: “As early voting is being rolled back in some states, future research should explore the potential asymmetry between the expansion and contraction of election practices” (p. 573). However, we think this is not feasible with existing data. As MacKinnon and Webb note, only two states adopted early voting and only one state repealed early voting. As a result, analyzing these cases separately as they do essentially renders the treatment variables to be little more than fixed effects for one or two states, as we warned in our article. The coefficients might be statistically significant using various standard error calculations, but it is not clear that MacKinnon and Webb are actually estimating the treatment effects rather than something idiosyncratic about one or two states.

**Conclusion**

While the errors made in our difference-in-difference analysis were regrettable, we think the greater lesson from the skilled analysis of MacKinnon and Webb is to raise further doubt about whether this tool is simply unsuitable in such a policy setting. While all else is equal, it may offer a superior mode of analysis; but all else is not equal. Researchers need to find the best mode of analysis to fit with the limitations of the data.

**Footnotes**

Matthew D. Webb, Department of Economics, Carleton University

**Extended Abstract**

There is a large and rapidly growing literature on inference with clustered data, that is, data where the disturbances (error terms) are correlated within clusters. This type of correlation is commonly observed whenever multiple observations are associated with the same political jurisdictions. Observations might also be clustered by time periods, industries, or institutions such as hospitals or schools.

When estimating regression models with clustered data, it is very common to use a “cluster-robust variance estimator” or CRVE. However, inference for estimates of treatment effects with clustered data requires great care when treatment is assigned at the group level. This is true for both pure treatment models and difference-in-differences regressions, where the data have both a time dimension and a cross-section dimension and it is common to cluster at the cross-section level.

Even when the number of clusters is quite large, cluster-robust standard errors can be much too small if the number of treated (or control) clusters is small. Standard errors also tend to be too small when cluster sizes vary a lot, resulting in too many false positives. Bootstrap methods based on the wild bootstrap generally perform better than t-tests, but they can also yield very misleading inferences in some cases. In particular, what would otherwise be the best variant of the wild bootstrap can underreject extremely severely when the number of treated clusters is very small. Other bootstrap methods can overreject extremely severely in that case.

In Section 2, we briefly review the key ideas of cluster-robust covariance matrices and standard errors. In Section 3, we then explain why inference based on these standard errors can fail when there are few treated clusters. In Section 4, we discuss bootstrap methods for cluster-robust inference. In Section 5, we report (graphically) the results of several simulation experiments which illustrate just how severely both conventional and bootstrap methods can overreject or underreject when there are few treated clusters. In Section 6, the implications of these results are illustrated using an empirical example from Burden, Canon, Mayer, and Moynihan (2017). The final section concludes and provides some recommendations for empirical work.

**Full Article**

**Replication File**

Replication files for the Monte Carlo simulations and the empirical example can be found at: doi:10.7910/DVN/GBEKTO .

- We are grateful to Justin Esarey for several very helpful suggestions and to Joshua Roxborough for valuable research assistance. This research was supported, in part, by a grant from the Social Sciences and Humanities Research Council of Canada. Some of the computations were performed at the Centre for Advanced Computing at Queen’s University.

Arthur Spirling (New York University) | October 20th |

Roundtable on Reproducibility and a Stricter Threshold for Statistical Significance: Dan Benjamin (University of Southern California), Daniel Lakens (Eindhoven University of Technology), and E.J. Wagenmakers(University of Amsterdam) | October 27th |

Will Hobbs (Northeastern University) | November 3rd |

Adeline Lo (Princeton University) | November 10th |

Olga Chyzh (Iowa State University) | November 17th |

Pamela Ban (Harvard University) | December 1st |

Teppei Yamamoto (Massachussetts Institute of Technology) | February 2nd |

Clay Webb (University of Kansas) | February 16th |

Mark Pickup (Simon Fraser University) | February 23rd |

Erik Peterson (Dartmouth College) | March 2nd |

Casey Crisman-Cox (Washington University in St. Louis) | March 9th |

Erin Hartman (University of California, Los Angeles) | March 23rd |

Note that all presentations will begin at noon Eastern time and last for one hour. Additional information for each presentation (including a title and link to relevant paper) will be released closer to its date. You can preregister to attend a presentation by clicking on the link in the Google Calendar entry corresponding to the talk; the IMC’s Google Calendar is available at https://www.methods-colloquium.com/schedule. (Anyone can show up the day of the presentation without pre-registering if they wish as long as room remains; there are 500 seats available in each webinar.)

The International Methods Colloquium (IMC) is a weekly seminar series of methodology-related talks and roundtable discussions focusing on political methodology; the series is supported by Rice University and a grant from the National Science Foundation. The IMC is free to attend from anywhere around the world using a PC or Mac, a broadband internet connection, and our free software. You can find out more about the IMC at our website, http://www.methods-colloquium.com/, where you can register for any of these talks and/or join a talk in progress using the “Watch Now!” link. You can also watch archived talks from previous IMC seasons at this site.

]]>A large and interdisciplinary group of researchers recently proposed redefining the conventional threshold of statistical significance from *p* < 0.05 to *p* < 0.005, both two-tailed (Benjamin et al. 2017). The purpose of the reform is to “immediately improve the reproducibility of scientific research in many fields” (p. 5); this comes in the context of recent large-scale replication efforts that have uncovered startlingly low rates of replicability among published results (e.g., Klein et al. 2014; Open Science Collaboration 2015). Recent work suggests that results that meet a more stringent standard for statistical significance will indeed be more reproducible (V. E. Johnson 2013; Esarey and Wu 2016; V. E. Johnson et al. 2017). Reproducibility should be improved by this reform because the *false discovery rate*, or the proportion of statistically significant findings that are null relationships (Benjamini and Hochberg 1995), is reduced by its implementation. Benjamin et al. (2017) explicitly disavow a requirement that results meet the stricter standard in order to be publishable, but in the past statistical significance has been used as a necessary condition for publication (T. D. Sterling 1959; T. Sterling, Rosenbaum, and Winkam 1995) and we must therefore anticipate that a redefinition of the threshold for significance may lead to a redefinition of standards for publishability.

As a method of screening empirical results for publishability, the conventional *p* < 0.05 null hypothesis significance test (NHST) procedure lies on a kind of Pareto frontier: it is difficult to improve its qualities on one dimension (i.e., increasing the replicability of published research) without degrading its qualities on some other important dimension. This makes it hard for any proposal to displace the conventional NHST: such a proposal will almost certainly be worse in some ways, even if it is better in others. For moving the threshold for statistical significance from 0.05 to 0.005, the most obvious tradeoff is that the *power* of the test to detect non-null relationships is harmed at the same time that the *size* of the test (i.e., its propensity to mistakenly reject a true null hypothesis) is reduced. The usual way of increasing the power of a study is to increase the sample size, *N*; given the fixed nature of historical time-series cross-sectional data sets and the limited budgets of those who use experimental methods, this means that many researchers may feel forced to accept less powerful studies and therefore have fewer opportunities to publish. Additionally, reducing the threshold for statistical significance can exacerbate *publication bias*, i.e., the extent to which the published literature exaggerates the magnitude of an relationship by publishing only the largest findings from the sampling distribution of a relationship (T. Sterling, Rosenbaum, and Winkam 1995; Scargle 2000; Schooler 2011; Esarey and Wu 2016).

Simply accepting lower power in exchange for a lower false discovery rate would increase the burden on the most vulnerable members of our community: assistant professors and graduate students, who must publish enough work in a short time frame to stay in the field. However, there is a way of maintaining adequate power using *p* < 0.005 significance tests without dramatically increasing sample sizes: design studies that conduct conjoint tests of multiple predictions from a single theory (instead of individual tests of single predictions). When *K*-many statistically independent tests are performed on pre-specified hypotheses that must be jointly confirmed in order to support a theory, the chance of simultaneously rejecting them all by chance is *α*^{K} where *p* < *α* is the critical condition for statistical significance in an individual test. As *K* increases, the *α* value for each individual study can fall and the overall power of the study often (though not always) increases. It is important that hypotheses be specified *before* data analysis and that failed predictions are reported; simply conducting many tests and reporting the statistically significant results creates a multiple comparison problem that inflates the false positive rate (Sidak 1967; Abdi 2007). Because it is usually more feasible to collect a greater depth of information about a fixed-size sample rather than to greatly expand the sample size, this version of the reform imposes fewer hardships on scientists at the beginning of their careers.

I do not think that an NHST with a lowered *α* is the best choice for adjudicating whether a result is statistically meaningful; approaches rooted in statistical decision theory and with explicit assessment of out-of-sample prediction have advantages that I consider decisive. However, if reducing *α* prompts us to design research stressing the simultaneous testing of multiple theoretical hypotheses, I believe this change would improve upon the status quo—even if passing this more selective significance test becomes a requirement for publication.

Based on the extant literature, I believe that requiring published results to pass a two-tailed NHST with *α* = 0.05 is at least partly responsible for the “replication crisis” now underway in the social and medical sciences (Wasserstein and Lazar 2016). I also anticipate that studies that can meet a lowered threshold for statistical significance will be more reproducible. In fact, I argued these two points in a recent paper (Esarey and Wu 2016). Figure 1 reproduces a relevant figure from that article; the *y*-axis in that figure is the *false discovery rate* (Benjamini and Hochberg 1995), or FDR, of a conventional NHST procedure with an *α* given by the value on the *x*-axis. Written very generically, the false discovery rate is:

or, the proportion of statistically significant results (“discoveries”) that correspond to true null hypotheses. *A* is a function of the *power* of the test, Pr(stat. sig.|null hypothesis is true). *B* is a function of the *size* of the test, Pr(stat. sig.|null hypothesis is false). Both *A* and *B* are a function of the underlying proportion of null hypotheses being proposed by researchers, Pr(null hypothesis is true).

As Figure 1 shows, when Pr(null hypothesis is true) is large, there is a very disappointing FDR among results that pass a two-tailed NHST with *α* = 0.05. For example, when 90% of the hypotheses tested by researchers are false leads (i.e., Pr(null hypothesis is true)=0.9), we may expect nearly ≈30% of discoveries to be false when all studies have perfect power (i.e., Pr(stat. sig.|null hypothesis is false)=1). Two different studies applying disparate methods to data from the Open Science Collaboration’s replication project data (2015) have determined that approximately 90% of researcher hypotheses are false (V. E. Johnson et al. 2017; Esarey and Liu 2017); consequently, an FDR in the published literature of at least 30% is attributable to this mechanism. Even higher FDRs will result if studies have less than perfect power.

By contrast, Figure 1 *also* shows that setting *α* ≈ 0.005 would greatly reduce the false discovery rate, even when the proportion of null hypotheses posed by researchers is extremely high. For example, if 90% of hypotheses proposed by researchers are false, the lower bound FDR in the published literature among studies meeting the higher standard would be about 5%. Although non-null results are not always perfectly replicable in underpowered studies (and null results have a small chance of being successfully replicated), a reduction in the FDR from ≈30% to ≈5% would almost certainty and drastically improve the replicability of published results.

Despite this and other known weaknesses of the current NHST, it lies on or near a Pareto frontier of possible ways to classify results as “statistically meaningful” (one aspect of being suitable for publication). That is, it is challenging to propose a revision to the NHST that will bring improvements without also bringing new or worsened problems, some of which may disproportionately impact certain segments of the discipline. Consider some dimensions on which we might rate a statistical test procedure, the first few of which I have already discussed above:

- the
*false discovery rate*created by the procedure, itself a function of:- the
*power*of the procedure to detect non-zero relationships, and - the
*size*of the procedure, its probability of rejecting true null hypotheses;

- the
- the degree of
*publication bias*created by the procedure, i.e., the extent to which the published literature exaggerates the size of an effect by publishing only the largest findings from the sampling distribution of a relationship (T. Sterling, Rosenbaum, and Winkam 1995; Scargle 2000; Schooler 2011; Esarey and Wu 2016); - the
*number of researcher assumptions*needed to execute the procedure, a criterion related to the*consistency*of the standards implied by use of the test; and - the
*complexity*, or ease of use and interpretability, of the procedure.

It is hard to improve on the NHST in one of these dimensions without hurting performance in another area. In particular, lowering *α* from 0.05 to 0.005 would certainly lower the power of most researchers’ studies and might increase the publication bias in the literature.

Changing the *α* of the NHST from 0.05 to 0.005 is a textbook example of moving along a Pareto frontier because the size and the power of the test are in direct tension with one another. Because powerful research designs are more likely to produce publishable results when the null is false, maintaining adequate study power is especially critical to junior researchers who need publications in order to stay in the field and advance their careers. The size/power tradeoff is depicted in Figure 2.

Figure 2 depicts two size and power analyses for a coefficient of interest *β*. I assume that , the standard deviation of the sampling distribution of *β*, is equal to 1; I also assume 100 degrees of freedom (typically corresponding to a sample size slightly larger than 100. In the left panel (Figure 2a), the conventional NHST with *α* = 0.05 is depicted. In the right panel (Figure 2b), the Benjamin et al. (2017) proposal to decrease *α* to 0.005 is shown. The power analyses assume that the true *β* = 3, while the size analyses assume that *β* = 0.

The most darkly shaded area under the right hand tail of the *t*-distribution under the null is the probability of incorrectly rejecting a true null hypothesis, *α*. As the figure shows, it is impossible to shrink *α* without simultaneously shrinking the lighter shaded area, where the lighter shading depicts the power of a hypothesis test to correctly reject a false null hypothesis. This tradeoff can be more or less severe (depending on *β* and ), but always exists. The false discovery rate will still (almost always) improve when *α* falls from 0.05 to 0.005, despite the loss in power, but many more true relationships will not be discovered under the latter standard.

The size/power tradeoff is not the only compromise associated with lowering *α*: in some situations, decreased *α* can increase the publication bias associated with using statistical significance tests as a filter for publication. Of course, (Benjamin et al. 2017) explicitly disavow using an NHST with lowered *α* as a necessary condition for publication; they say that “results that would currently be called ‘significant’ but do not meet the new threshold should instead be called ‘suggestive’” (p. 5). However, given the fifty year history of of requiring results to be statistically significant to be publishable (T. D. Sterling 1959; T. Sterling, Rosenbaum, and Winkam 1995), we must anticipate that this pattern could continue into the future.

Consider Figure 3, which shows two perspectives on how decreased *α* might impact publication bias. The left panel (Figure 3a) shows the percentage difference in magnitude between a true coefficient *β* = 3 and the average statistically significant coefficient using an NHST with various values of *α*.^{2} The figure shows that decreased values of *α* increase the degree of publication bias; this occurs because stricter significance tests tend to reject only the largest estimates from a sampling distribution, as also shown in Figure 2. On the other hand, the right panel (Figure 3b) shows the percentage difference in magnitude from a true coefficient drawn from a spike-and-normal prior density of coefficients:

where there is a substantial (50%) chance of a null relationship being studied by a researcher; is the indicator function for a null relationship.^{3} In this case, increasingly strict significance tests tend to *decrease* publication bias, because the effect of screening out null relationships (which are greatly overestimated by any false positive result) is stronger than the effect of estimating only a portion of the sampling distribution of non-null relationships (which drives the result in the left panel).

If the proposal of (Benjamin et al. 2017) is simply to accept lowered power in exchange for lower false discovery rates, it is a difficult proposal to accept. First and foremost, assistant professors and graduate students must publish a lot of high-quality research in a short time frame and may be forced to leave the discipline if they cannot; higher standards that are an inconvenience to tenured faculty may be harmful to them unless standards for hiring and tenure adapt accordingly. In addition, even within a particular level of seniority, this reform may unlevel the playing field. Political scientists in areas that often observational data with essentially fixed *N*, such as International Relations, would be disproportionately affected by such a change: more historical data cannot be created in order to raise the power of a study. Among experimenters and survey researchers, those with smaller budgets would also be disproportionately affected: they cannot afford to simply buy larger samples to achieve the necessary power. Needless to say, the effect of this reform on the scientific ecosystem would be difficult to predict and not necessarily beneficial; at first glance, such a reform seems to benefit the most senior scholars and people at the wealthiest institutions.

However, I believe that acquiesence to lower power is *not* the only option. It may be difficult or impossible for researchers to collect larger *N*, but it is considerably easier for them to measure a larger number of variables of interest *K* from their extant samples. This creates the possibility that researchers can spend more time developing and enriching their theories so that these theories make multiple predictions. Testing these predictions jointly typically allows for much greater power than testing any single prediction alone, as long as any prediction is clearly laid out *prior to analysis* and *all failed predictions are reported* in order to avoid a multiple comparison problem (Sidak 1967; Abdi 2007); predictions must be specified in advance and failed predictions must be reported because simply testing numerous hypotheses and reporting any that were confirmed tends to generate an excess of false positive results.

When *K*-many statistically independent hypothesis tests are performed using a significance threshold of *α* and *all* must be passed in order to confirm a theory,^{4} the chance of simultaneously rejecting them all by chance is . Fixing , it is clear that increasing *K* allows the individual test’s *α* to be higher.^{5} Specifically, *α* must be equal to in order to achieve the desired size. That means that two statistically independent hypothesis tests can have their individual *α* ≈ 0.07 in order to achieve a joint size of 0.005; this is a *lower* standard on the individual level than the current 0.05 convention. When *k* = 3, the individual *α* ≈ 0.17.

When will conducting a joint test of multiple hypotheses yield greater power than conducting a single test? If two hypothesis tests are statistically independent^{6} and conducted as part of a joint study, this will occur when:

[1]

[2]

Here, *τ*_{k} is the non-central cumulative *t*-density corresponding to the sampling distribution of *β*_{k}, *k* ∈ {1, 2}; I presume that *k* is sorted so that tests are in descending order of power. *t*^{⋆}(*a*) is the positive critical *t*−statistic needed to create a single two-tailed hypothesis test of size *a*. The left hand side of equation (1) is the power of the single hypothesis test with the greatest power; the right hand side of equation (1) is the power of the joint test of two hypotheses. As equation (2) shows, the power of the joint test is larger when the proportional change in power for the test of *β*_{1} is less than the power for the test of *β*_{2}. Whether this condition is met depends on many factors, including the magnitude and variability of *β*_{1} and *β*_{2}.

To illustrate the potential for power gains, I numerically calculated the power of joint tests with size for the case of one, two, and three statistically independent individual tests. For this calculation, I assumed three relationships *β* = *β*_{k} with equal magnitude and sign, *k* ∈ {1, 2, 3} with ; thus, each one of the three tests has identical power when conducted individually. I then calculated the power of each joint test for each value of *k* and for varying values of *β*, where *τ* is the non-central cumulative *t*-density with non-centrality parameter equal to *β*. Note that, as before, I define *t*^{⋆}(*a*) as the critical *t*-statistic for a single two-tailed test with size *a*. The result is illustrated in Figure 4.

Figure 4 shows that joint hypothesis testing creates substantial gains in power over single hypothesis testing for most values of *β*. There is a small amount of power loss near the tails (where *β* ≈ 0 and *β* ≈ 6), but this is negligible.

There are reasons to be skeptical of the Benjamin et al. (2017) proposal to move the NHST threshold for statistical significance from *α* = 0.05 to *α* = 0.005. First, there is substantial potential for adverse effects on the scientific ecosystem: the proposal seems to advantage senior scholars at the most prominent institutions in fields that do not rely on fixed-*N* observational data. Second, the NHST with a reduced *α* is not my ideal approach to adjudicating which results are statistically meaningful; I believe it is more advantageous for political scientists to adopt a statistical decision theory-oriented approach to inference^{7} and give greater emphasis to cross-validation and out-of-sample prediction.

However, based on the evidence presented here and in related work, I believe that moving the threshold for statistical significance from *α* = 0.05 to *α* = 0.005 would benefit political science *if* we adapt to this reform by developing richer, more robust theories that admit multiple predictions. Such a reform would reduce the false discovery rate without reducing power or unduly disadvantaging underfunded scholars or subfields that rely on historical observational data, even if meeting the stricter standard for significance became a necessary condition for publication. It would also force us to focus on improving our body of theory; our extant theories lead us to propose hypotheses that are wrong as much as 90% of the time (V. E. Johnson et al. 2017; Esarey and Liu 2017). Software that automates the calculation of appropriate critical *t*-statistics for correlated joint hypothesis tests would make it easier for substantive researchers to make this change, and ought to be developed in future work. This software has already been created for joint tests involving interaction terms in generalized linear models by Esarey and Sumner (2017), but the procedure needs to be adapted for the more general case of any type of joint hypothesis test.

Abdi, Herve. 2007. “The Bonferonni and Sidak Corrections for Multiple Comparisons.” In *Encyclopedia of Measurement and Statistics*, edited by Neil Salkind. Thousand Oaks, CA: Sage. URL: https://goo.gl/EgNhQQ accessed 8/5/2017.

Benjamin, Daniel J., James O. Berger, Magnus Johannesson, Brian A. Nosek, E. J. Wagenmakers, Richard Berk, Kenneth A. Bollen, et al. 2017. “Redefine Statistical Significance.” *Nature Human Behavior* Forthcoming: 1–18. URL: https://osf.io/preprints/psyarxiv/mky9j/ accessed 7/31/2017.

Benjamini, Y., and Y. Hochberg. 1995. “Controlling the false discovery rate: a practical and powerful approach to multiple testing.” *Journal of the Royal Statistical Society. Series B (Methodological)* 57 (1). JSTOR: 289–300. URL: http://www.jstor.org/stable/10.2307/2346101.

Esarey, Justin, and Nathan Danneman. 2015. “A Quantitative Method for Substantive Robustness Assessment.” *Political Science Research and Methods* 3 (1). Cambridge University Press: 95–111.

Esarey, Justin, and Vera Liu. 2017. “A Prospective Test for Replicability and a Retrospective Analysis of Theoretical Prediction Strength in the Social Sciences.” Poster presented at the 2017 Texas Methods Meeting at the University of Houston. URL: http://jee3.web.rice.edu/replicability-package-poster.pdf accessed 8/1/2017.

Esarey, Justin, and Jane Lawrence Sumner. 2017. “Marginal Effects in Interaction Models: Determining and Controlling the False Positive Rate.” *Comparative Political Studies* forthcoming: 1–39. URL: http://jee3.web.rice.edu/interaction-overconfidence.pdf accessed 8/5/2017.

Esarey, Justin, and Ahra Wu. 2016. “Measuring the Effects of Publication Bias in Political Science.” *Research & Politics* 3 (3). SAGE Publications Sage UK: London, England: 1–9. URL: https://doi.org/10.1177/2053168016665856 accessed 8/1/2017.

Johnson, Valen E. 2013. “Revised Standards for Statistical Evidence.” *Proceedings of the National Academy of Sciences* 110 (48). National Acad Sciences: 19313–7.

Johnson, Valen E, Richard D. Payne, Tianying Wang, Alex Asher, and Soutrik Mandal. 2017. “On the Reproducibility of Psychological Science.” *Journal of the American Statistical Association* 112 (517). Taylor & Francis: 1–10.

Klein, Richard A., Kate A. Ratliff, Michelangelo Vianello, Reginald B. Adams, Stepan Bahnik, Michael J. Bernstein, Konrad Bocian, et al. 2014. “Investigating Variation in Replicability.” *Social Psychology* 45 (3): 142–52. doi:10.1027/1864-9335/a000178.

Open Science Collaboration. 2015. “Estimating the Reproducibility of Psychological Science.” *Science* 349 (6251): aac4716. doi:10.1126/science.aac4716.

Scargle, Jeffrey D. 2000. “Publication Bias: The ‘File-Drawer’ Problem in Scientific Inference.” *Journal of Scientific Exploration* 14: 91–106.

Schooler, Jonathan. 2011. “Unpublished Results Hide the Decline Effect.” *Nature* 470: 437.

Sidak, Zbynek. 1967. “Rectangular confidence regions for the means of multivariate normal distributions.” *Journal of the American Statistical Association* 62 (318): 626–33.

Sterling, T.D., W. L. Rosenbaum, and J. J. Winkam. 1995. “Publication Decisions Revisited: The Effect of the Outcome of Statistical Tests on the Decision to Publish and Vice Versa.” *The American Statistician* 49: 108–12.

Sterling, Theodore D. 1959. “Publication Decisions and Their Possible Effects on Inferences Drawn from Tests of Significance—or Vice Versa.” *Journal of the American Statistical Association* 54 (285): 30–34. doi:10.1080/01621459.1959.10501497.

Wasserstein, Ronald L., and Nicole A. Lazar. 2016. “The Asa’s Statement on P-Values: Context, Process, and Purpose.” *The American Statistician* 70 (2): 129–33. URL: http://dx.doi.org/10.1080/00031305.2016.1154108 accessed 8/5/2017.

The code to replicate Figures 2-4 is available at http://dx.doi.org/10.7910/DVN/C6QTF2. Figure 1 is a reprint of a figure originally published in Esarey and Wu; the replication file for that publication is available at http://dx.doi.org/10.7910/DVN/2BF2HB.

- I thank Jeff Grim, Martin Kavka, Tim Salmon, and Mike Ward for helpful comments on a previous draft of this paper, particularly in regard to the effect of lowered
*α*on junior scholars and the question of whether statistical significance should be required for publication. - Specifically, I measure , where is a statistically significant estimate.
- Here, I measure , where
*μ*_{β}is the mean of*f*(*β*). - The points in this paragraph are similar to those made by Esarey and Sumner (2017, pp. 15–19).
- For statistically correlated tests, the degree to which is smaller; at the limit where all the hypothesis tests are perfectly correlated, .
- Correlated significance tests require the creation of a joint distribution
*τ*on the right hand side of equation ([eq:power-gain-line-one]) and the determination of a critical value*t*^{⋆}such that ; while practically important, this analysis is not as demonstratively illuminating as the case of statistically independent tests. - In a paper with Nathan Danneman (2015), I show that a simple, standardized approach could reduce the rate of false positives without harming our power to detect true positives (see Figure 4 in that paper). Failing this, I would prefer a statistical significance decision explicitly tied to expected replicability, which requires information about researchers’ propensity to test null hypotheses as well as their bias toward positive findings (Esarey and Liu 2017). These changes would increase the
*complexity*and the*number of researcher assumptions*of a statistical assessment procedure relative to the NHST, but not (in my opinion) to a substantial degree.