Understanding Equation Balance in Time Series Regression

Peter K. Enns and Christopher Wlezien

Abstract: Most contributors to a recent Political Analysis symposium on time series anal­ysis suggest that in order to maintain equation balance, one cannot combine stationary, integrated, and/or fractionally integrated variables with general error correction models (GECMs) and the equivalent autoregressive distributed lag (ADL) models. This defini­tion of equation balance implicates most previous uses of these models in political science and circumscribes their use moving forward. The claim thus is of real consequence and worthy of empirical substantiation, which the contributors did not provide. Here we address the issue. First, we highlight the difference between estimating unbalanced equations and mixing orders of integration, the former of which clearly is a problem and the latter of which is not, at least not necessarily. Second, we assess some of the consequences of mixing orders of integration by conducting simulations using stationary and integrated time series. Our simulations show that with an appropriately specified model, regressing a stationary variable on an integrated one, or the reverse, does not increase the risk of spurious results. We then illustrate the potential importance of these conclusions with an applied example—income inequality in the United States.[1]

Political Analysis (PA) recently hosted a symposium on time series analysis that built upon De Boef and Keele’s (2008) influential time series article in the American Journal of Political Science. Equation balance was an important point of emphasis throughout the symposium. In their classic work on the subject, Banerjee, Dolado, Galbraith and Hendry (1993, 164) explain that an unbalanced equation is a regression, “in which the regressand is not the same order of integration as the regressors, or any linear combination of the regressors.” The contributors to this symposium were right to emphasize the importance of equation balance, as unbalanced equations can produce serially correlated residuals (e.g., Pagan and Wickens 1989) and spurious relationships (e.g., Banerjee et al. 1993, 79).

Throughout the PA symposium, however, equation balance is defined and applied in different ways. Grant and Lebo (2016, 7) follow Banerjee, et al’s definition when they explain that a general error correction model (GECM)—or autoregressive distributed lag (ADL)—is balanced if co-integration is present.[2] Keele, Linn and Webb (2016a, 83) implicitly make this same point in their second contribution to the symposium when they cite Bannerjee et al. (1993) in their discussion of equation balance. Yet, other parts of the symposium seem to apply a stricter standard of equation balance, stating that when estimating a GECM/ADL all time series must be the same order of integration. As Grant and Lebo write in the abstract of their first article, “Time series of various orders of integration—stationary, non-stationary, explosive, near-and fractionally integrated—should not be analyzed together… That is, without equation balance the model is misspecified and hypothesis tests and long­run-multipliers are unreliable.” Keele, Linn and Webb (2016b, 34) similarly write, “no regression model is appropriate when the orders of integration are mixed because no long-run relationship can exist when the equation is unbalanced.” Box-Steffensmeier and Helgason (2016, 2) make the point by stating, “when studying the relationship between two (or more) series, the analyst must ensure that they are of the same level of integration; that is, they have to be balanced.” Although Freeman (2016) offers a more nuanced perspective on equation balance, many of the symposium contributors could be interpreted as recommending that scholars never mix orders of integration.[3] Indeed, in their concluding article, Lebo and Grant write, “One point of agreement among the papers here is that equation balance is an important and neglected topic. One cannot mix together stationary, unit-root, and fractionally integrated variables in either the GECM or the ADL” (p.79).

It is possible that these authors did not mean for these quotes to be taken literally. However, we both have recently been asked to review articles that have used these quotes to justify analytic decisions with time series data.[4] Thus, we think the claims should be reviewed carefully. This is especially the case because Grant and Lebo could be interpreted as applying these strict standards in some of their empirical applications. For example, in their discussion of Sánchez Urribarrí, Schorpp, Randazzo and Songer (2011), Grant and Lebo write, “both the UK and US models are unbalanced—each DV is stationary, and the inclusion of unit-root IVs has compromised the results” (Supplementary Materials, p.36). Researchers might take this statement to imply that including stationary and unit root variables automatically produces an unbalanced equation.

In addition to holding implications for practitioners, the strict interpretation of equation balance holds implications for the vast number of existing time series articles that employ GECM/ADL models without pre–whitening the data to ensure equal orders of integration across all series. Lebo and Grant (2016, 79) point out, for example, “FI [fractional inte­gration] methods allow us to create a balanced equation from dissimilar data. By filtering each series by its own (p, d, q) noise model, the residuals of each can be rendered (0, 0, 0) so that you can investigate how X’s deviations from its own time-dependent patterns affect Y’s deviations from its own time-dependent patterns.” Fortunately, existing time series analysis that does not pre-whiten the data need not be automatically dismissed. The strict inter­pretation of equation balance—i.e., that mixing orders of integration is always problematic with the GECM/ADL—is not accurate. As noted above, the contributors to the symposium may indeed understand this point. But based on the quotes above, we feel that it is impor­tant to clarify for practitioners that an unbalanced equation is not synonymous with mixing orders of integration. While related, they are not the same, and while the former is always a problem the latter is not.

We begin by showing that equation balance does not necessarily require that all series have the same order of integration with the GECM/ADL. This is important because the classic examples in the literature of unbalanced equations include series of different orders of integration (see, for example, Banerjee et al. (1993, 79) and Maddala and Kim (1998, 252)). But our results are not at odds with these scholars, as their examples all assume a relationship with no dynamics. When using a GECM/ADL to model dynamic processes, even mixed orders of integration can produce balanced equations. This conclusion is consistent with Banerjee et al. (1993), who write, “The moral of the econometricians’ story is the need to keep track of the orders of integration on both sides of the regression equation, which usually means incorporating dynamics; models that have restrictive dynamic structures are relatively likely to give misleading inferences simply for reasons of inconsistency of orders of integration” (p.192, italics ours).

We believe the PA symposium was not sufficiently clear that adding dynamics can solve the equation balance problem with mixed orders of integration. Thus, a key contribution of our article is to show how appropriate model specification can be used to produce equation balance and avoid inflating the rate of spurious regression—even when the model includes series with different orders of integration. Our particular focus is analysis that mixes station­ary I(0) and integrated I(1) time series. In practice, researchers might encounter other types of time series, such as fractionally integrated, near-integrated, or explosive series. Evaluating every type of time series and the vast number of ways different orders of integration could appear in a regression model is beyond the scope of this paper. Our goal is more basic, but still important. We aim to demonstrate that there are exceptions to the claim that, “The order of integration needs to be consistent across all series in a model” (Grant and Lebo 2016, 4) and that these exceptions can hold important implications for social science research.

More specifically, we show that when data are either stationary or (first order) integrated, scenarios exist when a GECM/ADL that includes both types of series can be estimated without problem. Our simulations show that regressing an integrated variable on a stationary one (or the reverse) does not increase the risk of spurious results when modeled correctly. While this may be a simple point, we think it is a crucial one. As mentioned above, if readers interpreted the previous quotes from the PA symposium as defining equation balance to mean that different orders of integration cannot be mixed, most existing research that employs the ADL/GECM model would be called into question. Given the fact that Political Analysis is one of the most cited journals in political science and the symposium included some of the top time series practitioners in the discipline, we believe it is valuable to clarify that mixing orders of integration is not always a problem and that existing time series research is not inherently flawed. Furthermore, the one article that has responded to particular claims made in the symposium contribution did not address the symposium’s definition of equation balance (Enns et al. 2016).[5] We hope our article helps clarify the concept of equation balance for those who use time series analysis.

We also illustrate the importance of our findings with an applied example—income in­equality in the United States. The example illustrates how the use of pre-whitening to force variables to be of equal orders of integration (when the equation is already balanced) can be quite costly, leading researchers to fail to detect relationships.[6]

Clarifying Equation Balance

The contributors to the PA symposium were all correct to emphasize equation balance. Time series analysis requires a balanced equation. An unbalanced equation is mis-specified by definition, typically resulting in serially correlated residuals and an increased probability of Type I errors.[7] As noted above, Banerjee et al. (1993, 164, italics ours) explain that an unbalanced equation is a regression, “in which the regressand is not the same order of integration as the regressors, or any linear combination of the regressors.” Our primary concern is that much of the discussion in the PA symposium seems to focus on the order of integration of each variable in the equation without acknowledging that a “linear combination of the regressors” can also produce equation balance. We worry that researchers might interpret this focus to mean that equation balance requires each series in the model to be the same order of integration.[8] Such a conclusion would be wrong. As the previous quote from Banerjee et al. (1993) indicates (also see, Maddala and Kim (1998, 251), if the regressand and the regressors are not the same order of integration, the equation will still be balanced if a linear combination of the variables is the same order of integration.

As Grant and Lebo (2016, 7) and Keele, Linn and Webb (2016a, 83) acknowledge, coin­tegration offers a useful illustration of how an equation can be balanced even when the regressand and regressors are not the same order of integration.[9] Consider two integrated I(1) variables, Y and X, in a standard GECM model:

\Delta Y_t = \alpha_0 + \alpha_1 Y_{t-1} + \gamma_1 \Delta X_t + \beta_1 X_{t-1} + \epsilon_t. (1)

Clearly, the equation mixes orders of integration. We have a stationary regressand (\Delta Y_t) and a combination of integrated (Y_{t-1}, $X_{t-1}$) and stationary (\Delta X_t) regressors. However, if X and Y are cointegrated, the equation is still balanced. To see why, we can rewrite Equation 1 as:

\Delta Y_t = \alpha_0 + \alpha_1 (Y_{t-1} + \frac{\beta_1}{\alpha_1} X_{t-1}) + \gamma_1 \Delta X_{t} + \epsilon_t. (2)

X and Y are cointegrated when X and Y are both integrated (of the same order) and \alpha_1 and and \beta_1 are non-zero (and \alpha_1 < 0). Because cointegration ensures that Y and X maintain an equilibrium relationship, a linear combination of these variables exists that is stationary (that is, if we regress Y on X, in levels, the residuals would be stationary).[10] As noted above, this (stationary) linear combination is captured by (Y_{t-1} + \frac{\beta_1}{\alpha_1}X_{t-1}). Additionally, since Y and X are both integrated of order one, \Delta Y and \Delta X will be stationary. Thus, cointegration ensures that the equation is balanced: the regressand (\Delta Y_t) and either the regressors (\Delta X_t) or a linear combination of the regressors (Y_{t-1} + \frac{\beta_1}{\alpha_1}X_{t-1}) are all stationary. Importantly, if we added a stationary regressor to the model, e.g., if we thought innovations in Y were also influenced by a stationary variable, the equation would still be balanced.

The fact that the GECM—which mixes stationary and integrated regressand and regressors—is appropriate when cointegration is present demonstrates that equation balance does not require the series to be the same order of integration. As we have mentioned, Grant and Lebo (2016, 7) acknowledge that a GECM is balanced if co-integration is present and Keele, Linn and Webb (2016a, 83) make this point in their second contribution to the symposium citing Bannerjee et al. (1993) in their discussion of equation balance. However, as noted above, we have begun to encounter research that interprets other statements in the sym­posium to mean that analysts can never mix orders of integration. For example, in their discussion of Volscho and Kelly (2012), Grant and Lebo write that the “data is a mix of data types (stationary and integrated), so any hypothesis tests will be based on unbalanced equations” (supplementary appendix, p. 48). But is this this really the case? The above example shows that when cointegration is present, equation balance can exist even when the orders of integration are mixed.

Below, we use simulations to illustrate two seemingly less well-known scenarios when equation balance exists despite different orders of integration. Again, our goal is not to iden­tify all cases where different orders of integration can result in equation balance. Rather, we want to show that researchers should not automatically equate different orders of integration with an unbalanced equation. Situations exist where it is completely appropriate to estimate models with different orders of integration.

Equation Balance with Mixed Orders of Integration: Simulation Results

We begin with an integrated Y and a stationary X. At first glance, estimating a rela­tionship between these variables, which requires mixing an I(1) and I(0) series, might seem problematic. Grant and Lebo (2016, 4) explain, “Mixing together series of various orders of integration will mean a model is misspecified” and in econometric texts, mixing I(1) and I(0) series offers a classic example of an unbalanced equation (Banerjee et al. 1993, 79, Maddala and Kim 1998, 252).[11]

It is still possible to estimate the relationship between an integrated Y and a stationary X in a correctly specified and balanced equation. First, we must recognize that when Banerjee et al. (1993) (see also Mankiw and Shapiro (1986) and Maddala and Kim (1998)) state that an I(1) and I(0) series represent an unbalanced equation, they are modeling the equation:

y_t = \alpha + \beta x_{t-1} + u_t. (3)

Equation 3 is indeed unbalanced (and thus misspecified) as the regressand is integrated and the regressor is stationary. This result does not, however, mean that we cannot consider these two series. A stationary series, X, might be related to innovations in an integrated series, Y. If so, we could model this process with an autoregressive distributed lag model:

Y_t = \alpha + \alpha_1 Y_{t-1} + \beta_1 X_t + \beta_2 X_{t-1} + \epsilon. (4)

Much as before, this might appear to still be an unbalanced equation. We continue to mix I(1) and I(0) series, which seemingly violates Lebo and Grant’s (2016, 71) conclusion that, “One cannot mix together stationary, unit–root, and fractionally integrated variables in either the GECM or the ADL.”[12] However, since Y is I(1), \alpha_1 = 1, which means Y_t - \alpha_1 Y_{t-1} = \Delta Y. Thus, we can rewrite the equation as,

\Delta Y_t = \alpha + \beta_1 X_t + \beta_2 X_{t-1} + \epsilon. (5)

Because Y is an integrated, I(1), series, \Delta Y must be stationary. Thus, the regressand and regressors are all I(0) series. As Banerjee et al. (1993, 169) explain, “regressions that are linear transformations of each other have identical statistical properties. What is important, therefore, is the possibility of transforming in such a way that the regressors are integrated of the same order as the regressand.”[13] Thus, Equation 5 shows that the ADL in Equation 4 is indeed balanced. (Because the GECM is algebraically equivalent to the ADL, the GECM would—by denition—also be balanced in this example.)

The above discussion suggests that we can use an ADL to estimate the relationship between an integrated Y and stationary X. To test these expectations, we conduct a series of Monte Carlo experiments. We generate an integrated Y with the following DGP:

Y_t= Y_{t-1} + \epsilon_{yt},\; \epsilon_{yt}\sim N(0,1). (6)

We generate the stationary time series X, with the following DGP, where \theta equals 0.0 or 0.5:

X_t=\theta_x X_{t-1} + \epsilon_{xt},\; \epsilon_{xt}\sim N(0,1). (7)

Notice that X and Y are independent series. Particularly with dependent series that contain a unit root (as is the case here), the dominant concern in time series literature is the potential for estimating spurious relationships (e.g., Granger and Newbold 1974, Grant and Lebo 2016, Yule 1926). Thus, our first simulations seek to identify the percentage of analyses that would incorrectly reject the null hypothesis of no relationship between a stationary X and integrated Y with an ADL. As noted above, in light of the recommendations in the PA symposium to never mix orders of integration, this approach seems highly problematic. However, if the equation is balanced as we suggest, the false rejection rate in our simulations should only be about 5 percent.

In the following simulations, T is set to 50 and then 1,000. These values allow us to evaluate both a short time series that political scientists often encounter and a long time series that will approximate the asymptotic properties of the series. We use the DGP from Equations 6 and 7, above, to generate 1,000 simulated data sets. Recall that in our stationary series, \theta_x equals 0.5 or 0.0 and Y and X are never related. To evaluate the relationship between X and Y, we estimate an ADL model in Equation 4.[14]


Table 1 reports the average estimated relationship across all simulations between X and Y (\hat{\beta_1} and \hat{\beta_2} in Equation 4) and the percent of simulations in which these relationships were statistically significant. The mean estimated relationship is close to zero and the Type I error rate is close to 5 percent. With this ADL specification, when Y is integrated and X is stationary, mixing integrated and stationary time series does not increase the risk of spurious regression.[15]

Results in Table 2 show that the same pattern of results emerges when X is integrated and Y is stationary.[16] Most time series analysis in the political and social sciences could be accused of mixing orders of integration. Thus, the recommendations of the PA symposium could be interpreted as calling this research into question. We have shown, however, that mixing orders of integration does not automatically imply an unbalanced equation. It also does not automatically lead to spurious results.


The Rise of the Super Rich: Reconsidering Volscho and Kelly (2012)

We think the foregoing discussion and analyses offer compelling evidence that, despite the range of statements about equation balance in the PA symposium, mixing orders of integration when using a GECM/ADL does not automatically pose a problem to researchers. Of course, to a large degree the previous sections reiterate and unpack what econometricians have shown mathematically (e.g., Sims, Stock and Watson 1990), and so may come as little surprise to some readers (especially those who have not read the PA Symposium). Here, we use an applied example to illustrate the importance of correctly understanding equation balance. We turn to a recent article by Volscho and Kelly (2012) that analyzes the rapid income growth among the super-rich in the United States (US). They estimate a GECM of pre-tax income growth among the top 1% and find evidence that political, policy, and economic variables influence the proportion of income going those at the top. Critically for our purposes, they include stationary and integrated variables on the right-hand side, which Grant and Lebo (2016, 26) actually single out as a case where the “GECM model [is] inappropriate with mixed orders of integration.” Grant and Lebo go on to assert that Volscho and Kelly’s “data is a mix of data types (stationary and integrated), so any hypothesis tests will be based on unbalanced equations” (supplementary appendix, p. 48). Based on the conclusion that mixing orders of integration produces an unbalanced equation, Grant and Lebo employ fractional error correction technology and find that none of the political or and policy variables (and only some economic variables) matter for incomes among the top 1%.

These are very different findings, ones with potential policy consequences, and so it is important to reconsider what Volscho and Kelly did—and whether the mixed orders of integration pose a problem for their analysis. To begin our analysis, we present the dependent variable from Voschlo and Kelly, the total pre-tax income share of the top 1% for the period between 1913 and 2008.[17] In Figure 1 we can see that income shares start off quite high and then drop and then return to inter-war levels toward the end of the series. The variable thus exhibits none of the trademarks of a stationary series, i.e., it is not mean-reverting, and looks to contain a unit root instead. Notice that the same is true for the shorter period encompassed by Volscho and Kelly’s analysis, 1949-2008. Augmented Dickey-Fuller (ADF) and Phillips–Perron unit root tests confirm these suspicions, and are summarized in the first row of Table 3, below.[18]


Figure 1: The Top 1 Percent’s Share of Pre-tax Income in the United States, 1913 to 2008  

What about the independent variables? Here, we find a mix (see Table 3). Some variables clearly are nonstationary and also appear to contain unit roots: the capital gains tax rate, union membership, the Treasury Bill rate, Gross Domestic Product (logged), and the Stan­dard and Poor 500 composite index. The top marginal tax rate also is clearly nonstationary and we cannot reject a unit root even when taking into account the secular (trending) decline over time. The results for the Shiller Home Price Index are mixed and trade openness is on the statistical cusp, and there is reason—based on the size of the autoregressive parameter (-0.29) and the fact that we reject the unit root over a longer stretch of time—to assume that the variable is stationary. For the other variables included in the analysis, we reject the null hypothesis of a unit root: Democratic president, and the Percentage of Democrats in Congress. These findings seem to comport with what Volscho and Kelly found (see their supplementary materials).[19]


Volscho and Kelly proceed to estimate a GECM of the top 1% income share including current first differences and lagged levels of the stationary and integrated variables. So far, the diagnostics support their decision (integrated DV, some IVs are integrated, and we find evidence of cointegration).[20] The fact that stationary variables are also included in the model should not affect equation balance. However, in order to evaluate the robustness of Volscho and Kelly’s results, we re-consider their data with Pesaran and Shin’s ARDL (Autoregressive Distributed Lag) critical bounds testing approach (Pesaran, Shin and Smith 2001). Although political scientists typically refer to the autoregressive distributed lag model as an ADL, Pesaran, Shin and Smith (2001) prefer ARDL. For their bounds test of cointegration, they estimate the model as a GECM.[21]

The ARDL approach is one of the approaches recommended by Grant and Lebo and is especially advantageous in the current context because two critical values are provided, one which assumes all stationary regressors and one which assumes all integrated regres­sors. Values in between these “bounds” correspond to a mix of integrated and stationary regressors, meaning the bounds approach is especially appropriate when the analysis includes both types of regressors. Grant and Lebo (2016, 19) correctly acknowledge that “With the bounds testing approach, the regressors can be of mixed orders of integration—stationary, non-stationary, or fractionally integrated—and the use of bounds allow the researcher to make inferences even when the integration of the regressors is unknown or uncertain.”[22] Since Table 3 indicates we have a mix of stationary and integrated regressors, if our critical value exceeds the highest bound, we will have evidence of cointegration.

The ARDL approach proceeds in several steps.[23] First, if the dependent variable is in­tegrated, the ARDL model (which is equivalent to the GECM) is estimated. Next, if the residuals from this model are stationary, an F-test is conducted to evaluate the null hypoth­esis that the combined effect of all lagged variables in the model equals zero. This F statistic is compared to the appropriate critical values (Pesaran, Shin and Smith 2001). We rely on the small-sample critical values from Narayan (2005). If there is evidence of cointegration, both long and short-run relationships from the initial ARDL (i.e., ADL/GECM) model can be evaluated.

Our analysis focuses on Column 5 from Volscho and Kelly’s Table 1, which is their preferred model. The first column of our Table 4, below, shows that we successfully replicate their results. The ARDL analysis appears in Column 2.[24] The key difference between this specication and that of Volscho and Kelly’s is that they (based on a Breusch-Godfrey test) employed the Prais-Winsten estimator to correct for serially correlated errors and we do not. Our decision reflects the fact that other tests do not reject the null of white noise, e.g., the Portmanteau (Q) test produces a p-value of 0.12, and it allows us to compare the results with and without the correction. Also note that an expanded model including lagged differenced dependent and independent variables (see Appendix Table A-1) produces very similar estimates to those shown in column 2 of Table 4, and a Breusch-Godfrey test indicates that the resulting residuals are uncorrelated.

To begin with, we need to test for cointegration. For this, we compare the F-statistic from the lagged variables (6.54) with the Narayan (2005) upper (I(1)) critical value (3.82), which provides evidence of cointegration.[25] The Bounds t-test also supports this inference, as the t-statistic (-7.75) for the Y_{t-1} parameter is greater (in absolute terms) than the I(1) bound tabulated by Pesaran, Shin, and Smith (2001, 303). Returning to the results in Column 2, we see that the ARDL approach produces similar conclusions to Column 1. (Philips (2016) uses the ARDL approach to re-consider the first model in Volscho and Kelly’s (2012) Table 1 and also obtains similar results.) The coefficients for all but two of the independent variables have similar effects, i.e., the same sign and statistical significance.[26] The exceptions are Divided Government(t-1) and \Delta Trade Openness, for which the coefficients using the two approaches are similar but the standard errors differ substantially. Consistent with the existing research on the subject, we find evidence that economics, politics, and policy matter for the share of income going to the top 1 percent.


Although Grant and Lebo (2016, 18) recommend both the ARDL approach and a three-step fractional error correction model (FECM) approach, they only report the results for the latter in their re-analysis of Volscho and Kelly.[27] It turns out that the two approaches produce very different results. This can be seen in column 3 of Table 4, which reports Grant and Lebo’s FECM reanalysis of Volscho and Kelly Model 5 (from Grant and Lebo’s supplementary appendix, p. 50). With their approach, only the change in stock prices (Real S&P 500 Index) and Trade Openness are statistically significant (p<.05) predictors of income shares, though levels of stock prices and trade openness also matter via the FECM component, which captures disequilibria between those variables and lagged income shares. Despite theoretical and empirical evidence suggesting that the marginal tax rate (Mertens 2015, Piketty, Saez and Stantcheva 2014), union strength (Jacobs and Myers 2014, Pontusson 2013, Western and Rosenfeld 2011), and the partisan composition of government (Bartels 2008, Hibbs 1977, Kelly 2009) can influence the pre-tax income of the upper 1 percent, we would conclude that only trade openness and stock prices influence the pre-tax income share of richest Americans. Of course, analysts might reasonably prefer alternative models to the ones Volscho and Kelly estimate, perhaps opting for a more parsimonious specification, allowing endogenous relationships, and/or including alternate lag specifications. The key point is that, given the particular model, the ARDL and three–step FECM produce very different estimates.


In his contribution to the PA symposium, John Freeman wrote, “It now is clear that equation balance is not understood by political scientists” (Freeman 2016, 50). Our goal has been to help clarify misconceptions about equation balance. In particular, we have shown that mixing orders of integration in a GECM/ADL model does not automatically lead to an unbalanced equation. As the title of Lebo and Grant’s second contribution to the symposium (“Equation Balance and Dynamic Political Modeling”) illustrates, equation balance was a central theme of the symposium. Although others have responded to particular criticisms within the PA symposium (e.g., Enns et al. 2016), this article is the first to address the symposium’s discussion and recommendations related to equation balance.

Because they are related, it is easy to (erroneously) conclude that mixing orders of inte­gration is synonymous with an unbalanced equation. It would be wrong, however, to reach this conclusion. We have focused on two types of time series: stationary and unit–root series and we have found that situations exist when it is unproblematic—and inconsequential—to mix these types of series (because the equation is balanced).[28]

These results help clarify existing time series research (e.g., Banerjee et al. 1993, Sims, Stock and Watson 1990) by showing that when we use a GECM/ADL to model dynamic processes, even mixed orders of integration can produce balanced equations. The findings also lead to three recommendations for researchers. First, scholars should not automatically dismiss existing time series research that mixes orders of integration. Even when series are of different orders of integration or when the equation transforms variables in a way that leads to different orders of integration, the equation may still be balanced and the model correctly specified. In fact, we identified, and our simulations confirmed, specific scenarios when integrated and stationary time series can be analyzed together. Second, as we showed with our simulations and with our applied example, researchers must evaluate whether they have equation balance based on both the univariate properties of their variables and the model they specify. Third and finally, our results show that researchers do not always need to pre-whiten their data to ensure equation balance. Although pre-whitening time series will sometimes be appropriate, we have shown that this step is not a necessary condition for equation balance. This is important because such data transformations are potentially quite costly, specifically, in the presence of equilibrium relationships. As we saw above, Grant and Lebo’s decision to pre-whiten Volscho and Kelly’s data with their three-step FECM may be one such example.


Banerjee, Anindya, Juan Dolado, John W. Galbraith and David F. Hendry. 1993. Co-Integration, Error Correction, and the Econometric Analysis of Non-Stationary Data. Oxford: Oxford University Press.

Bartels, Larry M. 2008. Unequal Democracy. Princeton: Princeton University Press.

Box-Steffensmeier, Janet and Agnar Freyr Helgason. 2016. “Introduction to Symposium on Time Series Error Correction Methods in Political Science.” Political Analysis 24(1):1–2.

De Boef, Suzanna and Luke Keele. 2008. “Taking Time Seriously.” American Journal of Political Science 52(1):184–200.

Enns, Peter K., Nathan J. Kelly, Takaaki Masaki and Patrick C. Wohlfarth. 2016. “Don’t Jettison the General Error Correction Model Just Yet: A Practical Guide to Avoiding Spurious Regression with the GECM.” Research and Politics 3(2):1–13.

Ericsson, Neil R. and James G. MacKinnon. 2002. “Distributions of Error Correction Tests for Cointegration.” Econometrics Journal 5(2):285–318.

Esarey, Justin. 2016. “Fractionally Integrated Data and the Autodistributed Lag Model: Results from a Simulation Study.” Political Analysis 24(1):42–49.

Freeman, John R. 2016. “Progress in the Study of Nonstationary Political Time Series: A Comment.” Political Analysis 24(1):50–58.

Granger, Clive W.J., Namwon Hyung and Yongil Jeon. 2001. “Spurious Regressions with Stationary Series.” Applied Economics 33:899–904.

Granger, Clive W.J. and Paul Newbold. 1974. “Spurious Regressions in Econometrics.” Journal of Econometrics 26:1045–1066.

Grant, Taylor and Matthew J. Lebo. 2016. “Error Correction Methods with Political Time Series.” Political Analysis 24(1):3–30.

Hibbs, Jr., Douglas A. 1977. “Political Parties and Macroeconomic Policy.” American Po­litical Science Review 71(4):1467–1487.

Jacobs, David and Lindsey Myers. 2014. “Union Strength, Neoliberalism, and Inequality.” American Sociological Review 79(4):752–774.

Keele, Luke, Suzanna Linn and Clayton McLaughlinWebb. 2016a. “Concluding Comments.” Political Analysis 24(1):83–86.

Keele, Luke, Suzanna Linn and Clayton McLaughlin Webb. 2016b. “Treating Time with All Due Seriousness.” Political Analysis 24(1):31–41.

Kelly, Nathan J. 2009. The Politics of Income Inequality in the United States. New York: Cambridge University Press.

Lebo, Matthew J. and Taylor Grant. 2016. “Equation Balance and Dynamic Political Mod­eling.” Political Analysis 24(1):69–82.

Maddala, G.S. and In-Moo Kim. 1998. Unit Roots, Cointegration, and Structural Change. ed. New York: Cambridge University Press.

Mankiw, N. Gregory and Matthew D. Shapiro. 1986. “Do We Reject Too Often? Small Sam­ple Properties of Tests of Rational Expectations Models.” Economics Letters 20(2):139–

Mertens, Karel. 2015. “Marginal Tax Rates and Income: New Time Series Evidence.” https://mertens.economics.cornell.edu/papers/MTRI_september2015.pdf.

Murray, Michael P. 1994. “A Drunk and Her Dog: An Illustration of Cointegration and Error Correction.” The American Statistician 48(1):37–39.

Narayan, Paresh Kumar. 2005. “The Saving and Investment Nexus for China: Evidence from Cointegration Tests.” Applied Economics 37(17):1979–1990.

Pagan, A.R. and M.R. Wickens. 1989. “A Survey of Some Recent Econometric Methods.” The Economic Journal 99(398):962–1025.

Pesaran, Hashem M., Yongcheol Shin and Richard J. Smith. 2001. “Bounds Testing Ap­proaches to the Analysis of Level Relationships.” Journal of Applied Econometrics 16(3):289–326.

Philips, Andrew Q. 2016. “Have Your Cake and Eat it Too? Cointegration and Dynamic Inference from Autoregressive Distributed Lag Models.” Working Paper .

Piketty, Thomas and Emmanuel Saez. 2003. “Income Inequality in the United States, 1913­1998.” Quarterly Journal of Economics 118(1):1–39.

Piketty, Thomas, Emmanuel Saez and Stefanie Stantcheva. 2014. “Optimal Taxation of Top Labor Incomes: A Tale of Three Elasticities.” American Economic Journal: Economic Policy 6(1):230–271.

Pontusson, Jonas. 2013. “Unionization, Inequality and Redistribution.” British Journal of Industrial Relations 51(4):797–825.

Sánchez Urribarrí, Raúl A., Susanne Schorpp, Kirk A. Randazzo and Donald R. Songer. 2011. “Explaining Changes to Rights Litigation: Testing a Multivariate Model in a Comparative Framework.” Journal of Politics 73(2):391–405.

Sims, Christopher A., James H. Stock and Mark W. Watson. 1990. “Inference in Linear Time Series Models with Some Unit Roots.” Econometrica 58(1):113–144.

Volscho, Thomas W. and Nathan J. Kelly. 2012. “The Rise of the Super-Rich: Power Resources, Taxes, Financial Markets, and the Dynamics of the Top 1 Percent, 1949 to 2008.” American Sociological Review 77(5):679–699.

Western, Bruce and Jake Rosenfeld. 2011. “Unions, Norms, and the Rise of U.S. Wage Inequality.” American Sociological Review 76(4):513537.

Wlezien, Christopher. 2000. “An Essay on ‘Combined’ Time Series Processes.” Electoral Studies 19(1):77–93.

Yule, G. Udny. 1926. “Why do we Sometimes get Nonsense-Correlations between Time­Series?–A Study in Sampling and the Nature of Time-Series.” Journal of the Royal Statistical Society 89:1–63.


[1] A previous version of this paper was presented at the Texas Methods Conference, 2017. We would like to thank Neal Beck, Patrick Brandt, Harold Clarke, Justin Esarey, John Freeman, Nate Kelly, Jamie Monogan, Mark Pickup, Pablo Pinto, Randy Stevenson, Thomas Volscho, and two anonymous reviewers for helpful comments and suggestions. All replication materials are available on The Political Methodologist Dataverse site  (https://dataverse.harvard.edu/dataverse/tpmnewsletter).

[2] The GECM and ADL are the same model (e.g., Banerjee et al. 1993, De Boef and Keele 2008, Esarey 2016). However, since the two models estimate different quantities of interest (Enns, Kelly, Masaki and Wohlfarth 2016), they are often discussed as two separate models.

[3] Specifically, Freeman (2016, 50) explains, “KLWs [Keele, Linn, and Webb] claim that unbalanced equa­tions are ‘nonsensical’ (16, fn. 4) and GLs [Grant and Lebo] recommendation to ‘set aside’ unbalanced equations (7) are a bit overdrawn. Banerjee et al. (1993) and others discuss the estimation of unbalanced equations. They simply stress the need to use particular nonstandard distributions in these cases.”

[4] Given the prominence of the authors as well as the Political Analysis journal, it is perhaps not surprising that practitioners have begun to adopt these recommendations.

[5] Enns et al. (2016) focused on how to correctly implement and interpret the GECM.

[6] Of course, equation balance is not the only relevant consideration. Researchers must check that their model satisfies other assumptions, such as no autocorrelation in the residuals and no omitted variables.

[7] See, e.g., Banerjee et al. (1993, 164-168), Maddala and Kim (1998, 251-252), and Pagan and Wickens (1989, 1002).

[8] For example, Grant and Lebo (p.7-8) write, “Additionally, any loss of equation balance makes a coin­tegration test dubious so, again, if the dependent variable is I(1), then the model should only include I(1) independent variables.”

[9] See Murray (1994) for a discussion of cointegration.

[10] This, in fact, is the first step of the Engle-Granger two-step method of testing for cointegration.

[11] Interestingly, existing simulations show that despite being unbalanced regressions, we will not find evidence that unrelated I(0) and I(1) series are (spuriously) related in a simple bivariate regression if the I(0) variable is AR(0) (see, e.g., Banerjee et al. (1993, 79), Granger, Hyung and Jeon (2001, 901), and Maddala and Kim (1998, 252)). Banerjee et al. explain that the only way in which OLS can make the regression consistent and minimize the sum of squares is to drive the coefficient to zero (p.80). Our own simulations confirm that when estimating unbalanced regressions with AR(1) and I(1) series, both serial correlation and inflated Type I error rates emerge.

[12] Although fractionally integrated variables may also be of interest to researchers, this example focuses on stationary and integrated processes, which offer a clear illustration of the consequences of mixing orders of integration.

[13] Banerjee et al. (1993) wrote this in the context of a discussion of equation balance among cointegrated variables, but the point applies equally well in this context.

[14] The ADL is mathematically equivalent to the general error correction model (GECM), so the GECM would produce the same results, as long as the parameters are interpreted correctly (see Enns et al. 2016).

[15] The simulations reported in Table 1 also indicate that the ADL specification addresses the issue of serially correlated residuals, which would not be the case with an unbalanced regression. When $\latex \theta_x=0$ and T=50, a Breusch-Godfrey test rejects the null of no serial correlation just 6.5% of the time. When T=1,000, we find evidence of serially correlated residuals in just 4.6% of the simulations. When $\latex \theta_x=0.5$, the corresponding rates are 6.4% (T=50) and 4.6% (T=1,000).

[16] The fact that we do not observe evidence of an increased rate of spurious regression in Table 2, partic­ularly when Y is AR(1), implies that we do not have an equation balance problem. We also find that the simulations in Table 2 tend not to produce serially correlated residuals (we only reject the null of no serial correlation in 6.3% and 5.2% of simulations when T=50 and 4.9% and 3.3% of simulations when T=1,000).

[17] These data, which come from Voschlo and Kelly, were originally compiled by Piketty and Saez (2003).

[18] These results are consistent with the unit root tests Volscho and Kelly report in the supplementary materials to their article. Grant and Lebo’s analysis also supports this conclusion. In their supplementary appendix, Grant and Lebo estimate the order of integration d=0.93 with a standard error of (0.10), indicating they cannot reject the null hypothesis that d=1.0.

[19] Although the dependent variable is pre-tax income, Volscho and Kelly identify several mechanisms that could lead tax rates to influence pre-tax income share (also see, Mertens 2015, Piketty, Saez and Stantcheva 2014). Based on existing research, it also would not be surprising if we observed evidence of a relationship between the top 1 percent’s income share and union strength (for recent examples, see Jacobs and Myers 2014, Pontusson 2013, Western and Rosenfeld 2011) and the partisan composition of government (Bartels 2008, Hibbs 1977, Kelly 2009).

[20] When using the error correction parameter in the GECM to evaluate cointegration, the correct Ericsson and MacKinnon (2002) critical values must be used. When doing so, we find evidence of cointegration for Volscho and Kelly’s (2012) preferred specification (Model 5).

[21] Recall that the ADL, ARDL, and GECM all refer to equivalent models.

[22] It is not clear why Grant and Lebo seemingly contradict their statement that “Mixing together series of various orders of integration will mean a model is misspecified” (p.4) in this context, especially since the ARDL is equivalent to the GECM, but they are correct to do so.

[23] For a concise overview of the ARDL approach, see http://davegiles.blogspot.ca/2013/06/ ardl-models-part-ii-bounds-tests.html.

[24] We exactly follow their lag structure and the assumption of a single endogenous variable, which seemingly is incorrect but possibly intractable.

[25] The 5 percent critical value when T=60 with an unrestricted intercept and no trend is 3.823. Narayan (2005) only reports critical values for up to 7 regressors. However, the size of the critical value decreases as the number of regressors increases (Narayan 2005, Pesaran, Shin and Smith 2001), so our reliance on the the critical value based on 7 regressors is actually a conservative test of cointegration. We also tested for integration allowing for short-run effects of all integrated variables and we again find evidence of cointegration (F= 4.32).

[26] This reveals that explicitly taking into account serial correlation, which Volscho and Kelly did, has modest consequences.

[27] As Grant and Lebo (2016, 18) explain, the three–Step FECM proceeds as follows. First, Y is regressed on X and the residuals are obtained. The fractional difference parameter, d, is then estimated for each of the three series (Y, X, and the residuals). Grant and Lebo explain that if d for the residuals is less than d for both X and Y, then error correction is occurring. If this is the case, the researcher then fractionally differences Y, X, and the residual by each ones own d value. Finally, the researcher regresses the fractionally differenced Y and the fractionally differenced X, and the lag of the fractionally differenced residual (Grant and Lebo 2016, 18). This regression produces the results reported in Column 3 of Table 4.

[28] Of course, other statistical assumptions must also be satisfied. In other work, we have also consid­ered combined time series that contain both stationary and unit–root properties (Wlezien 2000). We find that when we analyze combined time series with mixed orders of integration, we are able to detect true relationships in the data. These results further highlight the fact that mixed orders of integration do not automatically imply an unbalanced regression.

Appendix 1: Alternate ARDL Model Specification


Posted in Statistics | Tagged | 2 Comments

The Future of Academic Publishing is Now

[Editor’s note: this post is contributed by R. Michael Alvarez, Co-Editor of Political Analysis and Professor of Political Science at Caltech.]

Over the past six months, Political Analysis has made two important transitions: the move to Cambridge University Press and to Cambridge’s new publishing platform, Core.  We are excited about these changes. Working with The Press as a publishing partner expands our journal’s reach, while providing benefits to members of the Society for Political Methodology  (for example, SPM’s new website, https://www.cambridge.org/core/membership/spm).

Yet it’s the transition to Cambridge Core that is most exciting. This new publishing platform provides innovative synergies between our journal and other relevant journals, as well as books and Elements–Cambridge’s new digital enterprise combining the best of journals and books; all of which will benefit our readers, and more generally students and scholars in social science methodology. Some changes/features will be immediately recognizable to readers; others will appear later, as new features and content on Cambridge Core emerges and the Press adds new functionality; but also as researchers learn how to take advantage of the platform’s capabilities.

For example, in the past, most journals published a manuscript in one location, while ancillary materials—most importantly, supplementary and replication materials—were archived independently. Now, all content—the primary text, supplementary material, code, data—will be in one place, or easily accessible via links. Readers of Political Analysis will be able to toggle back and forth between the online published version of a manuscript and the ancillary materials. This accessibility continues to evolve while the Cambridge team and us work to build direct connections among the manuscript, data, and code.

As mentioned, emerging synergies will enable manuscripts published in Political Analysis to be connected to related papers published on Cambridge Core, in political science, and across the social and data sciences. Readers will be able to build their own content within the Cambridge universe by connecting manuscripts topically, methodologically, or however they find useful.  As editors of Political Analysis, we’ll be able to make these connections as well, for example “virtual issues,” which could include curated content from the American Political Science Review or Political Science Research and Methods, among other journals.

And there are of course the tens of thousands of books on Cambridge Core, providing content for a new type of virtual issue for our readers, where we can combine journal and book content. For example, we will be able to publish virtual issues on topics that will include manuscripts from Political Analysis, chapters from book series like Analytical Methods for Social Research, and material from Elements.

I think this is exciting stuff. I hope everyone agrees. The future of academic publishing is now, and we are excited to play a part in its creation.

Posted in Uncategorized | Leave a comment

Will H. Moore, Scholar and Mentor

I can’t let the recent death of my friend and co-author Will H. Moore go by without remarking upon the impact that his scholarly work and mentorship had on the scientific community.

Will is probably best known as an expert on human rights, terrorism, and civil conflict. These topics are now in vogue in political science, but weren’t when Will started his career. It is in part because of his work that they are now recognized as important topics in the discipline.

But readers of The Political Methodologist may also know that Will made several contributions to the field of political methodology, although I’m not sure he considered himself “a methodologist.” I think his achievement is particularly notable because his interest in the field developed relatively late in his life. I think most of us struggle just to stay abreast of new developments and avoid becoming too out of date in our own narrow fields after we leave graduate school. Will grew well beyond his intial methods training, eventually co-authoring several papers that introduced new statistical models of special application to substantive problems in International Relations. He also co-authored a book, A Mathematics Course for Political and Social Research, that many graduate programs (including the one at Rice University, my current employer) use as a part of their methods training.

Will was a professor at Florida State University when I was a graduate student there. It was partly due to his mentorship that I decided to focus on political methodology. This was somewhat of a risky decision because FSU was not really known for producing methodologists at the time. Nevertheless, he and Bumba Mukherjee encouraged me to take a paper that I’d written for our MLE class and turn it into what eventually became my first publication in Political Analysis. Everything good in my life, at least as it pertains to my work, stems from that decision. I owe them both a lot.

It’s really sad to me that Will felt like a misfit because I know so many people that loved and respected him, including me. Will was one of the first people who suggested to me that I might be on the autism spectrum. I don’t know whether I am on that spectrum or not, but I do know how hard it is to say and do things that upset people without truly understanding why. Will did that a lot. I thought of him as someone who showed that you could succeed professionally and personally—you could make a real positive difference to science and in people’s lives—despite that. That was something I really needed to know at the time.

I remember that Will used to call FSU “the island of misfit toys” because (he felt) many of us landed there because of some issue or thing in our background that had kept us out of what he considered to be a more prestigious venue. I guess I’ve always thought the misfit toys were the only ones really worth knowing.

I’ll miss you, Will.

Posted in Uncategorized | 1 Comment

Call for conference presentation proposals and a special issue of the Journal of Defense Modeling and Simulation: Forecasting in the Social Sciences for National Security

[Editor’s note: The following announcement comes from Ryan Baird, a Political Scientist at the Joint Warfare Analysis Center.]

Call for Presentation Proposals:  Forecasting in the Social Sciences for National Security

In recent years, the subject of forecasting has steadily increased inimportance as policy makers and academic researchers attempt to respond to changes in world events.  Academic projects seeking to advance the state of the art in forecasting have a natural policy application with the potential for improving the world by helping to identify and respond to problems and opportunities before they fully emerge.  On July 26th (Date may change slightly), National Defense University in Washington D.C. will host a small conference co-sponsored by U.S. Strategic Command to advance the state of the art in the in social science forecasting as applied to national security.  The goal of this conference is to attract papers that engage in rigorous theoretical and empirical research on the application of forecasting to National Security problem sets. These papers should have a focus on applications that support policymakers, military planners, or the warfighter on the ground through forecasting of national security relevant events or decisions.  Relevant examples include, but are definitely not limited to: 

  • Interstate or intrastate conflict initiation or termination
  • The creation or abandonment of a WMD program
  • Extra-legal attempts to seize control of a state, re-alignment of formal or informal alliances
  • Methodological approaches may include, but are not limited to, large sample statistical analysis, laboratory experiments, field experiments, formal models and computer simulations.

Space is limited for this conference.  Airfare and lodging expenses will be covered for all accepted presenters.

Additionally, we are to say that we have secured space in the Journal of Defense Modeling and Simulation for a special issue.  All accepted conference presenters will have their papers considered for the special issue.

As discussants for the conference we are fortunate enough to have:

Dr. Scott de Marchi of Duke University whose work focuses on mathematical methods, especially computational social science, machine learning, and mixed methods. Substantively, he examines individual decision-making in contexts that include the American Congress and presidency, bargaining in legislatures, interstate conflict, and voting behavior. He has been an external fellow at the Santa Fe Institute and the National Defense University and is currently a principal investigator for NSF’s EITM program.

Dr. Jay Ulfelder is a political scientist whose research interests includedemocratization, political violence, social unrest, state collapse, andforecasting methods. He has served as research director for the Political Instability Task Force, a U.S. government-funded research program that develops statistical models to forecast various political events around the world. He has spent most of the past three years working with the U.S. Holocaust Memorial Museum’s Center for the Prevention of Genocide to develop the Early Warning Project, a public early-warning system for mass atrocities.

To propose a 12 minute presentation, to be followed by a paper submission for the special issue of JDMS, please email a title and abstract of no more than 300 words along with a short statement about why you are interested in this conference to Justin Duncliffe (justin.a.dunnicliff@ndu.edu ) by June 2. 

Key Dates:

June 2 – Deadline for presentation proposals

June 16 – Decisions on conference acceptance and travel awards

July 16 – Drafts of the accepted conference papers are due

July 26 – (final date may move slightly in this week). – Conference takes place at National Defense University in Washington D.C.

Dec 31 – final manuscript due for the special issue of the Journal ofDefense Modeling and Simulation

Papers submitted should not be concurrently under review at another journal, or similar venue.

The guest editors for the special issue will be:

Dr. Scott de Marchi – Duke University

Dr. Jay Ulfelder – Consultant

Dr. Ryan Baird – U.S. Strategic Command

Please email Justin Duncliffe with any questions.

Posted in Uncategorized | Leave a comment

New Print Edition Released!

The newest print edition of The Political Methodologist is now available! Click this link to read now:

Volume 24, Number 1

Posted in Uncategorized | Leave a comment

2016 Year in Review (and the Most-Viewed Post!)

The Political Methodologist is still in a transitional period, with the search for a new editorial team (and possibly a new publication structure) still ongoing. But 2016 was a great year for new work in TPM, and that’s been reflected in our readership statistics.

In 2016, articles in The Political Methodologist were viewed 46,807 times by 34,324 unique visitors:


This is slightly less than our viewership for the 2015 year (52,000 views and 37,800 visitors) but still an excellent performance and reflective of the important role that TPM plays as an outlet for discussion of topical and practical issues of interest to the political methodology community.

Our special issue on peer review was a big part of the new content on TPM, and indeed our most viewed post for 2016 came from this special issue. With 3319 views in 2016 (and December of 2015, when the post was originally made), “An Editor’s Thoughts on the Peer Review Process” by Sara McLaughlin Mitchell is the most-viewed post on The Political Methodologist in 2016. Congratulations!

I also wish to acknowledge that “Making High-Resolution Graphics for Academic Publishing” by Thomas Leeper (originally posted in 2013) is still by far the most popular post on TPM, garnering 20,168 views in 2016 alone. There is no official award or recognition for this distinction, but it is pretty amazing.

On behalf of the (now prior) editorial team, thanks to everyone who contributed to The Political Methodologist under our editorship!

Posted in Uncategorized | Leave a comment

The .GOV Internet Archive: A Big Data Resource for Political Science

By Emily Kalah GadeJohn Wilkerson, and Anne Washington

“Big data” will transform social science research. By big data, we primarily mean datasets that are so large that they cannot be analyzed using traditional data processing techniques. However, big data is further distinguished by diverse types of information and the rapid accumulation of that information.[1] We introduce one recently released big data resource, and discuss its promise along with potential pitfalls. For nearly 20 years, governments have used the web to share information and communicate with citizens and the world. .GOV is an archive of nearly two decades of content from .gov domains (US federal, state, local) organized into a database format that is nine times larger than the entire print content of the Library of Congress (90 terabytes, or 90,000 gigabytes).[2] Big data resources like .GOV pose novel analytic challenges in terms how to access and analyze so much data. In addition to the difficulty posed by its size, big data is often messy. Additionally, .GOV is neither a complete nor a representative sample of government presence on the web across time.

The Internet Archive

In 1963, J.C.R. Licklider of the Advanced Research Projects Agency (ARPA) drafted a “Memorandum For Members and Affiliates of the Intergalactic Computer Network” (emphasis added). Subsequent discussions ultimately led to the creation of ARPANET in 1968. Soon after, major government departments and agencies were constructing their own “nets” (DOE and MFENet/ HEPNet, NASA and SPAN). In 1989, Tim Berners-Lee proposed (among other things) using hypertext links to enable users to post and search for information on the internet, creating the World Wide Web. The first commercial contracts for managing network addresses were awarded in the early 1990s. In 1995, the internet was officially recognized by the Federal Networking Council, and Netscape Navigator, “the web browser for everyone,” went public.

In 1996, a non-profit organization, the Internet Archive (IA) assumed the ambitious task of documenting the public web. The current collection contains more than 450 billion web page “captures” (downloads of URL linked pages and metadata) dating back to 1995. The best way to quickly appreciate what’s in the IA holdings is to visit the WayBack Machine website, where specific historical website captures (e.g. the White House home page from Dec. 27, 1996) can be viewed.

.GOV: Government on the Internet

The Internet Archive also curates sub-collection: .GOV.[4] .GOV contains approximately 1.1 billion page captures of URLs with a .gov suffix (from 1996 through Sept 30, 2013). At the federal level, this includes the official websites of elected officials, departments, agencies, consulates, embassies, USAID missions and much more.[5]

.GOV offers four types of data from each webpage capture: the link data (the page URL and every other url/hyperlink found on the page); the parsed text of the page; the full content of the page (the text including html markup language; images; video files etc); and the CDX index file that is used to access the page via the Wayback Machine.

Messy Data

There is no way to download the entire content of the internet, or even a representative sample. The IA (as well as major search firms such as Google) capture content by “crawling” from one page to another. Starting from a limited number of “seed” URLs (web page addresses) a “bot” (software program) collects content from all of the URLS found on the originating page, then all of the URLs on those pages (etc.) until it encounters no more unique pages, or a user defined search constraint tells it to stop. This sequential process inevitably offers an incomplete snapshot of a constantly evolving World Wide Web. In 2008, the official Google blog reported that developers had collected 1 trillion unique URLs in a single concerted effort but also noted that “the number of pages out there is infinite.”[6]

Crawl results are also incomplete because web pages are sometimes located behind firewalls (the “dark web”), or include scripts that discourage bots from collecting content. The Internet Archive will also delete a website at the owner’s request. We have also discovered other limitations of the .GOV data that users should be aware of in designing projects.[7]

The quality of the Internet Archive also improves over time, both because of changes in the way the Web is used and because of changes in the way the Internet Archive conducted its crawls. Figure 1 displays how often the White House website was captured across four different years starting in 1997.[8] The Wayback Machine indicates that whitehouse.gov was crawled just 3 times in 1997. In 2001, it was not crawled at all in the month of August and then hundreds of times in the three months following the terrorist attacks on September 11. In 2007, it was captured much more often – at least once a week. And in 2014, whitehouse.gov was captured at least once a day.


Figure 1: Frequency of whitehouse.gov crawls (selected years)

The most complete .GOV crawls occurred during three month time periods (Nov-Jan) of election years starting in 2004.[9] Using congressional websites URLs as the seeds, the IA captured more government web presence than before. Figure 2 indicates spikes in unique .GOV URLs captured during election years. For example, the number triples from about 500 million to 1.5 billion between 2003 and 2004.

URLs (1).png

Figure 2: Total .GOV Unique URLs

Although .GOV is less than ideal as a data resource from a conventional social science perspective, there is no other option for investigating two decades of White House website content, or the content of millions of other pages of government website content. Importantly, because these crawls contain snapshots of each page, researchers could hypothetically examine language that agencies or individuals chose to remove – something scraping those same pages now could not provide. The challenge researchers face is finding the hidden gems in a resource that cannot be easily explored.

Big Data and Distributed Computing

.GOV is an excellent platform for learning about “big data” analysis techniques. The basic challenge is that the dataset is too large (90,000 gigabytes) to download and explore. Big data is stored and managed differently. Traditional databases (aka “structured data” are organized into neat rows and columns.[10] Big data projects rely on more flexible data storage processes where portions of the data are distributed across a cluster of computers. Each computer in the cluster is a node, and portions of the data are stored in “buckets” or “bins” within each node (see Figure 3). To access the data, researchers use special software to send simultaneous requests to the different nodes. The piecemeal results of these multiple queries are then recombined into a much smaller, single working


Hadoop (1).png

Figure 3: Hadoop System for .GOV

Querying .GOV

The .GOV database is currently hosted on a Hadoop computing cluster operated by a commercial datacloud service, Altiscale (www.altiscale.com). Within the cluster housing .GOV, the data are distributed across nine separate “buckets.” Each bucket contains thousands of large (100mb) WARC (Web Archive Container) files (or `ARC’ files for earlier records). Each of these WARC files then contains thousands of individual webpage capture records. As mentioned, each capture record includes the parsed text, the URLs found on the page, the full content of the capture (including images and video files); and the CDX index file. The CDX file includes useful metadata about specific records that can be used to find and exclude particular records, such as the URL, timestamp, Content Digest, MIME type, HTTP Status Code, and the WARC file where it is located.

The data are accessed using Apache software programs. Apache Pig and Hive are SQL-based languages that can be used for basic data processing such as joining or merging files, searching for specific URLs, and more generally retrieving data of interest. Many Apache commands will be familiar to users with working knowledge of SQL, R or Python. To search all of the capture records in the .GOV database, one must write a query to search thousands WARC files across each of the nine buckets.

Obtaining a key and creating a workbench

Here we describe the big picture process of querying .GOV. In the next section, we present some preliminary findings using the parsed text data. The specific annotated scripts used to accomplish the latter can be found in the online Appendix.

Users must first gain access to the Altiscale computing cluster by requesting an “ssh” key (detailed on Altiscale’s website – you have to email accounts@altiscale.com to request a key). Each key owner is granted a local workbench (an Apache Work Station (AWS)) on the cluster that is similar to the “desktop” of a personal computer and contains the Apache software programs needed to query the database. About 20 gb of storage is also provided (the .GOV database is about 4500 times larger).

Writing scripts to extract information

  1. Specifying what is to be collected

Apache Hive and Pig are used execute SQL queries. This can be done on the command line directly (there is no GUI option), but it is easier to write and store scripts on the workbench, and then write a command to execute them across the buckets of interest. For example, one can write an Apache Pig script that requests each parsed text file from a specified URL (e.g. whitehouse.gov), separates the parsed text fields (“date” “URL” “content” “title” etc.), searches each field for each record for a keyword or regular expression, and then counts how many times a match occurs. The full parsed text could also be downloaded in order to explore the content in more detail later. But with so much data (1.1 billion pages), such a collection can quickly become too large to export.

The functionality of Pig and Hive (like SQL) is limited. For example, Pig will return and count the webpage captures that contain a keyword (true/false) for a date range (e.g., per month/year), but it can’t compute the frequency of keyword mentions. To do more detailed or custom analysis, researchers can write user defined functions (UDFs) in Python. The Python script is stored on the workbench as a .py file and then called by the Pig script.[11]

Processing time is a major consideration. Even simple jobs such as keyword counts can take hours or even days to run over so much data. More computationally intensive methods, such as topic models may be impractical. The best way to discover whether a script is going to work and how long it will take is to test it on a subset of the data, such as on just one WARC file in one of the buckets. Running a complete job without testing it is likely to lead to many hours or days of waiting only to discover that it did not work. Linux “Screen” (already installed on the cluster) can then be used to run the script across the cluster remotely (so that your own computer can be used for other things).[12]

  1. Providing instructions about where to search on the cluster

The CDX files provide guidance that makes it possible to limit queries to particular URLs, date ranges, WARC files etc.[13] For example, the first query might identify and produce a list of all of the URLs that contain the keyword. The next query would focus on extracting the relevant information (such as keyword and total word counts) from that more limited set of URLs.

  1. Concatenating and exporting the results

Query results for each WARC or ARC file (containing thousands of captures) are stored separately on the cluster. Additional scripts must be written to concatenate them. Whether the results can be exported can also be calculated at this point.[14]

Application: Government Attention to the Financial Crisis, Terrorism, and Climate Change

As a starting point to discovering what’s in .GOV, we investigate keyword frequencies for three recent issues in American politics. We hope to observe patterns consistent with what is generally known about the issues, and perhaps more novel patterns that begin to illustrate the potential of this new data source.

Collecting the data

We first created a limited list of top level URLs (departments, agencies and political institutions) relevant to the three issues. For example, the regular expression “.house.gov” theoretically captures every web page of every branch of the official website of the U.S. House of Representatives that the IA collected. This includes, among other things, every Representative’s official website (e.g., pelosi.house.gov) and every House committee’s website (e.g., agriculture.house.gov). Aggregating results during the collection process in this way means that we cannot pull out just the results for a particular member or committee’s website. That would require a different query using more specific URLs.

We then counted keyword mentions on every subpage of that root URL. We first developed broad lists of keywords related to the three issues. After obtaining results, we created more refined lists by dropping terms that seemed problematic or were used less often (see online appendix). For example, we dropped “security” from the terrorism keyword list because it was too general (e.g., financial security). Running the query over all WARC files took about five days of processing time. All together, the results reported below are based on 8.3 billion keyword hits generated by searching about 600 billion words found on the parsed text pages of the specified URLs.

Focusing on raw counts of keywords gives more weight to larger domains. Any changes in attention to terrorism at the much larger State Department will swamp changes at the Bureau of Alcohol, Tobacco and Firearms (ATF). We focus on the proportion of attention given to the issue within an agency or political institution, by dividing the number of keyword hits by total website words. A proportion-based approach also does a better job of controlling for the expanding size of government web presence.

Overall Trends

One way to begin to assess the validity of using website content to study political attention is to ask whether changes in content correlate with known events. For each issue in Figure 4, we identified the URLs (federal government organizations) thought to play a role on the issue (see Appendix II for these lists and URLs). The graphs then report average proportions of attention across these URLs.[15] Attention to terrorism similarly increases after 9/11/2001, but government-wide attention to terrorism increases most dramatically from 2005 to 2006. Institutionalization is almost certainly part of the explanation. We are capturing attention to terrorism (relative to other issues) on the websites of government agencies. The Department of Homeland Security was not created until 2003 and one of the purposes of its creation was to re-orient the missions of existing agencies (such as FEMA) towards preventing and responding to terrorism. In addition, while 9/11 was an important focusing event for the US, terrorism worldwide continued to increase post 9/11. As we might expect, there is no evidence of equivalent shocks for climate change.


Figure 4: Issue Attention Across .GOV

Diffusion of Attention to Terrorism

Political scientists have long been interested in how “focusing events” impact political attention (Birkland 1998). Many studies have examined the impact of 9/11 on the organization and activities of specific government agencies and departments. Here, we ask how attention to terrorism spread across government departments and agencies. Entropy is a measure disorder that is frequently used to study the dispersion of political attention (Boydstun et al. 2014). Figure 5 confirms that attention to terrorism in the federal government became more dispersed post 9/11.[16]


Figure 5: Diffusion in Attention Across .GOV

A Financial “Bubble”?

One of the questions raised in congressional hearings after the 2007-08 financial crisis was whether it could have been anticipated and averted. A related question is whether government agencies saw it coming. As an historical archive, .GOV may provide some clues. Here we simply examine “bubble” mentions across organizations (as a proportion of total website words). Figure 6 indicates that references to bubbles spike in the elected branches after the meltdown, whereas bubble mentions at the four agencies most responsible for the economy increase 2-3 years ahead of the crisis. Bubbles also see increased attention at the Federal Reserve before the stock market sell-off in 2001.


Figure 6: Attention to Financial Crisis Across .GOV

Framing Climate Change

Early in President G.W. Bush’s first term of office, pollster Frank Luntz advised Republicans to talk about “climate change” rather than “global warming” because focus groups saw the latter as more of a threat (Leiserowitz et al. 2014, 7). Subsequent academic research also found that the public is somewhat more likely to support action to address global warming. However, it seems as though conservatives also spend much of their time ridiculing global warming. Recently Senator James Inhofe (R-OK) brought a snowball to the Senate floor to question scientists’ claims that 2016 was one of the warmest years on record (Bump 2015). If conservatives have discredited global warming in the eyes of the public, then proponents of climate action may have less incentive to use that frame.

In Figure 7, values above .50 indicate that climate change mentions are more common than global warming mentions. “Agencies” refers to the average emphasis on climate change for four agencies with central roles (the EPA, NSF, NOAA, and NASA). According to Figure 7, scientific agencies have always emphasized climate change over global warming, with climate change increasingly favored in recent years. For the elected branches, the patterns are more variable and seem to support the notion that conservatives control the global warming frame. In Congress, global warming has been a more popular frame during periods of Republican control (2001-2008; 2011-2013) and has been used more often over time. The patterns for the White House do not support what Luntz advised. Global warming receives more attention than climate change for most of the years of the Bush administration. The Obama administration, in contrast, has gone all in for climate change. Although preliminary, these results do suggest that conservatives have defanged what was once the most effective frame for winning public support for climate action.


Figure 7: Attention To Climate Change Across .GOV


Accessing this big data resource requires new skills and a new mindset. In terms of skills, we hope that our description of the process and working scripts lower the bar. In terms of mindset, political scientists working with statistical methods are used to immediate results. Exploring .GOV in this way is not an option (the Wayback Machine is probably the best way to get a sense of what’s in .GOV, and it can take days or even weeks to run a query). On the other hand, .GOV contains insights available nowhere else. Although the current database has important limitations, the Internet Archive recently embarked on a collaboration with many partners to scrape all federal government agencies as completely as possible prior the end of the Obama administration. If this effort is successful and if similar efforts follow in subsequent years, .GOV will be an even more valuable resource for investigating a wide range of questions about the federal bureaucracy and federal programs.


“Bash Shell Basic Commands.” GNU Software.

Boydstun, A. E., Bevan, S. and Thomas, H. F. (2014), “The Importance of Attention Diversity and How to Measure It.” Policy Studies Journal, 42: 173–196. doi: 10.1111/psj.12055

Bump, P. “Jim Inhofe’s Snowball Has Disproven Climate Change Once and for All.” The Washington Post, 26 Feb. 2015. Web. 28 June 2016.

Birkland, T. A. (1998). “Focusing events, mobilization, and agenda setting.” Journal of Public Policy. 18(01), 53-74.

Edwards, J., McCurley, K. S., and Tomlin, J. A. (2001). “An adaptive model for optimizing performance of an incremental web crawler”. Tenth Conference on World Wide Web (Hong Kong: Elsevier Science): 106–113.

“The History of the Internet.” The Internet Society.  

“The Internet Archive.” Internet Archive. https://archive.org/

Kahn, R. (1972). “Communications Principles for Operating Systems.” Internal BBN memorandum.

Leiner et al. “Brief History of the Internet.” 

Leiserowitz, A. WHAT’S IN A NAME? GLOBAL WARMING VERSUS CLIMATE CHANGE. Rep. Yale Project on Climate Change Communication, May 2014. Web. 28 June 2016.

Licklider, J. C. (1963). “Memorandum for members and affiliates of the intergalactic computer network.” M. a. A. ot IC Network (Ed.). Washington DC: KurzweilAI. ne.

Najork, M and J. L. Wiener. (2001). “Breadth-first crawling yields high-quality pages.” Tenth Conference on World Wide Web, (Hong Kong: Elsevier Science): 114–118.

“Pig Manual.” Apache Systems.

“The Rise of 3G.” THE WORLD IN 2010. International Telecommunication Union (ITU)

Sagiroglu, S., & Sinanc, D. (2013, May). “Big data: A review.” In Collaboration Technologies and Systems (CTS), 2013 International Conference (pp. 42-47). IEEE. 

A “ssh” key (Secure Shell)” (2006). 

Vance, A. (2009). “Hadoop, a Free Software Program, Finds Uses Beyond Search”. The New York Times. 27 February 2017.

Appendix I

The following script flags all web pages that include one or more mentions of the term `climate change’ and stores the full text of those captures. We begin with an overview of the process of running jobs on the cluster, and then provide specific code. For questions, please contact the authors.


Running scripts on the cluster requires a basic understanding of bash (Unix) shell commands using the Command Line on a home computer (on a Mac, this is the program “Terminal”). For a basic rundown of bash commands, see here.

Begin by opening a bash shell on a home desktop, and using an ssh key obtained from Altiscale to login. Once logged in, you will be on your personal workbench and now have to use a script editor (such as Vi). Come up with a name for the script, open the editor, and then either paste or write the desired script in the editor, close and save the file (to your personal workbench on the cluster).

Scripts must be written in Hadoop-accessible languages, such as Apache Pig, Hive, Giraph or Oozie. Apache languages are SQL-like, which means if you have experience with SQL, MySQL, SQLlite or PostgreSQL (or R or Python), the jump should not be too big. For text processing, Apache Pig is most appropriate, whereas for link analysis, Hive is best. The script below is written in Apache Pig and a manual can be found at https://pig.apache.org/. For an example of some scripts written for this cluster, see here. May be easiest to it “clone” the “archive analysis” file hosted on GitHub from Vinay Goel or three basic scripts from Emily Gade .govDataAnalysis and use those as a launchpoint. If you don’t know how to use GitHub, see here.

Because Apache languages have limited functionality, users may want to write user defined functions in a program like Python. A tutorial about how to do this can be found here.

Once a script is written, you will want to run it on a segment of the cluster. This requires another set of Unix style Hadoop shell commands. Users must then specify the file path(s), the desired output directory, and where the script can be found.

Getting a Key

As discussed above, this script is run from your workbench on the cluster. To gain access, you will need to set up an SSH “key” with Altiscale. Once you have obtained and sent your SSH key to Alitscale, you can log in using any bash shell from your desktop with the command “ssh altiscale”.

Locating the Data

The Altiscale cluster houses 9 “buckets” of .GOV data. Each bucket contains hundreds or thousands of Web Archive Files (older version are “ARC” files, newer version are “WARC” files, but they have all the same fields). Each WARC/ARC file contains captures from the same crawl, but it (a) won’t contain all of the captures from a given crawl, and (b) since the crawl is doing a lot of things simultaneously, captures of a single site can be located in different WARC files.

With so much data, there is no simple “table” or directory that can be consulted to locate a specific web page. The best way to find specific pages is to use Hive to query the CDX database. See Vinay Goel’s GitHub for details about how to query CDX. If a user knows exactly what he or she wants (all the captures of the whitehouse.gov main page, or all the captures from September 11, 2001), the CDX can tell you where to find them. Otherwise, users will want to query all of the buckets because there is no easy way to learn where results are stored. (Though we advise first testing scripts on a single bucket or WARC file.)

First, use the command line with SSH interface to query the data directories and see which buckets or files to run a job over. This requires the Hadoop syntax to “talk” to the cluster where all the data is stored. The cluster has a user-specific directory where users can store the results of scrapes. A user’s local workbench does not have enough space to save them.

Whenever users “talk” from a user’s local workbench to the main cluster, users need to use `hadoop fs -‘ and then the bash shell command of interest. For a list of Hadoop-friendly bash shell commands, see here.

For example, the line of code

hadoop fs -ls

pulls a listing of the files in your personal saved portion of the cluster (in addition to the local workbench, each user has a file directory to save the results). As well,

hadoop fs -ls /dataset-derived/gov/parsed/arcs/bucket-2/

would draw up all the files in Bucket #2 of the parsed text ARCS directory.

Defining Search Terms

Scripts that deal with text are best written in Apache Pig. Hadoop also supports Apache Hive, Giraffe and Spark. To find and collect terms or URLs of interest, users will need to write a script. For example, users might write a script to flag any captures that have a mention of a global warming term, and return the date of the capture, URL, page title, checksum, and the parsed text. This script is saved on your local workbench and needs to have a .pig suffix. Users will need to use some sort of bash editor to write and store the script such as vi (details about how to use vi can be found above). Script is below. The first four lines are defaults and also set the memory.

Script begins:

SET default_parallel 100;
SET mapreduce.map.memory.mb 8192; 
SET mapred.max.map.failures.percent 10;
REGISTER lib/ia-porky-jar-with-dependencies.jar;
DEFINE FROMJSON org.archive.porky.FromJSON();
DEFINE SequenceFileLoader org.archive.porky.SequenceFileLoader();
DEFINE SURTURL org.archive.porky.SurtUrlKey();

The sequence file loader pulls the files out of the ARC/WARC format and makes them readable. Note, when they were put into the ARC/WARC format, they were run through a HTML parser to remove the HTML boilerplate. However, if the file was not in HTML to begin with, the parser will just produce symbols and this won’t fix it. Users will have to deal with those issues separately.

When loading data on the command line (instructions below), give the data a name (here $I_Parsed_Data) and make sure to use the same “name” for the data in the command line command. This is a stand-in for the name of the directory or file over which you will run a script.

Archive = LOAD "$I_PARSED_DATA" USING SequenceFileLoader()
AS (key:chararray, value:chararray);
Archive = FOREACH Archive GENERATE FROMJSON(value) AS m:[];
Archive = FILTER Archive BY m#`errorMessage' is null;
ExtractedCounts = FOREACH Archive GENERATE m#`url' AS src:chararray,
   SURTURL(m#`url') AS surt:chararray,
   REPLACE(m#`digest',`sha1:','') AS checksum:chararray,
   SUBSTRING(m#`date', 0, 8) AS date:chararray,
   REPLACE(m#`code', `[^p{Graph]', ` ') AS code:chararray,
   REPLACE(m#`title', `[^p{Graph]', ` ') AS title:chararray,
   REPLACE(m#`description', `[^p{Graph]', ` ')AS description:chararray,
   REPLACE(m#`content', `[^p{Graph]', ` ') AS content:chararray;

The above code block says: for each value and key pair, pull out the following fields. Chararray means character array – so a list of characters with no limits on what sort of content may be included in that field. The next line selects the first eight characters of the date string (year, month, day). The full format is year, month, day, hour, second. Unicode errors can wreck havoc on script and outputs. The regular expression p{Graph means “all printed characters”– e.g., NOT newlines, carriage returns, etc. So, this query finds anything that is not text, punctuation and whitespace, and replaces it with a space. Also note that because Pig is under-written in Java, users need two escape characters in these scripts (whereas only one is needed in Python).

UniqueCaptures = FILTER ExtractedCounts BY content MATCHES 
`.*naturals+disaster.*' OR content MATCHES `.*naturals+disaster.*' 
OR content MATCHES `.*desertification.*' OR content MATCHES 
`.*climates+change.*' OR content MATCHES `.*pollution.*' OR 
content MATCHES `.*foods+security.*';

This filters out the pages with keywords of interest (in this case words related to climate change) and keeps only those pages.

STORE UniqueCaptures INTO `$O_DATA_DIR' USING PigStorage('\u0001');

This stores the counts the file name given to it.

The “using PigStorage” function allows users to set their own delimiters. I chose a Unicode delimiter because commas/tabs show up in the existing text. And, since I stripped out all Unicode above, this should be clearly a new field. Save this script to your local workbench.

Another option would be to count all the mentions of specific terms. Instead of the above, users would run:

SET default_parallel 100;
SET mapreduce.map.memory.mb 8192;
SET mapred.max.map.failures.percent 10;
REGISTER lib/ia-porky-jar-with-dependencies.jar;

This line allows you to load user-defined functions from a Python file:

REGISTER `UDFs.py' USING jython AS myfuncs;
DEFINE FROMJSON org.archive.porky.FromJSON();
DEFINE SequenceFileLoader org.archive.porky.SequenceFileLoader();
DEFINE SURTURL org.archive.porky.SurtUrlKey();
Archive = LOAD `$I_PARSED_DATA' USING SequenceFileLoader()
AS (key:chararray, value:chararray);
Archive = FOREACH Archive GENERATE FROMJSON(value) AS m:[];
Archive = FILTER Archive BY m#`errorMessage' is null;
ExtractedCounts = FOREACH Archive GENERATE m#`url' AS src:chararray,
   SURTURL(m#`url') AS surt:chararray,
   REPLACE(m#`digest',`sha1:',`') AS checksum:chararray,
   SUBSTRING(m#`date', 0, 8) AS date:chararray,
   REPLACE(m#`code', `[^p{Graph]', ` ') AS code:chararray,
   REPLACE(m#`title', `[^p{Graph]', ` ') AS title:chararray,
   REPLACE(m#`description', `[^p{Graph]', ` ')AS description:chararray,
   REPLACE(m#`content', `[^p{Graph]', ` ') AS content:chararray;

If a user has function which selects certain URLs of interest and groups all other URLs as “other”, they would run it only on the URL field. And, if a user has a function that collects words of interest and counts them as well as total words, the user should run that through the content field. Code for using those UDFs would look something like this:

UniqueCaptures = FOREACH ExtractedCounts GENERATE myfuncs.pickURLs(src),
   src AS src,
   surt AS surt,
   checksum AS checksum,
   date AS date,

In Pig, and the default delimiter is `\n’ (new line) but many `\n’ appear in text. So one must get rid of all the new lines in the text. This will affect our ability to do text parsing by paragraph, but sentences will still be possible. Code to get rid of the `\n’ (new line delimiters) which are causing problems with reading in tables might look something like this:

UniqueCaptures = FOREACH UniqueCaptures GENERATE REPLACE(content, `\n', ` ');

To get TOTAL number of counts of web pages, rather than simply unique observations, merge with checksum data:

Checksum = LOAD `$I_CHECKSUM_DATA' USING PigStorage() AS (surt:chararray, 
date:chararray, checksum:chararray);
CountsJoinChecksum = JOIN UniqueCaptures BY (surt, checksum), 
Checksum BY (surt, checksum);
FullCounts = FOREACH CountsJoinChecksum GENERATE
   UniqueCaptures::src as src,
   Checksum::date as date,
   UniqueCaptures::counts as counts,
   UniqueCaptures::URLs as URLs;

This would sort counts by original “source” or URL:

GroupedCounts = GROUP FullCounts BY src;

This fills in the missing counts and stores results:

GroupedCounts = FOREACH GroupedCounts GENERATE
   group AS src,
   FLATTEN(myfuncs.fillInCounts(FullCounts)) AS (year:int,
   month:int, word:chararray, count:int, filled:int,
   afterlast:int, URLs:chararray);
STORE GroupedCounts INTO `$O_DATA_DIR';

The UDFs mention here (pickURLs, Threat_countWords, and FillinCounts) are written in Python and can be seen in at the bottom of this Appendix.

Running the Script

To run this script, type the following code into the command line, after having logged in the Altiscale cluster with your ssh key. Users will select the file or bucket they want to run the script over, and type in an “output” directory (this will appear on your home/saved data on the cluster, not on your local workbench). Finally, users need to tell Hadoop which script they want to run. The I_PARSED_DATA was defined as the location of the data to run the script over in the script above. Here we telling the computer that this bucket is the I_PARSED_DATA. Next, one must load the CHECKSUM data, and finally, give the output directory, and the location of your script.

The following should be run all as one line:

pig -p I_PARSED_DATA=/dataset-derived/gov/parsed/arcs/bucket-2/ -p I_CHECKSUM_DATA=/dataset/gov/url-ts-checksum/ -p O_DATA_DIR=place_where_you_want_the_file_to_end_up location_of_your_script/scriptname.pig

Exporting Results

Lastly, to remove results from the cluster users need to open a new Unix shell on their local machine that is NOT logged in to the cluster with their ssh key. Then type the location of the file they’d like to copy and give it a file path for where they’d like to put it on their desktop. For example:

The following should be run all as one line:

scp -r altiscale:~/results_location/location_on_your_computer_you_want_to_move_results_to/

Python UDFs

#import packages
from collections import defaultdict
import sys
import re
#define output schema so the UDF can talk to Pig
# define Function
def pickURLs(url):
   # these can be any regular expressions
     keyURLs = [
     URLs = []
     for i in range(len(keyURLs)):
        tmp = len(re.findall(keyURLs[i], url, re.IGNORECASE))
         if tmp > 0:
          return keyURLs[i]
     return `other'

# counting words
#define output schema as a "bag" with the word and then the count of the word
def Threat_countWords(content):
     # these can be any regular expressions
     Threat_Words = [
#if you want a total of each URL or page, include a total count
     threat_counts = defaultdict(int)
     threat_counts[`total'] = 0
     if not content or not isinstance(content, unicode):
        return [((`total'), 0)]
     threat_counts[`total'] = len(content.split())
     for i in range(len(Threat_Words)):
        tmp = len(re.findall(Threat_Words[i], content, re.IGNORECASE))
         if tmp > 0:
          threat_counts[Threat_Words[i]] = tmp
    # Convert counts to bag
     countBag = []
     for word in threat_counts.keys():
         countBag.append( (word, threat_counts[word] ) )
     return countBag

## filling in counts using CHECKSUM and carrying over counts
from the "last seen" count
@outputSchema("counts:bag{tuple(year:int, month:int, word:chararray, count:int,filled:int, afterLast:int, URLs:chararray)")
def fillInCounts(data):
     outBag = []
     firstYear = 2013
     firstMonth = 9
     lastYear = 0
     lastMonth = 0
# used to compute averages for months with multiple captures
# word -> (year, month) -> count
     counts = defaultdict(lambda : defaultdict(list))
     lastCaptureOfMonth = defaultdict(int)
     endOfMonthCounts = defaultdict(lambda : defaultdict(lambda:
     seenDates = {
#ask for max observed date
     for (src, date, wordCounts, urls) in data:
         for (word, countTmp) in wordCounts:
           year = int(date[0:4])
           month = int(date[4:6])
           if isinstance(countTmp,str) or isinstance(countTmp,int):
               count = int(countTmp)
           ymtup = (year, month)
           if date > lastCaptureOfMonth[ymtup]:
              lastCaptureOfMonth[ymtup] = date
           if date > endOfMonthCounts[word][ymtup]['date']:
               endOfMonthCounts[word][ymtup]['date'] = date
               endOfMonthCounts[word][ymtup]['count'] = count
           seenDates[(year,month)] = True
           if year < firstYear:
               firstYear = year
               firstMonth = month
           elif year == firstYear and month < firstMonth:                firstMonth = month            elif year > lastYear:
               lastYear = year
               lastMonth = month
           elif year == lastYear and month > lastMonth:
               lastMonth = month
     for word in counts.keys():
# The data was collected until Sep 2013
# make sure that you aren't continuing into the future
         years = range(firstYear, 2014)
         useCount = 0
         afterLast = False
         filled = False
         ymLastUsed = (0,0)
         for y in years:
           if y > lastYear:
               afterLast = True
           if y == firstYear:
               mStart = firstMonth
               mStart = 1
           if y == 2013:
               mEnd = 9
               mEnd = 12
           for m in range(mStart, mEnd+1):
              if y == lastYear and m > lastMonth:
              if (y,m) in seenDates:
# Output sum, as we will divide by sum of totals later
                 useCount = sum(counts[word][(y,m)])
                 ymLastUsed = (y,m)
                 filled = False
# If we didn't see this date in the capture, we want to use the last capture we saw
# previously (we might have two captures in Feb, so for Feb we output both,
# but to fill-in for March we would only output the final Feb count)
# Automatically output an assumed total for each month (other words
# may no longer exist)
                 if endOfMonthCounts[word][ymLastUsed]['date'] ==                 lastCaptureOfMonth[ymLastUsed]:
                     useCount = endOfMonthCounts[word][ymLastUsed]['count']
                 filled = True
              if useCount == 0:
              outBag.append((y, m, word, useCount, int(filled), int(afterLast), urls))

Appendix II: Lists of URLs and Terms


Figure 8: URLs used for this study


Figure 9: Terrorism Terms


Figure 10: Finance Terms


Figure 11: Climate Terms


[1] For example, see this article on understanding Big Data: Sagiroglu, S., & Sinanc, D. (2013, May). Big data: A review. In Collaboration Technologies and Systems (CTS), 2013 International Conference (pp. 42-47). IEEE. 

[2] In thinking about using the volume of the Library of Congress as a unit of measure, see: “A “Library of Congress” Worth of Data” by Leslie Johnston, April 25, 2012. 

[4] See the Internet Archive’s description of their sub-collections here.

[5] .GOV also includes state and local websites that use the .gov suffix. Whereas the Wayback Machine makes it possible to view date-specific individual websites, the .GOV collection can be used to investigate patterns across websites and over time.

[6] See Google’s Official Blog (July 25, 2008) for discussion at “We knew the web was big…”

[7] These are listed in the online Appendix.

[8] The graphs are copied from Wayback Machine search results for whitehouse.gov.

[9] According to Vinay Goel, senior data engineer at the IA, the Library of Congress contracted with the IA to systematically capture congressional websites during these time periods.

[10] See Sagiroglu & Sinanc

[11] See Pig Wiki and Pig manuel.

[12] For instructions about how to use Screen see here.

[13] Instructions for querying the CDX file can be found here. For queries than cannot be restricted in advance (e.g. the research objective is to identify all parsed text files that contain a particular keyword), breaking a job into steps can be more efficient.

[14] Instructions for exporting documents from the Altiscale cluster can be found here. If they cannot, Apache Giraffe is designed to facilitate analyses and graphing on the cluster.

[15] The different proportions for the different issues are not comparable because they are dependent on the keyword lists. Financial crisis term usage (as a proportion of all terms) spikes upward in 2007-08 as expected (and also in 2001 when there was another stock market decline).

[16] Our measure is based on the proportion of domain content for 23 departments and agencies, where entropy is based on each domain’s proportion of the sum of all agencies’ proportions.

Posted in Uncategorized | Leave a comment