set.seed(1)
# generate two random walk processes
= cumsum(rnorm(200))
x = cumsum(rnorm(200))
y
plot.ts(cbind(x,y), main="Two Random Walks")
18 Spurious Regression
It is said, all stationary series are alike, but each non-stationary series is non-stationary in its own way (remember Leo Tolstoy’s famous quote: all happy families are alike; each unhappy family is unhappy in its own way.)
In all previous chapters, we have been working on stationary processes. We have shown that similar regression techniques and asymptotic results hold for stationary processes as for \(iid\) observations, albeit not exactly the same. If a time series is not stationary, we transform it to stationary by taking differences.
This chapter is devoted to study non-stationary time series. Special attention is given to unit root processes. We will see the theories involving non-stationary processes are entirely different from those applied to stationary processes. This makes unit root analysis an rather independent topic.
The obsession with unit root in academia have faded away in recent decades (I do not know if this assessment is accurate). Despite the topic posses immense theoretical interest, it does not seem to provide proportionate value for applied studies. Nonetheless, the topic is indispensable for a comprehensive understanding of time series analysis.
We start by pointing out that, it is very dangerous to blindly include non-stationary variables in a regression. To illustrate this, consider two random walks:
\[ \begin{aligned} x_{t} &= x_{t-1} + \epsilon_{t},\quad\epsilon_{t}\overset{iid}\sim N(0,\sigma_X^2)\\ y_{t} &= y_{t-1} + \eta_{t},\quad\eta_{t}\overset{iid}\sim N(0,\sigma_Y^2) \end{aligned} \]
\(\epsilon_t\) and \(\eta_t\) are independent to each other.
We would expect the two series completely uncorrelated, as they are two independent random processes. However, if we regress \(y_t\) on \(x_t\), we would likely find a very strong correlation. This is called a spurious regression.
\[ y_t = \alpha + \beta x_t + u_t \]
lm(y ~ x) |> summary()
Call:
lm(formula = y ~ x)
Residuals:
Min 1Q Median 3Q Max
-6.831 -3.904 1.021 3.223 9.731
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.23975 0.56584 5.726 3.78e-08 ***
x 0.15324 0.06886 2.225 0.0272 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.049 on 198 degrees of freedom
Multiple R-squared: 0.0244, Adjusted R-squared: 0.01948
F-statistic: 4.953 on 1 and 198 DF, p-value: 0.02718
Note that if we difference the two series to stationary, the spurious correlation disappears.
lm(diff(y) ~ diff(x)) |> summary()
Call:
lm(formula = diff(y) ~ diff(x))
Residuals:
Min 1Q Median 3Q Max
-2.93078 -0.55055 -0.02841 0.67557 2.58926
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.03963 0.07203 0.55 0.583
diff(x) -0.02171 0.07756 -0.28 0.780
Residual standard error: 1.015 on 197 degrees of freedom
Multiple R-squared: 0.0003974, Adjusted R-squared: -0.004677
F-statistic: 0.07831 on 1 and 197 DF, p-value: 0.7799
If we simulate many random walks, we would observe that a large percentage of the regressions report statistically significant relationships even though the series are independent.
# Simulation parameters
<- 100 # Number of observations in each random walk
n <- 1000 # Number of simulations
num_sim <- 0 # Counter for significant p-values
sig_count <- numeric(num_sim) # To store p-values
pvals
# Loop over simulations
for (i in 1:num_sim) {
# Generate two independent random walks of length n
<- cumsum(rnorm(n))
x <- cumsum(rnorm(n))
y
# Run linear regression of y on x
<- lm(y ~ x)
model <- summary(model)$coefficients[2, 4]
pval <- pval
pvals[i]
}
# Calculate percentage of simulations with p-value < 0.05
<- mean(pvals < 0.05) * 100
sig_percent print(paste("Percentage of significant regressions:", sig_percent, "%"))
[1] "Percentage of significant regressions: 77 %"
This example gives you a quantitative feel for how frequently spurious results can occur. What’s more interesting, however, is that we can eliminate this spurious strong correlation by including lags of dependent and independent variables —
# Compute lags for x and y
= dplyr::lag(y)
y_lag = dplyr::lag(x)
x_lag
# Regression with lags
lm(y ~ y_lag + x + x_lag) |> summary()
Call:
lm(formula = y ~ y_lag + x + x_lag)
Residuals:
Min 1Q Median 3Q Max
-2.6094 -0.6136 0.0433 0.7026 2.4059
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.34804 0.32558 1.069 0.288
y_lag 0.91403 0.03597 25.413 <2e-16 ***
x 0.12296 0.11279 1.090 0.278
x_lag -0.05127 0.11040 -0.464 0.643
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.076 on 95 degrees of freedom
(1 observation deleted due to missingness)
Multiple R-squared: 0.9206, Adjusted R-squared: 0.9181
F-statistic: 367.3 on 3 and 95 DF, p-value: < 2.2e-16
Non-stationary time series can cause troubles, but they are also fascinating topics to explore. In what follows, we will focus on two types of non-stationary processes: trend-stationary processes and unit root processes, which are the most common types of non-stationary series we would encounter in economic and finance. Non-stationary series with exponential growth can be transformed into linear trend, hence is not of particular interest. We will start with the relatively easy tread-stationary processes, and spend most of the paragraphs on unit root processes.