33  VECM*

33.1 Cointegrated Systems

Definition 33.1 An \(n \times 1\) vector \(y_t\) is said to be cointegrated if each of its elements individually is \(I(1)\) and there exists a non-zero vector \(a\in\mathbb R^n\) such that \(a'y_t\) is stationary. \(a\) is called a cointegrating vector.

If there are \(h<n\) linearly independent cointegrating vectors \((a_1, a_2, \dots,a_h)\), then any linear combination \(k_1a_2 + k_2a_2+\dots+k_ha_h\) is also a cointegrating vector. Thus, we say \((a_1, a_2, \dots,a_h)\) form a basis a basis for the space of cointegrating vectors.

Cointegrated systems can be represented by a Vector Error Correlation Model (VECM):

\[ \Delta y_t = \Pi y_{t-1} + \Gamma_1\Delta y_{t-1} + \Gamma_2\Delta y_{t-2} + \dots + \Gamma_{p-1}\Delta y_{t-p+1} + \mu+\epsilon_t \tag{33.1}\]

Theorem 33.1 (Engle-Granger Representation Theorem) Any set of \(I(1)\) variables are cointegrated if and only if there exists an error correlation (ECM) representation for them.

Therefore, it is inappropriate to model a cointegrated system with differenced VAR. Because the term \(\Pi y_{t-1}\) is missing out, which means a misspecification.

The cointegration term can be further factored as \(\Pi=\underset{n \times h}{A}\times\underset{h \times n}{B'}\), in which \(B\) contains the cointegrating vectors and \(A\) hosts the adjustment coefficients. In a bivariate example, it looks like

\[ \begin{bmatrix}\Delta y_{1t} \\ \Delta y_{2t}\end{bmatrix} = \begin{bmatrix}\alpha_1 \\ \alpha_2\end{bmatrix} \begin{bmatrix}\beta_1 & \beta_2\end{bmatrix} \begin{bmatrix}y_{1t-1} \\ y_{2t-1}\end{bmatrix} + \sum_{j=1}^{p-1} \begin{bmatrix} \gamma_{j,11} & \gamma_{j,12} \\ \gamma_{j,21} & \gamma_{j,22} \end{bmatrix} \begin{bmatrix}\Delta y_{1t-j} \\ \Delta y_{2t-j}\end{bmatrix} + \begin{bmatrix}\epsilon_{1t} \\ \epsilon_{2t}\end{bmatrix} \]

The economic interpretation is that \(\beta_1 y_{1t} + \beta_2 y_{2t}\) represents some long-run equilibrium relationship of the two variables. Parameters \(\alpha_1,\alpha_2\) describe the speed of adjustment, that is how each variable reacts to the deviations from the equilibrium path. Small values of \(\alpha_i\) would imply a relatively unresponsive reaction, which means it takes a long time to return to the equilibrium.

Note that \(\Pi = AB'\) cannot be a full rank matrix. If there are \(n\) independent cointegrating vectors, it follows that any linear combination of the components of \(y_t\) is stationary, which effectively means \(y_t\) is stationary. \(\Pi\) cannot be zero either. If this is the case, the system is fully characterized by differenced VAR, there is no cointegration. Therefore, for a cointegrated system, it necessitates \(0<h<n\).

A three variable example would be like:

\[ \begin{bmatrix}\Delta y_{1t} \\ \Delta y_{2t} \\ \Delta y_{3t}\end{bmatrix} = \begin{bmatrix} \alpha_{11} & \alpha_{12}\\ \alpha_{21} & \alpha_{22}\\ \alpha_{31} & \alpha_{32}\\ \end{bmatrix} \begin{bmatrix} \beta_{11} & \beta_{12} & \beta_{13}\\ \beta_{21} & \beta_{22} & \beta_{23}\\ \end{bmatrix} \begin{bmatrix}y_{1t-1} \\ y_{2t-1} \\ y_{3t-1}\end{bmatrix} + \cdots \]

In general, if \(y_t\) has \(n\) non-stationary components, there could be at most \(n-1\) cointegrating vectors. The number of cointegrating vectors is also called the cointegrating rank.

Note that a single equation ECM is equivalent to an ARDL model:

\[ \begin{aligned} &\Delta y_t = \alpha(y_{t-1} - \delta-\beta x_{t-1}) + \gamma\Delta x_t + u_t \\ \Leftrightarrow &\ y_t = (\alpha+1)y_{t-1} + \gamma x_t - (\alpha\beta+\gamma)x_{t-1} -\alpha\delta + u_t \\ \Leftrightarrow &\ y_t = b_1y_{t-1} + b_2x_t + b_3x_{t-1} + c + u_t \end{aligned} \]

Given a (possibly) cointegrated system, we would like to know if any cointegrating relations exist and how many cointegrating vectors there are. Johansen (1991) provides a likelihood-based method to test and estimate a cointegrated system. But before we introduce the Johansen’s method, we need the prerequisite knowledge of canonical correlations.

33.2 Canonical Correlation

Principle component analysis (PCA) finds a linear combination of \([x_1\ x_2\ \dots x_n]\) that produces the largest variance. What if we want to extend the analysis to the correlations between two datasets: \(\underset{T \times n}{X} = [x_1\ x_2\ \dots x_n]\) and \(\underset{T \times m}{Y} = [y_1\ y_2\ \dots y_m]\)? The cross-dataset covariance matrix is

\[ \underset{n\times m}{\Sigma_{XY}} = \begin{bmatrix} \text{cov}(x_1,y_1) & \text{cov}(x_1,y_2) & \dots & \text{cov}(x_1,y_m) \\ \text{cov}(x_2,y_1) & \text{cov}(x_2,y_2) & \dots & \text{cov}(x_2,y_m) \\ \vdots & \vdots & \ddots & \vdots \\ \text{cov}(x_n,y_1) & \text{cov}(x_n,y_2) & \dots & \text{cov}(x_n,y_m) \\ \end{bmatrix} \]

Canonical correlation analysis (CCA) seeks two vectors \(a \in \mathbb R^n\) and \(b \in \mathbb R^m\) such that \(a'X\) and \(b'Y\) maximize the correlation \(\rho =\text{corr}(a'X, b'Y)\). The random transformed random variable \(U = a'X\) and \(V = b'Y\) are the first pair of canonical variables. The second pair of canonical variables are orthogonal to the first pair and maximize the same correlation, and so on.

Suppose we want to choose \(a\) and \(b\) to maximize

\[ \rho = \frac{a'\Sigma_{XY}b}{\sqrt{a'\Sigma_{XX}a}\sqrt{b'\Sigma_{YY}b}} \]

We may impose the constraint such that \(a'\Sigma_{XX}a\) and \(b'\Sigma_{YY}b\) normalize to \(1\). Form the Lagrangian

\[ \mathcal{L} = a'\Sigma_{XY}b - \frac{\mu}{2}(a'\Sigma_{XX}a - 1) -\frac{\nu}{2}(b'\Sigma_{YY}b -1) \]

The first-order conditions are

\[ \begin{aligned} &\frac{\partial\mathcal{L}}{\partial a} = \Sigma_{XY}b -\mu\ \Sigma_{XX}a = 0 \\ &\frac{\partial\mathcal{L}}{\partial b} = \Sigma_{YX}a -\nu\ \Sigma_{YY}b = 0 \\ \end{aligned} \]

which implies

\[ \begin{aligned} \Sigma_{XX}^{-1}\Sigma_{XY}\Sigma_{YY}^{-1}\Sigma_{YX} a &= \mu\nu a =\lambda a \\[1em] \Sigma_{YY}^{-1}\Sigma_{YX}\Sigma_{XX}^{-1}\Sigma_{XY} b &= \mu\nu b =\lambda b \end{aligned} \]

Therefore, \(a\) is the eigenvector of \(\Sigma_{XX}^{-1}\Sigma_{XY}\Sigma_{YY}^{-1}\Sigma_{YX}\). The associated eigenvalue \(\lambda\) is equivalent to the maximized correlation squared \(\hat\rho^2\). To see this, just multiply the first-order condition by \(a'\):

\[ a'\Sigma_{XY}b = \mu\cdot a'\Sigma_{XX}a=\mu=\rho^* \]

Symmetrically, \(b\) is the eigenvector of \(\Sigma_{YY}^{-1}\Sigma_{YX}\Sigma_{XX}^{-1}\Sigma_{XY}\).

So the canonical correlation can be computed as follows:

  1. Compute the eigenvalues and eigenvectors of \[\Sigma_{XX}^{-1}\Sigma_{XY}\Sigma_{YY}^{-1}\Sigma_{YX}\]

  2. Sort the eigenvalues as \(\lambda_1 \geq \lambda_2 \geq\dots\geq \lambda_n\) and \(a_1,a_2,\dots,a_n\) are the corresponding eigenvectors such that \(a'\Sigma_{XX}a=1\).

  3. Compute the eigenvalues and eigenvectors of \[\Sigma_{YY}^{-1}\Sigma_{YX}\Sigma_{XX}^{-1}\Sigma_{XY}\]

  4. Sort the eigenvalues as \(\lambda_1 \geq \lambda_2 \geq\dots\geq \lambda_m\) and \(b_1,b_2,\dots,b_m\) are the corresponding eigenvectors such that \(b'\Sigma_{XX}b=1\).

  5. \((a_k,b_k)\) is the \(k\)-th pair of canonical variables, where \(k\leq\min\{m,n\}\); and \(\gamma_k\) is the \(k\)-th largest canonical correlation.

Note that, if \(\underset{n \times k}{A} = [a_1\ a_2\ \dots a_k]\), \(\underset{m \times k}{B} = [b_1\ b_2\ \dots b_k]\), we would have \(A'\Sigma_{XY}B = \Lambda\) where

\[ \Lambda = \begin{bmatrix} \rho_1 & & \\ & \rho_2 & \\ & & \ddots & \\ & & & \rho_k \end{bmatrix} \]

is a diagonal matrix. Therefore, canonical variables transform the covariance matrix between \(X\) and \(Y\) into a diagonal matrix, where entries on the diagonal best summarize the correlations between the two datasets.

33.3 Johansen’s Procedure

  1. Estimate by OLS two regressions \[ \begin{aligned} \Delta y_t &= \hat\Psi_0 + \hat\Psi_1\Delta y_{t-1} + \dots +\hat\Psi_{p-1}\Delta y_{t-p+1} +\hat u_t \\ y_{t-1} &= \hat\Theta_0 + \hat\Theta_1\Delta y_{t-1} + \dots +\hat\Theta_{p-1}\Delta y_{t-p+1} +\hat v_t \\ \end{aligned} \] Save \(\hat u_t\) and \(\hat v_t\).

  2. Compute the canonical correlations between \(\hat u_t\) and \(\hat v_t\). Find the eigenvalues of \(\Sigma_{vv}^{-1}\Sigma_{vu}\Sigma_{uu}^{-1}\Sigma_{uv}\). Sort them from the largest to the smallest: \(\hat\lambda_1 \geq \hat\lambda_2 \geq \dots\geq \hat\lambda_n\). Then the maximum log-likelihood function subject to the constraint that there are \(h\) cointegrating relations is given by \[ \ell^*=-\frac{Tn}{2}\log(2\pi)-\frac{Tn}{2}-\frac{T}{2}\log|\hat\Sigma_{uu}| -\frac{T}{2}\sum_{i=1}^{h}\log(1-\hat\lambda_i) \] The test for there being \(h\) cointegrating relations is equivalent to testing \[ -\frac{T}{2}\sum_{i=h+1}^{n}\log(1-\hat\lambda_i) = 0. \]

  3. Calculate the MLE of the parameters. The cointegrating matrix is given by \[\hat B = \begin{bmatrix}\hat b_1 & \hat b_2 & \dots & \hat b_h\end{bmatrix}\] where \(\hat b_i\) are the eigenvectors used to normalize \(v_t\). The adjustment matrix and other parameters are given by \[ \begin{aligned} \hat A &= \hat\Sigma_{uv}\hat B\\ \hat\Pi &= \hat A\hat B' \\ \hat\Gamma_i &= \hat\Psi_i - \hat\Pi\hat\Theta_i\\ \mu &= \hat\Psi_0 - \hat\Pi\hat\Theta_0. \end{aligned} \]

To understand this procedure, note that if we treat \(\Pi\) as given, the MLE for Equation 33.1 is equivalent to estimating the coefficients by OLS:

\[ \Delta y_t - \Pi y_{t-1} = \hat\Gamma_1\Delta y_{t-1} + \hat\Gamma_2\Delta y_{t-2} + \dots + \hat\Gamma_{p-1}\Delta y_{t-p+1} + \hat\mu+ \hat\epsilon_t \]

The log-likelihood function becomes

\[ \ell(\Pi, \Omega) = -\frac{T}{2}\log |2\pi\Omega| - \frac{1}{2}\sum_{t=1}^{T} (\hat\epsilon_t'\Omega^{-1}\hat\epsilon_t). \]

Step 1 does this in two separate regressions, in which

\[ \begin{aligned} \hat\Gamma_i &= \hat\Psi_i - \Pi\hat\Theta_i \\ \hat\epsilon_t &= \hat u_t - \Pi\hat v_t \\ \end{aligned} \]

Thus, the log-likelihood can be rewritten as

\[ \ell(\Pi, \Omega) = -\frac{T}{2}\log |2\pi\Omega| - \frac{1}{2}\sum_{t=1}^{T} [(\hat u_t - \Pi\hat v_t)'\Omega^{-1} (\hat u_t - \Pi \hat v_t)] \]

Further concentrating \(\Omega\):

\[ \hat\Omega = \frac{1}{T}\sum_{t=1}^{T}[(\hat u_t - \Pi\hat v_t)(\hat u_t - \Pi \hat v_t)'] \]

Substituting this into the \(\ell\) function

\[ \ell(\Pi) = -\frac{Tn}{2}\log (2\pi) - \frac{Tn}{2}- \frac{T}{2}\log \left|\frac{1}{T}\sum_{t=1}^{T}[(\hat u_t - \Pi\hat v_t)(\hat u_t - \Pi \hat v_t)']\right| \]

Thus, maximizing \(\ell\) is equivalent to minimizing

\[ \left|\frac{1}{T}\sum_{t=1}^{T}[(\hat u_t - \Pi\hat v_t)(\hat u_t - \Pi \hat v_t)']\right| \]

by choosing \(\Pi\). If \(u_t\) is a single variable, the optimal \(\hat\Pi\) that minimizes \(|T^{-1}\sum_t (u_t-\Pi v_t)^2|\) would be simply the OLS estimator. Similarly, for the vector case, we have

\[ \hat\Pi = \left(\frac{1}{T}\sum_t u_t v_t'\right)\left(\frac{1}{T}\sum_t v_t v_t'\right)^{-1} \]

Suppose we have the canonical decomposition \(A'\Sigma_{uv}B = \Lambda\) where \(\Lambda\) is a diagonal matrix of canonical correlations, and \(A'\Sigma_{uu}A=I\), \(B'\Sigma_{vv}B=I\). The estimator can be reduced to

\[ \begin{aligned} \hat\Pi &= (A'^{-1}\Lambda B^{-1})(B'^{-1}B^{-1})^{-1} = A'^{-1}\Lambda B' \\[1em] &= A'^{-1}\begin{bmatrix}r_1 \\ & r_2 \\ & & \ddots \\ & & & r_n\end{bmatrix} B' \end{aligned} \]

The minimized ‘squared residuals’ is

\[ \begin{aligned} \left|\frac{1}{T}\sum_{t=1}^{T}[(\hat u_t - \Pi\hat v_t)(\hat u_t - \Pi \hat v_t)']\right| &= \left| A'^{-1} \frac{1}{T}\sum_{t=1}^{T}(A'u_t - \Lambda B'v_t)(A'u_t - \Lambda B'v_t)' A^{-1} \right| \\[1em] &= |A|^{-2} |I-\Lambda\Lambda'| \\[1em] &= |A|^{-2} \begin{vmatrix}1-r_1^2 \\ & 1-r_2^2 \\ & & \ddots \\ & & & 1-r_n^2\end{vmatrix} \\[1em] &= |A|^{-2} \prod_{i=1}^{n} (1-\lambda_i). \end{aligned} \]

This explains the likelihood function in Step 2. The MLE for \(\hat\Pi\) above is subject to no constraint. However, if there is any cointegrating relations, \(\Pi\) cannot be full rank. If we restrict the rank of \(\Pi\) to \(h\), the minimized squared residuals is achieved by picking the \(h\) largest \(\lambda\)s.

33.4 Hypothesis Testing

Test 1: At most h cointegrations

Hypothesis:

\[ \begin{aligned} &H_0: \text{there are no more than }h\text{ cointegrating relations}\\ &H_1: \text{there are more than }h\text{ cointegrating relations}\\ \end{aligned} \]

Test statistics:

\[ \lambda_{\text{trace}} = 2(\ell_1^* - \ell_0^*) = -T\sum_{i=h+1}^{n}\log(1-\hat\lambda_i) \]

If \(H_0\) is true, \(\lambda_{\text{trace}}\) should be close to zero. The critical values are provided by the table below. Case 1 means there is no constant or deterministic trend; Case 2 contains constants in cointegrating vectors but no deterministic trend; Case 3 contains deterministic trend.

Critical values of Johansen’s likelihood ratio test of the null hypothesis of \(h\) integrating relations against the alternative of no restrictions
\(n-h\) \(T\) 0.1 0.05 0.025 0.001
Case 1
1 400 2.86 3.84 4.93 6.51
2 400 10.47 12.53 14.43 16.31
Case 2
1 400 6.69 8.08 9.66 11.58
2 400 15.58 17.84 19.61 21.96
Case 3
1 400 2.82 3.96 5.33 6.94
2 400 13.34 15.20 17.30 19.31

Test 2: h cointegrations vs h+1

Hypothesis:

\[ \begin{aligned} &H_0: \text{there are }h\text{ cointegrating relations}\\ &H_1: \text{there are }h+1\text{ cointegrating relations}\\ \end{aligned} \]

Test statistics:

\[ \lambda_{\text{max}} = 2(\ell_1^* -\ell_0^*)=-T\log(1-\hat\lambda_{h+1}) \]

The critical values are given as below.

Critical values of Johansen’s likelihood ratio test of the null hypothesis of \(h\) integrating relations against the alternative of \(h+1\) relations
\(n-h\) \(T\) 0.1 0.05 0.025 0.001
Case 1
1 400 2.86 3.84 4.96 6.51
2 400 9.52 11.44 13.27 15.69
Case 2
1 400 6.69 8.08 9.66 11.58
2 400 12.78 14.60 16.40 18.78
Case 3
1 400 2.82 3.96 5.33 6.94
2 400 12.10 14.04 15.81 17.94