34  Intro to Bayes

We have completed our venture on classical time series topics covering univariate and multivariate, stationary and non-stationary models. This chapter taps into the Bayesian approach to time series analysis, which is not an essential component for classical treatment of the subject. But given its rising popularity and importance, it is an exciting topic that cannot be missed. Bayesian statistics is a whole new world for frequentist statisticians. We introduce the Bayesian approach with an example.

34.1 The Sunrise Problem

Question: What is the probability that the sun will rise tomorrow?

We do not consider the physics here. Suppose we want to answer this question by purely statistics. What we need to do is to observe how many days the sun had risen in the past, and make some inference about the future. If we had collect the data on the past \(n\) days, we would have observed the sun had risen everyday for sure (the sun rises even in cloudy or rainy days). If we want to calculate the probability of a sunrise event, denoted by \(A\), the frequentist approach would give \(P(A) = \frac{n}{n} = 1\). The probability is always \(1\) no matter how many observations we have. That’s a bit quirky, even though we haven’t looked at the confidence interval. The 100 percent probability does not sound correct, as nothing can be so certain.

Let’s have a look at how Laplace in the 18th century solves this problem. Let \(x_t\) be an random variable such that

\[ x_t = \begin{cases} 1, \quad\text{the sun rise in day } t \text{ with probability }\theta\\ 0, \quad\text{otherwise} \end{cases} \]

In other words, \(x_t\) follows a Bernoulli distribution \(x_t\sim\text{Bern}(\theta)\). There is an unknown parameter \(\theta\), which is our goal to estimate. Before we have observed any data, we have no knowledge about this \(\theta\). We assume it is distributed uniformly, \(\theta\sim\text{Unif}(0,1)\). That is, it can be any value between \(0\) and \(1\).

Suppose we have observed the data for \(n\) days: \(x_1,x_2,\dots,x_n\). Assume these events are \(i.i.d\). Define \(S_n\) as the total number of sunrises that had happened:

\[ S_n = x_1 + x_2 + \dots + x_n \]

We know \(S_n\) follows a Binomial distribution \(S_n\sim\text{Bin}(n,\theta)\), with the probability mass function

\[ P(S_n = k | \theta) = \binom{n}{k}\theta^k (1-\theta)^{n-k} \]

Our goal is to find: \(\theta | S_n=?\) An estimation of the probability of a sunrise after observing the data.

Recall that the Bayesian rule allows us to invert the conditional probability:

\[ P(A|B) = \frac{P(B|A)P(A)}{P(B)} \]

Using this formula, we have

\[ \begin{aligned} P(\theta|S_n=k) &=\frac{P(S_n=k|\theta) P(\theta)}{P(S_n=k)}= \frac{P(S_n=k|\theta) P(\theta)}{\int_0^1 P(S_n=k|\theta) P(\theta)d\theta}\\[1em] &= \frac{\binom{n}{k}\theta^k (1-\theta)^{n-k}\cdot\mathbb{I}(0\leq\theta\leq 1)} {\int_0^1\binom{n}{k}\theta^k (1-\theta)^{n-k}\cdot\mathbb{I}(0\leq\theta\leq 1)d\theta}\\[1em] &= \begin{cases}\frac{\binom{n}{k}\theta^k (1-\theta)^{n-k}} {\int_0^1\binom{n}{k}\theta^k (1-\theta)^{n-k}d\theta}&\quad\text{if}\ 0\leq\theta\leq 1\\ 0&\quad\text{otherwise} \end{cases} \end{aligned} \]

If \(k=n\), we have

\[ P(\theta|S_n=n) = \frac{\theta^n}{\int_0^1\theta^n d\theta}=(n+1)\theta^n \]

for \(0\leq\theta\leq 1\). Now we are ready to calculate the probability of the sun rising tomorrow after observing \(n\) sunrises:

\[ \begin{aligned} P(x_{n+1}=1|S_n=n) &= \int_0^1 P(x_{n+1}=1|\theta)P(\theta|S_n=n)d\theta \\ &=\int_0^1 \theta\cdot(n+1)\theta^n d\theta \\ &= \frac{n+1}{n+2} \end{aligned} \]

As \(n\to\infty\), the probability approaches \(1\). Personally, I think this is much more reasonable answer than the frequentist approach, in which you always get probability \(1\).

34.2 The Bayesian Approach

This illustration literally shows every tenet of the Bayesian approach. We start with an prior distribution about an unknown parameter \(\theta\). In the previous example, we model it as a uniform distribution because of our ignorance. But the prior can be any distribution reflecting our subjective belief about the parameter before we see the data. Note how this contrasts with the frequentist approach, in which the parameter is a fixed unknown number.

The principle of Bayesian analysis is then to combine the prior information with the information contained in the data to obtain an updated distribution accounting for both sources of information, known as the posterior distribution. This is done by using the Bayes rule:

\[ \underbrace{p(\theta|X)}_{\text{posterior}} = \frac{\overbrace{p(X|\theta)}^{\text{likelihood}} \times \overbrace{p(\theta)}^{\text{prior}}} {\underbrace{p(X)}_{\text{normalizing scalar}}} \]

The posterior distribution \(p(\theta|X)\) is the central object for Bayesian inference as it combines all the information we have about \(\theta\). Note that \(p(\theta|X)\) is a function of \(\theta\). Since the denominator \(p(X)\) is independent of \(\theta\), it only plays the role of a normalizing constant to ensure the posterior is a valid probability density function that integrates to \(1\). It is therefore convenient to ignore it and rewrite the posterior as

\[ p(\theta|X) \propto p(X|\theta)p(\theta) \]

One difficulty of Bayesian inference is that the denominator \(p(X)\) is often impossible to compute, especially for high dimensional parameters:

\[ p(X) = \int_{\theta_1}\int_{\theta_2}\dots\int_{\theta_n} p(X,\theta_1,\theta_2,\dots,\theta_n)d\theta_1d\theta_2\dots d\theta_n \]

But the relative frequencies of parameter values are easy to compute

\[ \frac{p(\theta_A|X)}{p(\theta_B|X)}=\frac{p(X|\theta_A)p(\theta_A)}{p(X|\theta_B)p(\theta_B)} \]

This allows us to sample from the posterior distribution even the \(pdf\) of the distribution is unknown. We will return to this point when we discuss computational Bayesian methods.

The relative weight of the prior versus the data in determining the posterior depends on (i) how strong the prior is, and (ii) how many data we have. If the prior is so strong (very small variance / uncertainty) that seeing the data will not change our beliefs, the posterior would be mostly determined by the prior. On the contrary, if the data is so abundant that the evidence overwhelms any prior belief, the impact of prior would be negligible.

34.3 Frequentist vs Bayesian

Frequentists and Bayesians hold different philosophy about statistics. Frequentists view our sample as the result of one of an infinite number of exactly repeated experiments. The data are randomly sampled from a fixed population distribution. The unknown parameters are properties of the population, and therefore are fixed. The purpose of statistics is to make inference about the population parameters (the ultimate truth) with limited samples. The uncertainty associated with this process arises from sampling. Because we do not have the entire population, each sample only tells partial truth about the population. Therefore our inference about the parameters can never be perfect due to sampling errors. Frequentists conduct hypothesis tests assuming a hypothesis (about the population parameter) is true and calculating the probability of obtaining the observed sample data.

In Bayesians’ world view, probability is an expression of subjective beliefs (a measure of certainty in a belief), which can be updated in light of new data. Parameters are probabilistic rather than fixed, which reflects the uncertainties about the parameters. The essence of Bayesian inference is to update the probability of a ‘hypothesis’ given the data we have obtained. The Bayes’ rule is all we need. All information is summarized in the posterior probability and there is not need for explicit hypothesis testing.

Frequentist Bayesian
Probability is the limit of frequency Probability is uncertainty
Parameters are fixed unknown numbers Parameters are random variables
Data is a random sample from the population Data is fixed/given
LLN/CLT Bayes’ rule

In time series analysis, there are good reasons to be Bayesian. Perhaps the frequentist perspective makes sense in a cross section, where it is intuitive to image taking different samples from the population. However, in time series we have only one realization. It is difficult to imagine where we would obtain another sample. It is more natural to take a Bayesian perspective. For example, we have some prior belief on how inflation and unemployment might be related (the Phillips curve), then we update our belief with data.

Frequentists often criticize Bayesians’ priors as entirely subjective. Bayesians would respond that frequentists also have prior assumptions that they are not even aware of. Frequentist inference utilizes the LLN and CLT, which inevitably assumes the speed of convergence. In settings like VAR models, where there are a large number of parameters to estimate but only a limited amount of observations. Are the asymptotically properties really plausible? Bayesians believe it would be better to make our assumptions explicit.

Apart from the philosophical difference, in practice Frequentists and Bayesians might well give similar results (though the results should be interpreted differently). After all, if the data is plenty, the influence of priors would diminish to zero.