Bayesian Statistics

Introduction to Bayesian Statistics

Bayesian statistics provides an alternative framework for statistical inference that treats parameters as random variables with probability distributions. Rather than estimating fixed but unknown parameters, Bayesian methods characterize full uncertainty about parameters through probability distributions. This approach naturally incorporates prior knowledge and provides intuitive probability interpretations for results.

The Bayesian framework dates to the 18th century work of Thomas Bayes, who developed the theorem for updating probabilities given new evidence. For many years, Bayesian methods were limited by computational difficulty. The advent of modern computing, particularly Markov Chain Monte Carlo (MCMC) methods, has enabled widespread application of Bayesian approaches.

The philosophical distinction between Bayesian and frequentist approaches lies in how they treat probability. Frequentists treat parameters as fixed and data as random. Bayesians treat parameters as random and data as fixed (given). This difference affects interpretation but often leads to similar conclusions in practice.

Bayes' Theorem

Bayes' theorem provides the mathematical foundation for Bayesian inference. It shows how to update prior beliefs given observed data to produce posterior beliefs.

Theorem Statement

Bayes' theorem states: P(θ|data) = P(data|θ) × P(θ) / P(data), where θ represents parameters and data represents observed data.

The posterior probability P(θ|data) is what we want—our updated belief about parameters after seeing the data. The likelihood P(data|θ) shows how probable the observed data is under different parameter values. The prior P(θ) represents our belief before seeing the data. The marginal likelihood P(data) normalizes the result.

The theorem shows that posterior belief is proportional to prior belief times the likelihood. Data that is more probable under certain parameter values increases our belief in those values.

Simple Example

Consider a simple example: testing for a rare disease. The disease prevalence is 1% (prior probability of having disease). A test is 95% accurate: it correctly identifies 95% of diseased people (sensitivity) and correctly identifies 95% of healthy people (specificity). Someone tests positive—what is the probability they actually have the disease?

Using Bayes' theorem: P(disease|positive) = P(positive|disease) × P(disease) / P(positive). P(positive) = P(positive|disease) × P(disease) + P(positive|healthy) × P(healthy) = 0.95 × 0.01 + 0.05 × 0.99 = 0.059. P(disease|positive) = 0.95 × 0.01 / 0.059 = 0.161. Even with a positive test, the probability of having the disease is only about 16%.

This counterintuitive result arises because the base rate (prevalence) is low. The example illustrates how priors affect conclusions and why Bayesian updating matters.

Prior Distributions

Prior distributions represent knowledge about parameters before observing data. The choice of prior is a key feature of Bayesian analysis and involves both statistical and substantive considerations.

Types of Priors

Informative priors specify substantial knowledge about parameter values. They are appropriate when strong prior information exists from previous studies or expert knowledge. They have significant influence on posterior results.

Weakly informative priors constrain parameters without strongly specifying values. They prevent unreasonable conclusions while allowing data to dominate. They are more diffuse than informative priors but still restrict the parameter space.

Non-informative priors attempt to minimize prior influence, letting data dominate posterior results. They are appropriate when little prior information exists. They might also be called flat priors (constant density over the parameter range).

Prior Selection

Prior selection involves both statistical and domain considerations. Historical data can inform priors in empirical Bayes approaches. Expert opinion can be formally incorporated through elicitation. Sensitivity analysis examines how different priors affect conclusions.

Vague priors might be inappropriately informative in high dimensions, concentrating probability in regions that might not reflect ignorance. Proper priors have finite integrals and avoid these problems.

Priors should be chosen before seeing data to maintain objectivity. Choosing priors based on data would compromise the validity of inference.

Posterior Distribution

The posterior distribution represents updated belief about parameters after seeing data. It combines prior belief with information from observed data.

Posterior Computation

For simple models, analytical solutions provide posterior distributions. Conjugate priors produce posteriors in the same family as priors, enabling closed-form solutions. Beta-Binomial, Normal-Normal, and Gamma-Poisson are conjugate pairs.

For complex models, numerical methods are required. Markov Chain Monte Carlo (MCMC) methods sample from posterior distributions. Gibbs sampling, Metropolis-Hastings, and Hamiltonian Monte Carlo are common MCMC algorithms.

The posterior contains all information about parameters. Point estimates derive from posterior distributions using various loss functions. The posterior mean minimizes expected squared error. The posterior median minimizes absolute error. The posterior mode (MAP estimate) corresponds to the highest density point.

Posterior Summaries

Posterior means provide point estimates under squared error loss. Posterior medians provide point estimates under absolute error loss. Posterior modes maximize posterior density.

Posterior standard deviations measure uncertainty. Smaller standard deviations indicate more precise estimates. Standard deviations are directly interpretable as uncertainty measures.

Credible intervals (also called Bayesian confidence intervals) provide interval estimates. A 95% credible interval contains the true parameter with 0.95 posterior probability. This direct probability interpretation distinguishes Bayesian intervals from frequentist confidence intervals.

Bayesian Inference for Common Models

Bayesian methods apply across many statistical models. Understanding common applications illustrates the general framework.

Bayesian t-Test

The Bayesian t-test models differences between groups using normal distributions with conjugate priors. The posterior for the mean difference can be derived analytically under appropriate priors.

The Bayesian approach provides direct probability statements about the mean difference. We can state that the probability of the difference exceeding zero is, say, 0.95, which is more intuitive than stating "the difference is significant at the 0.05 level."

Bayesian approaches also provide Bayesian hypothesis testing through Bayes factors comparing evidence for different hypotheses. The Bayes factor quantifies the relative evidence for one model versus another.

Bayesian Regression

Bayesian regression places priors on regression coefficients and uses MCMC to sample from the posterior. This approach provides full posterior distributions for all parameters, enabling inference about any function of coefficients.

The posterior can incorporate prior information about likely coefficient values. Shrinkage priors (like horseshoe prior) can perform variable selection by shrinking irrelevant coefficients toward zero.

Bayesian regression naturally handles uncertainty in predictions. The posterior predictive distribution incorporates both parameter uncertainty and residual variability, providing more complete prediction intervals than frequentist approaches.

Hierarchical Models

Hierarchical (mixed effects) models have multiple levels of parameters with priors connecting levels. They are appropriate for data with grouped structure or multiple sources of variation.

The partial pooling across groups provides compromise between complete pooling (ignoring group differences) and no pooling (treating groups completely separately). This is particularly valuable when some groups have few observations.

MCMC software like Stan, JAGS, and PyMC provides flexible frameworks for fitting hierarchical models. These tools handle complex model structures that would be difficult or impossible with frequentist approaches.

Bayesian Model Comparison

Bayesian model comparison provides formal methods for comparing different models or hypotheses. This addresses questions about model complexity and relative evidence.

Bayes Factors

The Bayes factor compares marginal likelihoods of models: BF₁₂ = P(data|Model 1) / P(data|Model 2). It represents the ratio of evidence for Model 1 versus Model 2.

Bayes factors are continuous measures of evidence. Values greater than 1 support Model 1; values less than 1 support Model 2. Jeffrey's scale provides interpretation guidelines: BF 1-3 is "barely worth mentioning," 3-10 is "moderate," 10-30 is "strong," 30-100 is "very strong," and >100 is "decisive."

Bayes factors naturally penalize model complexity. More complex models have lower marginal likelihoods because they can fit more patterns in data. This provides automatic Occam's razor.

Bayesian Model Averaging

Instead of selecting a single model, Bayesian model averaging weights predictions by posterior model probabilities. This accounts for model uncertainty and typically provides better predictions than model selection.

The posterior probability of each model equals its prior probability times its Bayes factor, normalized across all models. Predictions average across models with weights equal to posterior model probabilities.

Bayesian model averaging is particularly valuable when models make substantially different predictions. The average incorporates uncertainty about which model is correct.

Information Criteria

Information criteria like WAIC (Widely Applicable Information Criterion) and LOO-CV (Leave-One-Out Cross-Validation) estimate out-of-sample predictive accuracy. They approximate the Bayes factor for comparing models.

These criteria provide computationally tractable approximations to Bayes factors for complex models where exact computation is infeasible. They are particularly useful in MCMC settings.

The criteria include penalty terms for model complexity. They favor models that balance fit and parsimony. They can be used for model selection or model averaging using weights derived from information criteria.

Markov Chain Monte Carlo

MCMC methods sample from posterior distributions when analytical solutions are unavailable. These computational methods have enabled Bayesian methods for complex models.

Metropolis-Hastings Algorithm

The Metropolis-Hastings algorithm generates a Markov chain whose stationary distribution is the target posterior. It proceeds by proposing new parameter values, accepting or rejecting them based on acceptance probability.

The proposal distribution generates candidate points. The acceptance probability compares the posterior density at the candidate to the current point, adjusting for asymmetric proposals. This ensures the chain visits regions in proportion to their posterior density.

The algorithm requires a "burn-in" period before the chain reaches stationarity. Multiple chains with different starting points can diagnose convergence. Diagnostics should always be checked before using MCMC results.

Gibbs Sampling

Gibbs sampling is a special case of Metropolis-Hastings where proposals always accept. It samples each parameter conditional on all others, cycling through parameters sequentially.

Gibbs sampling applies to models with conjugate prior-likelihood pairs. It is simple and always accepts proposals. However, it can be slow when parameters are highly correlated.

Modern probabilistic programming languages (Stan, PyMC, JAGS) use MCMC methods automatically. Users specify models, and software handles sampling details.

Hamiltonian Monte Carlo

Hamiltonian Monte Carlo (HMC) uses gradient information to propose efficient moves through the parameter space. It is faster than Metropolis-Hastings for high-dimensional models.

HMC works well for models with strongly correlated parameters. It requires specifying the log posterior and its gradient. Stan implements HMC automatically, making it accessible to users.

No-U-Turn Sampler (NUTS) extends HMC by automatically tuning trajectory length. This makes HMC more automatic and robust. Stan uses NUTS as its default sampler.

Bayesian Decision Theory

Bayesian decision theory formalizes decision-making under uncertainty by combining inference with utility functions. This provides a framework for optimal decisions.

Loss Functions

Loss functions specify the cost of different decisions. Squared error loss (L(θ, a) = (θ - a)²) penalizes errors quadratically. Absolute error loss (L(θ, a) = |θ - a|) penalizes linearly. Zero-one loss penalizes any error equally.

The optimal decision minimizes expected loss with respect to the posterior distribution. This decision depends on the loss function as well as the posterior.

Different decisions are optimal under different loss functions. Sensitivity to loss function choice should be examined in practice.

Bayesian Optimization

Bayesian optimization uses Bayesian models to optimize expensive-to-evaluate functions. It builds a posterior model of the function and selects evaluation points balancing exploration and exploitation.

This approach is valuable when function evaluations are costly (like hyperparameter tuning or drug discovery). It requires a prior over functions and an acquisition function guiding evaluation selection.

Gaussian processes provide flexible prior models for Bayesian optimization. They provide posterior predictions with uncertainty estimates. Expected improvement and upper confidence bound are common acquisition functions.

Key Takeaways

Bayesian statistics treats parameters as random variables with probability distributions
Prior distributions represent knowledge before seeing data; posterior distributions represent updated knowledge
Bayes' theorem shows how to update prior beliefs given observed data
MCMC methods enable Bayesian inference for complex models
Bayes factors provide formal model comparison without ad-hoc criteria
Bayesian approaches provide intuitive probability interpretations unavailable in frequentist frameworks

All Topics