← Back to Data Science

All Topics

Advertisement

Learn/Data Science/Data Science Fundamentals

Statistical Modeling Fundamentals

Topic: Statistical Modeling

Advertisement

Nature of Statistical Models

Statistical models are simplified mathematical representations of real-world phenomena. They capture essential features while ignoring irrelevant detail. The modeling process involves making assumptions about data-generating mechanisms and translating those assumptions into testable mathematical forms.

The phrase "all models are wrong, but some are useful" captures an essential truth. Models are approximations; they cannot capture every nuance of reality. However, useful models capture important relationships sufficiently well to enable prediction, inference, and understanding.

Model building is iterative. Initial models are refined as understanding develops. Diagnostics reveal assumption violations requiring model modification. The process continues until a satisfactory model emerges or resource limits are reached.

Model Specification

Model specification involves explicitly stating which variables are related, in what ways, and through what functional forms. This is the foundation upon which analysis rests.

Variable Selection

Dependent (response, outcome) variables are what we want to predict or explain. Independent (predictor, explanatory) variables are those hypothesized to influence the dependent variable.

The distinction between dependent and independent variables often reflects causal assumptions. In observational data, this distinction might reflect temporal precedence or theoretical reasoning rather than proven causation.

Including relevant variables is essential. Omitting important variables leads to omitted variable bias. Including irrelevant variables reduces precision. The goal is including exactly the "right" variables.

Functional Form

Linear models assume linear (straight-line) relationships. Non-linear models allow curved relationships. The choice affects interpretation and predictions.

Polynomial terms allow curved relationships while maintaining linear model estimation. Log transformations linearize certain relationships. Interactions model how relationships depend on other variables.

Model selection should be guided by theory, data exploration, and scientific goals. Overly complex forms might overfit; overly simple forms might miss important patterns.

Model Fitting

Model fitting estimates parameters that make the model best match observed data. Different estimation principles suit different situations.

Least Squares Estimation

Ordinary Least Squares (OLS) minimizes sum of squared residuals. This criterion has desirable properties under classical assumptions: unbiasedness, minimum variance among unbiased estimators, and consistent estimation.

The mathematical solution involves matrix operations. For linear models, explicit formulas exist. For non-linear models, iterative optimization is required.

Least squares is appropriate when error terms have equal variance and are uncorrelated. Weighted least squares addresses heteroscedasticity. Generalized least squares addresses autocorrelation.

Maximum Likelihood Estimation

Maximum Likelihood Estimation (MLE) finds parameter values that maximize the probability of observing the obtained data. Under regular conditions, MLE estimators have desirable asymptotic properties.

For normal linear models, MLE gives the same results as OLS. For non-normal distributions or non-linear models, MLE differs from OLS and might be preferred.

The likelihood function determines the objective. Optimization finds parameter values maximizing the likelihood. Numerical methods solve for most models.

Bayesian Estimation

Bayesian estimation treats parameters as random variables. Prior distributions represent initial beliefs. Data update these beliefs through Bayes' theorem to produce posterior distributions.

Posterior distributions contain all information about parameters. Point estimates derive from the distribution. Credible intervals provide uncertainty measures.

MCMC methods enable Bayesian estimation for complex models. This has expanded Bayesian methods to previously intractable problems.

Model Diagnostics

Model diagnostics evaluate whether assumptions are satisfied. Violations can bias results or make inference invalid.

Residual Analysis

Residuals (observed - predicted values) reveal model shortcomings. Plotting residuals against predicted values checks assumptions about error structure.

Heteroscedasticity appears as changing variance in residual plots. Non-linearity appears as curved patterns. Non-normality appears in residual histograms or Q-Q plots.

Formal tests complement visual diagnosis. Tests for heteroscedasticity, non-normality, and other violations exist. However, tests have limited power with small samples.

Influence Analysis

Influence measures assess how much each observation affects results. Cook's distance combines residual magnitude and leverage. High influence points warrant special attention.

Observations might be influential because they are unusual on predictors, have large residuals, or both. Each type of influence affects different aspects of results.

Outliers might be errors requiring correction, genuine extremes requiring separate analysis, or something in between. Understanding outliers is important for appropriate handling.

Model Comparison

Model comparison evaluates different models fit to the same data. This guides model selection and provides evidence about underlying relationships.

Information Criteria

Information criteria (AIC, BIC, DIC) balance model fit against complexity. Lower values indicate better models. The penalty for complexity differs across criteria.

AIC (Akaike Information Criterion) approximates expected Kullback-Leibler divergence. BIC (Bayesian Information Criterion) approximates Bayesian posterior probability. Both penalize parameters.

Criteria are useful for comparing non-nested models and selecting among many candidates. They do not directly test hypotheses.

Likelihood Ratio Tests

Likelihood ratio tests compare nested models. The test statistic equals -2 times the log-likelihood ratio. This follows a chi-square distribution under the null hypothesis.

The test can only compare models where one is a special case of the other. For non-nested models, other approaches are needed.

Nested model testing provides a framework for stepwise model building. Starting from a minimal model, adding terms can be tested formally.

Cross-Validation

Cross-validation evaluates predictive performance by holding out test data. This provides realistic estimates of how the model will perform on new data.

K-fold cross-validation divides data into k parts. The model is fit k times, each time using k-1 parts for training and 1 part for testing. Results are averaged.

Leave-one-out cross-validation uses n-1 observations for training and 1 for testing. This is computationally intensive but provides nearly unbiased estimates.

Prediction and Inference

Statistical models serve two primary purposes: predicting future outcomes and making inferences about underlying relationships.

Prediction

Prediction uses the fitted model to estimate outcomes for new predictor values. Point predictions provide single values; prediction intervals provide ranges likely to contain future values.

Prediction uncertainty comes from two sources: parameter estimation uncertainty and inherent outcome variability. Both contribute to prediction intervals.

Generalization to new situations depends on whether new situations resemble training data. Extrapolation beyond observed predictor ranges is risky.

Inference

Inference draws conclusions about population parameters from sample estimates. Hypothesis tests evaluate specific claims. Confidence intervals provide ranges for parameters.

Causal inference is particularly challenging. Association does not imply causation. Stronger designs (randomization) or careful analysis (controlling for confounders) are needed.

Effect sizes quantify relationship magnitudes beyond statistical significance. Practical significance considers real-world importance, not just statistical detectability.

Regularization and Shrinkage

Regularization addresses overfitting by constraining coefficient estimates. This is valuable when many predictors exist relative to observations.

Ridge Regression

Ridge regression adds a penalty proportional to sum of squared coefficients. This shrinks coefficients toward zero, reducing variance at the cost of some bias.

The shrinkage parameter controls penalty strength. Cross-validation selects optimal values. Ridge handles multicollinearity well.

Ridge cannot set coefficients exactly to zero. All predictors remain in the model, though some might have negligible effects.

Lasso Regression

Lasso adds a penalty proportional to sum of absolute coefficient values. This can shrink coefficients to exactly zero, performing variable selection.

The lasso performs automatic variable selection. It is particularly useful when many predictors might be irrelevant.

Elastic net combines ridge and lasso penalties, capturing benefits of both when predictors are correlated.

Hierarchical and Mixed Models

Mixed models incorporate both fixed and random effects. They are appropriate for data with grouped structure or multiple sources of variation.

Fixed and Random Effects

Fixed effects are parameters of interest estimated from data. Random effects represent group-level variation from which we sample.

Mixed models estimate both fixed effects and variance components for random effects. Random effects are often used for grouping (subjects, schools, clinics).

Mixed models properly account for within-group correlation. Ignoring grouping leads to standard errors that are too small.

Applications

Mixed models are used for longitudinal data where repeated measures come from the same subjects. They are used for multilevel data with observations nested in groups. They are used for designs with both within-subject and between-subject factors.

The models can be fit with maximum likelihood or Bayesian methods. Various software implementations are available.

Key Takeaways

  1. Statistical models are simplified representations that capture essential relationships while ignoring detail
  2. Model specification defines relationships among variables through choice of variables and functional forms
  3. Different estimation methods (OLS, MLE, Bayesian) suit different situations
  4. Model diagnostics evaluate assumptions and identify problems requiring attention
  5. Model comparison uses information criteria, hypothesis tests, or cross-validation
  6. Regularization addresses overfitting when many predictors exist relative to observations

Advertisement

Advertisement

Need More Practice?

Get personalized data science help from ChatWhole's AI-powered platform.

Get Expert Help →