Home About Projects Blog Graph Resume Contact 中文
Back to list

1970年1月1日

2086 Lecture 8 Model Selection And Penalized Regression

No description yet.

Lecture Note: Lecture 8 Notes.pdf

Previous: 2086 Lecture 7 - Classification and Logistic Regression Next: 2086 Lecture 9 - Trees and Nearest Neighbour Methods

Underfitting, Overfitting and MSPE

  • Underfitting: model too simple, misses real signal, high bias.
  • Overfitting: model too complex, learns noise, poor generalization.

For test data (xi,yi)(x'_i,y'_i), mean-squared prediction error:

MSPE(M^)=1ni=1n(yiy^i(M^))2\mathrm{MSPE}(\hat M)=\frac{1}{n'}\sum_{i=1}^{n'}\left(y'_i-\hat y_i(\hat M)\right)^2

Expected prediction error can be decomposed as:

EMSPE=bias2+variance\mathrm{EMSPE}=\text{bias}^2+\text{variance}

Goal of model selection: trade off bias and variance to minimize prediction error.

Selecting Predictors

Hypothesis testing and multiple testing

For each predictor:

H0:βj=0vsHA:βj0H_0:\beta_j=0 \quad\text{vs}\quad H_A:\beta_j\neq0

If we test many predictors, false positives increase.
Bonferroni correction uses threshold:

p-value<αpp\text{-value}<\frac{\alpha}{p}

where pp is number of tests.

Information criteria

Use negative log-likelihood plus complexity penalty:

L(yM^)+penaltyL(y\mid \hat M)+\text{penalty}

Common forms:

AIC(M^)=L(yM^)+kM\mathrm{AIC}(\hat M)=L(y\mid \hat M)+k_M KIC(M^)=L(yM^)+32kM\mathrm{KIC}(\hat M)=L(y\mid \hat M)+\frac{3}{2}k_M BIC(M^)=L(yM^)+kM2logn\mathrm{BIC}(\hat M)=L(y\mid \hat M)+\frac{k_M}{2}\log n RIC(M^)=L(yM^)+kMlogp\mathrm{RIC}(\hat M)=L(y\mid \hat M)+k_M\log p

kMk_M is number of predictors in model MM.

Cross-validation (CV)

Core idea: simulate future prediction with repeated train/test splits.

Basic steps:

  1. Split data into training and testing sets.
  2. Fit model on training set.
  3. Evaluate prediction error on testing set.
  4. Repeat and average errors.

Common variants:

  • KK-fold CV (usually K=10K=10),
  • Leave-one-out CV (LOO CV).

Penalized Regression

Instead of hard include/exclude decisions, shrink coefficients:

(β^0,β^λ)=argminβ0,β{RSS(β0,β)+λj=1pg(βj)}(\hat\beta_0,\hat\beta_\lambda)= \arg\min_{\beta_0,\beta} \left\{ \mathrm{RSS}(\beta_0,\beta)+ \lambda\sum_{j=1}^p g(\beta_j) \right\}
  • λ\lambda controls penalty strength.
  • λ0\lambda\to0: close to least squares.
  • larger λ\lambda: stronger shrinkage, lower complexity.
  • usually do not penalize intercept β0\beta_0.

Predictors should be standardized before penalization:

i=1nxi,j=0,i=1nxi,j2=n\sum_{i=1}^n x_{i,j}=0,\qquad \sum_{i=1}^n x_{i,j}^2=n

This framework also extends to logistic regression (penalized likelihood).

Ridge regression

(β^0,β^λ)=argminβ0,β{RSS(β0,β)+λj=1pβj2}(\hat\beta_0,\hat\beta_\lambda)= \arg\min_{\beta_0,\beta} \left\{ \mathrm{RSS}(\beta_0,\beta)+ \lambda\sum_{j=1}^p \beta_j^2 \right\}

Strengths:

  • very stable,
  • handles correlated predictors well,
  • computationally efficient.

Weakness:

  • coefficients are shrunk but usually not exactly zero (no direct variable selection).

Lasso regression

(β^0,β^λ)=argminβ0,β{RSS(β0,β)+λj=1pβj}(\hat\beta_0,\hat\beta_\lambda)= \arg\min_{\beta_0,\beta} \left\{ \mathrm{RSS}(\beta_0,\beta)+ \lambda\sum_{j=1}^p |\beta_j| \right\}

Strengths:

  • stable,
  • can set some coefficients exactly to zero,
  • performs variable selection + estimation together.

Weaknesses:

  • may bias large coefficients downward,
  • can be less robust than ridge under strong predictor correlation,
  • may still overfit in some real datasets.

Choosing λ\lambda

Standard approach:

  1. Define a grid of λ\lambda values.
  2. Use CV to estimate error for each λ\lambda.
  3. Choose λ\lambda with smallest CV error.
  4. Refit final model on all data using this λ\lambda.

Bias-Variance View of Penalization

Least squares can have low bias but high variance.
Penalization introduces some bias but can greatly reduce variance.
A good λ\lambda reduces total prediction error by this trade-off.

Backlinks