1970年1月1日

2086 Lecture 8 Model Selection And Penalized Regression

No description yet.

Previous: 2086 Lecture 7 - Classification and Logistic Regression Next: 2086 Lecture 9 - Trees and Nearest Neighbour Methods

Underfitting, Overfitting and MSPE

Underfitting: model too simple, misses real signal, high bias.
Overfitting: model too complex, learns noise, poor generalization.

For test data $(x'_i,y'_i)$ , mean-squared prediction error:

\mathrm{MSPE}(\hat M)=\frac{1}{n'}\sum_{i=1}^{n'}\left(y'_i-\hat y_i(\hat M)\right)^2

Expected prediction error can be decomposed as:

\mathrm{EMSPE}=\text{bias}^2+\text{variance}

Goal of model selection: trade off bias and variance to minimize prediction error.

Selecting Predictors

Hypothesis testing and multiple testing

For each predictor:

H_0:\beta_j=0 \quad\text{vs}\quad H_A:\beta_j\neq0

If we test many predictors, false positives increase.
Bonferroni correction uses threshold:

p\text{-value}<\frac{\alpha}{p}

where $p$ is number of tests.

Information criteria

Use negative log-likelihood plus complexity penalty:

L(y\mid \hat M)+\text{penalty}

Common forms:

\mathrm{AIC}(\hat M)=L(y\mid \hat M)+k_M

\mathrm{KIC}(\hat M)=L(y\mid \hat M)+\frac{3}{2}k_M

\mathrm{BIC}(\hat M)=L(y\mid \hat M)+\frac{k_M}{2}\log n

\mathrm{RIC}(\hat M)=L(y\mid \hat M)+k_M\log p

$k_M$ is number of predictors in model $M$ .

Cross-validation (CV)

Core idea: simulate future prediction with repeated train/test splits.

Basic steps:

Split data into training and testing sets.
Fit model on training set.
Evaluate prediction error on testing set.
Repeat and average errors.

Common variants:

$K$ -fold CV (usually $K=10$ ),
Leave-one-out CV (LOO CV).

Penalized Regression

Instead of hard include/exclude decisions, shrink coefficients:

(\hat\beta_0,\hat\beta_\lambda)= \arg\min_{\beta_0,\beta} \left\{ \mathrm{RSS}(\beta_0,\beta)+ \lambda\sum_{j=1}^p g(\beta_j) \right\}

$\lambda$ controls penalty strength.
$\lambda\to0$ : close to least squares.
larger $\lambda$ : stronger shrinkage, lower complexity.
usually do not penalize intercept $\beta_0$ .

Predictors should be standardized before penalization:

\sum_{i=1}^n x_{i,j}=0,\qquad \sum_{i=1}^n x_{i,j}^2=n

This framework also extends to logistic regression (penalized likelihood).

Ridge regression

(\hat\beta_0,\hat\beta_\lambda)= \arg\min_{\beta_0,\beta} \left\{ \mathrm{RSS}(\beta_0,\beta)+ \lambda\sum_{j=1}^p \beta_j^2 \right\}

Strengths:

very stable,
handles correlated predictors well,
computationally efficient.

Weakness:

coefficients are shrunk but usually not exactly zero (no direct variable selection).

Lasso regression

(\hat\beta_0,\hat\beta_\lambda)= \arg\min_{\beta_0,\beta} \left\{ \mathrm{RSS}(\beta_0,\beta)+ \lambda\sum_{j=1}^p |\beta_j| \right\}

Strengths:

stable,
can set some coefficients exactly to zero,
performs variable selection + estimation together.

Weaknesses:

may bias large coefficients downward,
can be less robust than ridge under strong predictor correlation,
may still overfit in some real datasets.

Choosing $\lambda$

Standard approach:

Define a grid of $\lambda$ values.
Use CV to estimate error for each $\lambda$ .
Choose $\lambda$ with smallest CV error.
Refit final model on all data using this $\lambda$ .

Bias-Variance View of Penalization

Least squares can have low bias but high variance.
Penalization introduces some bias but can greatly reduce variance.
A good $\lambda$ reduces total prediction error by this trade-off.

Backlinks

2086 Lecture 7 Classification And Logistic Regression

No description yet.

2086 Lecture 9 Trees And Nearest Neighbour Methods

No description yet.

Underfitting, Overfitting and MSPE

Selecting Predictors

Hypothesis testing and multiple testing

Information criteria

Cross-validation (CV)

Penalized Regression

Ridge regression

Lasso regression

Choosing λ\lambdaλ

Bias-Variance View of Penalization

Backlinks

Choosing $\lambda$