Home About Projects Blog Graph Resume Contact 中文
Back to list

1970年1月1日

2086 Lecture 7 Classification And Logistic Regression

No description yet.

Lecture Note: Lecture 7 Notes.pdf

Previous: 2086 Lecture 6 - Linear Regression Next: 2086 Lecture 8 - Model Selection and Penalized Regression

Classification

A classifier is a supervised learning model for categorical targets.
Instead of predicting a continuous value, we model:

P(Y=yX1=x1,,Xp=xp)P(Y=y \mid X_1=x_1,\dots,X_p=x_p)

Direct construction (categorical predictors)

If all predictors are categorical, we can use:

P(Y=yX1=x1,,Xp=xp)=P(Y=y,X1=x1,,Xp=xp)P(X1=x1,,Xp=xp)P(Y=y\mid X_1=x_1,\dots,X_p=x_p) = \frac{P(Y=y,X_1=x_1,\dots,X_p=x_p)} {P(X_1=x_1,\dots,X_p=x_p)}

with

P(X1=x1,,Xp=xp)=yP(Y=y,X1=x1,,Xp=xp)P(X_1=x_1,\dots,X_p=x_p)=\sum_y P(Y=y,X_1=x_1,\dots,X_p=x_p)

In practice we estimate these probabilities by empirical proportions.

Limitation of direct counting

If predictors are binary, the number of joint probabilities grows as 2p+12^{p+1}.
This becomes too large quickly, so we need direct conditional models.

Logistic Regression

For binary target Y{0,1}Y\in\{0,1\}, define linear predictor:

ηi=β0+j=1pβjxi,j\eta_i=\beta_0+\sum_{j=1}^p \beta_j x_{i,j}

Map ηi\eta_i to probability with logistic function:

P(Yi=1xi)=11+eηiP(Y_i=1\mid x_i)=\frac{1}{1+e^{-\eta_i}}

Odds and log-odds

Odds in favor of class 1:

Oi=P(Yi=1xi)P(Yi=0xi)O_i=\frac{P(Y_i=1\mid x_i)}{P(Y_i=0\mid x_i)}

Logistic regression is equivalent to:

logOi=logP(Yi=1xi)P(Yi=0xi)=β0+j=1pβjxi,j\log O_i=\log\frac{P(Y_i=1\mid x_i)}{P(Y_i=0\mid x_i)} =\beta_0+\sum_{j=1}^p\beta_j x_{i,j}

Interpretation:

  • β0\beta_0: log-odds when all predictors are 0.
  • βj\beta_j: change in log-odds per one-unit increase in xjx_j.

Estimating Logistic Regression (MLE)

Let

θi(β0,β)=11+exp ⁣(β0j=1pβjxi,j)\theta_i(\beta_0,\beta)=\frac{1}{1+\exp\!\left(-\beta_0-\sum_{j=1}^p\beta_j x_{i,j}\right)}

Then likelihood:

p(yβ0,β)=i=1nθiyi(1θi)1yip(y\mid \beta_0,\beta)=\prod_{i=1}^n \theta_i^{y_i}(1-\theta_i)^{1-y_i}

Negative log-likelihood:

L(β0,β)=i=1n[yilogθi+(1yi)log(1θi)]=i=1n[yiηi+log(1+eηi)]L(\beta_0,\beta)= -\sum_{i=1}^n\left[y_i\log\theta_i+(1-y_i)\log(1-\theta_i)\right] =\sum_{i=1}^n\left[-y_i\eta_i+\log(1+e^{\eta_i})\right]

We minimize LL numerically to get β^0,β^\hat\beta_0,\hat\beta.

Model Assessment and Selection

Unlike linear regression, logistic regression has no direct R2R^2 counterpart.
Common goodness-of-fit measure:

L(yβ^0,β^)L(y\mid\hat\beta_0,\hat\beta)

Improvement over intercept-only model:

L(yβ^0)L(yβ^0,β^)L(y\mid \hat\beta_0)-L(y\mid \hat\beta_0,\hat\beta)

Information criterion form:

L(yβ^0,β^)+kαnL(y\mid\hat\beta_0,\hat\beta)+k\alpha_n

where kk is predictor count, αn=1\alpha_n=1 (AIC), αn=3/2\alpha_n=3/2 (KIC), αn=12logn\alpha_n=\frac12\log n (BIC).

Hypothesis test for one predictor:

H0:βj=0vsHA:βj0H_0:\beta_j=0 \quad \text{vs}\quad H_A:\beta_j\neq0

Prediction

For new feature vector xx':

η^=β^0+j=1pβ^jxj\hat\eta=\hat\beta_0+\sum_{j=1}^p\hat\beta_j x'_j P^(Y=1x)=11+exp(η^),O^=eη^\hat P(Y'=1\mid x')=\frac{1}{1+\exp(-\hat\eta)},\qquad \hat O=e^{\hat\eta}

Default class decision:

  • predict class 1 if P^(Y=1x)>0.5\hat P(Y'=1\mid x')>0.5 (equivalently η^>0\hat\eta>0),
  • else class 0.

More generally, use threshold T(0,1)T\in(0,1):

  • class 1 if P^(Y=1x)T\hat P(Y'=1\mid x')\ge T,
  • class 0 otherwise.

Evaluating Classifiers

Confusion matrix terms

  • TP: predicted 1, true 1
  • TN: predicted 0, true 0
  • FP: predicted 1, true 0
  • FN: predicted 0, true 1

Metrics

Classification accuracy:

CA=TP+TNTP+TN+FP+FNCA=\frac{TP+TN}{TP+TN+FP+FN}

Sensitivity (true positive rate):

TPR=TPTP+FNTPR=\frac{TP}{TP+FN}

Specificity (true negative rate):

TNR=TNTN+FPTNR=\frac{TN}{TN+FP}

ROC and AUC

The model outputs a predicted score, such as P(Y=1X)P(Y=1 \mid X).
We choose a threshold TT to convert the score into a class prediction.

Y^={1,if scoreT0,if score<T\hat{Y} = \begin{cases} 1, & \text{if score} \ge T \\ 0, & \text{if score} < T \end{cases}

By changing the threshold (T), we get different values of sensitivity and specificity.

sensitivity=TPTP+FN\text{sensitivity} = \frac{TP}{TP+FN} specificity=TNTN+FP\text{specificity} = \frac{TN}{TN+FP}

The ROC curve plots:

sensitivityagainst1specificity (FPR)\text{sensitivity} \quad \text{against} \quad 1 - \text{specificity}\text{ (FPR)}

where 1specificity1 - \text{specificity} is the false positive rate (FPR).

AUC is the area under the ROC curve.

AUC can be interpreted as the probability that a randomly chosen class-1 sample receives a higher predicted score than a randomly chosen class-0 sample.

For example, if:

AUC=0.8AUC = 0.8

then the model has an 80% chance of ranking a random positive sample higher than a random negative sample.

In simple terms, AUC measures how well the model separates class 1 from class 0 across all possible thresholds.

Logarithmic loss

Per sample:

i={logP^(Yi=1xi),yi=1logP^(Yi=0xi),yi=0\ell_i= \begin{cases} -\log \hat P(Y_i=1\mid x_i), & y_i=1\\ -\log \hat P(Y_i=0\mid x_i), & y_i=0 \end{cases}

Total:

L=iiL=\sum_i \ell_i

Smaller log-loss means better probability calibration.

Backlinks