1970年1月1日

2086 Lecture 7 Classification And Logistic Regression

No description yet.

Previous: 2086 Lecture 6 - Linear Regression Next: 2086 Lecture 8 - Model Selection and Penalized Regression

Classification

A classifier is a supervised learning model for categorical targets.
Instead of predicting a continuous value, we model:

P(Y=y \mid X_1=x_1,\dots,X_p=x_p)

Direct construction (categorical predictors)

If all predictors are categorical, we can use:

P(Y=y\mid X_1=x_1,\dots,X_p=x_p) = \frac{P(Y=y,X_1=x_1,\dots,X_p=x_p)} {P(X_1=x_1,\dots,X_p=x_p)}

with

P(X_1=x_1,\dots,X_p=x_p)=\sum_y P(Y=y,X_1=x_1,\dots,X_p=x_p)

In practice we estimate these probabilities by empirical proportions.

Limitation of direct counting

If predictors are binary, the number of joint probabilities grows as $2^{p+1}$ .
This becomes too large quickly, so we need direct conditional models.

Logistic Regression

For binary target $Y\in\{0,1\}$ , define linear predictor:

\eta_i=\beta_0+\sum_{j=1}^p \beta_j x_{i,j}

Map $\eta_i$ to probability with logistic function:

P(Y_i=1\mid x_i)=\frac{1}{1+e^{-\eta_i}}

Odds and log-odds

Odds in favor of class 1:

O_i=\frac{P(Y_i=1\mid x_i)}{P(Y_i=0\mid x_i)}

Logistic regression is equivalent to:

\log O_i=\log\frac{P(Y_i=1\mid x_i)}{P(Y_i=0\mid x_i)} =\beta_0+\sum_{j=1}^p\beta_j x_{i,j}

Interpretation:

$\beta_0$ : log-odds when all predictors are 0.
$\beta_j$ : change in log-odds per one-unit increase in $x_j$ .

Estimating Logistic Regression (MLE)

Let

\theta_i(\beta_0,\beta)=\frac{1}{1+\exp\!\left(-\beta_0-\sum_{j=1}^p\beta_j x_{i,j}\right)}

Then likelihood:

p(y\mid \beta_0,\beta)=\prod_{i=1}^n \theta_i^{y_i}(1-\theta_i)^{1-y_i}

Negative log-likelihood:

L(\beta_0,\beta)= -\sum_{i=1}^n\left[y_i\log\theta_i+(1-y_i)\log(1-\theta_i)\right] =\sum_{i=1}^n\left[-y_i\eta_i+\log(1+e^{\eta_i})\right]

We minimize $L$ numerically to get $\hat\beta_0,\hat\beta$ .

Model Assessment and Selection

Unlike linear regression, logistic regression has no direct $R^2$ counterpart.
Common goodness-of-fit measure:

L(y\mid\hat\beta_0,\hat\beta)

Improvement over intercept-only model:

L(y\mid \hat\beta_0)-L(y\mid \hat\beta_0,\hat\beta)

Information criterion form:

L(y\mid\hat\beta_0,\hat\beta)+k\alpha_n

where $k$ is predictor count, $\alpha_n=1$ (AIC), $\alpha_n=3/2$ (KIC), $\alpha_n=\frac12\log n$ (BIC).

Hypothesis test for one predictor:

H_0:\beta_j=0 \quad \text{vs}\quad H_A:\beta_j\neq0

Prediction

For new feature vector $x'$ :

\hat\eta=\hat\beta_0+\sum_{j=1}^p\hat\beta_j x'_j

\hat P(Y'=1\mid x')=\frac{1}{1+\exp(-\hat\eta)},\qquad \hat O=e^{\hat\eta}

Default class decision:

predict class 1 if $\hat P(Y'=1\mid x')>0.5$ (equivalently $\hat\eta>0$ ),
else class 0.

More generally, use threshold $T\in(0,1)$ :

class 1 if $\hat P(Y'=1\mid x')\ge T$ ,
class 0 otherwise.

Evaluating Classifiers

Confusion matrix terms

TP: predicted 1, true 1
TN: predicted 0, true 0
FP: predicted 1, true 0
FN: predicted 0, true 1

Metrics

Classification accuracy:

CA=\frac{TP+TN}{TP+TN+FP+FN}

Sensitivity (true positive rate):

TPR=\frac{TP}{TP+FN}

Specificity (true negative rate):

TNR=\frac{TN}{TN+FP}

ROC and AUC

The model outputs a predicted score, such as $P(Y=1 \mid X)$ .
We choose a threshold $T$ to convert the score into a class prediction.

\hat{Y} = \begin{cases} 1, & \text{if score} \ge T \\ 0, & \text{if score} < T \end{cases}

By changing the threshold (T), we get different values of sensitivity and specificity.

\text{sensitivity} = \frac{TP}{TP+FN}

\text{specificity} = \frac{TN}{TN+FP}

The ROC curve plots:

\text{sensitivity} \quad \text{against} \quad 1 - \text{specificity}\text{ (FPR)}

where $1 - \text{specificity}$ is the false positive rate (FPR).

AUC is the area under the ROC curve.

AUC can be interpreted as the probability that a randomly chosen class-1 sample receives a higher predicted score than a randomly chosen class-0 sample.

For example, if:

AUC = 0.8

then the model has an 80% chance of ranking a random positive sample higher than a random negative sample.

In simple terms, AUC measures how well the model separates class 1 from class 0 across all possible thresholds.

Logarithmic loss

Per sample:

\ell_i= \begin{cases} -\log \hat P(Y_i=1\mid x_i), & y_i=1\\ -\log \hat P(Y_i=0\mid x_i), & y_i=0 \end{cases}

Total:

L=\sum_i \ell_i

Smaller log-loss means better probability calibration.

Backlinks

2086 Lecture 6 Linear Regression

No description yet.

2086 Lecture 8 Model Selection And Penalized Regression

No description yet.

3152 Lecture 7

Naive Bayes Classification and Evaluate performance