Lecture Note: Lecture 7 Notes.pdf
Previous: 2086 Lecture 6 - Linear Regression
Next: 2086 Lecture 8 - Model Selection and Penalized Regression
Classification
A classifier is a supervised learning model for categorical targets.
Instead of predicting a continuous value, we model:
P(Y=y∣X1=x1,…,Xp=xp)
Direct construction (categorical predictors)
If all predictors are categorical, we can use:
P(Y=y∣X1=x1,…,Xp=xp)=P(X1=x1,…,Xp=xp)P(Y=y,X1=x1,…,Xp=xp)
with
P(X1=x1,…,Xp=xp)=y∑P(Y=y,X1=x1,…,Xp=xp)
In practice we estimate these probabilities by empirical proportions.
Limitation of direct counting
If predictors are binary, the number of joint probabilities grows as 2p+1.
This becomes too large quickly, so we need direct conditional models.
Logistic Regression
For binary target Y∈{0,1}, define linear predictor:
ηi=β0+j=1∑pβjxi,j
Map ηi to probability with logistic function:
P(Yi=1∣xi)=1+e−ηi1
Odds and log-odds
Odds in favor of class 1:
Oi=P(Yi=0∣xi)P(Yi=1∣xi)
Logistic regression is equivalent to:
logOi=logP(Yi=0∣xi)P(Yi=1∣xi)=β0+j=1∑pβjxi,j
Interpretation:
- β0: log-odds when all predictors are 0.
- βj: change in log-odds per one-unit increase in xj.
Estimating Logistic Regression (MLE)
Let
θi(β0,β)=1+exp(−β0−∑j=1pβjxi,j)1
Then likelihood:
p(y∣β0,β)=i=1∏nθiyi(1−θi)1−yi
Negative log-likelihood:
L(β0,β)=−i=1∑n[yilogθi+(1−yi)log(1−θi)]=i=1∑n[−yiηi+log(1+eηi)]
We minimize L numerically to get β^0,β^.
Model Assessment and Selection
Unlike linear regression, logistic regression has no direct R2 counterpart.
Common goodness-of-fit measure:
L(y∣β^0,β^)
Improvement over intercept-only model:
L(y∣β^0)−L(y∣β^0,β^)
Information criterion form:
L(y∣β^0,β^)+kαn
where k is predictor count, αn=1 (AIC), αn=3/2 (KIC), αn=21logn (BIC).
Hypothesis test for one predictor:
H0:βj=0vsHA:βj=0
Prediction
For new feature vector x′:
η^=β^0+j=1∑pβ^jxj′
P^(Y′=1∣x′)=1+exp(−η^)1,O^=eη^
Default class decision:
- predict class 1 if P^(Y′=1∣x′)>0.5 (equivalently η^>0),
- else class 0.
More generally, use threshold T∈(0,1):
- class 1 if P^(Y′=1∣x′)≥T,
- class 0 otherwise.
Evaluating Classifiers
Confusion matrix terms
- TP: predicted 1, true 1
- TN: predicted 0, true 0
- FP: predicted 1, true 0
- FN: predicted 0, true 1
Metrics
Classification accuracy:
CA=TP+TN+FP+FNTP+TN
Sensitivity (true positive rate):
TPR=TP+FNTP
Specificity (true negative rate):
TNR=TN+FPTN
ROC and AUC
The model outputs a predicted score, such as P(Y=1∣X).
We choose a threshold T to convert the score into a class prediction.
Y^={1,0,if score≥Tif score<T
By changing the threshold (T), we get different values of sensitivity and specificity.
sensitivity=TP+FNTP
specificity=TN+FPTN
The ROC curve plots:
sensitivityagainst1−specificity (FPR)
where 1−specificity is the false positive rate (FPR).
AUC is the area under the ROC curve.
AUC can be interpreted as the probability that a randomly chosen class-1 sample receives a higher predicted score than a randomly chosen class-0 sample.
For example, if:
AUC=0.8
then the model has an 80% chance of ranking a random positive sample higher than a random negative sample.
In simple terms, AUC measures how well the model separates class 1 from class 0 across all possible thresholds.
Logarithmic loss
Per sample:
ℓi={−logP^(Yi=1∣xi),−logP^(Yi=0∣xi),yi=1yi=0
Total:
L=i∑ℓi
Smaller log-loss means better probability calibration.