首页 关于我 项目 博客 图谱 简历 联系 English
返回列表

1970年1月1日

2086 Lecture 9 Trees And Nearest Neighbour Methods

No description yet.

Lecture Note: Lecture 9 Notes.pdf

Previous: 2086 Lecture 8 - Model Selection and Penalized Regression

Machine Learning and CV Revisited

Use cross-validation to choose model complexity parameter γ\gamma (e.g. number of leaves, regularization strength).
General KK-fold CV idea:

  1. split into KK folds,
  2. train on K1K-1 folds, test on held-out fold,
  3. repeat over folds and repetitions,
  4. choose complexity with smallest average prediction error.

Decision Trees

[!NOTE] Other Unit In FIT3152, we also mentioned deeper content at 3152 Lecture 6 - Decision Tree

A decision tree recursively splits predictor space into disjoint regions:

R1,,RL,RiRj= (ij)R_1,\dots,R_L,\qquad R_i\cap R_j=\varnothing\ (i\neq j)

Each leaf stores a simple local model:

  • regression: average/normal model in leaf,
  • classification: class probability (e.g. Bernoulli parameter).

Tree complexity is mainly controlled by number of leaves LL.

Forward growing (greedy)

  1. Start with root node.
  2. Try candidate splits on predictors.
  3. Choose split with best score improvement.
  4. Repeat until no useful split.

Common scoring idea: negative log-likelihood (or information criterion / CV score).

Splitting criterion for binary targets

In a leaf with n1n_1 ones and n0n_0 zeros:

p(yθ)=θn1(1θ)n0,θ^=n1np(y\mid\theta)=\theta^{n_1}(1-\theta)^{n_0},\qquad \hat\theta=\frac{n_1}{n}

Minimized negative log-likelihood:

logp(yθ^)=n1log(n1n)n0log(n0n)-\log p(y\mid\hat\theta) =-n_1\log\left(\frac{n_1}{n}\right)-n_0\log\left(\frac{n_0}{n}\right)

Purity interpretation:

  • highest uncertainty around n1/n=0.5n_1/n=0.5,
  • pure leaf (n1=0n_1=0 or n1=nn_1=n) gives lower loss.

For numeric predictors, split by threshold:

xjcvsxj>cx_j\le c \quad \text{vs}\quad x_j>c

choose cc that gives best score.

Trees + CV (grow then prune)

Typical practice:

  1. Grow a large overfitted tree.
  2. Prune back to candidate sizes L=1,,LmaxL=1,\dots,L_{\max}.
  3. Use CV to choose best LL.
  4. Fit final tree with chosen LL on full data.

Tree strengths / weaknesses

Strengths:

  • interpretable,
  • handles nonlinearities and interactions naturally,
  • works with continuous and categorical predictors,
  • embedded variable selection (unused variables are excluded).

Weaknesses:

  • unstable (small data perturbations can change tree),
  • search space is huge (greedy procedure may miss global optimum),
  • can be inefficient for simple linear relationships.

Random Forests

Random forest = ensemble of many trees.

Training idea

  • Grow qq trees.
  • At each split, only a random subset of predictors is considered.
  • This reduces correlation between trees and mitigates greedy instability.

Prediction

  • Regression: average predictions across trees.
  • Classification: average class probabilities (or majority vote).

Why it works:

  • single tree: low bias, high variance;
  • averaging many trees: keeps low bias, reduces variance.

Pros / Cons

Pros:

  • strong predictive accuracy,
  • much more stable than one tree,
  • handles nonlinear and mixed-type data well.

Cons:

  • much less interpretable than a single tree,
  • many hyperparameters and algorithm choices.

k-Nearest Neighbours (k-NN)

k-NN is a non-parametric method: it does not fit an explicit global model.

Given training pairs (xi,yi)(x_i,y_i) and new point xx':

  1. Compute distances
di=d(xi,x),i=1,,nd_i=d(x_i,x'),\quad i=1,\dots,n
  1. Sort by distance.
  2. Take kk nearest targets y(1),,y(k)y_{(1)},\dots,y_{(k)}.
  3. Aggregate:
y^=f ⁣(y(1),,y(k))\hat y'=f\!\left(y_{(1)},\dots,y_{(k)}\right)

For regression (simple average):

y^=1ki=1ky(i)\hat y'=\frac{1}{k}\sum_{i=1}^k y_{(i)}

For classification: majority vote (or averaged probabilities).

Distance and weighting

Common distance: Euclidean

d(x,x)=(j=1p(xjxj)2)1/2d(x,x')=\left(\sum_{j=1}^p(x_j-x'_j)^2\right)^{1/2}

Can use weighted aggregation so closer neighbours have larger weights.

Practical tuning

Need to tune:

  • neighborhood size kk,
  • distance metric,
  • weighting/kernel function.

Standard approach: use CV (often LOO-CV) to choose settings with smallest prediction error.

k-NN strengths / weaknesses

Strengths:

  • conceptually simple,
  • weak assumptions,
  • flexible with suitable distance functions.

Weaknesses:

  • many tuning choices,
  • variable selection is not automatic,
  • no interpretability (no explicit model parameters).

反向链接