Home About Projects Blog Graph Resume Contact 中文
Back to list

1970年1月1日

2086 Lecture 6 Linear Regression

No description yet.

Lecture Note: Lecture 6 Notes.pdf

Previous: 2086 Lecture 5 - Hypothesis Testing Next: 2086 Lecture 7 - Classification and Logistic Regression

Linear Regression

Linear regression describe the relation between predicted outcome yy and it’s factor xx, the factor also called as predictor or regressor.

Single Linear Regression

The Single Linear Regression formula includes the coefficients and the intercept

E[Yx]=y^=β0+β1x\mathbb{E}[Y \mid x] = \hat{y} = \beta_0 + \beta_1 x

β0\beta_0 is the intercept, showing what predicted outcome is when predictor x=0x = 0

We also use residual error to show how good the model fits on this data

ei=yiy^i,=yiβ0β1xi.\begin{aligned} e_i &= y_i - \hat{y}_i, \\ &= y_i - \beta_0 - \beta_1 x_i. \end{aligned}

Multiple Linear Regression

The Multiple Linear Regression will have multiple predictor X=x1,x2,...,xnX = x_1,x_2,...,x_n . We will also have this number of coefficients, the formula of multiple linear regression is:

E[Yixi,1,,xi,p]=y^=β0+β1xi,1+β2xi,2++βpxi,p,=β0+j=1pβjxi,j.\begin{aligned} \mathbb{E}[Y_i \mid x_{i,1}, \ldots, x_{i,p}] &= \hat{y} = \beta_0 + \beta_1 x_{i,1} + \beta_2 x_{i,2} + \cdots + \beta_p x_{i,p}, \\ &= \beta_0 + \sum_{j=1}^{p} \beta_j x_{i,j}. \end{aligned}

β0\beta_0 is still the intercept, showing what predicted outcome is when all predictors x1=x2=...=xn=0x_1 = x_2 = ... = x_n = 0

Least-square

Similar to minimize SSE for estimator, Linear Regression have it’s own way of measuring how good the model fits across the samples.

RSS, Residual sum of square represent the goodness of fit

RSS(β0,β1,,βp)=i=1n(yiβ0j=1pβjxi,j)2,=i=1n(yiy^i)2,=i=1nei2.\begin{aligned} \mathrm{RSS}(\beta_0, \beta_1, \ldots, \beta_p) &= \sum_{i=1}^{n} \left( y_i - \beta_0 - \sum_{j=1}^{p} \beta_j x_{i,j} \right)^2, \\ &= \sum_{i=1}^{n} (y_i - \hat{y}_i)^2, \\ &= \sum_{i=1}^{n} e_i^2. \end{aligned}

It represent the sum of squared difference between predicted value and the real sample value. The less the RSS, the more accurate our model is.

The principle of Least-Square states that if we want to find the best model, then we are finding the model that minimize the RSS, i.e.

(β^0,β^1,,β^p)=argminβ0,β1,,βp{i=1n(yiβ0j=1pβjxi,j)2}(\hat{\beta}_0, \hat{\beta}_1, \ldots, \hat{\beta}_p) = \arg \min_{\beta_0, \beta_1, \ldots, \beta_p} \left\{ \sum_{i=1}^{n} \left( y_i - \beta_0 - \sum_{j=1}^{p} \beta_j x_{i,j} \right)^2 \right\}

R2R^2 Score

We can use RSS to judging how good our model is, the smaller the better. But this may involves a problem, RSS is unit aware metric. lets say we measure a person’s height, actual height is 184cm, the predicted value is 180cm.

Then the RSS in cm will be (184180)2=16(184-180)^2 = 16. RSS in mm will be (18401800)2=1600(1840-1800)^2 = 1600 which does not really meaningful

They way to solve it is measuring a ratio of how good the RSS is comparing to the Sum of Mean Squared Error, we call it TSS, total sum-of-squares

TSS=i=1n(yiyˉ)2\mathrm{TSS} = \sum_{i=1}^{n} (y_i - \bar{y})^2

This state a baseline, where we only use sample mean

The R2R^2 value is then describe as

R2=1RSSTSSR^2 = 1 - \frac{\mathrm{RSS}}{\mathrm{TSS}}

This represent a Ratio of how good the RSS(using model) is comparing to TSS(with out using model). If R2R^2 is close to 0 means RSS are close to TSS, then we can say this model does not describe our data, not accurate at all. If R2R^2 is close to 1 means RSS is very much smaller than TSS, and close to 0, our model fits the data well.

Choosing Predictor

Overfitting and Underfitting

Overfitting of a Linear regression model means our model is lack of generalization, model learn too much information from the training data, include randomness of training data, which includes a lot of noise and make the model can’t predict new data correctly

Underfitting of a Linear regression models mean our model did not learn enough information, model can’t make correct prediction with out learn enough rules about the data

Hypothesis testing to remove predictor

We can also use the hypothesis testing to do such process.

H0:βj=0vsHA:βj0\begin{aligned} H_0 &: \beta_j = 0 \\ \text{vs} \\ H_A &: \beta_j \neq 0 \end{aligned}

lets say we want to check whether xjx_j predictor is important or not, then we should check it’s coefficient βj\beta_j

If βj\beta_j = 0 means xjx_j is not important, we use this null hypothesis to find a p-value, then using α=0.05\alpha = 0.05 to judge whether xjx_j is important.

Information Criteria

Beside from using hypothesis testing, we can use a formula called information criteria:

L(yM^)+α(n,kM)L(y \mid \hat{\mathcal{M}}) + \alpha(n, k_{\mathcal{M}})

Where L(yM^)L(y \mid \hat{\mathcal{M}}) represent the minimized negative log-likelihood estimated model, this represent the parameter with maximized likelihood estimation.

Usually when we maximizing the likelihood, it will overfit, because we include as much predictors as possible. So we need α(n,kM)\alpha(n, k_{\mathcal{M}}), the complexity penalty here, the more the predictors kMk_{\mathcal{M}} , the higher the penalty if we using AIC as complexity penalty equation, where α(n,kM)=kM\alpha(n, k_{\mathcal{M}}) = k_{\mathcal{M}}

Backlinks