1970年1月1日

2086 Lecture 6 Linear Regression

No description yet.

Previous: 2086 Lecture 5 - Hypothesis Testing Next: 2086 Lecture 7 - Classification and Logistic Regression

Linear Regression

Linear regression describe the relation between predicted outcome $y$ and it’s factor $x$ , the factor also called as predictor or regressor.

Single Linear Regression

The Single Linear Regression formula includes the coefficients and the intercept

\mathbb{E}[Y \mid x] = \hat{y} = \beta_0 + \beta_1 x

$\beta_0$ is the intercept, showing what predicted outcome is when predictor $x = 0$

We also use residual error to show how good the model fits on this data

\begin{aligned} e_i &= y_i - \hat{y}_i, \\ &= y_i - \beta_0 - \beta_1 x_i. \end{aligned}

Multiple Linear Regression

The Multiple Linear Regression will have multiple predictor $X = x_1,x_2,...,x_n$ . We will also have this number of coefficients, the formula of multiple linear regression is:

\begin{aligned} \mathbb{E}[Y_i \mid x_{i,1}, \ldots, x_{i,p}] &= \hat{y} = \beta_0 + \beta_1 x_{i,1} + \beta_2 x_{i,2} + \cdots + \beta_p x_{i,p}, \\ &= \beta_0 + \sum_{j=1}^{p} \beta_j x_{i,j}. \end{aligned}

$\beta_0$ is still the intercept, showing what predicted outcome is when all predictors $x_1 = x_2 = ... = x_n = 0$

Least-square

Similar to minimize SSE for estimator, Linear Regression have it’s own way of measuring how good the model fits across the samples.

RSS, Residual sum of square represent the goodness of fit

\begin{aligned} \mathrm{RSS}(\beta_0, \beta_1, \ldots, \beta_p) &= \sum_{i=1}^{n} \left( y_i - \beta_0 - \sum_{j=1}^{p} \beta_j x_{i,j} \right)^2, \\ &= \sum_{i=1}^{n} (y_i - \hat{y}_i)^2, \\ &= \sum_{i=1}^{n} e_i^2. \end{aligned}

It represent the sum of squared difference between predicted value and the real sample value. The less the RSS, the more accurate our model is.

The principle of Least-Square states that if we want to find the best model, then we are finding the model that minimize the RSS, i.e.

(\hat{\beta}_0, \hat{\beta}_1, \ldots, \hat{\beta}_p) = \arg \min_{\beta_0, \beta_1, \ldots, \beta_p} \left\{ \sum_{i=1}^{n} \left( y_i - \beta_0 - \sum_{j=1}^{p} \beta_j x_{i,j} \right)^2 \right\}

$R^2$ Score

We can use RSS to judging how good our model is, the smaller the better. But this may involves a problem, RSS is unit aware metric. lets say we measure a person’s height, actual height is 184cm, the predicted value is 180cm.

Then the RSS in cm will be $(184-180)^2 = 16$ . RSS in mm will be $(1840-1800)^2 = 1600$ which does not really meaningful

They way to solve it is measuring a ratio of how good the RSS is comparing to the Sum of Mean Squared Error, we call it TSS, total sum-of-squares

\mathrm{TSS} = \sum_{i=1}^{n} (y_i - \bar{y})^2

This state a baseline, where we only use sample mean

The $R^2$ value is then describe as

R^2 = 1 - \frac{\mathrm{RSS}}{\mathrm{TSS}}

This represent a Ratio of how good the RSS(using model) is comparing to TSS(with out using model). If $R^2$ is close to 0 means RSS are close to TSS, then we can say this model does not describe our data, not accurate at all. If $R^2$ is close to 1 means RSS is very much smaller than TSS, and close to 0, our model fits the data well.

Choosing Predictor

Overfitting and Underfitting

Overfitting of a Linear regression model means our model is lack of generalization, model learn too much information from the training data, include randomness of training data, which includes a lot of noise and make the model can’t predict new data correctly

Underfitting of a Linear regression models mean our model did not learn enough information, model can’t make correct prediction with out learn enough rules about the data

Hypothesis testing to remove predictor

We can also use the hypothesis testing to do such process.

\begin{aligned} H_0 &: \beta_j = 0 \\ \text{vs} \\ H_A &: \beta_j \neq 0 \end{aligned}

lets say we want to check whether $x_j$ predictor is important or not, then we should check it’s coefficient $\beta_j$

If $\beta_j$ = 0 means $x_j$ is not important, we use this null hypothesis to find a p-value, then using $\alpha = 0.05$ to judge whether $x_j$ is important.

Information Criteria

Beside from using hypothesis testing, we can use a formula called information criteria:

L(y \mid \hat{\mathcal{M}}) + \alpha(n, k_{\mathcal{M}})

Where $L(y \mid \hat{\mathcal{M}})$ represent the minimized negative log-likelihood estimated model, this represent the parameter with maximized likelihood estimation.

Usually when we maximizing the likelihood, it will overfit, because we include as much predictors as possible. So we need $\alpha(n, k_{\mathcal{M}})$ , the complexity penalty here, the more the predictors $k_{\mathcal{M}}$ , the higher the penalty if we using AIC as complexity penalty equation, where $\alpha(n, k_{\mathcal{M}}) = k_{\mathcal{M}}$

Backlinks

2086 Lecture 5 Hypothesis Testing

No description yet.

2086 Lecture 7 Classification And Logistic Regression

No description yet.

3152 Lecture 4

Regression Modelling

Linear Regression

Single Linear Regression

Multiple Linear Regression

Least-square

R2R^2R2 Score

Choosing Predictor

Overfitting and Underfitting

Hypothesis testing to remove predictor

Information Criteria

Backlinks

$R^2$ Score