2026年4月1日

3152 Lecture 4

Regression Modelling

FIT3152Class NoteEnglish

Previous: 3152 Lecture 3 - Data Manipulation and EDA Next: 3152 Lecture 5 - Clustering

[!NOTE] Other Unit In FIT2086, we also mentioned regression at 2086 Lecture 6 - Linear Regression

Regression

Regression models the relationship between two or more variables. It can be used to observe the effect of independent variables on the dependent variable, predict values for new data, and determine the relative importance of variables.

The dependent variable is the output. The independent variables are the inputs or predictors. Linear regression assumes a straight-line relationship, but other relationships can also be modelled.

Regression is a form of supervised learning because the model is learned from data with known inputs and outputs.

Simple Linear Regression

Simple linear regression has one predictor and one response variable. The relationship can be written as:

y \approx ax+b

Or with error term:

y=ax+b+\epsilon

Where:

\begin{aligned} y &= \text{response variable} \\ x &= \text{predictor variable} \\ a &= \text{slope / coefficient} \\ b &= \text{intercept} \\ \epsilon &= \text{error term} \end{aligned}

The slope $a$ tells how much $y$ changes when $x$ increases by 1 unit. The intercept $b$ is the predicted value of $y$ when $x=0$ .

Least Squares

Linear regression uses least squares to find the best line. The idea is to minimise the squared error between observed values and fitted values.

SSE=\sum_{i=1}^{n}(y_i-\hat{y}_i)^2

Where:

\begin{aligned} y_i &= \text{observed value} \\ \hat{y}_i &= \text{fitted / predicted value} \\ SSE &= \text{sum of squared error} \end{aligned}

The smaller the SSE, the closer the fitted line is to the observed data.

Assumptions

Simple least squares regression assumes that the relationship is approximately linear, x and y are numerical variables, the coefficients are calculated to minimise squared error, and errors are approximately normally distributed.

Residuals are the difference between observed values and fitted values.

Residual=y-\hat{y}

Ideally, residuals should be approximately normally distributed and centred around 0.

Regression Diagnostics

Regression diagnostics help us check whether the model is reasonable.

Residual histogram can show whether residuals are approximately normally distributed. Q-Q plot is a better visual reference for checking normality.

Summary output gives coefficients, p-values, residual standard error, $R^2$ , adjusted $R^2$ and F-statistic.

R-squared

$R^2$ measures how much variability in the response variable is explained by the model.

0 \leq R^2 \leq 1

Higher $R^2$ means the model explains more variation in the response.

For simple linear regression:

R^2=r^2

Where $r$ is the correlation between predictor and response.

P-value

The p-value for a coefficient tests whether that coefficient is significantly different from 0. If p-value is smaller than 0.05, the predictor is usually considered statistically significant.

The F-statistic tests the overall significance of the regression. It tests whether at least one coefficient is not equal to 0.

Prediction

The fitted regression model can be used to predict values for new data. The model can also produce confidence intervals or prediction intervals.

Confidence interval is about the uncertainty in the estimated mean response. Prediction interval is about the uncertainty for a new individual observation.

Multiple Linear Regression

Multiple linear regression has more than one predictor. The relationship can be written as:

y=a_1x_1+a_2x_2+a_3x_3+\cdots+b+\epsilon

Where each $a_i$ is the coefficient for predictor $x_i$ .

The coefficient $a_i$ measures the effect of $x_i$ on $y$ , holding other predictors constant. This is important because predictors may be related to each other.

Looking at each predictor individually may miss the combined effect of variables.

Qualitative Predictors

Qualitative predictors are categorical variables, such as eye colour, clarity, cut or colour. Regression needs numerical inputs, so categorical variables must be converted into indicator variables.

This is also called one-hot encoding.

When a categorical variable has multiple levels, one level is used as the reference level. The other coefficients show the difference compared with the reference level. This is the same idea as treatment contrast.

Non-linear Data

If the relationship is not linear, transformation can help. For example, if both x and y grow exponentially, we can use log transformation.

\log(y) \text{ vs } \log(x)

Log transformation can make a non-linear relationship closer to linear, which makes linear regression more suitable.

Model Selection

When there are many predictors, a model may be improved by reducing the number of inputs. This can remove irrelevant variables, reduce variability in predictions and make the model easier to interpret.

Methods include subset selection, ridge regression, LASSO and dimension reduction.

反向链接

3152 Lecture 3

Data Manipulation and EDA

3152 Lecture 5

Clustering