2026年4月1日
3152 Lecture 4
Regression Modelling
Lecture Slide: FIT3152 Lecture 04.pdf
Previous: 3152 Lecture 3 - Data Manipulation and EDA Next: 3152 Lecture 5 - Clustering
[!NOTE] Other Unit In FIT2086, we also mentioned regression at 2086 Lecture 6 - Linear Regression
Regression
Regression models the relationship between two or more variables. It can be used to observe the effect of independent variables on the dependent variable, predict values for new data, and determine the relative importance of variables.
The dependent variable is the output. The independent variables are the inputs or predictors. Linear regression assumes a straight-line relationship, but other relationships can also be modelled.
Regression is a form of supervised learning because the model is learned from data with known inputs and outputs.
Simple Linear Regression
Simple linear regression has one predictor and one response variable. The relationship can be written as:
Or with error term:
Where:
The slope tells how much changes when increases by 1 unit. The intercept is the predicted value of when .
Least Squares
Linear regression uses least squares to find the best line. The idea is to minimise the squared error between observed values and fitted values.
Where:
The smaller the SSE, the closer the fitted line is to the observed data.
Assumptions
Simple least squares regression assumes that the relationship is approximately linear, x and y are numerical variables, the coefficients are calculated to minimise squared error, and errors are approximately normally distributed.
Residuals are the difference between observed values and fitted values.
Ideally, residuals should be approximately normally distributed and centred around 0.
Regression Diagnostics
Regression diagnostics help us check whether the model is reasonable.
Residual histogram can show whether residuals are approximately normally distributed. Q-Q plot is a better visual reference for checking normality.
Summary output gives coefficients, p-values, residual standard error, , adjusted and F-statistic.
R-squared
measures how much variability in the response variable is explained by the model.
Higher means the model explains more variation in the response.
For simple linear regression:
Where is the correlation between predictor and response.
P-value
The p-value for a coefficient tests whether that coefficient is significantly different from 0. If p-value is smaller than 0.05, the predictor is usually considered statistically significant.
The F-statistic tests the overall significance of the regression. It tests whether at least one coefficient is not equal to 0.
Prediction
The fitted regression model can be used to predict values for new data. The model can also produce confidence intervals or prediction intervals.
Confidence interval is about the uncertainty in the estimated mean response. Prediction interval is about the uncertainty for a new individual observation.
Multiple Linear Regression
Multiple linear regression has more than one predictor. The relationship can be written as:
Where each is the coefficient for predictor .
The coefficient measures the effect of on , holding other predictors constant. This is important because predictors may be related to each other.
Looking at each predictor individually may miss the combined effect of variables.
Qualitative Predictors
Qualitative predictors are categorical variables, such as eye colour, clarity, cut or colour. Regression needs numerical inputs, so categorical variables must be converted into indicator variables.
This is also called one-hot encoding.
When a categorical variable has multiple levels, one level is used as the reference level. The other coefficients show the difference compared with the reference level. This is the same idea as treatment contrast.
Non-linear Data
If the relationship is not linear, transformation can help. For example, if both x and y grow exponentially, we can use log transformation.
Log transformation can make a non-linear relationship closer to linear, which makes linear regression more suitable.
Model Selection
When there are many predictors, a model may be improved by reducing the number of inputs. This can remove irrelevant variables, reduce variability in predictions and make the model easier to interpret.
Methods include subset selection, ridge regression, LASSO and dimension reduction.