2026年3月11日
3152 Lecture 1
Introduction to Data Science
Lecture Slide: FIT3152 Lecture 01.pdf
Next: 3152 Lecture 2 - Visualising Data
Data Science
Data Science is an interdisciplinary field that uses statistics, computing, scientific methods, visualisation, algorithms and domain knowledge to extract knowledge from data. The data can be structured, unstructured, noisy, incomplete, large, or from different sources. The main purpose of data science is to understand real world phenomena through data.
Data science becomes important because digital technologies allow humans to collect and store data at very large scale. Many datasets are not collected for the exact research question, so they often need cleaning, transformation and interpretation before analysis.
Common Themes
Data science problems are usually complex problems of social, scientific or business interest. The datasets are often large, messy, incomplete, heterogeneous, or created for another purpose. Data analysis often needs to combine different data sources and then use statistical, visualisation or machine learning methods.
Good visualisation is important because it helps us understand the data and communicate results clearly. In many real data science tasks, the result is only useful if it can be explained to other people.
Data Science Methods
Common data science methods include classification, regression, similarity matching, clustering, co-occurrence, profiling, link prediction, data reduction and causal modelling.
Classification means predicting a category or class. Regression means predicting a numerical value. Clustering means grouping data without known labels. Link prediction means predicting connections between entities. Data reduction means making a large dataset smaller while keeping important information.
Skills for Data Scientists
A data scientist needs to understand the problem from the client’s perspective, collect and clean data, manage and combine data, understand the data using summaries and visualisation, analyse or model the data, and communicate the result clearly.
The technical skills are important, but domain knowledge and communication are also important because the analysis needs to answer a real problem.
Data Science Process
The data science process is usually iterative. It often starts with understanding the problem, then collecting data, cleaning data, exploring and visualising data, analysing or modelling data, interpreting the result, and communicating the result.
The process may need to repeat many times because the first analysis may reveal new problems in the data or new questions.
Basic Statistics
Descriptive statistics describe one variable. Common examples include mean, median, variance, standard deviation, quantile, range, minimum and maximum. These statistics help summarise the centre and spread of a variable.
For multiple groups, descriptive statistics can be calculated separately for each group. This is useful when we want to compare groups, such as different countries, species, or categories.
Bivariate Data
Bivariate data means data with two variables. We can use correlation to measure the strength and direction of a linear relationship between two numerical variables.
Where is the correlation coefficient. If is close to 1, there is strong positive linear relationship. If is close to -1, there is strong negative linear relationship. If is close to 0, there is weak or no linear relationship.
Regression can be used to model the relationship between one response variable and one or more predictors.
Hypothesis Testing
Hypothesis testing is used to test a claim about data. For example, we can compare means or generate confidence intervals.
The p-value measures how likely the observed result is if the null hypothesis is true. If p-value is small, usually smaller than 0.05, the evidence against the null hypothesis is stronger.
Time Series
Time series data is data collected over time. For example, monthly sales data is a time series.
Time series data can contain trend, seasonal pattern and random noise. Trend shows long-term movement, seasonal pattern repeats at regular time intervals, and random noise is the irregular part of the data.