2026年3月25日
3152 Lecture 3
Data Manipulation and EDA
Lecture Slide: FIT3152 Lecture 03.pdf
Previous: 3152 Lecture 2 - Visualising Data Next: 3152 Lecture 4 - Regression Modelling
Data Manipulation
Data manipulation means changing data into a format that is easier to analyse. This can include summarising, grouping, creating new columns, selecting rows, combining data frames and reshaping data.
Before doing analysis, it is useful to get a high level view of the data. For example, we need to know which columns are numerical, which columns are categorical, and whether some variables naturally belong together.
Summarising by Groups
Sometimes we need to calculate statistics for each group. For example, we may want to calculate the average value for each species, country, gender, or category.
Grouped summaries are useful because the overall average can hide important differences between groups.
Correlation Matrix
Correlation matrix shows pairwise correlation between multiple numerical variables. It is useful for quickly checking which variables are related.
If the correlation is close to 1, the two variables have strong positive linear relationship. If the correlation is close to -1, the two variables have strong negative linear relationship. If the correlation is close to 0, the linear relationship is weak.
Correlation can also be calculated by group. For example, the correlation between two variables may be different for different species or different countries.
Creating New Columns
New variables can be created from existing variables. This is useful when the original variables do not directly show the pattern we want.
For example, aspect ratio can be calculated as:
Creating new columns can make relationships easier to see, especially when a derived variable has more meaning than the original raw measurements.
Merging Data Frames
Data frames can be combined by using a common column as index. This is useful when different summary results need to be combined into one table.
For example, if two summary tables both contain the same group variable, they can be merged by that group variable.
dplyr
dplyr is a grammar of data manipulation. It gives a consistent set of verbs for common data manipulation tasks, such as selecting rows, selecting columns, creating new columns, grouping rows, creating summaries and sorting rows.
The main advantage is that the data manipulation process becomes easier to read and easier to repeat.
Pipe
Pipe connects multiple data manipulation steps together. It can be understood as “then do this”.
This makes the workflow more readable because the output of one step becomes the input of the next step.
Tibble
Tibble is similar to data frame, but it prints in a cleaner way. Some tidyverse tools return tibbles instead of normal data frames.
The main idea is not the object itself, but that a tidy table should make each variable, observation and value clear.
Indexing and Subsetting
Indexing and subsetting means selecting part of the data based on conditions. For example, we can select one group, combine several groups, or remove rows that do not satisfy a condition.
Logical conditions are important because they allow us to focus the analysis on a relevant subset of data.
Compact Graphics
Compact graphics can show multiple variables or multiple groups in one figure. Common examples include side-by-side boxplots, heatmaps and correlation matrix.
Side-by-side boxplots show the distribution of multiple variables by group. Heatmap shows values using colour, usually across two categorical dimensions. Correlation matrix shows pairwise correlations, often using colour and size to represent the strength of correlation.
Long Format
Some plots need data in long format. Long format means one column stores the variable name and another column stores the value.
This is useful because it allows one plot to show many variables using the same visual structure.
Exploratory Data Analysis
Exploratory Data Analysis, EDA, is the process of using summaries, plots and simple models to understand data. EDA is usually an iterative cycle.
The cycle is visualise, transform, model and ask better questions. The first graph or model may not tell us much, but it can help refine the next question or next graph.
Variation
Variation means how values of one variable change. For one numerical variable, we can use histogram or boxplot. For one categorical variable, we can use bar plot.
We should look for typical values and unusual values. Unusual values may be real outliers, data errors, or important cases that need further investigation.
Covariation
Covariation means how two variables change together. For categorical and numerical variables, boxplots or density plots by group can be useful. For two categorical variables, count tables or stacked bar plots can be useful. For two numerical variables, scatterplots can be useful.
EDA helps us find patterns, relationships, missing values and possible outliers before doing formal modelling.