2026年3月18日

3152 Lecture 2

Visualising Data

FIT3152Class NoteEnglish

Previous: 3152 Lecture 1 - Introduction to Data Science Next: 3152 Lecture 3 - Data Manipulation and EDA

Visualising Data

Data visualisation is used to understand data and communicate results. A good graph should make the message in the data easier to see.

When we look at a graph, we should think about what information is being shown, what message is being told by the data, how the information is shown, how many variables are displayed, and how space is used.

Visual elements can represent variables, such as position, colour, size, shape and transparency. Choosing the right visual element helps the reader understand the data faster.

Visualization Zoo

The Visualization Zoo groups common visualisation types into major families, such as maps, time series, statistical distributions, hierarchies and networks.

Maps are useful for spatial data. Time series plots are useful for data over time. Statistical distribution plots are useful for showing spread, shape and outliers. Hierarchies are useful for tree-like structures. Networks are useful for showing connections between entities.

Getting to Know a Dataset

Before plotting or modelling, we need to understand the basic structure of the dataset. Important things to check include number of rows and columns, variable names, data types, missing values, range of numerical variables, and levels of categorical variables.

A summary of the data can help us quickly understand the dataset. For numerical variables, summary usually includes minimum, quartiles, median, mean and maximum. For categorical variables, summary usually gives levels and counts.

Selecting Rows and Columns

Rows and columns can be selected to inspect part of a dataset. This is useful when we only need some observations, some variables, or a specific subset.

Selecting rows means choosing observations. Selecting columns means choosing variables. Blank selection usually means keeping all rows or all columns.

Base Graphics

Base graphics are the basic graphing system in R. High level graphic functions create a new graph, while low level graphic functions add details to an existing graph.

Common graph types include scatterplot, scatterplot matrix, histogram, stem-and-leaf plot, boxplot, bar plot and dot plot. Details such as title, axis label, colour, point shape, line type and legend can be added to make the graph easier to read.

Common Plots

Boxplot can compare a numerical variable across groups. Scatterplot can show the relationship between two numerical variables. Histogram can show the distribution of one numerical variable. Bar plot can show counts or values for categories.

Colour and point shape can be used to show categorical variables. Jitter can reveal overlapping points by adding small random movement. Legend helps explain colour, shape or other visual mapping.

Grammar of Graphics

The Grammar of Graphics treats a graph as being built from structured components. The main idea is that a graph is made from data, aesthetic mapping, geometric objects, scales, faceting, position adjustment and annotation.

This is useful because complex graphics can be built step by step by adding layers.

Aesthetic Mapping

Aesthetic mapping means mapping variables to visual elements. For example, one variable can be mapped to x position, another variable can be mapped to y position, and another variable can be mapped to colour or size.

This allows one graph to show multiple variables at the same time.

Geom

Geom means the type of plot or geometric object. For example, points are used for scatterplots, bars are used for bar charts, boxes are used for boxplots, and tiles are used for heatmaps.

Choosing the correct geom depends on the type of variables and the question we want to answer.

Faceting

Faceting splits one plot into multiple small plots based on a categorical variable. This is useful when we want to compare the same pattern across different groups without putting too much information into one graph.

Compact Graphics

Some graphics can display many variables compactly. Examples include side-by-side boxplots, heatmaps and correlation matrix.

Side-by-side boxplots compare distributions across multiple factors. Heatmaps use colour to represent values over two dimensions. Correlation matrix displays pairwise correlations between multiple variables.

Backlinks

3152 Lecture 1

Introduction to Data Science

3152 Lecture 3

Data Manipulation and EDA