Solution: Data visualisation - The Complete Academic Support Hub

Jones, 2019
Data visualisation
Data visualisation is an important tool to look at trends/patterns in your data, deal with large datasets
but most importantly visualise your data to show it to others either through a poster, scientific report,
infographic or other means. In a world of big data, being able to interpret and see your data is
extremely important. This workshop is aimed at helping you learn the skills to visualise your data in a
number of ways. There are a few examples here but there are many more which you can have a look
at in your own time.
This workshop demonstrates the code you need to do various visualisation and data wrangling
techniques. However, there are also tips and extra things for you to do. Don’t just copy and paste the
code blindly, have a go at the extra bits added in the document e.g. adding your own legends or having
a look at other variables. After all, these are the techniques you will need for your poster assignment.
Code is highlighted in grey to make it easier for you and tips are shown in red.
This workshop is based on material from Joque & Simao (2017) and Thomas et al. (2013).
Quick R tips
Each command or function can have parameters defined by the user. A few quick tips:
• Any text preceded by a # is a comment to explain the code, and is not run by R.
• Use an R script to write your code. This can be saved and tidied up to keep nice, clean
code. Additionally you can annotate your code using the ~ described above. You will be
able to and should print these R scripts out for the exam. I am NOT expecting you to
remember code!
• To find information about a function and its parameters in R, precede the function with ? in
the console to get help:
?sum
• When you run into errors with R, your best resource is often the global community of Rusers. Simply copy and paste your error or question into your search bar and browse forums
where people have posed similar questions. (Stack Overflow is a popular one.)
• To begin getting a feel for R, run the following lines of helpful tricks one at a time in your R
environment:
colours() #list all of R’s colours
palette() #show the current order of colours
palette(rainbow(6)) #set colours
palette() #shows how current order of colours has been re-defined
palette(c(“green”, “brown”, “red”)) #set user defined colours
palette() #shows how current order of colours has been re-defined
Jones, 2019
Data visualisation workshop
Uploading data to R
To complete this tutorial, first download the census data in the CANVAS data visualisation module.
Next, upload the data file using the methods described in workshops 1 & 2. You have learnt a number
of different ways to do this.
With the census data uploaded into the R environment, run the following lines of code to explore
your dataset:
View(census) #view the data table in its entirety
names(census) #see all header (column) names
head(census) #see the headers plus the first six rows of data within each header
summary(census) #see some summary statistics of each column
To select a column of data, preface the name of the desired column with the name of its data set (in
this case, census) and a $:
census$rent
census$region
Before visualisation: Understanding your data
Before visualizing data, it’s important to:
a. define your research question,
b. understand the underlying data you’re working with, and
c. prepare the appropriate data as needed for visualization.
(a.) The first step of defining your research question may seem trivial, but it guides many of your
visualization decisions. Knowing your research question allows you to identify the variables relevant
to your visualization or analysis, as well as how you think they’ll relate to each other.
(b.) Once you’ve defined your research question, you can identify the number of variables you’re
interested in comparing—i.e. the dimensionality of your analysis—as well as the type of
data contained in each variable of interest—e.g. is it continuous or discrete?
(c.) Sometimes a final step is required, where the data needs to be organized or re-structured before
the final step of visualization.
Jones, 2019
Organising and re-structuring data in R
Often the data you receive is not formatted for visualization, or needs to be transformed. In such
cases, R has a variety of function that can help.
i. Mapping to a Different Scale
Returning to the census data set you’ve uploaded, let’s say we’re interested in visualizing the rent
data on a scale of 1-10. If our rent data is not already on this scale, we can easily map it to this scale
in R.
To check the current scale of the rent data, run the following code to find the current maximum and
minimum, or simply range of the data.
max(census$rent) #find maximum value of rent
min(census$rent) #find minimum value of rent
range(census$rent) #find range of rent
The following formula is how you would recalculate this in R:
# (new_max – new_min) * ([value] – lowest_value) / (highest_value –
# lowest_value) + new_min
Therefore, the code you would use to re-scale based on a 1-10 scale would be:
(10 – 1) * ((census$rent – min(census$rent))/(max(census$rent) – min(census$rent))) + 1
The following code now adds our re-scaled rent data as a new column labelled ‘rent_10’ to
our census data set:
# it is easy to add a new column: census$newcol <- [formula]
census$rent_10 <- (10 – 1) * ((census$rent – min(census$rent))/(max(census$rent) –
min(census$rent))) + 1
ii. Selecting Subsets
Another very useful data re-structuring technique involves extracting a subset of existing data. Try
the following to subset the census data into 3 different unique data sets, each containing
information about one ‘region’:
upper <- subset(census, region == “upper”) #subsets for where region equals upper
View(upper) #compare how newly created ‘upper’ differs from ‘census’ data
central <- subset(census, region == “central”)
Jones, 2019
View(central)
lower <- subset(census, region == “lower”)
View(lower)
iii. Scaling and normalising
R provides a variety of ways to transform data. Below are a few examples:
A) Transforming and adding data
census$population_transf <- (census$population)^5 #Raises population to the power of 5
census$population_transf <- exp(census$population) #Raises the constant e to the power of
population
census$population_transf <- cos(census$population) #Calculates the cosine of population
census$population_transf <- abs(census$population) #Finds the absolute value of
population
census$population_transf <- (census$population) * 10 #Multiplies population by 10
N.B. These will continue to over-write each other unless you change the name of the new
column header e.g. population_transf1, population_transf2 etc.
B) Log transformations
log(census$population) #Computes log base e of population
Jones, 2019
log(census$population, 2) #Computes log base 2 of population
log(census$population, 10) #Computes log base 10 of population
C) Creating new variables from old variables
Create a variable of populations density by dividing population size by area
census$pop_density <- census$population/census$area
Log transform your new variable
log(census$pop_density)
Visualising ‘simple’ data (1-3 variables)
Before starting to visualize our census data, a quick recap of the questions you should have
answered before visualising:
• Have you defined your research question?
• (so that you know your variables of interest)
• Do you know the dimensionality and type(s) of data in your data set?
• (so that you’re aware of the types of visualization available to you)
• Have you organized your data as needed?
• (so that your data is ready to be used)
If so, you’re ready to visualize away!
i. Visualising 1 variable
Often it’s useful to visualise a singular variable to understand the size and shape of its distribution,
and to observe any outliers. Histograms and boxplots are common ways to visualize singular
continuous variables, both of which are easily done in R.
#Histograms
Histogram plots are produced with the hist() function in R, and can be made more readable by
adding additional parameters within the function. Run the lines below to see the differences:
hist(census$unemploy) # default graph, without labels
hist(census$unemploy, xlab = “Unemployment”, main = “Histogram of Unemployment by County
nin Michigan”) # inserting better x-axis label (xlab parameter) and title (main parameter)
Jones, 2019
An additional useful parameter is breaks, which allow users to increase or decrease the bin-size of
the histogram. Notice how the choice of breaks alters the visualization. When looking at your data,
you will need to decide what an appropriate ‘binning’ size is.
hist(census$unemploy, xlab = “Unemployment”, main = “Histogram of Unemployment by County; 6
breaks”, breaks = 6)
hist(census$unemploy, xlab = “Unemployment”, main = “Histogram of Unemployment by County; 20
breaks”, breaks = 20)
#Boxplots
Boxplots are another useful way to show measures of distribution and can be called with the
boxplot() function in R. Boxplots show the median (horizontal line in bold), the interquartile range
(top and bottom edges of the rectangle), the lowest and highest values within 1.5 X the interquartile
range (lower and upper whiskers extending from rectangle), and any outliers (shown as dots) in the
data. Note that a variable must be continuous to view its distribution with a boxplot.
boxplot(census$hh_income, main = “Median Household Incomenacross Michigan Counties nACS
2006-2010″) #’n’ signals a break into a new line
Try picking another variable in the census data and comparing its histogram and boxplot.
ii. Visualising 2-3 variables
Once you start including two variables in a visualization, you can finally start to chart relationships
between variables.
#Boxplots
While we’ve introduced boxplots of one variable in the previous section, boxplots can also be used
to show more than one variable. As mentioned previously, to use boxplots the response variable
needs to be continuous; but now, when considering two variables, the predictor variable can be
discrete.
Note the differences in syntax of your input variables in the following two examples:
Example 1 Syntax: (continuous variable 1 ~ factor(discrete variable 2))
boxplot(hh_income ~factor(region), data = census, col = palette(rainbow(3)), main = “Median
Income by Regionnby County ACS 2006-2010″) #this will give us household income by region.
Note that the boxplot colours are the green, brown and red you set up as your colour palette at the
very beginning of this exercise. These can be changed if you want to change the colours by altering
them as above (Quick R tips). Have a go at changing these yourself.
Jones, 2019
Example 2 Syntax: (continuous variable 1, continuous variable 2)
boxplot(census[, 12], census[, 13], col = “lightblue”, names = c(“men”, “women”),
main = “Median Income by Sex For PopulationnThat Worked Full-time Last 12 Monthsnby
County ACS 2006-2010″) #selecting column 12 and 13 of the data frame ‘census’. This is another
way of doing this, as opposed to the $ sign. Here we have also specified that we want the boxplots
to be light blue in colour.
Do it yourself: How would you add axis labels to these figures?
#Scatterplots
Scatterplots are extremely common for plotting 2 continuous variables and can be called by using
the plot() function. In R, enter the variable intended for the x-axis first, followed by the variable
intended for the y-axis.
plot(census$unemploy, census$hh_income, cex = 1.5) # cex controls the size of the points (1 is
default) – See additional tips below.
You can transform or even calculate a new variable within the plot function itself:
plot(census$hh_income, log(census$population, 2)) # we can plot the log of a variable by simply
adding it to the plot command
You can also transform a variable too:
plot(census$hh_income, census$population/census$area) # for the y axis this is the population
normalized by area (population density)
We can also build up a plot by using the points function. In the example below, we’re using the
upper, central and lower subsets of data you created earlier:
Scatterplot – building by layers (run this one line at a time so you can see what it’s doing.
plot(c(0, max(census$rent)), c(0, max(census$hh_income)), type = “n”, xlab = “Rent”, ylab = “Median
Household Income”) #create a blank canvas (‘type’ parameter) and add axis labels (‘xlab’ and ‘ylab’
parameters)
points(upper$rent, upper$hh_income, pch = 20, col = “blue”) #now build up your layers of data
points(central$rent, central$hh_income, pch = 20, col = “red”)
points(lower$rent, lower$hh_income, pch = 20, col = “green”)
Jones, 2019
legend(“topleft”, c(“upper”, “central”, “lower”), pch = 20, col = c(“blue”, “red”, “green”), title =
“Region”, bty = “n”) # run ‘?legend’ to figure out what these commands mean or check the tips
section below.
Do it yourself:
• You want to look at whether rent varied across the different regions. What sort of figure
would you use for this? Can you produce it in R?
• You want to have a look at whether there’s a relationship between area and population.
What sort of figure would you use for this? Can you produce it in R?
Jones, 2019
References:
Joque, Justin & Simao, Maria-Carolina. (2017). Datavis r.
Thomas, Rob., Vaughan, Ian. & Lello, Jo. (2013). Data analysis with R statistical software. Eco-explore,
Newport printing.