Learning Objectives

  • Understand the philosophy of Tidy Data
  • Get to know some of the Tidyverse packages


The tidyverse is a suite of packages that includes libraries such as dplyr and ggplot2. These packages are designed for data science and share underlying principles, grammar and data structures. There are many ways to do the same thing in R, but following the philosophy of tidy data and using the tidyverse packages will keep your datasets organised and make analysis easier in the long run.


Tidy data


Data can be represented in many different ways across multiple tables but the tidyverse packages are specifically designed to work with tidy datasets. Tidy data conforms to the following criteria:

This is the optimal structure when working in R and provides consistency amongst your datasets. Getting your data into R and wrangling it into the correct format is always the first step in your analysis. Fortunately, the tidyr package contains many functions to tidy up your dataset.

We will start by reading in a dataset. The readr package has functions for importing data as tibbles. Tibbles are the tidyverse compatible version of an R dataframe. They have stricter formatting and allow you to perform grouping of variables as we will see in the next section.

library(tidyverse)

#If you already have the data installed on your computer you can read from a file:
surveys <- read_csv("data/surveys_complete.csv")
#Otherwise you can read from a URL
surveys <- read_csv("http://bifx-core3.bio.ed.ac.uk/training/R_dplyr_and_ggplot2/data/surveys_complete.csv")

Discussion

  • Look at the options available in the read_csv and compare this with the read.table function we saw earlier.
  • What other readr functions can you find?


This dataset contains observations from a field survey of different organisms at different sites (plots). Let’s inspect the data.

#Type an R objects name to print the contents
surveys
#Use the View function
View(surveys)
#We can look at the structure of the dataset
str(surveys)
## spc_tbl_ [30,463 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ record_id      : num [1:30463] 845 1164 1261 1756 1818 ...
##  $ month          : num [1:30463] 5 8 9 4 5 7 10 11 1 5 ...
##  $ day            : num [1:30463] 6 5 4 29 30 4 25 17 16 18 ...
##  $ year           : num [1:30463] 1978 1978 1978 1979 1979 ...
##  $ plot_id        : num [1:30463] 2 2 2 2 2 2 2 2 2 2 ...
##  $ species_id     : chr [1:30463] "NL" "NL" "NL" "NL" ...
##  $ sex            : chr [1:30463] "M" "M" "M" "M" ...
##  $ hindfoot_length: num [1:30463] 32 34 32 33 32 32 33 30 33 31 ...
##  $ weight         : num [1:30463] 204 199 197 166 184 206 274 186 184 87 ...
##  $ genus          : chr [1:30463] "Neotoma" "Neotoma" "Neotoma" "Neotoma" ...
##  $ species        : chr [1:30463] "albigula" "albigula" "albigula" "albigula" ...
##  $ taxa           : chr [1:30463] "Rodent" "Rodent" "Rodent" "Rodent" ...
##  $ plot_type      : chr [1:30463] "Control" "Control" "Control" "Control" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   record_id = col_double(),
##   ..   month = col_double(),
##   ..   day = col_double(),
##   ..   year = col_double(),
##   ..   plot_id = col_double(),
##   ..   species_id = col_character(),
##   ..   sex = col_character(),
##   ..   hindfoot_length = col_double(),
##   ..   weight = col_double(),
##   ..   genus = col_character(),
##   ..   species = col_character(),
##   ..   taxa = col_character(),
##   ..   plot_type = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>

Further Resources

  • There are cheatsheets available for many tidyverse and rstudio packages that will help you to choose the correct functions.
  • Take a look at these slides or www.tidyverse.org for more information on the tidyverse.


Key points

  • The tidyverse is a suite of R packages
  • Stick to the principles and philosophy of tidy data
  • Use the readr package to import data as tibbles
  • Use further tidyverse packages to tidy, re-format and visualise data