This workshop does not provide an introduction to R and is not aimed at beginners. If you are new to R, we recommend first completing our introductory workshop.
We recommend using RStudio for this workshop. RStudio is an Integrated Development Environment (IDE) for R. It can be accessed in several ways. Download RStudio and install it on your own computer.
The are 4 main panes, each with several tabs:
You can cutomise the appearance of RStudio under the Tools -> Global Options… menu.
There is a drop-down project menu at the top right of RStudio. Click this, select “New Project…” and create one in a new directory. Make sure you have write permission for the directory you choose.
Once you have done this, this will be your working directory. Files will be saved (or loaded from) here by default unless you specify a full path. You can change your working directory under the session menu at the top.
Using Rstudio has the advantage that everything you do can be saved between RStudio sessions.
You can work in 3 different ways in RStudio.
Commands can be typed directly into the console, but in order to keep track it’s best to write them into a script as you go (File->New File->R Script). From here you can use a shortcut to run the command on the line where your cursor is:
You can also use the Tab key to autocomplete names of functions and objects as you type them into your script.
Hint: When using the console, the Up/Down arrow keys can be use to cycle through previous commands.
In the console you should always see a >
prompt, if
you can’t see this R may still be working. There is a red Stop light at
the top right of the console when a command is running. If you see a
+
instead of >
, R is waiting for more
input. Sometimes this means you have forgotten to close a bracket or
quotation.
Using R Markdown is a great way to annotate your code and present it at the end. It’s worth learning but will add a further level of complication for novice users.
Libraries provide additional functions in R and can be downloaded from several sources:
Install the packages we need for these lessons by running the code below in the R console:
# Install from CRAN with install.packages()
install.packages(c("gprofiler2","ggplot2"))
# Install from bioconductor with BiocManager
if (!requireNamespace("BiocManager", quietly = TRUE)){
install.packages("BiocManager")
}
BiocManager::install(c("GenomicRanges","genomation","BSgenome.Hsapiens.UCSC.hg19","org.Hs.eg.db","TxDb.Hsapiens.UCSC.hg19.knownGene","biomaRt","AnnotationHub","BSgenome.Mmusculus.UCSC.mm10","ChIPseeker","clusterProfiler","profileplyr","soGGi"))
# Install from github with the devtools package - we don't need these packages this is just for demonstration.
#install.packages("devtools")
#devtools::install_github("thomasp85/patchwork")
To load a specific package within an R session, use the “library” function:
library(ggplot2)
We will work with several different genomic datasets including genome sequences, gene models, ChIP-seq peak co-ordinates and aligned sequencing reads. Most of these datasets exist in databases outside of R in standardised text formats. It will be useful to familiarise yourself with these formats and the type of data stored within each.
In this course we will use Fasta, Bed, wiggle and GFF files. A brief summary of file formats is provided below as well as in this presentation
Standard format for sequence data (DNA/RNA or protein). Can store multiple sequences separated by unique headers.
Fastq is similar to Fasta but is used to store sequencing reads and includes extra lines to encode quality scores.
Bed (browser extensible data) files store genomic co-ordinates to represent positions of features such as genes, ChIP-seq peaks or regulatory features (e.g. CpG islands). Bed files have at least 3 columns (chromosome, start, end) to encode regions of the genome.
The wiggle or .wig format is used to represent signals or scores across the genome (e.g. sequencing read depth, GC% etc.). It has four columns (chromosome,start,end, score). There is a binary version called bigWig used to store and visualise large datasets.
The GFF (general feature format) is used to store rich information on genomic annotations. GTF (gene transfer format) is a derivative of GFF which stores gene transcript models, with detailed information on genetic features such as start codons, alternative transcripts, exons and CDS.
The SAM (sequence alignment/map) format stores sequencing reads and quality scores as well as detailed information following alignment to a genome/transcriptome. There is a binary version called BAM used to store and visualise large datasets.
The VCF (variant calling format) is used to store positions of SNPs, INDELs and other genomic variations following variant calling analysis