Learning Objectives

Learn to use the GenomicRanges package in R
Import genomic regions as GRanges objects
Perform operations on GRanges objects

1. GenomicRanges

The GenomicRanges package defines general purpose containers, known as GRanges objects, for storing and manipulating genomic intervals and variables defined along a genome.

Load the GenomicRanges library and use the GRanges function to generate a sample of genomic intervals. The IRanges function creates numeric intervals which are passed to GRanges and understood as genomic co-ordinates:

library(GenomicRanges)

## Create a genomicRanges object from scratch
gr<-GRanges(seqnames=c("chr2","chr2","chr6","chr6","chr10"),
            ranges=IRanges(start=c(100,150,300,350,500),
                           end=c(200,300,400,500,600)),
            names=c(1,2,3,4,5))
gr

## GRanges object with 5 ranges and 1 metadata column:
##       seqnames    ranges strand |     names
##          <Rle> <IRanges>  <Rle> | <numeric>
##   [1]     chr2   100-200      * |         1
##   [2]     chr2   150-300      * |         2
##   [3]     chr6   300-400      * |         3
##   [4]     chr6   350-500      * |         4
##   [5]    chr10   500-600      * |         5
##   -------
##   seqinfo: 3 sequences from an unspecified genome; no seqlengths

We can use the sample, rep and seq functions to generate co-ordinates. Try playing around with these functions.

## Create a second 
gr2<-GRanges(seqnames = paste0("chr",seq(2,10,2)),
            ranges = IRanges(start = sample(1000000,5) ,
            width = rep(500,5)),strand = c(rep("+",3),rep("-",2)),names=seq(1,5))
gr2

## GRanges object with 5 ranges and 1 metadata column:
##       seqnames        ranges strand |     names
##          <Rle>     <IRanges>  <Rle> | <integer>
##   [1]     chr2 921642-922141      + |         1
##   [2]     chr4 919858-920357      + |         2
##   [3]     chr6 422714-423213      + |         3
##   [4]     chr8 560031-560530      - |         4
##   [5]    chr10 937246-937745      - |         5
##   -------
##   seqinfo: 5 sequences from an unspecified genome; no seqlengths

We can store several GRanges objects in a GRangeslist.

For instance, say we had two sets of candidate genes from two different, but related experiments. We can place these together in a list and apply the same operations on both sets at the same time.

## Create a GRanges list
gl<-GRangesList(gr,gr2)
gl

## GRangesList object of length 2:
## [[1]]
## GRanges object with 5 ranges and 1 metadata column:
##       seqnames    ranges strand |     names
##          <Rle> <IRanges>  <Rle> | <numeric>
##   [1]     chr2   100-200      * |         1
##   [2]     chr2   150-300      * |         2
##   [3]     chr6   300-400      * |         3
##   [4]     chr6   350-500      * |         4
##   [5]    chr10   500-600      * |         5
##   -------
##   seqinfo: 5 sequences from an unspecified genome; no seqlengths
## 
## [[2]]
## GRanges object with 5 ranges and 1 metadata column:
##       seqnames        ranges strand |     names
##          <Rle>     <IRanges>  <Rle> | <numeric>
##   [1]     chr2 921642-922141      + |         1
##   [2]     chr4 919858-920357      + |         2
##   [3]     chr6 422714-423213      + |         3
##   [4]     chr8 560031-560530      - |         4
##   [5]    chr10 937246-937745      - |         5
##   -------
##   seqinfo: 5 sequences from an unspecified genome; no seqlengths

1.1 Functions to access GRanges objects

The GRanges object has several accessor functions to quickly extract information about the data:

## Get the number of intervals in a GRanges object
length(gr)

## [1] 5

## Get the number of intervals in a GRanges object
length(gr)

## [1] 5

## Get the names of sequences (e.g. chromosomes) represented in the GRanges object
seqnames(gr)

## factor-Rle of length 5 with 3 runs
##   Lengths:     2     2     1
##   Values : chr2  chr6  chr10
## Levels(3): chr2 chr6 chr10

## Get the start and end positions or the strand
start(gr2)

## [1] 921642 919858 422714 560031 937246

end(gr2)

## [1] 922141 920357 423213 560530 937745

strand(gr2)

## factor-Rle of length 5 with 2 runs
##   Lengths: 3 2
##   Values : + -
## Levels(3): + - *

## Get the length of each genomic interval
width(gr)

## [1] 101 151 101 151 101

The GRanges object also contains metadata columns with additional information about each genomic interval. We can assign and retrieve metadata with the mcols function.

## Create a new column to represent GC content
gr$GC<-seq(1,0,length=length(gr))
## Fetch the metadata columns as a DataFrame
mcols(gr)

## DataFrame with 5 rows and 2 columns
##       names        GC
##   <numeric> <numeric>
## 1         1      1.00
## 2         2      0.75
## 3         3      0.50
## 4         4      0.25
## 5         5      0.00

1.2 Perform vector operations on GRanges objects

GRanges objects work like a lot of other vector objects in R. This means you can use many of the base R functions to subset, combine, sort and compare GRanges.

## Select specific GRanges
gr3<-gr[c(5,4,3)]
gr3

## GRanges object with 3 ranges and 2 metadata columns:
##       seqnames    ranges strand |     names        GC
##          <Rle> <IRanges>  <Rle> | <numeric> <numeric>
##   [1]    chr10   500-600      * |         5      0.00
##   [2]     chr6   350-500      * |         4      0.25
##   [3]     chr6   300-400      * |         3      0.50
##   -------
##   seqinfo: 3 sequences from an unspecified genome; no seqlengths

## Sort a Granges object
sort(gr3)

## GRanges object with 3 ranges and 2 metadata columns:
##       seqnames    ranges strand |     names        GC
##          <Rle> <IRanges>  <Rle> | <numeric> <numeric>
##   [1]     chr6   300-400      * |         3      0.50
##   [2]     chr6   350-500      * |         4      0.25
##   [3]    chr10   500-600      * |         5      0.00
##   -------
##   seqinfo: 3 sequences from an unspecified genome; no seqlengths

## Concatenate two GRanges
c(gr,gr2)

## GRanges object with 10 ranges and 2 metadata columns:
##        seqnames        ranges strand |     names        GC
##           <Rle>     <IRanges>  <Rle> | <numeric> <numeric>
##    [1]     chr2       100-200      * |         1      1.00
##    [2]     chr2       150-300      * |         2      0.75
##    [3]     chr6       300-400      * |         3      0.50
##    [4]     chr6       350-500      * |         4      0.25
##    [5]    chr10       500-600      * |         5      0.00
##    [6]     chr2 921642-922141      + |         1        NA
##    [7]     chr4 919858-920357      + |         2        NA
##    [8]     chr6 422714-423213      + |         3        NA
##    [9]     chr8 560031-560530      - |         4        NA
##   [10]    chr10 937246-937745      - |         5        NA
##   -------
##   seqinfo: 5 sequences from an unspecified genome; no seqlengths

## Subset a GRanges object
subset(gr2,strand(gr2)=="-")

## GRanges object with 2 ranges and 1 metadata column:
##       seqnames        ranges strand |     names
##          <Rle>     <IRanges>  <Rle> | <integer>
##   [1]     chr8 560031-560530      - |         4
##   [2]    chr10 937246-937745      - |         5
##   -------
##   seqinfo: 5 sequences from an unspecified genome; no seqlengths

## Split GRanges into a GRangesList
split(gr,seqnames(gr))

## GRangesList object of length 3:
## $chr2
## GRanges object with 2 ranges and 2 metadata columns:
##       seqnames    ranges strand |     names        GC
##          <Rle> <IRanges>  <Rle> | <numeric> <numeric>
##   [1]     chr2   100-200      * |         1      1.00
##   [2]     chr2   150-300      * |         2      0.75
##   -------
##   seqinfo: 3 sequences from an unspecified genome; no seqlengths
## 
## $chr6
## GRanges object with 2 ranges and 2 metadata columns:
##       seqnames    ranges strand |     names        GC
##          <Rle> <IRanges>  <Rle> | <numeric> <numeric>
##   [1]     chr6   300-400      * |         3      0.50
##   [2]     chr6   350-500      * |         4      0.25
##   -------
##   seqinfo: 3 sequences from an unspecified genome; no seqlengths
## 
## $chr10
## GRanges object with 1 range and 2 metadata columns:
##       seqnames    ranges strand |     names        GC
##          <Rle> <IRanges>  <Rle> | <numeric> <numeric>
##   [1]    chr10   500-600      * |         5         0
##   -------
##   seqinfo: 3 sequences from an unspecified genome; no seqlengths

Challenge:

1.Extract the GRanges in gr with >=0.5 GC content and split them into a list for each chromosome.

Solution:

gr_subset=subset(gr,GC>=0.5)
split(gr_subset,seqnames(gr_subset))

## GRangesList object of length 3:
## $chr2
## GRanges object with 2 ranges and 2 metadata columns:
##       seqnames    ranges strand |     names        GC
##          <Rle> <IRanges>  <Rle> | <numeric> <numeric>
##   [1]     chr2   100-200      * |         1      1.00
##   [2]     chr2   150-300      * |         2      0.75
##   -------
##   seqinfo: 3 sequences from an unspecified genome; no seqlengths
## 
## $chr6
## GRanges object with 1 range and 2 metadata columns:
##       seqnames    ranges strand |     names        GC
##          <Rle> <IRanges>  <Rle> | <numeric> <numeric>
##   [1]     chr6   300-400      * |         3       0.5
##   -------
##   seqinfo: 3 sequences from an unspecified genome; no seqlengths
## 
## $chr10
## GRanges object with 0 ranges and 2 metadata columns:
##    seqnames    ranges strand |     names        GC
##       <Rle> <IRanges>  <Rle> | <numeric> <numeric>
##   -------
##   seqinfo: 3 sequences from an unspecified genome; no seqlengths

2.Concatenate the GRanges in gr and gr2 and extract the intervals on chr2.

Solution:

grc<-c(gr,gr2)
subset(grc,seqnames(grc)=="chr2")

## GRanges object with 3 ranges and 2 metadata columns:
##       seqnames        ranges strand |     names        GC
##          <Rle>     <IRanges>  <Rle> | <numeric> <numeric>
##   [1]     chr2       100-200      * |         1      1.00
##   [2]     chr2       150-300      * |         2      0.75
##   [3]     chr2 921642-922141      + |         1        NA
##   -------
##   seqinfo: 5 sequences from an unspecified genome; no seqlengths

1.3 Perform range based operations on GRanges objects

The GenomicRanges package also has several useful functions to perform operations on GRanges objects based on the genomic intervals.

## Get the promoter region around the start of the feature
promoters(gr2,upstream = 1000,downstream=1000)

## GRanges object with 5 ranges and 1 metadata column:
##       seqnames        ranges strand |     names
##          <Rle>     <IRanges>  <Rle> | <integer>
##   [1]     chr2 920642-922641      + |         1
##   [2]     chr4 918858-920857      + |         2
##   [3]     chr6 421714-423713      + |         3
##   [4]     chr8 559531-561530      - |         4
##   [5]    chr10 936746-938745      - |         5
##   -------
##   seqinfo: 5 sequences from an unspecified genome; no seqlengths

## Reduce overlapping ranges into a single range
reduce(gr)

## GRanges object with 3 ranges and 0 metadata columns:
##       seqnames    ranges strand
##          <Rle> <IRanges>  <Rle>
##   [1]     chr2   100-300      *
##   [2]     chr6   300-500      *
##   [3]    chr10   500-600      *
##   -------
##   seqinfo: 3 sequences from an unspecified genome; no seqlengths

## Shift all the co-ordinates along
gr.shift<-shift(gr,100)
gr.shift

## GRanges object with 5 ranges and 2 metadata columns:
##       seqnames    ranges strand |     names        GC
##          <Rle> <IRanges>  <Rle> | <numeric> <numeric>
##   [1]     chr2   200-300      * |         1      1.00
##   [2]     chr2   250-400      * |         2      0.75
##   [3]     chr6   400-500      * |         3      0.50
##   [4]     chr6   450-600      * |         4      0.25
##   [5]    chr10   600-700      * |         5      0.00
##   -------
##   seqinfo: 3 sequences from an unspecified genome; no seqlengths

## Find the overlaps between 2 sets of GRanges
fo<-findOverlaps(query = gr,subject = gr.shift)
fo

## Hits object with 7 hits and 0 metadata columns:
##       queryHits subjectHits
##       <integer>   <integer>
##   [1]         1           1
##   [2]         2           1
##   [3]         2           2
##   [4]         3           3
##   [5]         4           3
##   [6]         4           4
##   [7]         5           5
##   -------
##   queryLength: 5 / subjectLength: 5

## Use the subjectHits or queryHits functions to extract columns from the output
subjectHits(fo)

## [1] 1 1 2 3 3 4 5

## Merge the regions between 2 sets of GRanges
union(gr.shift,gr)

## GRanges object with 3 ranges and 0 metadata columns:
##       seqnames    ranges strand
##          <Rle> <IRanges>  <Rle>
##   [1]     chr2   100-400      *
##   [2]     chr6   300-600      *
##   [3]    chr10   500-700      *
##   -------
##   seqinfo: 3 sequences from an unspecified genome; no seqlengths

## Get the overlapping regions between 2 sets of GRanges
intersect(gr.shift,gr)

## GRanges object with 3 ranges and 0 metadata columns:
##       seqnames    ranges strand
##          <Rle> <IRanges>  <Rle>
##   [1]     chr2   200-300      *
##   [2]     chr6   400-500      *
##   [3]    chr10       600      *
##   -------
##   seqinfo: 3 sequences from an unspecified genome; no seqlengths

## What happens if we set all the shifted ranges to the minus strand?
strand(gr.shift)="-"
intersect(gr.shift,gr)

## GRanges object with 0 ranges and 0 metadata columns:
##    seqnames    ranges strand
##       <Rle> <IRanges>  <Rle>
##   -------
##   seqinfo: 3 sequences from an unspecified genome; no seqlengths

## Strandedness is used in comparisons by default but we can turn this off:
intersect(gr.shift,gr,ignore.strand=T)

## GRanges object with 3 ranges and 0 metadata columns:
##       seqnames    ranges strand
##          <Rle> <IRanges>  <Rle>
##   [1]     chr2   200-300      *
##   [2]     chr6   400-500      *
##   [3]    chr10       600      *
##   -------
##   seqinfo: 3 sequences from an unspecified genome; no seqlengths

We can also perform operations on all of the GRanges in a GRangesList object:

## We can perform operations on all items in the list
promoters(gl)

## GRangesList object of length 2:
## [[1]]
## GRanges object with 5 ranges and 1 metadata column:
##       seqnames    ranges strand |     names
##          <Rle> <IRanges>  <Rle> | <numeric>
##   [1]     chr2 -1900-299      * |         1
##   [2]     chr2 -1850-349      * |         2
##   [3]     chr6 -1700-499      * |         3
##   [4]     chr6 -1650-549      * |         4
##   [5]    chr10 -1500-699      * |         5
##   -------
##   seqinfo: 5 sequences from an unspecified genome; no seqlengths
## 
## [[2]]
## GRanges object with 5 ranges and 1 metadata column:
##       seqnames        ranges strand |     names
##          <Rle>     <IRanges>  <Rle> | <numeric>
##   [1]     chr2 919642-921841      + |         1
##   [2]     chr4 917858-920057      + |         2
##   [3]     chr6 420714-422913      + |         3
##   [4]     chr8 560331-562530      - |         4
##   [5]    chr10 937546-939745      - |         5
##   -------
##   seqinfo: 5 sequences from an unspecified genome; no seqlengths

GRangesList objects also work with base R list functions like lapply and unlist but we will not cover that here.

Challenge:

1.There are many more functions in the Genomic Ranges documentation as well as the How To document.

Read the documentation and see if you can figure out how to get the 1000 bases downstream of each of the ranges in gr2.

Solution:

flank(gr2,1000,start=F)

## GRanges object with 5 ranges and 1 metadata column:
##       seqnames        ranges strand |     names
##          <Rle>     <IRanges>  <Rle> | <integer>
##   [1]     chr2 922142-923141      + |         1
##   [2]     chr4 920358-921357      + |         2
##   [3]     chr6 423214-424213      + |         3
##   [4]     chr8 559031-560030      - |         4
##   [5]    chr10 936246-937245      - |         5
##   -------
##   seqinfo: 5 sequences from an unspecified genome; no seqlengths

2.Create a new GRanges object called genes and find which genes overlap with the ranges in gr2.

genes<-GRanges(seqnames=c("chr2","chr6","chr9"),ranges = IRanges(start=c(start(gr2)[c(1,3)],1000),width=100),strand="+",name=c("geneA","geneB","geneC"))

Solution:

overlaps<-findOverlaps(query = genes,subject=gr2)
qh<-queryHits(overlaps)
qh<-unique(qh) ## Not necessary here, but will prevent repeats if genes overlapped with more than one region
genes[qh]$name

## [1] "geneA" "geneB"

Note that it is also possible to string together GRanges operations with R pipes e.g.

qh<-findOverlaps(query = genes,subject=gr2) |> queryHits() |> unique()
genes[qh]$name

## [1] "geneA" "geneB"

2. Importing files as Genomic Ranges

The genomation package has several useful functions to import annotation and sequencing files into R as GRanges objects.

We are going to work with annotation files from the Human hg19 reference genome. The file Ensembl.GRCh37.74.edited.genes.strict.bed contains gene annotations downloaded from the Ensembl database in bed format.

Load the genomation library and use the readBed function to read in the file. You can download the file from the link above and save it in your working directory. Or, you can read directly from the URL.

library(genomation)
## Read in a bed file if it exists on your computer
#hg19.genes<-readBed("Ensembl.GRCh37.74.edited.genes.strict.bed")
## Read in a bed file from the web
hg19.genes<-readBed("http://bifx-core.bio.ed.ac.uk/genomes/human/hg19/annotation/Ensembl.GRCh37.74.edited.genes.strict.bed")
## Inspect the object
hg19.genes

## GRanges object with 57773 ranges and 2 metadata columns:
##           seqnames            ranges strand |     score            name
##              <Rle>         <IRanges>  <Rle> | <integer>     <character>
##       [1]    chr13 23551995-23552136      - |         0 ENSG00000223116
##       [2]    chr13 23708314-23708703      + |         0 ENSG00000233440
##       [3]    chr13 23726726-23726825      - |         0 ENSG00000207157
##       [4]    chr13 23743975-23744736      - |         0 ENSG00000229483
##       [5]    chr13 23791572-23791673      - |         0 ENSG00000252952
##       ...      ...               ...    ... .       ...             ...
##   [57769]     chr3 14313874-14345345      - |         0 ENSG00000255021
##   [57770]     chr3 14389952-14394068      - |         0 ENSG00000251576
##   [57771]     chr3 14436149-14436257      + |         0 ENSG00000199609
##   [57772]     chr3 14530620-14583588      - |         0 ENSG00000144596
##   [57773]     chr2 61704985-61765761      - |         0 ENSG00000082898
##   -------
##   seqinfo: 25 sequences from an unspecified genome; no seqlengths

Note that the GRanges object has a “slot” for seqinfo which can store the lengths of each chromosome and the genome they are derived from. Some functions make use of seqinfo, for instance when calculating the gaps in coverage of a GRanges object we need to know where the end of each chromosome is. This information is currently empty but we will see how to update it later.

seqinfo(hg19.genes)

## Seqinfo object with 25 sequences from an unspecified genome; no seqlengths:
##   seqnames seqlengths isCircular genome
##   chr1           <NA>       <NA>   <NA>
##   chr2           <NA>       <NA>   <NA>
##   chr3           <NA>       <NA>   <NA>
##   chr4           <NA>       <NA>   <NA>
##   chr5           <NA>       <NA>   <NA>
##   ...             ...        ...    ...
##   chr21          <NA>       <NA>   <NA>
##   chr22          <NA>       <NA>   <NA>
##   chrX           <NA>       <NA>   <NA>
##   chrY           <NA>       <NA>   <NA>
##   chrM           <NA>       <NA>   <NA>

We can also use genomation to read in files that are not in a standard format using the readGeneric function. Let’s update our hg19.genes object by reading a file with more information (metadata) about each gene.

This file contains the same gene annotations along with additional columns of information. The file is not in the strict bed format so we need to use the readGeneric function.

The meta.cols argument allows us to specify the column numbers to be read as metadata and how to name these.

## This file contains the same gene annotations but has some extra columns in the table which we can also import.
hg19.genes<-readGeneric("http://bifx-core.bio.ed.ac.uk/genomes/human/hg19/annotation/Ensembl.GRCh37.74.edited.genes.bed",chr = 1,start = 2,end = 3,strand = 6,meta.cols = list(name=4,symbol=5,biotype=7))

## The meta.cols parameter is a list of metadata column numbers and the names we want to give them.
hg19.genes

## GRanges object with 57773 ranges and 3 metadata columns:
##           seqnames            ranges strand |            name       symbol
##              <Rle>         <IRanges>  <Rle> |     <character>  <character>
##       [1]    chr13 23551994-23552136      - | ENSG00000223116   AL157931.1
##       [2]    chr13 23708313-23708703      + | ENSG00000233440      HMGA1P6
##       [3]    chr13 23726725-23726825      - | ENSG00000207157       RNY3P4
##       [4]    chr13 23743974-23744736      - | ENSG00000229483    LINC00362
##       [5]    chr13 23791571-23791673      - | ENSG00000252952     RNU6-58P
##       ...      ...               ...    ... .             ...          ...
##   [57769]     chr3 14313873-14345345      - | ENSG00000255021 RP11-536I6.2
##   [57770]     chr3 14389951-14394068      - | ENSG00000251576 RP11-536I6.1
##   [57771]     chr3 14436148-14436257      + | ENSG00000199609    RNA5SP124
##   [57772]     chr3 14530619-14583588      - | ENSG00000144596        GRIP2
##   [57773]     chr2 61704984-61765761      - | ENSG00000082898         XPO1
##                        biotype
##                    <character>
##       [1]                miRNA
##       [2]           pseudogene
##       [3]             misc_RNA
##       [4]              lincRNA
##       [5]                snRNA
##       ...                  ...
##   [57769]              lincRNA
##   [57770]              lincRNA
##   [57771]                 rRNA
##   [57772] processed_transcript
##   [57773]       protein_coding
##   -------
##   seqinfo: 25 sequences from an unspecified genome; no seqlengths

We can use the metadata to filter our GRanges object using the subset function.

##Get all protein coding genes
hg19.pc.genes<-subset(hg19.genes,biotype=="protein_coding")
hg19.pc.genes

## GRanges object with 20327 ranges and 3 metadata columns:
##           seqnames            ranges strand |            name       symbol
##              <Rle>         <IRanges>  <Rle> |     <character>  <character>
##       [1]    chr15 20737094-20747114      - | ENSG00000215405     GOLGA6L6
##       [2]    chr15 21004687-21005367      + | ENSG00000268343   AC012414.1
##       [3]    chr15 21040701-21071643      - | ENSG00000230031       POTEB2
##       [4]    chr15 49280673-49338760      - | ENSG00000138593    SECISBP2L
##       [5]    chr15 22011370-22012050      + | ENSG00000268531 DKFZP547L112
##       ...      ...               ...    ... .             ...          ...
##   [20323]    chr17 43224684-43229468      + | ENSG00000186834       HEXIM1
##   [20324]    chr17 43238067-43247407      + | ENSG00000168517       HEXIM2
##   [20325]    chr17 43298811-43324687      + | ENSG00000184922        FMNL1
##   [20326]    chr10 22823778-23003484      - | ENSG00000150867      PIP4K2A
##   [20327]     chr2 61704984-61765761      - | ENSG00000082898         XPO1
##                  biotype
##              <character>
##       [1] protein_coding
##       [2] protein_coding
##       [3] protein_coding
##       [4] protein_coding
##       [5] protein_coding
##       ...            ...
##   [20323] protein_coding
##   [20324] protein_coding
##   [20325] protein_coding
##   [20326] protein_coding
##   [20327] protein_coding
##   -------
##   seqinfo: 25 sequences from an unspecified genome; no seqlengths

Further Learning

There are several other import functions in genomation:

readBam() ##Import sequencing alignments.
readNarrowPeak() and readBroadPeak() ##Import ChIP-seq peak regions from commonly used peak caller tools.
gffToGRanges() ##Import GFF annotation files.
readTranscriptFeatures() ##Import a bed file and split it into promoter, exon and intron regions.
More…

genomation also has several analysis functions:

Perform additional operations on GRanges objects
Annotate GRanges objects with other features
Extract and visualise data across GRanges
More…

Take a look at the genomation manual and documentation to learn more.

Key points

Store genomic co-ordinates and metadata in GRanges objects
Access information and perform operations on GRanges with functions from the GenomicRanges library.
The genomation library has functions to easily load annotation files and other genomics datasets.
All of the Bioconductor packages have very good documentation!