3. Fetching data from annotation packages and external databases

There are several annotation packages within Bioconductor with pre-built databases that are ready to use. These include gene annotations, transcript models and genome sequences, as well as micro-array probe annotations and functional annotations like KEGG (pathway maps) and GO (gene ontology terms).

First, let’s make sure you have the following datasets from the GenomicRanges lesson:

library(GenomicRanges)
library(genomation)
options(timeout=600) ## This will increase the time for downloading large files

gr2<-GRanges(seqnames = paste0("chr",seq(2,10,2)),
            ranges = IRanges(start = sample(1000000,5) ,
            width = rep(500,5)),strand = c(rep("+",3),rep("-",2)),names=seq(1,5))

hg19.genes<-readGeneric("http://bifx-core.bio.ed.ac.uk/genomes/human/hg19/annotation/Ensembl.GRCh37.74.edited.genes.bed",chr = 1,start = 2,end = 3,strand = 6,meta.cols = list(name=4,symbol=5,biotype=7))
hg19.pc.genes<-subset(hg19.genes,biotype=="protein_coding")

3.1 Genomic sequences (BSgenome)

BSgenome packages contain genome reference sequences. We can use these to fetch the underlying sequences in GRanges objects and in conjunction with other functions that require sequence information e.g. motif discovery, calculating GC content etc.

Load the BSgenome library for hg19 and assign it to an object hs.

## Load the hg19 BSgenome object
library(BSgenome.Hsapiens.UCSC.hg19)
hs<-BSgenome.Hsapiens.UCSC.hg19

The Biostrings package is automatically loaded along with any BSgenome and contains functions to work with genomic sequences.

## Fetch sequences for the genomic intervals in GRanges object gr2  
gr2.seq<-getSeq(hs,gr2)
gr2.seq

## DNAStringSet object of length 5:
##     width seq
## [1]   500 TAAGTACCTAAGGGTAAAGTGGCCTGATGGCTCC...AGAGCTGACAAGCTGGCCCTGTGGGCATTTATT
## [2]   500 CCTGTTGTCTAGTGAATTAGTAACTTTCTTTAAT...CCCTCCCTGCACAAGCACCCCCACCGCGGCTCT
## [3]   500 ATGTGCCAGTGCGGCCACCCAAACCATGCCCCTC...TTAGAATAAAAAATTCCACAATTGTCAAAGAAG
## [4]   500 TAATTGTTTGATCCCCTTCGCCCATATGGCCCAA...CCTTGAAAATGTTAGGTTGTTTGCTTGATCACA
## [5]   500 ACCTGCAGCGAGGTCTGTGAACCCGTCTCAGGTT...CCAAGGTGTCGGACACCGTGGTGGAGCCCTACA

## Calculate the dinucleotide frequencies for each sequence
dinucleotideFrequency(gr2.seq)

##      AA AC AG AT CA CC CG CT GA GC GG GT TA TC TG TT
## [1,] 75 39 45 27 50 20  3 26 37 22 34 24 24 18 35 20
## [2,] 61 23 35 20 29 34 19 41 23 33 33 29 26 32 31 30
## [3,] 44 24 28 35 39 38  7 35 20 20 18 35 27 37 41 51
## [4,] 18 25 20 25 36 67  9 49 18 24 28 33 17 45 46 39
## [5,] 17 35 31 14 32 36 19 45 35 29 59 36 13 32 50 16

All BSgenome databases contain a seqinfo object that describes the genome. We can add this information to any of our GRanges objects.

seqinfo(hs)

## Seqinfo object with 298 sequences (2 circular) from hg19 genome:
##   seqnames           seqlengths isCircular genome
##   chr1                249250621      FALSE   hg19
##   chr2                243199373      FALSE   hg19
##   chr3                198022430      FALSE   hg19
##   chr4                191154276      FALSE   hg19
##   chr5                180915260      FALSE   hg19
##   ...                       ...        ...    ...
##   chr21_gl383580_alt      74652      FALSE   hg19
##   chr21_gl383581_alt     116690      FALSE   hg19
##   chr22_gl383582_alt     162811      FALSE   hg19
##   chr22_gl383583_alt      96924      FALSE   hg19
##   chr22_kb663609_alt      74013      FALSE   hg19

## Let's update our human protein coding genes GRanges object
seqinfo(hg19.pc.genes) <- seqinfo(hs)
hg19.pc.genes

## GRanges object with 20327 ranges and 3 metadata columns:
##           seqnames            ranges strand |            name       symbol
##              <Rle>         <IRanges>  <Rle> |     <character>  <character>
##       [1]    chr15 20737094-20747114      - | ENSG00000215405     GOLGA6L6
##       [2]    chr15 21004687-21005367      + | ENSG00000268343   AC012414.1
##       [3]    chr15 21040701-21071643      - | ENSG00000230031       POTEB2
##       [4]    chr15 49280673-49338760      - | ENSG00000138593    SECISBP2L
##       [5]    chr15 22011370-22012050      + | ENSG00000268531 DKFZP547L112
##       ...      ...               ...    ... .             ...          ...
##   [20323]    chr17 43224684-43229468      + | ENSG00000186834       HEXIM1
##   [20324]    chr17 43238067-43247407      + | ENSG00000168517       HEXIM2
##   [20325]    chr17 43298811-43324687      + | ENSG00000184922        FMNL1
##   [20326]    chr10 22823778-23003484      - | ENSG00000150867      PIP4K2A
##   [20327]     chr2 61704984-61765761      - | ENSG00000082898         XPO1
##                  biotype
##              <character>
##       [1] protein_coding
##       [2] protein_coding
##       [3] protein_coding
##       [4] protein_coding
##       [5] protein_coding
##       ...            ...
##   [20323] protein_coding
##   [20324] protein_coding
##   [20325] protein_coding
##   [20326] protein_coding
##   [20327] protein_coding
##   -------
##   seqinfo: 298 sequences (2 circular) from hg19 genome

## Print out seqinfo
seqinfo(hg19.pc.genes)

## Seqinfo object with 298 sequences (2 circular) from hg19 genome:
##   seqnames           seqlengths isCircular genome
##   chr1                249250621      FALSE   hg19
##   chr2                243199373      FALSE   hg19
##   chr3                198022430      FALSE   hg19
##   chr4                191154276      FALSE   hg19
##   chr5                180915260      FALSE   hg19
##   ...                       ...        ...    ...
##   chr21_gl383580_alt      74652      FALSE   hg19
##   chr21_gl383581_alt     116690      FALSE   hg19
##   chr22_gl383582_alt     162811      FALSE   hg19
##   chr22_gl383583_alt      96924      FALSE   hg19
##   chr22_kb663609_alt      74013      FALSE   hg19

Challenge:

1.Find the GC content for each sequence in gr2.seq. Hint: Look at the letterFrequency or alphabetFrequency functions.

Solution:

letterFrequency(gr2.seq,letters = "GC",as.prob = T)

##        G|C
## [1,] 0.432
## [2,] 0.482
## [3,] 0.426
## [4,] 0.528
## [5,] 0.582

2.Attempt to translate the sequences in gr2.seq into amino acid sequences using the Biostrings package.

Solution:

translate(gr2.seq)

## AAStringSet object of length 5:
##     width seq
## [1]   166 *VPKGKVA*WLQLTLK*SEKIICTCTKRHEERWT...SHG*GTRPLTDTKGVP*GVTAAAELTSWPCGHL
## [2]   166 PVV**ISNFL*SLSFFLWFARYKLYFCYTSGYKM...RETLSQKKKKKKKKSPKSL*VCAPPCTSTPTAA
## [3]   166 MCQCGHPNHAPPRGPSPSSAGPPPMSSVVLIITP...GQF*KLFLSKNILFSF*HMKLIH*NKKFHNCQR
## [4]   166 *LFDPLRPYGPK*PSPS*IPPGLWFLLFHLPRCS...PSFGTAPCESG*VPSRLWALV*PLENVRLFA*S
## [5]   166 TCSEVCEPVSGLTPVLIPNRGSCCPVTLGEGVSS...SAQ*DPGGVPRQDHKHIQHPALAQGVGHRGGAL

Because the gr2 regions are randomly selected they don’t really represent transcripts. You may end up with an error if there are Ns in your sequences. You can use the if.fuzzy.codon argument to solve this.

translate(gr2.seq,if.fuzzy.codon = "solve")

## AAStringSet object of length 5:
##     width seq
## [1]   166 *VPKGKVA*WLQLTLK*SEKIICTCTKRHEERWT...SHG*GTRPLTDTKGVP*GVTAAAELTSWPCGHL
## [2]   166 PVV**ISNFL*SLSFFLWFARYKLYFCYTSGYKM...RETLSQKKKKKKKKSPKSL*VCAPPCTSTPTAA
## [3]   166 MCQCGHPNHAPPRGPSPSSAGPPPMSSVVLIITP...GQF*KLFLSKNILFSF*HMKLIH*NKKFHNCQR
## [4]   166 *LFDPLRPYGPK*PSPS*IPPGLWFLLFHLPRCS...PSFGTAPCESG*VPSRLWALV*PLENVRLFA*S
## [5]   166 TCSEVCEPVSGLTPVLIPNRGSCCPVTLGEGVSS...SAQ*DPGGVPRQDHKHIQHPALAQGVGHRGGAL

3.2 Annotation Maps (org.db)

The annotation databases within R are collectively known as AnnotationDb objects. They can all be queried using the keys(), keytypes(), columns() and select() functions. The org.db packages contain mappings between different gene identifiers and annotations for specific organsims.

## Load the human org.db object
library(org.Hs.eg.db)

## The columns() function allows you to see the annotations that can be extracted
columns(org.Hs.eg.db)

##  [1] "ACCNUM"       "ALIAS"        "ENSEMBL"      "ENSEMBLPROT"  "ENSEMBLTRANS"
##  [6] "ENTREZID"     "ENZYME"       "EVIDENCE"     "EVIDENCEALL"  "GENENAME"    
## [11] "GENETYPE"     "GO"           "GOALL"        "IPI"          "MAP"         
## [16] "OMIM"         "ONTOLOGY"     "ONTOLOGYALL"  "PATH"         "PFAM"        
## [21] "PMID"         "PROSITE"      "REFSEQ"       "SYMBOL"       "UCSCKG"      
## [26] "UNIPROT"

## The keytypes() are the columns that can be searched (in this case all of them)
keytypes(org.Hs.eg.db)

##  [1] "ACCNUM"       "ALIAS"        "ENSEMBL"      "ENSEMBLPROT"  "ENSEMBLTRANS"
##  [6] "ENTREZID"     "ENZYME"       "EVIDENCE"     "EVIDENCEALL"  "GENENAME"    
## [11] "GENETYPE"     "GO"           "GOALL"        "IPI"          "MAP"         
## [16] "OMIM"         "ONTOLOGY"     "ONTOLOGYALL"  "PATH"         "PFAM"        
## [21] "PMID"         "PROSITE"      "REFSEQ"       "SYMBOL"       "UCSCKG"      
## [26] "UNIPROT"

To search for annotations we first need a set of ‘keys’. Here we will use the protein coding Ensembl gene identifiers from our GRanges object hg19.pc.genes.

## Choose a set of "keys" or search terms
keys<-hg19.pc.genes$name

We can then use the select function to extract information from the annotationDB package.

## Use the select() function to extract the EntrezID, RefSeq ID and gene symbols for all of these Ensembl gene identifiers
anno<-select(org.Hs.eg.db,columns=c("ENSEMBL","ENTREZID","SYMBOL","UNIPROT"),keys=keys,keytype="ENSEMBL")
anno

There are also other annotation packages, such as GO.db which contains a set of annotation maps describing the entire Gene Ontology, that we can probe in the same manner as above.

3.3 Gene Models (TxDB)

We can import gene models from packages like TxDB which contain a database of transcript structures.

Load the TxDb.Hsapiens.UCSC.hg19.knownGene library and assign it to an object called txdb.

## Load the Human hg19 transcript database
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb<-TxDb.Hsapiens.UCSC.hg19.knownGene
## See which columns can be returned
columns(txdb)

##  [1] "CDSCHROM"   "CDSEND"     "CDSID"      "CDSNAME"    "CDSSTART"  
##  [6] "CDSSTRAND"  "EXONCHROM"  "EXONEND"    "EXONID"     "EXONNAME"  
## [11] "EXONRANK"   "EXONSTART"  "EXONSTRAND" "GENEID"     "TXCHROM"   
## [16] "TXEND"      "TXID"       "TXNAME"     "TXSTART"    "TXSTRAND"  
## [21] "TXTYPE"

## See which columns can be searched
keytypes(txdb)

## [1] "CDSID"    "CDSNAME"  "EXONID"   "EXONNAME" "GENEID"   "TXID"     "TXNAME"

## Select columns from the first 10 entries in txdb
select(txdb,keys = head(keys(txdb)),
       columns=c('GENEID', 'TXCHROM', 'TXSTART', 'TXEND', 'TXID'),
       keytype="GENEID")

The org.db and TxDB databases are all centred upon Entrez gene IDs. However, we commonly use gene annotations from the Ensembl database. It is possible to make your own txdb objects.

## Fetch gene models from the Ensembl server
#txdb.ens<-makeTxDbFromEnsembl(organism = "Homo sapiens",release=74)

## We don't run this command in this workshop as it will require additional installation of RMaria database on your computer.

## Create a txdb from a GFF file
txdb.gff<-makeTxDbFromGFF(file = "http://bifx-core.bio.ed.ac.uk/genomes/human/hg19/annotation/Homo_sapiens.GRCh37.74.edited.gtf",format = "gtf",organism = "Homo sapiens")

columns(txdb.gff)

##  [1] "CDSCHROM"   "CDSEND"     "CDSID"      "CDSNAME"    "CDSPHASE"  
##  [6] "CDSSTART"   "CDSSTRAND"  "EXONCHROM"  "EXONEND"    "EXONID"    
## [11] "EXONNAME"   "EXONRANK"   "EXONSTART"  "EXONSTRAND" "GENEID"    
## [16] "TXCHROM"    "TXEND"      "TXID"       "TXNAME"     "TXSTART"   
## [21] "TXSTRAND"   "TXTYPE"

##Let's look at the first few entries
select(txdb.gff,keys = head(keys(txdb.gff)),
       columns=c('GENEID', 'TXCHROM', 'TXSTART', 'TXEND', 'TXNAME'),
       keytype="GENEID")

Further Learning

There are several other methods to create and interact with txdb objects in the GenomicFeatures package including from GRanges objects or by connecting directly to the Ensembl, UCSC or BioMart databases.

There is a good tutorial here.

3.4 Fetching online data (biomaRt)

We can also fetch data housed in online databases using the biomaRt package. Biomart is a web interface for downloading data from sites such as Ensembl. You can play around with the web version here.

Load the biomaRt library and set it to use the ensembl biomart.

library(biomaRt)

## Let's use the Ensembl bioMart.
ensembl<-useMart("ensembl")
## Every species is a different dataset. Let's list the options.
listDatasets(ensembl)

You will see that there are a lot of Ensembl databases. We will specifically select the human database.

## Select the human dataset within the Ensembl biomart
ensembl<-useMart("ensembl",dataset="hsapiens_gene_ensembl")

By default useMart will always select the most recent release of the Ensembl database and therefore the current genome assembly, in this case hg38. Because we have so far used data from the previous assembly (hg19/hg37) we must select an archived version of Ensembl. The archived versions are listed here.

##use an archived version of biomart
hg19.ensembl <- useMart(biomart="ensembl",host="grch37.ensembl.org", path="/biomart/martservice", dataset="hsapiens_gene_ensembl")

We can build a biomart query with attributes, filters and values similar to the AnnotationDB objects columns, keytypes and keys used above.

##List the attributes and filters of the human dataset
listAttributes(hg19.ensembl)

listFilters(hg19.ensembl)

We can query biomart using getBM. Let’s extract annotations for the gene MeCP2.

## Create biomart query for MeCP2
getBM(attributes =c("chromosome_name","start_position","end_position","hgnc_symbol","ensembl_gene_id","strand"),
      filters = "ensembl_gene_id",
      values = "ENSG00000169057", 
      mart = hg19.ensembl)

If we run the same query using the current version of Ensembl you can see that the co-ordinates differ. This why it’s important to ensure you are always working with a consistent reference assembly!

## Run the query with the current ensembl hg38 database
getBM(attributes =c("chromosome_name","start_position","end_position","hgnc_symbol","ensembl_gene_id","strand"),
      filters = "ensembl_gene_id",
      values = "ENSG00000169057", 
      mart = ensembl)

3.5 Annotation hubs (AnnotationHub)

AnnotationHub is another source of data within R. You can use the annotation hub to retrieve R objects and text files for multiple datasets and projects including Ensembl, UCSC, Broad Institute and the Epigenetics Roadmap project. You can even create your own AnnotationHub to provide access to your datasets.

Load the AnnotationHub library and create a hub connection called ah. It may ask you to create a cache folder in your home directory to store files locally. This will speed up access to this data in future.

library(AnnotationHub)
ah = AnnotationHub()
##View all of the data providers within AnnotationHub 
unique(ah$dataprovider)

##  [1] "UCSC"                                                                                                      
##  [2] "Ensembl"                                                                                                   
##  [3] "RefNet"                                                                                                    
##  [4] "Inparanoid8"                                                                                               
##  [5] "NHLBI"                                                                                                     
##  [6] "ChEA"                                                                                                      
##  [7] "Pazar"                                                                                                     
##  [8] "NIH Pathway Interaction Database"                                                                          
##  [9] "Haemcode"                                                                                                  
## [10] "BroadInstitute"                                                                                            
## [11] "PRIDE"                                                                                                     
## [12] "Gencode"                                                                                                   
## [13] "CRIBI"                                                                                                     
## [14] "Genoscope"                                                                                                 
## [15] "MISO, VAST-TOOLS, UCSC"                                                                                    
## [16] "UWashington"                                                                                               
## [17] "Stanford"                                                                                                  
## [18] "dbSNP"                                                                                                     
## [19] "BioMart"                                                                                                   
## [20] "GeneOntology"                                                                                              
## [21] "KEGG"                                                                                                      
## [22] "URGI"                                                                                                      
## [23] "EMBL-EBI"                                                                                                  
## [24] "MicrosporidiaDB"                                                                                           
## [25] "FungiDB"                                                                                                   
## [26] "TriTrypDB"                                                                                                 
## [27] "ToxoDB"                                                                                                    
## [28] "AmoebaDB"                                                                                                  
## [29] "PlasmoDB"                                                                                                  
## [30] "PiroplasmaDB"                                                                                              
## [31] "CryptoDB"                                                                                                  
## [32] "TrichDB"                                                                                                   
## [33] "GiardiaDB"                                                                                                 
## [34] "The Gene Ontology Consortium"                                                                              
## [35] "ENCODE Project"                                                                                            
## [36] "SchistoDB"                                                                                                 
## [37] "NCBI/UniProt"                                                                                              
## [38] "GENCODE"                                                                                                   
## [39] "http://www.pantherdb.org"                                                                                  
## [40] "RMBase v2.0"                                                                                               
## [41] "snoRNAdb"                                                                                                  
## [42] "tRNAdb"                                                                                                    
## [43] "NCBI"                                                                                                      
## [44] "DrugAge, DrugBank, Broad Institute"                                                                        
## [45] "DrugAge"                                                                                                   
## [46] "DrugBank"                                                                                                  
## [47] "Broad Institute"                                                                                           
## [48] "HMDB, EMBL-EBI, EPA"                                                                                       
## [49] "STRING"                                                                                                    
## [50] "OMA"                                                                                                       
## [51] "OrthoDB"                                                                                                   
## [52] "PathBank"                                                                                                  
## [53] "EBI/EMBL"                                                                                                  
## [54] "NCBI,DBCLS"                                                                                                
## [55] "FANTOM5,DLRP,IUPHAR,HPRD,STRING,SWISSPROT,TREMBL,ENSEMBL,CELLPHONEDB,BADERLAB,SINGLECELLSIGNALR,HOMOLOGENE"
## [56] "WikiPathways"                                                                                              
## [57] "UCSC Jaspar"                                                                                               
## [58] "VAST-TOOLS"                                                                                                
## [59] "pyGenomeTracks "                                                                                           
## [60] "NA"                                                                                                        
## [61] "UoE"                                                                                                       
## [62] "mitra.stanford.edu/kundaje/akundaje/release/blacklists/"                                                   
## [63] "ENCODE"                                                                                                    
## [64] "TargetScan,miRTarBase,USCS,ENSEMBL"                                                                        
## [65] "TargetScan"                                                                                                
## [66] "QuickGO"                                                                                                   
## [67] "ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/"

We can use collections of search terms to query the annotation hub. Let’s find the Ensembl hg19 gene annotations first.

query(ah,c("GRCh37"))[1:10]

## AnnotationHub with 10 records
## # snapshotDate(): 2022-04-25
## # $dataprovider: Ensembl, dbSNP
## # $species: Homo sapiens
## # $rdataclass: GRanges, VcfFile
## # additional mcols(): taxonomyid, genome, description,
## #   coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
## #   rdatapath, sourceurl, sourcetype 
## # retrieve records with, e.g., 'object[["AH7558"]]' 
## 
##             title                              
##   AH7558  | Homo_sapiens.GRCh37.70.gtf         
##   AH7619  | Homo_sapiens.GRCh37.69.gtf         
##   AH7666  | Homo_sapiens.GRCh37.71.gtf         
##   AH7726  | Homo_sapiens.GRCh37.72.gtf         
##   AH7790  | Homo_sapiens.GRCh37.73.gtf         
##   AH8753  | Homo_sapiens.GRCh37.74.gtf         
##   AH10684 | Homo_sapiens.GRCh37.75.gtf         
##   AH57956 | clinvar_20160203.vcf.gz            
##   AH57957 | clinvar_20160203_papu.vcf.gz       
##   AH57958 | common_and_clinical_20160203.vcf.gz

Let’s get version 74 of the Ensembl database as a GRanges object.

Ensembl.hg19.74.gr<-ah[["AH8753"]]

If we wanted to, we could create a txdb object with makeTxDbFromGRanges. Let’s download a different dataset from the annotation hub.

The EpigenomeRoadMap data is hosted online and can be searched here. The data is also provided by the Broad Institute as an annotation hub.

## Find H3K4me3 peak files from the EpigenomeRoadMap datasets for experiment E010 (Cultured neuronal cells).
epiFiles <- query(ah, c("EpigenomeRoadMap","E010","narrowPeak","H3K4me3"))
epiFiles

## AnnotationHub with 1 record
## # snapshotDate(): 2022-04-25
## # names(): AH30032
## # $dataprovider: BroadInstitute
## # $species: Homo sapiens
## # $rdataclass: GRanges
## # $rdatadateadded: 2015-05-06
## # $title: E010-H3K4me3.narrowPeak.gz
## # $description: Narrow ChIP-seq peaks for consolidated epigenomes from Epige...
## # $taxonomyid: 9606
## # $genome: hg19
## # $sourcetype: BED
## # $sourceurl: http://egg2.wustl.edu/roadmap/data/byFileType/peaks/consolidat...
## # $sourcesize: 721817
## # $tags: c("EpigenomeRoadMap", "peaks", "consolidated", "narrowPeak",
## #   "E010", "ES-deriv", "ESDR.H9.NEUR", "H9 Derived Neuron Cultured
## #   Cells") 
## # retrieve record with 'object[["AH30032"]]'

There is a single hit for this search with the name AH30032. Let’s import the H3K4me3 ChIP-seq peaks as GRanges.

##These files will be imported as GRanges objects
H3K4me3<-epiFiles[["AH30032"]]
H3K4me3

## GRanges object with 37783 ranges and 6 metadata columns:
##           seqnames              ranges strand |        name     score
##              <Rle>           <IRanges>  <Rle> | <character> <numeric>
##       [1]     chr2 200775200-200777693      * |      Rank_1      2708
##       [2]    chr10   30636343-30638665      * |      Rank_2      2681
##       [3]     chr4 142051175-142055073      * |      Rank_3      2660
##       [4]    chr14   71065888-71067491      * |      Rank_4      2634
##       [5]    chr21   30671163-30673514      * |      Rank_5      2590
##       ...      ...                 ...    ... .         ...       ...
##   [37779]     chr8   71156568-71156757      * |  Rank_37779        20
##   [37780]     chrX     9868663-9868855      * |  Rank_37780        20
##   [37781]     chr2 171570785-171570924      * |  Rank_37781        20
##   [37782]     chr1 162666379-162666679      * |  Rank_37782        20
##   [37783]     chr7   27170047-27170196      * |  Rank_37783        20
##           signalValue    pValue    qValue      peak
##             <numeric> <numeric> <numeric> <numeric>
##       [1]     65.7391   270.837   261.347      1370
##       [2]     58.6667   268.179   259.533      1607
##       [3]     59.5730   266.007   257.693      1972
##       [4]     60.2837   263.434   255.692      1321
##       [5]     57.9809   259.074   251.621       769
##       ...         ...       ...       ...       ...
##   [37779]     1.85714   2.05416   0.20999         2
##   [37780]     2.13285   2.04594   0.20241        83
##   [37781]     2.18182   2.03936   0.19594        26
##   [37782]     1.66667   2.03201   0.18894        50
##   [37783]     2.15983   2.00804   0.16532        82
##   -------
##   seqinfo: 298 sequences (2 circular) from hg19 genome

Challenge:

Can you get the sequences of the 10 longest exons in the Mouse Ensembl gene models?

Hints:

There is a gtf file from Ensembl available at the URL below, or you could try to fetch the gene models using AnnotationHub. “http://bifx-core.bio.ed.ac.ukgenomes/mouse/mm10/annotation/Mus_musculus.GRCm38.78.edited.gtf”
Use the exons function to extract exons from a txdb object.
Reorder the exons by width with the order function.
Load the mm10 BSgenome library and get the reference sequence using the getSeq function.
Use the writeXStringSet function to export sequences in fasta format.

Solution:

## create a txdb object from the gtf file downloaded from Ensembl
txdb.mm10<-makeTxDbFromGFF(file = "http://bifx-core.bio.ed.ac.uk/genomes/mouse/mm10/annotation/Mus_musculus.GRCm38.78.edited.gtf")
## Get exons as GRanges
exons<-exons(txdb.mm10)
## Sort by width
exons.sorted<-exons[order(width(exons),decreasing=T),]
## Select the top 10
top.exons<-exons.sorted[1:10,]
## Set the GRanges names to the exon_id
names(top.exons)<-top.exons$exon_id
## Get the mm10 sequence
library(BSgenome.Mmusculus.UCSC.mm10)
mm10<-BSgenome.Mmusculus.UCSC.mm10
## Fetch exon sequences
top.exons.seq<-getSeq(mm10,top.exons)
## Output sequences as a fasta file
writeXStringSet(top.exons.seq, 'long_exons.fa')
top.exons.seq

## DNAStringSet object of length 10:
##      width seq                                              names               
##  [1] 84395 AGGAGGAACAGTTGCCTCAGCAC...GGAAAATCATTTTTGCAGCTAC 202265
##  [2] 74456 GGCCAAAAAGCAAGCTTAATTAG...AAAATCTATTCCTCTGATTCTT 93241
##  [3] 52270 GTACTTGGACTAGATGGCAGTTC...AAAAGCCATGAGGAAAAAGATA 313921
##  [4] 40044 GCGTGTGCAGCCTGGTGCCTAAG...ACTCATTAAAAAACATTTTAAC 23720
##  [5] 24280 GTGAGAAGCCCCACCAGTGCAGC...GTTTGAATATGAAGGAATAATT 355534
##  [6] 20771 AGGAGTTAGTGACAAGGAGGGCT...ACGTGAGCAGGAAAAAGCAAAA 398118
##  [7] 20717 GGTATCTACAGCTGTATCCACGG...ACCTACCACAAATTATCAGGGT 169422
##  [8] 17497 AGCGGGGATGATGAGGACTACCA...ATAAAGCTCTTTATTAACCCTG 393364
##  [9] 17485 GTAGAAGAGTCAGCAATGATGGG...AATAAAACGTTTAATACTCTTA 28144
## [10] 17106 ATCCCCCAGGACCACCTTCAAAT...AAGGAAGGAGTCAAAATAACAG 60509

Key points

Bioconductor has specialised packages for working with genomics data in R.
Genomic sequences, annotations and datasets can be imported by various means.
It is possible to create your own annotation objects.

Genomic annotation and sequences

Learning Objectives

3. Fetching data from annotation packages and external databases

3.1 Genomic sequences (BSgenome)

Challenge:

Solution:

Solution:

3.2 Annotation Maps (org.db)

3.3 Gene Models (TxDB)

Further Learning

3.4 Fetching online data (biomaRt)

3.5 Annotation hubs (AnnotationHub)

Challenge:

Solution:

Key points