There are several annotation packages within Bioconductor with pre-built databases that are ready to use. These include gene annotations, transcript models and genome sequences, as well as micro-array probe annotations and functional annotations like KEGG (pathway maps) and GO (gene ontology terms).
First, let’s make sure you have the following datasets from the GenomicRanges lesson:
library(GenomicRanges)
library(genomation)
options(timeout=600) ## This will increase the time for downloading large files
gr2<-GRanges(seqnames = paste0("chr",seq(2,10,2)),
ranges = IRanges(start = sample(1000000,5) ,
width = rep(500,5)),strand = c(rep("+",3),rep("-",2)),names=seq(1,5))
hg19.genes<-readGeneric("http://bifx-core.bio.ed.ac.uk/genomes/human/hg19/annotation/Ensembl.GRCh37.74.edited.genes.bed",chr = 1,start = 2,end = 3,strand = 6,meta.cols = list(name=4,symbol=5,biotype=7))
hg19.pc.genes<-subset(hg19.genes,biotype=="protein_coding")
BSgenome packages contain genome reference sequences. We can use these to fetch the underlying sequences in GRanges objects and in conjunction with other functions that require sequence information e.g. motif discovery, calculating GC content etc.
Load the BSgenome library for hg19 and assign it to an object hs.
## Load the hg19 BSgenome object
library(BSgenome.Hsapiens.UCSC.hg19)
hs<-BSgenome.Hsapiens.UCSC.hg19
The Biostrings package is automatically loaded along with any BSgenome and contains functions to work with genomic sequences.
## Fetch sequences for the genomic intervals in GRanges object gr2
gr2.seq<-getSeq(hs,gr2)
gr2.seq
## DNAStringSet object of length 5:
## width seq
## [1] 500 TAAGTACCTAAGGGTAAAGTGGCCTGATGGCTCC...AGAGCTGACAAGCTGGCCCTGTGGGCATTTATT
## [2] 500 CCTGTTGTCTAGTGAATTAGTAACTTTCTTTAAT...CCCTCCCTGCACAAGCACCCCCACCGCGGCTCT
## [3] 500 ATGTGCCAGTGCGGCCACCCAAACCATGCCCCTC...TTAGAATAAAAAATTCCACAATTGTCAAAGAAG
## [4] 500 TAATTGTTTGATCCCCTTCGCCCATATGGCCCAA...CCTTGAAAATGTTAGGTTGTTTGCTTGATCACA
## [5] 500 ACCTGCAGCGAGGTCTGTGAACCCGTCTCAGGTT...CCAAGGTGTCGGACACCGTGGTGGAGCCCTACA
## Calculate the dinucleotide frequencies for each sequence
dinucleotideFrequency(gr2.seq)
## AA AC AG AT CA CC CG CT GA GC GG GT TA TC TG TT
## [1,] 75 39 45 27 50 20 3 26 37 22 34 24 24 18 35 20
## [2,] 61 23 35 20 29 34 19 41 23 33 33 29 26 32 31 30
## [3,] 44 24 28 35 39 38 7 35 20 20 18 35 27 37 41 51
## [4,] 18 25 20 25 36 67 9 49 18 24 28 33 17 45 46 39
## [5,] 17 35 31 14 32 36 19 45 35 29 59 36 13 32 50 16
All BSgenome databases contain a seqinfo object that describes the genome. We can add this information to any of our GRanges objects.
seqinfo(hs)
## Seqinfo object with 298 sequences (2 circular) from hg19 genome:
## seqnames seqlengths isCircular genome
## chr1 249250621 FALSE hg19
## chr2 243199373 FALSE hg19
## chr3 198022430 FALSE hg19
## chr4 191154276 FALSE hg19
## chr5 180915260 FALSE hg19
## ... ... ... ...
## chr21_gl383580_alt 74652 FALSE hg19
## chr21_gl383581_alt 116690 FALSE hg19
## chr22_gl383582_alt 162811 FALSE hg19
## chr22_gl383583_alt 96924 FALSE hg19
## chr22_kb663609_alt 74013 FALSE hg19
## Let's update our human protein coding genes GRanges object
seqinfo(hg19.pc.genes) <- seqinfo(hs)
hg19.pc.genes
## GRanges object with 20327 ranges and 3 metadata columns:
## seqnames ranges strand | name symbol
## <Rle> <IRanges> <Rle> | <character> <character>
## [1] chr15 20737094-20747114 - | ENSG00000215405 GOLGA6L6
## [2] chr15 21004687-21005367 + | ENSG00000268343 AC012414.1
## [3] chr15 21040701-21071643 - | ENSG00000230031 POTEB2
## [4] chr15 49280673-49338760 - | ENSG00000138593 SECISBP2L
## [5] chr15 22011370-22012050 + | ENSG00000268531 DKFZP547L112
## ... ... ... ... . ... ...
## [20323] chr17 43224684-43229468 + | ENSG00000186834 HEXIM1
## [20324] chr17 43238067-43247407 + | ENSG00000168517 HEXIM2
## [20325] chr17 43298811-43324687 + | ENSG00000184922 FMNL1
## [20326] chr10 22823778-23003484 - | ENSG00000150867 PIP4K2A
## [20327] chr2 61704984-61765761 - | ENSG00000082898 XPO1
## biotype
## <character>
## [1] protein_coding
## [2] protein_coding
## [3] protein_coding
## [4] protein_coding
## [5] protein_coding
## ... ...
## [20323] protein_coding
## [20324] protein_coding
## [20325] protein_coding
## [20326] protein_coding
## [20327] protein_coding
## -------
## seqinfo: 298 sequences (2 circular) from hg19 genome
## Print out seqinfo
seqinfo(hg19.pc.genes)
## Seqinfo object with 298 sequences (2 circular) from hg19 genome:
## seqnames seqlengths isCircular genome
## chr1 249250621 FALSE hg19
## chr2 243199373 FALSE hg19
## chr3 198022430 FALSE hg19
## chr4 191154276 FALSE hg19
## chr5 180915260 FALSE hg19
## ... ... ... ...
## chr21_gl383580_alt 74652 FALSE hg19
## chr21_gl383581_alt 116690 FALSE hg19
## chr22_gl383582_alt 162811 FALSE hg19
## chr22_gl383583_alt 96924 FALSE hg19
## chr22_kb663609_alt 74013 FALSE hg19
letterFrequency(gr2.seq,letters = "GC",as.prob = T)
## G|C
## [1,] 0.432
## [2,] 0.482
## [3,] 0.426
## [4,] 0.528
## [5,] 0.582
translate(gr2.seq)
## AAStringSet object of length 5:
## width seq
## [1] 166 *VPKGKVA*WLQLTLK*SEKIICTCTKRHEERWT...SHG*GTRPLTDTKGVP*GVTAAAELTSWPCGHL
## [2] 166 PVV**ISNFL*SLSFFLWFARYKLYFCYTSGYKM...RETLSQKKKKKKKKSPKSL*VCAPPCTSTPTAA
## [3] 166 MCQCGHPNHAPPRGPSPSSAGPPPMSSVVLIITP...GQF*KLFLSKNILFSF*HMKLIH*NKKFHNCQR
## [4] 166 *LFDPLRPYGPK*PSPS*IPPGLWFLLFHLPRCS...PSFGTAPCESG*VPSRLWALV*PLENVRLFA*S
## [5] 166 TCSEVCEPVSGLTPVLIPNRGSCCPVTLGEGVSS...SAQ*DPGGVPRQDHKHIQHPALAQGVGHRGGAL
Because the gr2 regions are randomly selected they don’t really represent transcripts. You may end up with an error if there are Ns in your sequences. You can use the if.fuzzy.codon argument to solve this.
translate(gr2.seq,if.fuzzy.codon = "solve")
## AAStringSet object of length 5:
## width seq
## [1] 166 *VPKGKVA*WLQLTLK*SEKIICTCTKRHEERWT...SHG*GTRPLTDTKGVP*GVTAAAELTSWPCGHL
## [2] 166 PVV**ISNFL*SLSFFLWFARYKLYFCYTSGYKM...RETLSQKKKKKKKKSPKSL*VCAPPCTSTPTAA
## [3] 166 MCQCGHPNHAPPRGPSPSSAGPPPMSSVVLIITP...GQF*KLFLSKNILFSF*HMKLIH*NKKFHNCQR
## [4] 166 *LFDPLRPYGPK*PSPS*IPPGLWFLLFHLPRCS...PSFGTAPCESG*VPSRLWALV*PLENVRLFA*S
## [5] 166 TCSEVCEPVSGLTPVLIPNRGSCCPVTLGEGVSS...SAQ*DPGGVPRQDHKHIQHPALAQGVGHRGGAL
The annotation databases within R are collectively known as AnnotationDb objects. They can all be queried using the keys(), keytypes(), columns() and select() functions. The org.db packages contain mappings between different gene identifiers and annotations for specific organsims.
## Load the human org.db object
library(org.Hs.eg.db)
## The columns() function allows you to see the annotations that can be extracted
columns(org.Hs.eg.db)
## [1] "ACCNUM" "ALIAS" "ENSEMBL" "ENSEMBLPROT" "ENSEMBLTRANS"
## [6] "ENTREZID" "ENZYME" "EVIDENCE" "EVIDENCEALL" "GENENAME"
## [11] "GENETYPE" "GO" "GOALL" "IPI" "MAP"
## [16] "OMIM" "ONTOLOGY" "ONTOLOGYALL" "PATH" "PFAM"
## [21] "PMID" "PROSITE" "REFSEQ" "SYMBOL" "UCSCKG"
## [26] "UNIPROT"
## The keytypes() are the columns that can be searched (in this case all of them)
keytypes(org.Hs.eg.db)
## [1] "ACCNUM" "ALIAS" "ENSEMBL" "ENSEMBLPROT" "ENSEMBLTRANS"
## [6] "ENTREZID" "ENZYME" "EVIDENCE" "EVIDENCEALL" "GENENAME"
## [11] "GENETYPE" "GO" "GOALL" "IPI" "MAP"
## [16] "OMIM" "ONTOLOGY" "ONTOLOGYALL" "PATH" "PFAM"
## [21] "PMID" "PROSITE" "REFSEQ" "SYMBOL" "UCSCKG"
## [26] "UNIPROT"
To search for annotations we first need a set of ‘keys’. Here we will use the protein coding Ensembl gene identifiers from our GRanges object hg19.pc.genes.
## Choose a set of "keys" or search terms
keys<-hg19.pc.genes$name
We can then use the select function to extract information from the annotationDB package.
## Use the select() function to extract the EntrezID, RefSeq ID and gene symbols for all of these Ensembl gene identifiers
anno<-select(org.Hs.eg.db,columns=c("ENSEMBL","ENTREZID","SYMBOL","UNIPROT"),keys=keys,keytype="ENSEMBL")
anno
There are also other annotation packages, such as GO.db which contains a set of annotation maps describing the entire Gene Ontology, that we can probe in the same manner as above.
We can import gene models from packages like TxDB which contain a database of transcript structures.
Load the TxDb.Hsapiens.UCSC.hg19.knownGene library and assign it to an object called txdb.
## Load the Human hg19 transcript database
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb<-TxDb.Hsapiens.UCSC.hg19.knownGene
## See which columns can be returned
columns(txdb)
## [1] "CDSCHROM" "CDSEND" "CDSID" "CDSNAME" "CDSSTART"
## [6] "CDSSTRAND" "EXONCHROM" "EXONEND" "EXONID" "EXONNAME"
## [11] "EXONRANK" "EXONSTART" "EXONSTRAND" "GENEID" "TXCHROM"
## [16] "TXEND" "TXID" "TXNAME" "TXSTART" "TXSTRAND"
## [21] "TXTYPE"
## See which columns can be searched
keytypes(txdb)
## [1] "CDSID" "CDSNAME" "EXONID" "EXONNAME" "GENEID" "TXID" "TXNAME"
## Select columns from the first 10 entries in txdb
select(txdb,keys = head(keys(txdb)),
columns=c('GENEID', 'TXCHROM', 'TXSTART', 'TXEND', 'TXID'),
keytype="GENEID")
The org.db and TxDB databases are all centred upon Entrez gene IDs. However, we commonly use gene annotations from the Ensembl database. It is possible to make your own txdb objects.
## Fetch gene models from the Ensembl server
#txdb.ens<-makeTxDbFromEnsembl(organism = "Homo sapiens",release=74)
## We don't run this command in this workshop as it will require additional installation of RMaria database on your computer.
## Create a txdb from a GFF file
txdb.gff<-makeTxDbFromGFF(file = "http://bifx-core.bio.ed.ac.uk/genomes/human/hg19/annotation/Homo_sapiens.GRCh37.74.edited.gtf",format = "gtf",organism = "Homo sapiens")
columns(txdb.gff)
## [1] "CDSCHROM" "CDSEND" "CDSID" "CDSNAME" "CDSPHASE"
## [6] "CDSSTART" "CDSSTRAND" "EXONCHROM" "EXONEND" "EXONID"
## [11] "EXONNAME" "EXONRANK" "EXONSTART" "EXONSTRAND" "GENEID"
## [16] "TXCHROM" "TXEND" "TXID" "TXNAME" "TXSTART"
## [21] "TXSTRAND" "TXTYPE"
##Let's look at the first few entries
select(txdb.gff,keys = head(keys(txdb.gff)),
columns=c('GENEID', 'TXCHROM', 'TXSTART', 'TXEND', 'TXNAME'),
keytype="GENEID")
There are several other methods to create and interact with txdb objects in the GenomicFeatures package including from GRanges objects or by connecting directly to the Ensembl, UCSC or BioMart databases.
There is a good tutorial here.
We can also fetch data housed in online databases using the biomaRt package. Biomart is a web interface for downloading data from sites such as Ensembl. You can play around with the web version here.
Load the biomaRt library and set it to use the ensembl biomart.
library(biomaRt)
## Let's use the Ensembl bioMart.
ensembl<-useMart("ensembl")
## Every species is a different dataset. Let's list the options.
listDatasets(ensembl)
You will see that there are a lot of Ensembl databases. We will specifically select the human database.
## Select the human dataset within the Ensembl biomart
ensembl<-useMart("ensembl",dataset="hsapiens_gene_ensembl")
By default useMart will always select the most recent release of the Ensembl database and therefore the current genome assembly, in this case hg38. Because we have so far used data from the previous assembly (hg19/hg37) we must select an archived version of Ensembl. The archived versions are listed here.
##use an archived version of biomart
hg19.ensembl <- useMart(biomart="ensembl",host="grch37.ensembl.org", path="/biomart/martservice", dataset="hsapiens_gene_ensembl")
We can build a biomart query with attributes, filters and values similar to the AnnotationDB objects columns, keytypes and keys used above.
##List the attributes and filters of the human dataset
listAttributes(hg19.ensembl)
listFilters(hg19.ensembl)
We can query biomart using getBM. Let’s extract annotations for the gene MeCP2.
## Create biomart query for MeCP2
getBM(attributes =c("chromosome_name","start_position","end_position","hgnc_symbol","ensembl_gene_id","strand"),
filters = "ensembl_gene_id",
values = "ENSG00000169057",
mart = hg19.ensembl)
If we run the same query using the current version of Ensembl you can see that the co-ordinates differ. This why it’s important to ensure you are always working with a consistent reference assembly!
## Run the query with the current ensembl hg38 database
getBM(attributes =c("chromosome_name","start_position","end_position","hgnc_symbol","ensembl_gene_id","strand"),
filters = "ensembl_gene_id",
values = "ENSG00000169057",
mart = ensembl)
AnnotationHub is another source of data within R. You can use the annotation hub to retrieve R objects and text files for multiple datasets and projects including Ensembl, UCSC, Broad Institute and the Epigenetics Roadmap project. You can even create your own AnnotationHub to provide access to your datasets.
Load the AnnotationHub library and create a hub connection called ah. It may ask you to create a cache folder in your home directory to store files locally. This will speed up access to this data in future.
library(AnnotationHub)
ah = AnnotationHub()
##View all of the data providers within AnnotationHub
unique(ah$dataprovider)
## [1] "UCSC"
## [2] "Ensembl"
## [3] "RefNet"
## [4] "Inparanoid8"
## [5] "NHLBI"
## [6] "ChEA"
## [7] "Pazar"
## [8] "NIH Pathway Interaction Database"
## [9] "Haemcode"
## [10] "BroadInstitute"
## [11] "PRIDE"
## [12] "Gencode"
## [13] "CRIBI"
## [14] "Genoscope"
## [15] "MISO, VAST-TOOLS, UCSC"
## [16] "UWashington"
## [17] "Stanford"
## [18] "dbSNP"
## [19] "BioMart"
## [20] "GeneOntology"
## [21] "KEGG"
## [22] "URGI"
## [23] "EMBL-EBI"
## [24] "MicrosporidiaDB"
## [25] "FungiDB"
## [26] "TriTrypDB"
## [27] "ToxoDB"
## [28] "AmoebaDB"
## [29] "PlasmoDB"
## [30] "PiroplasmaDB"
## [31] "CryptoDB"
## [32] "TrichDB"
## [33] "GiardiaDB"
## [34] "The Gene Ontology Consortium"
## [35] "ENCODE Project"
## [36] "SchistoDB"
## [37] "NCBI/UniProt"
## [38] "GENCODE"
## [39] "http://www.pantherdb.org"
## [40] "RMBase v2.0"
## [41] "snoRNAdb"
## [42] "tRNAdb"
## [43] "NCBI"
## [44] "DrugAge, DrugBank, Broad Institute"
## [45] "DrugAge"
## [46] "DrugBank"
## [47] "Broad Institute"
## [48] "HMDB, EMBL-EBI, EPA"
## [49] "STRING"
## [50] "OMA"
## [51] "OrthoDB"
## [52] "PathBank"
## [53] "EBI/EMBL"
## [54] "NCBI,DBCLS"
## [55] "FANTOM5,DLRP,IUPHAR,HPRD,STRING,SWISSPROT,TREMBL,ENSEMBL,CELLPHONEDB,BADERLAB,SINGLECELLSIGNALR,HOMOLOGENE"
## [56] "WikiPathways"
## [57] "UCSC Jaspar"
## [58] "VAST-TOOLS"
## [59] "pyGenomeTracks "
## [60] "NA"
## [61] "UoE"
## [62] "mitra.stanford.edu/kundaje/akundaje/release/blacklists/"
## [63] "ENCODE"
## [64] "TargetScan,miRTarBase,USCS,ENSEMBL"
## [65] "TargetScan"
## [66] "QuickGO"
## [67] "ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/"
We can use collections of search terms to query the annotation hub. Let’s find the Ensembl hg19 gene annotations first.
query(ah,c("GRCh37"))[1:10]
## AnnotationHub with 10 records
## # snapshotDate(): 2022-04-25
## # $dataprovider: Ensembl, dbSNP
## # $species: Homo sapiens
## # $rdataclass: GRanges, VcfFile
## # additional mcols(): taxonomyid, genome, description,
## # coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
## # rdatapath, sourceurl, sourcetype
## # retrieve records with, e.g., 'object[["AH7558"]]'
##
## title
## AH7558 | Homo_sapiens.GRCh37.70.gtf
## AH7619 | Homo_sapiens.GRCh37.69.gtf
## AH7666 | Homo_sapiens.GRCh37.71.gtf
## AH7726 | Homo_sapiens.GRCh37.72.gtf
## AH7790 | Homo_sapiens.GRCh37.73.gtf
## AH8753 | Homo_sapiens.GRCh37.74.gtf
## AH10684 | Homo_sapiens.GRCh37.75.gtf
## AH57956 | clinvar_20160203.vcf.gz
## AH57957 | clinvar_20160203_papu.vcf.gz
## AH57958 | common_and_clinical_20160203.vcf.gz
Let’s get version 74 of the Ensembl database as a GRanges object.
Ensembl.hg19.74.gr<-ah[["AH8753"]]
If we wanted to, we could create a txdb object with makeTxDbFromGRanges. Let’s download a different dataset from the annotation hub.
The EpigenomeRoadMap data is hosted online and can be searched here. The data is also provided by the Broad Institute as an annotation hub.
## Find H3K4me3 peak files from the EpigenomeRoadMap datasets for experiment E010 (Cultured neuronal cells).
epiFiles <- query(ah, c("EpigenomeRoadMap","E010","narrowPeak","H3K4me3"))
epiFiles
## AnnotationHub with 1 record
## # snapshotDate(): 2022-04-25
## # names(): AH30032
## # $dataprovider: BroadInstitute
## # $species: Homo sapiens
## # $rdataclass: GRanges
## # $rdatadateadded: 2015-05-06
## # $title: E010-H3K4me3.narrowPeak.gz
## # $description: Narrow ChIP-seq peaks for consolidated epigenomes from Epige...
## # $taxonomyid: 9606
## # $genome: hg19
## # $sourcetype: BED
## # $sourceurl: http://egg2.wustl.edu/roadmap/data/byFileType/peaks/consolidat...
## # $sourcesize: 721817
## # $tags: c("EpigenomeRoadMap", "peaks", "consolidated", "narrowPeak",
## # "E010", "ES-deriv", "ESDR.H9.NEUR", "H9 Derived Neuron Cultured
## # Cells")
## # retrieve record with 'object[["AH30032"]]'
There is a single hit for this search with the name AH30032. Let’s import the H3K4me3 ChIP-seq peaks as GRanges.
##These files will be imported as GRanges objects
H3K4me3<-epiFiles[["AH30032"]]
H3K4me3
## GRanges object with 37783 ranges and 6 metadata columns:
## seqnames ranges strand | name score
## <Rle> <IRanges> <Rle> | <character> <numeric>
## [1] chr2 200775200-200777693 * | Rank_1 2708
## [2] chr10 30636343-30638665 * | Rank_2 2681
## [3] chr4 142051175-142055073 * | Rank_3 2660
## [4] chr14 71065888-71067491 * | Rank_4 2634
## [5] chr21 30671163-30673514 * | Rank_5 2590
## ... ... ... ... . ... ...
## [37779] chr8 71156568-71156757 * | Rank_37779 20
## [37780] chrX 9868663-9868855 * | Rank_37780 20
## [37781] chr2 171570785-171570924 * | Rank_37781 20
## [37782] chr1 162666379-162666679 * | Rank_37782 20
## [37783] chr7 27170047-27170196 * | Rank_37783 20
## signalValue pValue qValue peak
## <numeric> <numeric> <numeric> <numeric>
## [1] 65.7391 270.837 261.347 1370
## [2] 58.6667 268.179 259.533 1607
## [3] 59.5730 266.007 257.693 1972
## [4] 60.2837 263.434 255.692 1321
## [5] 57.9809 259.074 251.621 769
## ... ... ... ... ...
## [37779] 1.85714 2.05416 0.20999 2
## [37780] 2.13285 2.04594 0.20241 83
## [37781] 2.18182 2.03936 0.19594 26
## [37782] 1.66667 2.03201 0.18894 50
## [37783] 2.15983 2.00804 0.16532 82
## -------
## seqinfo: 298 sequences (2 circular) from hg19 genome
Can you get the sequences of the 10 longest exons in the Mouse Ensembl gene models?
Hints:
## create a txdb object from the gtf file downloaded from Ensembl
txdb.mm10<-makeTxDbFromGFF(file = "http://bifx-core.bio.ed.ac.uk/genomes/mouse/mm10/annotation/Mus_musculus.GRCm38.78.edited.gtf")
## Get exons as GRanges
exons<-exons(txdb.mm10)
## Sort by width
exons.sorted<-exons[order(width(exons),decreasing=T),]
## Select the top 10
top.exons<-exons.sorted[1:10,]
## Set the GRanges names to the exon_id
names(top.exons)<-top.exons$exon_id
## Get the mm10 sequence
library(BSgenome.Mmusculus.UCSC.mm10)
mm10<-BSgenome.Mmusculus.UCSC.mm10
## Fetch exon sequences
top.exons.seq<-getSeq(mm10,top.exons)
## Output sequences as a fasta file
writeXStringSet(top.exons.seq, 'long_exons.fa')
top.exons.seq
## DNAStringSet object of length 10:
## width seq names
## [1] 84395 AGGAGGAACAGTTGCCTCAGCAC...GGAAAATCATTTTTGCAGCTAC 202265
## [2] 74456 GGCCAAAAAGCAAGCTTAATTAG...AAAATCTATTCCTCTGATTCTT 93241
## [3] 52270 GTACTTGGACTAGATGGCAGTTC...AAAAGCCATGAGGAAAAAGATA 313921
## [4] 40044 GCGTGTGCAGCCTGGTGCCTAAG...ACTCATTAAAAAACATTTTAAC 23720
## [5] 24280 GTGAGAAGCCCCACCAGTGCAGC...GTTTGAATATGAAGGAATAATT 355534
## [6] 20771 AGGAGTTAGTGACAAGGAGGGCT...ACGTGAGCAGGAAAAAGCAAAA 398118
## [7] 20717 GGTATCTACAGCTGTATCCACGG...ACCTACCACAAATTATCAGGGT 169422
## [8] 17497 AGCGGGGATGATGAGGACTACCA...ATAAAGCTCTTTATTAACCCTG 393364
## [9] 17485 GTAGAAGAGTCAGCAATGATGGG...AATAAAACGTTTAATACTCTTA 28144
## [10] 17106 ATCCCCCAGGACCACCTTCAAAT...AAGGAAGGAGTCAAAATAACAG 60509