[USERNAME]@bifx-core1:~/course$ head -5 bioinformatics_on_the_command_line_files/yeast_genes.bed
chrI 334 649 YAL069W . +
chrI 537 792 YAL068W-A . +
chrI 1806 2169 YAL068C . -
chrI 2479 2707 YAL067W-A . +
chrI 7234 9016 YAL067C . -
[USERNAME]@bifx-core1:~/course$ head -5 bioinformatics_on_the_command_line_files/yeast_genome.fasta
>I
CCACACCACACCCACACACCCACACACCACACCACACACCACACCACACCCACACACACA
CATCCTAACACTACCCTAACACAGCCCTAATCTAACCCTGGCCAACCTGTCTCTCAACTT
ACCCTCCATTACCCTGCCTCCACTCGTTACCCTGTCCCATTCAACCATACCACTCCGAAC
CACCATCCATCCCTCTACTTACTACCACTCACCCACCGTTACCCTCCAATTACCCATATC
[USERNAME]@bifx-core1:~/course$
Analysis
In this section you will learn how to work with common Next Generation Sequencing (NGS) data formats on the command line.
Bioinformatics data formats and tools
Key Points
- Many standard formats for storing high throughput sequencing data take the form of structured text files, which are easy to manipulate using standard GNU utilities (built in Unix commands)
- Many powerful specialist tools for bioinformatics analysis have been developed to use these formats
Many of the most common types of file that you will have to work with as a bioinformatician take the form of structured text files, examples include:
- Standard formats for representing raw sequences
- Tabular formats for representing aligned reads or genome features
Some other compressed formats, such as BAM, which is a compressed version of the SAM format, can easily be converted to human readable text.
Furthermore, numerous specialist bioinformatics tools have been specifically developed for working with these file formats. For example:
- bedtools and bedops, which work with BED files
- Various aligners, such as STAR, hisat2, and others, which take raw sequences in FASTQ or FASTA format and align them to the genome
- samtools, which works with SAM files and BAM files
- BAM files can be viewed in SAM format using the
samtools view
command
- BAM files can be viewed in SAM format using the
Note: These tools are not included in most Linux distributions as standard and typically have to be installed separately.
Working with NGS data using GNU tools
Key Points
- It is possible to perform a wide range of complex analysis tasks on NGS data files using standard GNU utilities (built in Unix commands)
- In this section we illustrate the application of these tools to NGS data
In previous sections, we extracted an archive named bioinformatics_on_the_command_line_files.tar.gz
into the ~/course
directory, creating a directory called ~/course/bioinformatics_on_the_command_line_files
.
This directory contains a file called yeast_genes.bed
, which lists the genomic co-ordinates of 7,126 yeast genes in BED format, and another file called yeast_genome.fasta
, which contains the yeast EF4 genome sequence in FASTA format:
Checking chromosome names in BED and FASTA files
Looking at the above outputs, we can see that the naming convention for the chromosomes in the BED file (shown in the first column) appears to be different to the naming convention for the chromosomes in the FASTA file (shown in the first line after the >
character). In order to confirm this we would like to produce a sorted list of chromosome names in each file and compare them.
The example below demonstrates the following new commands:
cut
, which we use to select specified fields (or columns) with the-f
option, and specified characters with the-c
optionsort
, which sorts the lines it receives from STDIN. The-u
flag tells it to remove duplicate lines from the outputdiff
, which compares two text files, and outputs the differences between them
We can do this as follows:
## Move to the relevant directory
cd ~/course/bioinformatics_on_the_command_line_files
## Select the first column of the bed file, sort entries and remove duplicates. Redirect output to a file.
cut -f1 yeast_genes.bed | sort -u > yeast_genes_bed_chromosomes.list
## Select the sequence ID lines from the fasta file. Sort and remove duplicates and select everything from the second character. Redirect to a file.
grep '^>' yeast_genome.fasta | sort -u | cut -c2- > yeast_genome_fasta_chromosomes.list
## Check if both files have the same number of chromosomes
wc -l *_chromosomes.list
## Check for differences in the files
diff yeast_genes_bed_chromosomes.list yeast_genome_fasta_chromosomes.list
We can see that there is a mismatch between the chromosome names in the BED and FASTA files. Each chromosome name in the BED file is equivalent to the corresponding name in the FASTA file with ‘chr’ added to the start.
Before using these files in a bioinformatics analysis, we need to update one of them so that the names match. In the following example we fix the BED file by removing ‘chr’ from the start of each line.
The example below uses the sed
command, which is used to perform a search and replace style substitution on each line in the BED file using a regular expression. Here, as in the previous examples using grep
, the ‘^’ character is a regular expression character representing the start of the line. We then confirm that the chromosomes are now the same using the diff
command. This produces no output, which means that there are no differences between the chromosome lists.
## Remove "chr" from the start of every line in the bed file
sed 's/^chr//' yeast_genes.bed > yeast_genes.fixed.bed
## Extract the chromsome names
cut -f1 yeast_genes.fixed.bed | sort -u > yeast_genes_fixed_bed_chromosomes.list
## Check for differences
diff yeast_genes_fixed_bed_chromosomes.list yeast_genome_fasta_chromosomes.list
## Remove the temporary list files
rm -i *.list
Manipulating BED files with awk
and sort
Many standard formats for representing NGS data take the form of tabular files, in which each line contains a number of fields, separated by a particular character (generally a tab character). The yeast_genes.fixed.bed
file generated in the previous example fits this pattern.
This example demonstrates how to use awk
and sort
to find the name and length of the longest gene on chromosome ‘I’ in the yeast_genes.fixed.bed
file:
## Use the unix programming language awk for text manipulation
awk -F'\t' -v OFS='\t' '$1=="I" {print $4,$3-$2}' yeast_genes.fixed.bed | sort -k2,2nr | head -1
The output should be YAR050W 4614
.
The above example showcases the power of awk
in dealing with tabular data. The command awk -F'\t' -v OFS='\t' '$1=="I" {print $4,$3-$2}'
can be deconstructed as follows:
-F'\t'
tellsawk
that the field separator in the input lines is the tab character (\t
)-v OFS='\t'
tellsawk
that the tab character should also be used to separate the fields in the output file'$1=="I" {print $4,$3-$2}'
is anawk
program consisting of a single line. Each line of anawk
program is a pair with the structure condition {action}, and the program is run on each line of the input. In this example:- The condition is
$1=="I"
, which tellsawk
that the action should be performed if the first field is equal to “I” - The action is
print $4,$3-$2
, which tellsawk
to output a line to STDOUT in which the first field is the 4th field of the input line, and the second field is the result of subtracting the 2nd field from the 3rd. In this case the result is the gene length
- The condition is
Note: A good resource to learn more about awk
is Effective AWK Programming.
The sort
command in the above example includes the option -k2,2nr
. -k
tells sort
to sort by a key, which comprises a start and stop column number, followed by two options, n
, which tells sort
that the column contains numbers, and r
, which tells sort
that the lines should be sorted in reverse (i.e. descending) order.
Challenge:
You could also remove ‘chr’ from the start of each line of yeast_genes.bed
by removing the first three characters of every line. See if you can achieve this with the cut
command.
Solution
Solution.
Solution:
Since we know that ‘chr’ is at the start of every line, cut -c 4- yeast_genes.bed
would also work.
Challenge:
How would you find the shortest gene on the plus strand of chromosome ‘II’ in yeast_genes.fixed.bed
using awk
?
Solution
Solution.
Solution:
You could run awk -F'\t' -v OFS='\t' '$1=="II"&&$6=="+" {print $4,$3-$2}' yeast_genes.fixed.bed | sort -k2,2n | head -1
Case study: a simple RNA-Seq analysis workflow
Key Points
- It is possible to perform a simple bioinformatics analysis from end to end using only the bash command line
- This section presents a simple case study using the
STAR
aligner andbedtools
In this section, we will work through a simple pipeline for analysing RNA-Seq data, which involves the following steps:
- Start with unaligned reads in FASTQ format
- Align the reads against an index generated from a target genome, whose sequence is stored in FASTA format, obtaining an output file in BAM format
- Here we use the yeast EF4 genome, and align the reads using STAR
- Compute the genome coverage of the aligned reads, obtaining an output file in bedGraph format, which can then be viewed in a genome browser
- Here we generate the bedGraph file directly from the BED file using bedtools
- Count the overlaps between the aligned reads and genomic features, stored in BED format, obtaining an output file in BED format, and find the genes with the largest number of overlapping reads
- Here we use bedtools to compute the intersection between the genes and the reference, and use standard GNU utilities to find the genes with the most hits
This analysis uses the files in the ‘bioinformatics_on_the_command_line_files’ directory that we have already been working with, and can be performed as follows:
## Move to the course directory
cd ~/course
## Make a new folder called analysis and move into it
mkdir analysis
cd analysis
## Make a new folder 00_source_data and move into it
mkdir 00_source_data
cd 00_source_data
## Link the yeast fastq file to this folder
ln -s ../../bioinformatics_on_the_command_line_files/raw_yeast_rnaseq_data.fastq
## Move back to the analysis folder and inspect the directory
cd ..
tree
## Make a directory for the STAR aligner index files and link the yeast fasta file
mkdir 01_star_index
cd 01_star_index
ln -s ../../bioinformatics_on_the_command_line_files/yeast_genome.fasta
## Run STAR genomeGenerate to create a mapping index
STAR --runThreadN 5 --runMode genomeGenerate --genomeDir . --genomeFastaFiles ./yeast_genome.fasta --genomeSAindexNbases 10
cd ..
tree
## Make a directory for the alignment outputs
mkdir 02_aligned_reads
cd 02_aligned_reads
## Run STAR to map the reads to the yeast index
STAR --genomeDir ../01_star_index/ --readFilesIn ../00_source_data/raw_yeast_rnaseq_data.fastq --runThreadN 2 --outFileNamePrefix raw_yeast_rnaseq_data. --outSAMtype BAM SortedByCoordinate
cd ..
tree
## Make a directory for coverage files
mkdir 03_coverage
cd 03_coverage
## Use bedtools genomecov to convert bam files to bedgraph
bedtools genomecov -ibam ../02_aligned_reads/raw_yeast_rnaseq_data.Aligned.sortedByCoord.out.bam -bg > raw_yeast_rnaseq_data.genomecov.bg
cd ..
tree
## Make a directory for counting reads over genes
mkdir 04_gene_overlap_counts
cd 04_gene_overlap_counts
## Link the yeast gene bed file
ln -s ../../bioinformatics_on_the_command_line_files/yeast_genes.fixed.bed
## Use bedtools to count intersection of genes and sequencing reads
bedtools intersect -c -a yeast_genes.fixed.bed -b ../02_aligned_reads/raw_yeast_rnaseq_data.Aligned.sortedByCoord.out.bam | awk -F'\t' '$7>0' | sort -k7,7nr > raw_yeast_overlap_data.gene_overlap_counts.bed
cd ..
tree
## Output the 10 genes with the highest read counts
head -10 04_gene_overlap_counts/raw_yeast_overlap_data.gene_overlap_counts.bed
cd ..
The output should look like this:
XII 460922 466869 RDN37-2 . - 3730
XII 451785 457732 RDN37-1 . - 3582
XII 468812 468931 RDN5-2 . + 3063
XII 468826 468958 YLR154C-H . - 3063
XII 472464 472583 RDN5-3 . + 3060
XII 472478 472610 YLR156C-A . - 3060
XII 482044 482163 RDN5-4 . + 3055
XII 482058 482190 YLR157C-C . - 3055
XII 485696 485815 RDN5-5 . + 3052
XII 485710 485842 YLR159C-A . - 3052
Note: In this analysis we have done everything from scratch, including creating the genome index. For real analyses it is a good idea to use a pre-generated index, as indices for larger genomes take up a lot of space on the server.
Challenge:
We saw earlier that the longest gene on chromosome I is YAR050W. Use bedtools getfasta
to find the nucleotide sequence of this gene in FASTA format.
Solution
Solution.
Solution:
Running bedtools getfasta --help
shows us that we need to specify an input DNA FASTA file and a BED file with the feature co-ordinates.
We can extract the feature co-ordinates of YAR050W from yeast_genes.fixed.bed
using grep or awk
, and use the output of this command with bedtools getfasta
. An example is illustrated in the following example:
cd ~/course
grep "YAR050W" bioinformatics_on_the_command_line_files/yeast_genes.fixed.bed > YAR050W.bed
bedtools getfasta -fi bioinformatics_on_the_command_line_files/yeast_genome.fasta -bed YAR050W.bed
The output should look like this:
>I:203402-208016
ATGACAATGCCTCATCGCTATATGTTTTTGGCAGTCTTTACACTTCTGGCACTAACTA...