Analysis

In this section you will learn how to work with common Next Generation Sequencing (NGS) data formats on the command line.

Bioinformatics data formats and tools

Key Points

Many standard formats for storing high throughput sequencing data take the form of structured text files, which are easy to manipulate using standard GNU utilities (built in Unix commands)
Many powerful specialist tools for bioinformatics analysis have been developed to use these formats

Many of the most common types of file that you will have to work with as a bioinformatician take the form of structured text files, examples include:

Standard formats for representing raw sequences
- FASTA
- FASTQ
Tabular formats for representing aligned reads or genome features
- BED
- SAM
- bedGraph

Some other compressed formats, such as BAM, which is a compressed version of the SAM format, can easily be converted to human readable text.

Furthermore, numerous specialist bioinformatics tools have been specifically developed for working with these file formats. For example:

bedtools and bedops, which work with BED files
Various aligners, such as STAR, hisat2, and others, which take raw sequences in FASTQ or FASTA format and align them to the genome
samtools, which works with SAM files and BAM files
- BAM files can be viewed in SAM format using the samtools view command

Note: These tools are not included in most Linux distributions as standard and typically have to be installed separately.

Working with NGS data using GNU tools

Key Points

It is possible to perform a wide range of complex analysis tasks on NGS data files using standard GNU utilities (built in Unix commands)
In this section we illustrate the application of these tools to NGS data

In previous sections, we extracted an archive named bioinformatics_on_the_command_line_files.tar.gz into the ~/course directory, creating a directory called ~/course/bioinformatics_on_the_command_line_files.

This directory contains a file called yeast_genes.bed, which lists the genomic co-ordinates of 7,126 yeast genes in BED format, and another file called yeast_genome.fasta, which contains the yeast EF4 genome sequence in FASTA format:

[USERNAME]@bifx-core1:~/course$ head -5 bioinformatics_on_the_command_line_files/yeast_genes.bed
chrI    334 649 YAL069W .   +
chrI    537 792 YAL068W-A   .   +
chrI    1806    2169    YAL068C .   -
chrI    2479    2707    YAL067W-A   .   +
chrI    7234    9016    YAL067C .   -
[USERNAME]@bifx-core1:~/course$ head -5 bioinformatics_on_the_command_line_files/yeast_genome.fasta
>I
CCACACCACACCCACACACCCACACACCACACCACACACCACACCACACCCACACACACA
CATCCTAACACTACCCTAACACAGCCCTAATCTAACCCTGGCCAACCTGTCTCTCAACTT
ACCCTCCATTACCCTGCCTCCACTCGTTACCCTGTCCCATTCAACCATACCACTCCGAAC
CACCATCCATCCCTCTACTTACTACCACTCACCCACCGTTACCCTCCAATTACCCATATC
[USERNAME]@bifx-core1:~/course$

Checking chromosome names in BED and FASTA files

Looking at the above outputs, we can see that the naming convention for the chromosomes in the BED file (shown in the first column) appears to be different to the naming convention for the chromosomes in the FASTA file (shown in the first line after the > character). In order to confirm this we would like to produce a sorted list of chromosome names in each file and compare them.

The example below demonstrates the following new commands:

cut, which we use to select specified fields (or columns) with the -f option, and specified characters with the -c option
sort, which sorts the lines it receives from STDIN. The -u flag tells it to remove duplicate lines from the output
diff, which compares two text files, and outputs the differences between them

We can do this as follows:

## Move to the relevant directory
cd ~/course/bioinformatics_on_the_command_line_files

## Select the first column of the bed file, sort entries and remove duplicates. Redirect output to a file.
cut -f1 yeast_genes.bed | sort -u > yeast_genes_bed_chromosomes.list 

## Select the sequence ID lines from the fasta file. Sort and remove duplicates and select everything from the second character. Redirect to a file.
grep '^>' yeast_genome.fasta | sort -u | cut -c2- > yeast_genome_fasta_chromosomes.list

## Check if both files have the same number of chromosomes
wc -l *_chromosomes.list

## Check for differences in the files
diff yeast_genes_bed_chromosomes.list yeast_genome_fasta_chromosomes.list

We can see that there is a mismatch between the chromosome names in the BED and FASTA files. Each chromosome name in the BED file is equivalent to the corresponding name in the FASTA file with ‘chr’ added to the start.

Before using these files in a bioinformatics analysis, we need to update one of them so that the names match. In the following example we fix the BED file by removing ‘chr’ from the start of each line.

The example below uses the sed command, which is used to perform a search and replace style substitution on each line in the BED file using a regular expression. Here, as in the previous examples using grep, the ‘^’ character is a regular expression character representing the start of the line. We then confirm that the chromosomes are now the same using the diff command. This produces no output, which means that there are no differences between the chromosome lists.

## Remove "chr" from the start of every line in the bed file
sed 's/^chr//' yeast_genes.bed > yeast_genes.fixed.bed

## Extract the chromsome names
cut -f1 yeast_genes.fixed.bed | sort -u > yeast_genes_fixed_bed_chromosomes.list

## Check for differences
diff yeast_genes_fixed_bed_chromosomes.list yeast_genome_fasta_chromosomes.list

## Remove the temporary list files
rm -i *.list

Manipulating BED files with `awk` and `sort`

Many standard formats for representing NGS data take the form of tabular files, in which each line contains a number of fields, separated by a particular character (generally a tab character). The yeast_genes.fixed.bed file generated in the previous example fits this pattern.

This example demonstrates how to use awk and sort to find the name and length of the longest gene on chromosome ‘I’ in the yeast_genes.fixed.bed file:

## Use the unix programming language awk for text manipulation
awk -F'\t' -v OFS='\t' '$1=="I" {print $4,$3-$2}' yeast_genes.fixed.bed | sort -k2,2nr | head -1

The output should be YAR050W 4614.

The above example showcases the power of awk in dealing with tabular data. The command awk -F'\t' -v OFS='\t' '$1=="I" {print $4,$3-$2}' can be deconstructed as follows:

-F'\t' tells awk that the field separator in the input lines is the tab character (\t)
-v OFS='\t' tells awk that the tab character should also be used to separate the fields in the output file
'$1=="I" {print $4,$3-$2}' is an awk program consisting of a single line. Each line of an awk program is a pair with the structure condition {action}, and the program is run on each line of the input. In this example:
- The condition is $1=="I", which tells awk that the action should be performed if the first field is equal to “I”
- The action is print $4,$3-$2, which tells awk to output a line to STDOUT in which the first field is the 4th field of the input line, and the second field is the result of subtracting the 2nd field from the 3rd. In this case the result is the gene length

Note: A good resource to learn more about awk is Effective AWK Programming.

The sort command in the above example includes the option -k2,2nr. -k tells sort to sort by a key, which comprises a start and stop column number, followed by two options, n, which tells sort that the column contains numbers, and r, which tells sort that the lines should be sorted in reverse (i.e. descending) order.

Challenge:

You could also remove ‘chr’ from the start of each line of yeast_genes.bed by removing the first three characters of every line. See if you can achieve this with the cut command.

Solution

Solution.

Solution:

Since we know that ‘chr’ is at the start of every line, cut -c 4- yeast_genes.bed would also work.

Challenge:

How would you find the shortest gene on the plus strand of chromosome ‘II’ in yeast_genes.fixed.bed using awk?

Solution

Solution.

Solution:

You could run awk -F'\t' -v OFS='\t' '$1=="II"&&$6=="+" {print $4,$3-$2}' yeast_genes.fixed.bed | sort -k2,2n | head -1

Case study: a simple RNA-Seq analysis workflow

Key Points

It is possible to perform a simple bioinformatics analysis from end to end using only the bash command line
This section presents a simple case study using the STAR aligner and bedtools

In this section, we will work through a simple pipeline for analysing RNA-Seq data, which involves the following steps:

Start with unaligned reads in FASTQ format
Align the reads against an index generated from a target genome, whose sequence is stored in FASTA format, obtaining an output file in BAM format
- Here we use the yeast EF4 genome, and align the reads using STAR
Compute the genome coverage of the aligned reads, obtaining an output file in bedGraph format, which can then be viewed in a genome browser
- Here we generate the bedGraph file directly from the BED file using bedtools
Count the overlaps between the aligned reads and genomic features, stored in BED format, obtaining an output file in BED format, and find the genes with the largest number of overlapping reads
- Here we use bedtools to compute the intersection between the genes and the reference, and use standard GNU utilities to find the genes with the most hits

This analysis uses the files in the ‘bioinformatics_on_the_command_line_files’ directory that we have already been working with, and can be performed as follows:


## Move to the course directory
cd ~/course

## Make a new folder called analysis and move into it
mkdir analysis
cd analysis

## Make a new folder 00_source_data and move into it
mkdir 00_source_data
cd 00_source_data

## Link the yeast fastq file to this folder
ln -s ../../bioinformatics_on_the_command_line_files/raw_yeast_rnaseq_data.fastq 

## Move back to the analysis folder and inspect the directory
cd ..
tree

## Make a directory for the STAR aligner index files and link the yeast fasta file
mkdir 01_star_index
cd 01_star_index
ln -s ../../bioinformatics_on_the_command_line_files/yeast_genome.fasta

## Run STAR genomeGenerate to create a mapping index
STAR --runThreadN 5 --runMode genomeGenerate --genomeDir . --genomeFastaFiles ./yeast_genome.fasta --genomeSAindexNbases 10
cd ..
tree

## Make a directory for the alignment outputs
mkdir 02_aligned_reads
cd 02_aligned_reads

## Run STAR to map the reads to the yeast index
STAR --genomeDir ../01_star_index/ --readFilesIn ../00_source_data/raw_yeast_rnaseq_data.fastq --runThreadN 2 --outFileNamePrefix raw_yeast_rnaseq_data. --outSAMtype BAM SortedByCoordinate
cd ..
tree

## Make a directory for coverage files
mkdir 03_coverage
cd 03_coverage

## Use bedtools genomecov to convert bam files to bedgraph
bedtools genomecov -ibam ../02_aligned_reads/raw_yeast_rnaseq_data.Aligned.sortedByCoord.out.bam -bg > raw_yeast_rnaseq_data.genomecov.bg
cd ..
tree

## Make a directory for counting reads over genes
mkdir 04_gene_overlap_counts
cd 04_gene_overlap_counts

## Link the yeast gene bed file
ln -s ../../bioinformatics_on_the_command_line_files/yeast_genes.fixed.bed

## Use bedtools to count intersection of genes and sequencing reads
bedtools intersect -c -a yeast_genes.fixed.bed -b ../02_aligned_reads/raw_yeast_rnaseq_data.Aligned.sortedByCoord.out.bam | awk -F'\t' '$7>0' | sort -k7,7nr > raw_yeast_overlap_data.gene_overlap_counts.bed
cd ..
tree

## Output the 10 genes with the highest read counts
head -10 04_gene_overlap_counts/raw_yeast_overlap_data.gene_overlap_counts.bed
cd ..

The output should look like this:

XII 460922  466869  RDN37-2 .   -   3730
XII 451785  457732  RDN37-1 .   -   3582
XII 468812  468931  RDN5-2  .   +   3063
XII 468826  468958  YLR154C-H   .   -   3063
XII 472464  472583  RDN5-3  .   +   3060
XII 472478  472610  YLR156C-A   .   -   3060
XII 482044  482163  RDN5-4  .   +   3055
XII 482058  482190  YLR157C-C   .   -   3055
XII 485696  485815  RDN5-5  .   +   3052
XII 485710  485842  YLR159C-A   .   -   3052

Note: In this analysis we have done everything from scratch, including creating the genome index. For real analyses it is a good idea to use a pre-generated index, as indices for larger genomes take up a lot of space on the server.

Challenge:

We saw earlier that the longest gene on chromosome I is YAR050W. Use bedtools getfasta to find the nucleotide sequence of this gene in FASTA format.

Solution

Solution.

Solution:

Running bedtools getfasta --help shows us that we need to specify an input DNA FASTA file and a BED file with the feature co-ordinates.
We can extract the feature co-ordinates of YAR050W from yeast_genes.fixed.bed using grep or awk, and use the output of this command with bedtools getfasta. An example is illustrated in the following example:


cd ~/course

grep "YAR050W" bioinformatics_on_the_command_line_files/yeast_genes.fixed.bed > YAR050W.bed

bedtools getfasta -fi bioinformatics_on_the_command_line_files/yeast_genome.fasta -bed YAR050W.bed

The output should look like this:

>I:203402-208016
ATGACAATGCCTCATCGCTATATGTTTTTGGCAGTCTTTACACTTCTGGCACTAACTA...

Bioinformatics data formats and tools

Key Points

Working with NGS data using GNU tools

Key Points

Checking chromosome names in BED and FASTA files

Manipulating BED files with awk and sort

Challenge:

Solution:

Challenge:

Solution:

Case study: a simple RNA-Seq analysis workflow

Key Points

Challenge:

Solution:

Manipulating BED files with `awk` and `sort`