Scripts

So far in this course we have learned how to work interactively on the command line to analyse data. This section will introduce the idea of bash shell scripting. We will also introduce the concept of reproducible data analysis, and will show how using scripts to analyse data can facilitate this.

Shell scripts in bash

Key points

A shell script is a text file containing multiple commands, which can then be run from the command line as a single command

Writing and running your own scripts

In its most basic form, a shell script is simply a text file containing a list of commands that is run in sequence from top to bottom. This kind of script can be run by providing it as an argument to the bash command, as in the following example:

[USERNAME]@bifx-core1:~$ date
Thu 12 Nov 14:21:16 GMT 2020
[USERNAME]@bifx-core1:~$ echo hello
hello
[USERNAME]@bifx-core1:~$ ls /library/training/Intro_to_Linux/genomes/mouse/
GRCm38  mm10  mm9  UCSC
[USERNAME]@bifx-core1:~$ echo date > my_script.sh
[USERNAME]@bifx-core1:~$ echo 'echo hello' >> my_script.sh
[USERNAME]@bifx-core1:~$ echo 'ls /library/training/Intro_to_Linux/genomes/mouse/' >> my_script.sh
[USERNAME]@bifx-core1:~$ cat my_script.sh
date
echo hello
ls /library/training/Intro_to_Linux/genomes/mouse/
[USERNAME]@bifx-core1:~$ bash my_script.sh
Thu 12 Nov 14:26:29 GMT 2020
hello
GRCm38  mm10  mm9  UCSC
[USERNAME]@bifx-core1:~$

It is also possible to create a script that can be run as a standalone program rather than as an argument to bash. This involves two steps:

Add the line #!/bin/bash (known as a shebang line) as the first line of the file using a text editor
Use the chmod command to make the file executable
- chmod a+x my_script.sh will change the permissions on the script so that any user can run it

Once these steps have been followed, it will be possible to execute the script using its path:

[USERNAME]@bifx-core1:~$ cat my_script.sh
#!/bin/bash
date
echo hello
ls /library/training/Intro_to_Linux/genomes/mouse/
[USERNAME]@bifx-core1:~$ ls -l ./my_script.sh
-rwxrwxr-x 1 [USERNAME] [USERNAME] 0 Nov 13 21:40 my_script.sh
[USERNAME]@bifx-core1:~$ ./my_script.sh
Thu 12 Nov 14:26:29 GMT 2020
hello
GRCm38  mm10  mm9  UCSC
[USERNAME]@bifx-core1:~$

Note: If you want to be able to run your script just by typing its name, you need to move it to a directory that is included in the $PATH environment variable.

To check which folders are listed in $PATH, you can type echo $PATH
To add a directory to $PATH permanently, add the line export PATH=[YOUR DIRECTORY]:$PATH to the end of the ~/.bashrc file, then run the command source ~/.bashrc
If you run a program from a directory in $PATH, it can be useful to check the full path to that program to make sure you’re not inadvertently running another program with the same name. You can do this by using the which command.

Editing scripts

You can edit your scripts using a text editor such as emacs or vim.

emacs -nw opens emacs in the terminal
- you can then take the tutorial by typing Ctrl+h t
vim opens vim
- you can take a tutorial by running the vimtutor command at the bash prompt

Reproducible data analysis with scripts

Key points

Data analysis should be reproducible
- You should be able to recreate all of the steps in your analysis
- Other researchers should also be able to recreate your analysis to verify your results
The bash shell makes it easy to analyse data interactively. However, if data is analysed this way it can be difficult to keep track of exactly which steps were taken to produce a given result
This problem can be solved by creating a script that contains all of the commands needed to produce the results
- You and other researchers can then recreate your analysis by running the script
This section presents a bash script that replicates the case study performed in the previous section

In the previous section we showed how to perform a simple analysis of some RNA-Seq data from scratch. To turn this into a reproducible analysis workflow, it is necessary to put all of the commands that make up the analysis into a script.

Creating a simple data analysis script

The simplest way to create a script is to put all of the commands that made up the analysis into a file in the order in which they were run, creating a file that looks like this:

mkdir analysis
cd analysis
mkdir 00_source_data
cd 00_source_data
ln -s ../../bioinformatics_on_the_command_line_files/raw_yeast_rnaseq_data.fastq
cd ..
tree
mkdir 01_star_index
cd 01_star_index
ln -s ../../bioinformatics_on_the_command_line_files/yeast_genome.fasta
nice STAR --runThreadN 5 --runMode genomeGenerate --genomeDir . --genomeFastaFiles ./yeast_genome.fasta --genomeSAindexNbases 10
cd ..
tree
mkdir 02_aligned_reads
cd 02_aligned_reads
nice STAR --genomeDir ../01_star_index/ --readFilesIn ../00_source_data/raw_yeast_rnaseq_data.fastq --runThreadN 5 --outFileNamePrefix raw_yeast_rnaseq_data. --outSAMtype BAM SortedByCoordinate
cd ..
tree
mkdir 03_coverage
cd 03_coverage
bedtools genomecov -ibam ../02_aligned_reads/raw_yeast_rnaseq_data.Aligned.sortedByCoord.out.bam > raw_yeast_rnaseq_data.genomecov.bg
cd ..
tree
mkdir 04_gene_overlap_counts
cd 04_gene_overlap_counts
ln -s ../../bioinformatics_on_the_command_line_files/yeast_genes.fixed.bed
bedtools intersect -c -a yeast_genes.fixed.bed -b ../02_aligned_reads/raw_yeast_rnaseq_data.Aligned.sortedByCoord.out.bam | awk -F'\t' '$7>0' | sort -k7,7nr > raw_yeast_overlap_data.gene_overlap_counts.bed
cd ..
tree
head -10 04_gene_overlap_counts/raw_yeast_overlap_data.gene_overlap_counts.bed
cd ..

We can see that this works by running the above file (which is saved on the server as ‘/library/training/Intro_to_Linux/analysis_raw.sh’):

## Rename the analysis folder as interactive_analysis
mv analysis interactive_analysis

## Run the bash script
bash /library/training/Intro_to_Linux/analysis_raw.sh

## Inspect the output
tree analysis

You should see that the script runs successfully, prints messages to the screen as it goes and creates a number of output files.

Creating an improved data analysis script

In this section we improve our script by doing the following:

Adding a shebang line to the script so that it can be run as a standalone program
Ensuring that the script does not keep running if one of the commands within it fails
- This is what the set -euo pipefail command at the top of the script does
Deleting unnecessary commands
Adding spacing and comments to make the script easier to read
- In bash, lines starting with # are taken to be comments
Specifying inputs as variables in order to make the script more general and easier to maintain, and using paths relative to ‘~’ rather than the current working directory so that the script can be run in any directory and will still find the files
Adding a few lines to the script to create a run report, which specifies the software versions used for the analysis

The updated script is shown here:

#!/bin/bash
# This is simple RNA-Seq analysis workflow that does the following:
# - Creates a STAR index for the genome fasta file specified in the variable GENOME_FASTA
# - Aligns the raw reads specified in the fastq file referenced by the variable RAW_FASTQ
# - Computes the genome coverage of the aligned reads, producing a bedGraph file
# - Counts the overlaps over the genes specified in the file referenced by the variable GENES_BED

set -euo pipefail


# Global variables ----

RAW_FASTQ='~/course/bioinformatics_on_the_command_line_files/raw_yeast_rnaseq_data.fastq'
GENOME_FASTA='~/course/bioinformatics_on_the_command_line_files/yeast_genome.fasta'
GENES_BED='~/course/bioinformatics_on_the_command_line_files/yeast_genes.fixed.bed'

## Check files exist - this will send an error if not found
ls "$RAW_FASTQ" "$GENOME_FASTA" "$GENES_BED" > /dev/null

RAW_FASTQ_BASENAME_PREFIX=`basename $RAW_FASTQ .fastq`
GENOME_FASTA_BASENAME=`basename $GENOME_FASTA`

THREADS=5


# Pipeline commands ----

# Create a directory for the results of the analysis

mkdir analysis
cd analysis

# Link to the fastq containing the raw sequences in '00_source_data'

mkdir 00_source_data
cd 00_source_data
ln -s "$RAW_FASTQ"
cd ..

# Create a STAR index for the genome fasta file, storing all of the results in '01_star_index'

echo 'Creating STAR index...'

mkdir 01_star_index
cd 01_star_index
ln -s "$GENOME_FASTA"
nice STAR --runThreadN "$THREADS" --runMode genomeGenerate --genomeDir . --genomeFastaFiles "$GENOME_FASTA_BASENAME" --genomeSAindexNbases 10
cd ..

# Align the raw sequences using STAR, storing all of the results in '02_aligned_reads'

echo 'Aligning raw reads...'

mkdir 02_aligned_reads
cd 02_aligned_reads
nice STAR --genomeDir ../01_star_index/ --readFilesIn ../00_source_data/"$RAW_FASTQ_BASENAME_PREFIX".fastq --runThreadN "$THREADS" --outFileNamePrefix "$RAW_FASTQ_BASENAME_PREFIX". --outSAMtype BAM SortedByCoordinate
cd ..

# Create a genome coverage file in bedGraph format, and store it in '03_coverage'

echo 'Creating genome coverage file...'

mkdir 03_coverage
cd 03_coverage
bedtools genomecov -ibam ../02_aligned_reads/"$RAW_FASTQ_BASENAME_PREFIX".Aligned.sortedByCoord.out.bam > "$RAW_FASTQ_BASENAME_PREFIX".genomecov.bg
cd ..

# Compute the gene overlap counts, and store them as a bed file in '04_gene_overlap_counts'

echo 'Computing gene overlap counts...'

mkdir 04_gene_overlap_counts
cd 04_gene_overlap_counts
ln -s ../../bioinformatics_on_the_command_line_files/yeast_genes.fixed.bed
bedtools intersect -c -a "$GENES_BED" -b ../02_aligned_reads/"$RAW_FASTQ_BASENAME_PREFIX".Aligned.sortedByCoord.out.bam | awk -F'\t' '$7>0' | sort -k7,7nr > "$RAW_FASTQ_BASENAME_PREFIX".gene_overlap_counts.bed
cd ..

# Create a run report

echo `realpath $0`" run completed successfully." > run_report.txt
date >> run_report.txt
echo 'Software versions:' >> run_report.txt
echo 'STAR' >> run_report.txt
STAR --version >> run_report.txt
echo 'bedtools' >> run_report.txt
bedtools --version | sed 's/^bedtools //' >> run_report.txt
echo 'sort' >> run_report.txt
sort --version | sed 's/^sort //' | head -1 >> run_report.txt
echo 'awk' >> run_report.txt
awk --version | head -1 >> run_report.txt

# If we've got here the pipeline has completed successfully

echo 'Pipeline completed successfully.'

Again, we can see that this works by running the above file (which is saved on the server as ‘/library/training/Intr_to_Linux/analysis_improved.sh’):

## Rename the analysis folder as analysis_raw
mv analysis analysis_raw

## Run the improved bash script
bash /library/training/Intro_to_Linux/analysis_improved.sh

## Read the output report
cat analysis/run_report.txt

## Inspect the output
tree analysis

Managing your scripts

Once you start writing scripts, it is a good idea to use a version control system to keep track of the changes you make to your scripts. A good choice for this would be git. The following five commands will allow you to use git for version control:

git init to initialise a git repository in the current working directory
git add -A; git commit -m "Latest updates" to commit any changes to files in the current working directory to the repository
git status to see if any changes have been made since the last commit
git diff to see the differences between the files in the current working directory and the last commit
git show HEAD:[NAME OF FILE IN GIT] to view the most recent version of a file in the git repository
- You can save the output of git show to a file by redirecting its output using >

To learn more about git, you can look at this software carpentry course, which covers it in a lot more detail.

Note: git should only be used for small text files such as scripts. It is not designed to work with large data files.

Closing thoughts: practical workflows on the command line

In this section we created a data analysis script in bash that satisfies two major requirements for reproducible data analysis:

The script provides a complete and accurate record of the steps that were taken to produce the output files, along with a record of the versions of the software tools used to manipulate the data
The script is also in a form that would be easy to share with other researchers, allowing them to replicate your analysis easily

The main drawback of using bash scripts to represent workflows is that they are often impractical. The example presented here uses an extremely small input FASTQ file, and an organism with a relatively small genome. As a result it can be run from scratch in under a minute. Running the same pipeline with a realistically sized input FASTQ file and a larger genome could take hours to run, so it is no longer practical to re-run the pipeline every time you make a change. A workaround for this would be to comment out parts of the script that you do not want to re-run, however this is considered bad practice as it introduces the possibility of human error in selecting the parts of the pipeline that need to be re-run when the pipeline is updated.

If you are writing analysis pipelines, you should look into using a modern workflow system such as Snakemake or Nextflow. Workflows written for these workflow systems can be run repeatedly, and the workflow system will automatically work out which steps of the analysis workflow need to be re-run based on the timestamps of the files. They also provide a number of other useful features:

They work out which steps of the analysis workflow can be run in parallel, and assign steps to different cores when possible, which can speed up the analysis considerably
They simplify the process of running workflows on a computing cluster