## Ensure you are in the course directory
cd ~/course
## Inspect the contents
tree
## The cat command prints everything to screen
cat bioinformatics_on_the_command_line_files/README
## The less command prints one page at a time. Use space to tab through pages and `q` to quit.
less bioinformatics_on_the_command_line_files/raw_yeast_rnaseq_data.fastq
## Print the first 5 lines of a file
head -5 bioinformatics_on_the_command_line_files/yeast_genome.fasta
## Print the final 5 lines of a file
tail -5 bioinformatics_on_the_command_line_files/yeast_genome.fasta
## How many lines are in the file?
wc -l bioinformatics_on_the_command_line_files/yeast_genes.bed
Files
In this section you will learn how to explore and manipulate files in bash using simple commands, and compound commands using pipes.
Exploring files
Key Points
- Regular files in Linux can be classified as text files, which contain human readable text, and binary files, that contain data that is not human readable
- The
cat
command can be used to show the contents of a file- The
less
command allows you to page through a large file- Hit
q
to exit theless
command
- Hit
- The
- The
head
andtail
commands can be used to show the first or last few lines of a file- These can be useful for large text files
- The
wc -l
command counts the number of lines in a file - The
grep
command allows you to filter a text file - Text files can be compressed using the
gzip
command, which converts them to a binary format that takes up less space- Many of the above commands for working with text files have equivalents for gzipped files
- These include
zcat
,zless
, andzgrep
The following commands can be used to explore text files:
The grep
command can search and filter files:
grep expression filename
returns all the lines with the word ‘expression’ in the file called ‘filename’.grep -i
is case insensitivegrep -c
will count the number of lines matching an expressiongrep -v
returns all of the lines that do not match an expression
## Print lines containing the word 'format'
grep format bioinformatics_on_the_command_line_files/README
## Print lines not containing the word 'format'
grep -v format bioinformatics_on_the_command_line_files/README
## Count the lines beginning with ">". All sequence ID lines start with a ">" in a fasta file. So we can use this command to count the number of sequences.
grep -c '^>' bioinformatics_on_the_command_line_files/yeast_genome.fasta
Note: grep
allows you to use a regular expressions to specify a pattern that grep
will look for rather than a fixed string. Conceptually, regular expressions are similar to glob patterns, although their syntax is different. Some characters have a special meaning in regular expressions. For example:
^
represents the start of a string$
represents the end of a string
The -E
and -P
flags can be used to make more regular expressions available to grep
.
Files can be zipped (compressed) and unzipped (decompressed) with the gzip
command:
gzip -k
will compress a file and keep the original uncompressed versiongzip -d
will decompress a gzip’d filezless
,zcat
andzgrep
work on compressed files
## Compress the file while keeping the original uncompressed version
gzip -k bioinformatics_on_the_command_line_files/yeast_genome.fasta
## Inspect both files
ls -lh bioinformatics_on_the_command_line_files/yeast_genome.fasta*
## Use zgrep to search a compressed file
zgrep -c -E '^>' bioinformatics_on_the_command_line_files/yeast_genome.fasta.gz
Challenge:
How would you check that every line of the ‘yeast_genes.bed’ file starts with the string ‘chr’ without looking through the whole file?
Solution
Solution.
Solution:
Run grep -v -c -E '^chr' bioinformatics_on_the_command_line_files/yeast_genes.bed
to count the number of lines that don’t start with ‘chr’. We can see that this is zero, so every line must start with ‘chr’.
Shell redirection
Key Points
- The shell can manage where programs receive inputs from and where they send outputs to
- It provides three I/O channels for programs to use. These are:
- Standard input, or STDIN, which provides input to the program
- Standard output, or STDOUT, which receives output from the program
- Standard error, or STDERR, which receives error messages from the program
- Program authors don’t have to use these I/O channels, but most command line tools designed for Linux, such as the GNU coreutils, do use them
- By default, STDIN comes from the keyboard, and STDOUT and STDERR go to the terminal, but each of these channels can be redirected
>
redirects STDOUT to an output file, overwriting its contents>>
redirects STDOUT to an output file, appending to its contents2>
redirects STDERR to an output file, overwriting its contents2>>
redirects STDERR to an output file, appending to its contents<
reads each line from an input file and feeds it to STDIN2>&1
redirects STDERR to STDOUT
The following examples demonstrate how shell redirection works:
## Direct the output of the echo command to a new file called output.txt
echo zero > output.txt
## Direct the file as input to the cat command. You could also just run: cat output.txt
cat < output.txt
## Overwrite the file with the new output
echo one > output.txt
cat < output.txt
## Append output to the existing file
echo two >> output.txt
cat < output.txt
## Redirect std output and std error to different files
cat bioinformatics_on_the_command_line_files/README > cat_readme.out 2> cat_readme.err
## Check the contents of these files. Did the command run without errors?
head -2 cat_readme.*
## Redirect std output and std error to different files
zcat bioinformatics_on_the_command_line_files/README > zcat_readme.out 2> zcat_readme.err
## Check the contents of these files. Did the command run without errors?
head -2 zcat_readme.*
## Redirect std output and std error to the same file
zcat bioinformatics_on_the_command_line_files/README > zcat_readme.all 2>&1
cat zcat_readme.all
## Remove the temporary files you just created
rm -i *cat_readme.* output.txt
Creating compound commands using pipes
Key Points
- Because the shell provides standard input and output channels, it is possible to chain together simple commands to perform complex tasks
- This can be done using a ‘pipe’, represented by the pipe character
|
So far we have discussed simple commands, which consist of a single command name followed by some options and arguments. However, a lot of the flexibility of the tools accessible via bash comes from the ability to combine them to form compound commands, using pipes. This allows the user to perform complex tasks by joining together simple commands.
Motivating example: How do you count how many of the first 40 lines in a FASTQ file contain the sequence ACTG?
Here’s how you could do it using simple commands:
## Get the first 40 lines of the file and save to a temporary file
head -40 bioinformatics_on_the_command_line_files/raw_yeast_rnaseq_data.fastq > first_40_lines.tmp
## Count the number of lines containing ACTG
grep -c ACTG first_40_lines.tmp
## Remove the temporary file
rm -i first_40_lines.tmp
The answer should be 5. Here’s how you can do it in a single command using a pipe:
head -40 bioinformatics_on_the_command_line_files/raw_yeast_rnaseq_data.fastq | grep -c ACTG
Pipes are particularly useful for working with large files, as they remove the need to create large intermediate files, which may take up space. They can also save time, as commands can sometimes start working on the data produced by commands preceding them in the pipeline before they have finished running, and in some cases preceding commands can be terminated early if further outputs are no longer needed.
Example
In the example below we measure the time it takes to find the first line that contains the character ‘A’ in a large genome file. We first write all of the matches to a temporary file and then find the first line using head
as follows:
[USERNAME]@bifx-core1:~/course$ time grep A /datastore/home/genomes/mouse/UCSC/mm10/Sequence/WholeGenomeFasta/genome.fa > grepA.tmp
real 0m36.182s
user 0m7.831s
sys 0m6.377s
[USERNAME]@bifx-core1:~/course$ time head -1 grepA.tmp
gcttcagaataatcatattattctcaaattttgtatcaatataaaaaaaA
real 0m0.008s
user 0m0.001s
sys 0m0.003s
[USERNAME]@bifx-core1:~/course$ rm -ri grepA.tmp
rm: remove regular file 'grepA.tmp'? y
[USERNAME]@bifx-core1:~/course$
We can see that the operation takes just over 36 seconds. In the example below, we pipe the output of the grep command into the head command.
[USERNAME]@bifx-core1:~/course$ time grep A /datastore/home/genomes/mouse/UCSC/mm10/Sequence/WholeGenomeFasta/genome.fa | head -1
gcttcagaataatcatattattctcaaattttgtatcaatataaaaaaaA
real 0m0.011s
user 0m0.002s
sys 0m0.012s
As we can see the operation completes in about 0.01 seconds, so using a pipe is considerably faster. This is because the pipeline stops when it has found the first match, so the grep
command doesn’t have to go through the whole file.