Files


In this section you will learn how to explore and manipulate files in bash using simple commands, and compound commands using pipes.

Exploring files


Key Points

  • Regular files in Linux can be classified as text files, which contain human readable text, and binary files, that contain data that is not human readable
  • The cat command can be used to show the contents of a file
    • The less command allows you to page through a large file
      • Hit q to exit the less command
  • The head and tail commands can be used to show the first or last few lines of a file
    • These can be useful for large text files
  • The wc -l command counts the number of lines in a file
  • The grep command allows you to filter a text file
  • Text files can be compressed using the gzip command, which converts them to a binary format that takes up less space
    • Many of the above commands for working with text files have equivalents for gzipped files
    • These include zcat, zless, and zgrep


The following commands can be used to explore text files:

## Ensure you are in the course directory
cd ~/course

## Inspect the contents
tree

## The cat command prints everything to screen
cat bioinformatics_on_the_command_line_files/README

## The less command prints one page at a time. Use space to tab through pages and `q` to quit.
less bioinformatics_on_the_command_line_files/raw_yeast_rnaseq_data.fastq

## Print the first 5 lines of a file
head -5 bioinformatics_on_the_command_line_files/yeast_genome.fasta

## Print the final 5 lines of a file
tail -5 bioinformatics_on_the_command_line_files/yeast_genome.fasta

## How many lines are in the file?
wc -l bioinformatics_on_the_command_line_files/yeast_genes.bed

The grep command can search and filter files:

  • grep expression filename returns all the lines with the word ‘expression’ in the file called ‘filename’.

  • grep -i is case insensitive

  • grep -c will count the number of lines matching an expression

  • grep -v returns all of the lines that do not match an expression

## Print lines containing the word 'format'
grep format bioinformatics_on_the_command_line_files/README

## Print lines not containing the word 'format'
grep -v format bioinformatics_on_the_command_line_files/README

## Count the lines beginning with ">". All sequence ID lines start with a ">" in a fasta file. So we can use this command to count the number of sequences.
grep -c '^>' bioinformatics_on_the_command_line_files/yeast_genome.fasta

Note: grep allows you to use a regular expressions to specify a pattern that grep will look for rather than a fixed string. Conceptually, regular expressions are similar to glob patterns, although their syntax is different. Some characters have a special meaning in regular expressions. For example:

  • ^ represents the start of a string
  • $ represents the end of a string

The -E and -P flags can be used to make more regular expressions available to grep.

Files can be zipped (compressed) and unzipped (decompressed) with the gzip command:

  • gzip -k will compress a file and keep the original uncompressed version

  • gzip -d will decompress a gzip’d file

  • zless, zcat and zgrep work on compressed files

## Compress the file while keeping the original uncompressed version
gzip -k bioinformatics_on_the_command_line_files/yeast_genome.fasta

## Inspect both files
ls -lh bioinformatics_on_the_command_line_files/yeast_genome.fasta*

## Use zgrep to search a compressed file
zgrep -c -E '^>' bioinformatics_on_the_command_line_files/yeast_genome.fasta.gz

Challenge:

How would you check that every line of the ‘yeast_genes.bed’ file starts with the string ‘chr’ without looking through the whole file?

Solution

Solution.

Solution:

Run grep -v -c -E '^chr' bioinformatics_on_the_command_line_files/yeast_genes.bed to count the number of lines that don’t start with ‘chr’. We can see that this is zero, so every line must start with ‘chr’.

Shell redirection


Key Points

  • The shell can manage where programs receive inputs from and where they send outputs to
  • It provides three I/O channels for programs to use. These are:
    • Standard input, or STDIN, which provides input to the program
    • Standard output, or STDOUT, which receives output from the program
    • Standard error, or STDERR, which receives error messages from the program
  • Program authors don’t have to use these I/O channels, but most command line tools designed for Linux, such as the GNU coreutils, do use them
  • By default, STDIN comes from the keyboard, and STDOUT and STDERR go to the terminal, but each of these channels can be redirected
    • > redirects STDOUT to an output file, overwriting its contents
    • >> redirects STDOUT to an output file, appending to its contents
    • 2> redirects STDERR to an output file, overwriting its contents
    • 2>> redirects STDERR to an output file, appending to its contents
    • < reads each line from an input file and feeds it to STDIN
    • 2>&1 redirects STDERR to STDOUT


The following examples demonstrate how shell redirection works:

## Direct the output of the echo command to a new file called output.txt
echo zero > output.txt

## Direct the file as input to the cat command. You could also just run: cat output.txt
cat < output.txt

## Overwrite the file with the new output
echo one > output.txt
cat < output.txt

## Append output to the existing file
echo two >> output.txt
cat < output.txt

## Redirect std output and std error to different files
cat bioinformatics_on_the_command_line_files/README > cat_readme.out 2> cat_readme.err
## Check the contents of these files. Did the command run without errors?
head -2 cat_readme.*

## Redirect std output and std error to different files
zcat bioinformatics_on_the_command_line_files/README > zcat_readme.out 2> zcat_readme.err
## Check the contents of these files. Did the command run without errors?
head -2 zcat_readme.*

## Redirect std output and std error to the same file
zcat bioinformatics_on_the_command_line_files/README > zcat_readme.all 2>&1
cat zcat_readme.all

## Remove the temporary files you just created
rm -i *cat_readme.* output.txt
 

Creating compound commands using pipes


Key Points

  • Because the shell provides standard input and output channels, it is possible to chain together simple commands to perform complex tasks
  • This can be done using a ‘pipe’, represented by the pipe character |


So far we have discussed simple commands, which consist of a single command name followed by some options and arguments. However, a lot of the flexibility of the tools accessible via bash comes from the ability to combine them to form compound commands, using pipes. This allows the user to perform complex tasks by joining together simple commands.

Motivating example: How do you count how many of the first 40 lines in a FASTQ file contain the sequence ACTG?

Here’s how you could do it using simple commands:

## Get the first 40 lines of the file and save to a temporary file
head -40 bioinformatics_on_the_command_line_files/raw_yeast_rnaseq_data.fastq > first_40_lines.tmp

## Count the number of lines containing ACTG
grep -c ACTG first_40_lines.tmp

## Remove the temporary file
rm -i first_40_lines.tmp

The answer should be 5. Here’s how you can do it in a single command using a pipe:

head -40 bioinformatics_on_the_command_line_files/raw_yeast_rnaseq_data.fastq | grep -c ACTG

Pipes are particularly useful for working with large files, as they remove the need to create large intermediate files, which may take up space. They can also save time, as commands can sometimes start working on the data produced by commands preceding them in the pipeline before they have finished running, and in some cases preceding commands can be terminated early if further outputs are no longer needed.

Example

In the example below we measure the time it takes to find the first line that contains the character ‘A’ in a large genome file. We first write all of the matches to a temporary file and then find the first line using head as follows:

[USERNAME]@bifx-core1:~/course$ time grep A /datastore/home/genomes/mouse/UCSC/mm10/Sequence/WholeGenomeFasta/genome.fa > grepA.tmp

real    0m36.182s
user    0m7.831s
sys     0m6.377s
[USERNAME]@bifx-core1:~/course$ time head -1 grepA.tmp
gcttcagaataatcatattattctcaaattttgtatcaatataaaaaaaA

real    0m0.008s
user    0m0.001s
sys     0m0.003s
[USERNAME]@bifx-core1:~/course$ rm -ri grepA.tmp
rm: remove regular file 'grepA.tmp'? y
[USERNAME]@bifx-core1:~/course$ 

We can see that the operation takes just over 36 seconds. In the example below, we pipe the output of the grep command into the head command.

[USERNAME]@bifx-core1:~/course$ time grep A /datastore/home/genomes/mouse/UCSC/mm10/Sequence/WholeGenomeFasta/genome.fa | head -1
gcttcagaataatcatattattctcaaattttgtatcaatataaaaaaaA

real    0m0.011s
user    0m0.002s
sys 0m0.012s

As we can see the operation completes in about 0.01 seconds, so using a pipe is considerably faster. This is because the pipeline stops when it has found the first match, so the grep command doesn’t have to go through the whole file.