We are using the Linux command line to run most of the tools we use today. If you are new to Linux please complete the Intro to Command Line Workshop.
For a quick refresh of linux commands take a look here.
We will use the WCB bioinformatics servers for this practical. If you do not have an account please contact a member of the core facility to get this set up. You will need access to the university network or VPN to access the server.
You will need to use a Terminal app on Mac, Linux or Windows or an alternative app such as MobaXTerm. You can also log in via X2Go if you would like to use a graphical interface.
To login via a command line application:
ssh USER@bifx-core3.bio.ed.ac.uk
Once you have typed in your password, you should see some text and a prompt that looks something like this:
[USERNAME]@bifx-core3:~$
In order to view files created on the server, we need to create a public_html directory in our home folder.
After logging in you should be in your $HOME directory, check with;
pwd
This shows the path of your present working directory, which should now be your home directory as you have just logged in. You can return to this place at any time using the change directory command.
cd
You have permission to create files and directories under your home
folder. Let’s create the public_html
directory to use later
on.
mkdir ~/public_html
Here we have used ~/ as a shortcut for your $HOME directory. As you
have created ~/public_html
, contents of this directory are
now available online with any web browser
To see it, enter the following URL, changing yourUserName to whatever your username is: https://bifx-core3.bio.ed.ac.uk/~yourUserName
Please install IGV on your own machine, alternatively you can use the web application.
The data we will we use in this workshop is part of a larger study described in Kenny PJ et al., Cell Rep 2014. The authors investigated interactions between various genes involved in Fragile X syndrome, a disease of aberrant protein production, which results in cognitive impairment and autistic-like features. They sought to show that RNA helicase MOV10 regulates the translation of RNAs involved in Fragile X syndrome.
The RNA was extracted from HEK293F cells that were transfected with a MOV10 transgene (MOV10 over-expression), MOV10 siRNA (MOV10 knockdown), or an irrelevant siRNA (Control). For now, we will just look at the over-expression and control samples.
If you are new to RNA-seq, this overview provides an explanation of how RNA-seq libraries are prepared and sequenced for short-read data.
If you are thinking about performing your own RNA-seq experiments then take a look at these experimental design considerations
It is also important to collect metadata (information about the data) that describes the samples and the experimental preparation. Some key things to note about these datasets:
Dataset | Description |
---|---|
Control_1 | Control, replicate 1 |
Control_2 | Control, replicate 2 |
Control_3 | Control, replicate 3 |
MOV10_OE_1 | MOV10 over-expression, replicate 1 |
MOV10_OE_2 | MOV10 over-expression, replicate 2 |
MOV10_OE_3 | MOV10 over-expression, replicate 3 |
First, make a new directory for this tutorial and move into that
directory. Then link the directory to your public_html
folder as we are going to make everything public in this tutorial.
cd
mkdir RNA-seq_workshop
cd RNA-seq_workshop
ln -s $PWD ~/public_html/
Next, create a subfolder called fastq for all of our sequence files and link the raw datasets to this folder:
mkdir fastq
ln -s /library/training/RNA-seq_analysis/data/lesson_1/fastq/*fq.gz fastq/
As well as the raw data we will also need access to annotations and reference sequences for the human genome/transcriptome. These are provided for this workshop and should also be linked to your folder:
ln -s /library/training/RNA-seq_analysis/annotation .
Sequencing data will typically be provided to you in fastq format (.fq or .fastq) or as a compressed gzipped fastq (.fq.gz) in order to save space. We can view a gzipped file with the zless command, let’s take a look:
cd fastq # Move into the fastq directory (if you haven't already)
zless Control_1.fq.gz | head -n 12
Fastq files contain 4 lines per sequenced read:
fastq
folderNext we want to assess the quality of our sequencing data and check for any biases and contamination.
It is useful to know if your sequencing library contains the types of sequences you expect. FastQ Screen searches for contaminants by mapping a sample of your reads to different genomes. It allows you to specify a custom set of genomes, e.g. all of the organisms you work on, along with PhiX, vectors and other contaminants commonly seen in sequencing experiments. We will run a screen of our data against a few default genomes:
cd .. #Move up a directory again
fastq_screen --conf /homes/genomes/tool_configs/fastq_screen/fastq_screen.conf fastq/*fq.gz --outdir fastq # * is a wild card character so includes all files that end fq.gz
Once complete, take a look at the output images in your browser via your public_html folder. This shows that most of your reads align to the human genome and that no reads align uniquely to other organisms:
FastQC provides simple quality control checks on raw sequence data coming from high throughput sequencing pipelines. It provides a modular set of analyses to quickly assess your data before proceeding with analysis.
fastqc fastq/*.fq.gz
FastQC will create report files for each of your datasets which we can view in the browser. We will go through each of the images during the workshop. For future reference, specific guidance on how to interpret the output of each module is provided in the fastqc help pages.
An example of poor quality sequencing at the end of short reads:
The software gives a pass, fail or warning flag for each test based on what we would expect from a regular DNA-sequencing run. It is important to realise that FastQC does not understand the origin of your data and that different datasets will have different characteristics. For instance RNA sequencing often involves the use of random hexamer primers that are not as random as you might expect. The profile below in the first ~15 bases is perfectly normal for these samples but will be flagged as an error by FastQC:
Visit the QCFail website for more examples and advice on quality control for NGS datasets.
We can view summaries of multiple reports at once by using multiqc:
multiqc -o fastq fastq
MultiQC searches for report files in a directory and compiles them into a single report. Open the multiqc_report.html file in the fastq directory via a web browser to see how the raw datasets compare. Here we have the output of FastQ_screen and FastQC, but MultiQC works with the outputs of many other tools which we’ll see later.
If we look at the adapter content and over represented sequences sections we can see a small amount of contamination particularly at the ends of the sequencing reads.
Up until now we have run command line tools on each one of our datasets in serial, this means they run one after the other. In this tutorial we only have a few small datasets and the tools run relatively quickly, but this approach won’t scale well to multiple large datasets. A more efficient approach is to create a script that will run all of our datasets in parallel.
Unix has a program called parallel which allows you to run tools on multiple datasets at the same time. The following command would list all of your gzipped fastq files and pipe “|” them into parallel.
ls fastq/*fq.gz | parallel -j 6 fastqc {} &
ps f
This will rerun the fastqc jobs and should be much quicker.
Putting all of your commands into a script is good practice for keeping track of your analysis and for reproducibility. We want to write scripts so they can be used again on different datasets and avoid hardcoding the names of files.
First, let’s create a file that lists our sample names so we can feed this into our pipeline. We could just type this manually or use a spreadsheet and export it as a tab-separated file. We will use a bit of unix text processing, don’t worry too much if you don’t understand the command.
ls fastq/*fq.gz | parallel basename | sed s/.fq.gz// > samples.txt
Here, the fastq file names are ‘piped’ into parallel as above but we
use a regular expression within sed
to remove the file name
suffix. The output file samples.txt
now contains a list of
our sample names. Take a look at the file with the less
command. You can type q
to exit.
less samples.txt
Now let’s create our pipeline script. We’ll start by creating and opening a new file called pipeline.sh. The .sh stands for shell which is essentially just a name for the command line environment.
emacs -nw pipeline.sh
Emacs is a Unix text editor and the -nw flag opens a new window for editing. Here we are going to paste all of the commands we have used so far:
## PREPARATION
## Create a web directory to view output
ln -s $PWD ~/public_html/
## Create a folder for fastq files
mkdir fastq
## Link the raw data to the fastq folder
ls /library/training/RNA-seq_analysis/data/lesson_1/fastq/*fq.gz | parallel -j 6 ln -s {} fastq/
## Link the annotation folder
ln -s /homes/library/training/RNA-seq_analysis/annotation .
## QC
## FastQ Screen
fastq_screen --conf /homes/genomes/tool_configs/fastq_screen/fastq_screen.conf fastq/*fq.gz --outdir fastq
## FastQC
fastqc fastq/*.fq.gz
##Multi QC
multiqc -o fastq fastq
## PREPROCESSING
The # symbol indicates a comment line and anything preceded by a # will not be run by the command line. It is good practice to comment your code.
See if you can adapt the QC portion of the script to use parallel.
Hint:cat samples.txt
will print the names
of all of the samples.
cat samples.txt | parallel -j 6 "fastq_screen --conf /homes/genomes/tool_configs/fastq_screen/fastq_screen.conf fastq/{}.fq.gz --outdir fastq"
cat samples.txt | parallel -j 6 "fastqc fastq/{}.fq.gz"
To save and exit the file press ctrl-x, followed by ctrl-c and then y to save.
Let’s move on with our RNA-seq analysis. Every time we run a tool that creates new output we should add it to our pipeline.sh file. These code blocks will be highlighted in blue. We will learn how to run this script and the full analysis in one go later on.
Although shell scripts save all of your commands they do not necessarily track the versions of software that you use or the current state of your environment and other tool dependencies. They are also unaware of the workflow of your data. More advanced methods of pipelining and containerisation are recommended for fully reproducible analyses and we will see an example of this using NextFlow in the next lesson.
From the FastQC report we can see that the overall quality of our sequencing is good, however it is good practice to perform some pre-processing and filtering of reads. Poor quality sequencing can make a read less alignable, so it is good practice to quality trim the ends of reads until we get to the high quality portion. Trimming is not always necessary as some mapping programs will trim the reads for you or perform soft clipping where only part of a read is required to align, but studies have shown that pre-processing generally improves alignment rate if done correctly.
Sequencing libraries are normally constructed by ligating adapters to fragments of DNA or RNA. If your read length is longer than your fragment then sequenced reads will contain the adapter sequence. Adapter removal is a necessary consideration for your QC workflow, especially if adapters are detected by FastQC.
An example of adapter contamination at the end of reads:
Once reads have been trimmed they will vary in length. You may want to filter out reads that are now too short to be uniquely mapped. Normally a cutoff of 20-30bp is standard.
In this workshop we will use Trim Galore! for adapter and quality trimming. It is a wrapper around the popular tool Cutadapt which finds and removes unwanted sequences from high-throughput sequencing reads.
Cutadapt can perform quality trimming, adapter removal and read filtering as well as many other operations to prepare your reads for optimal alignment. We will run trim galore! with the following parameters:
## Run this code then add it to your pipeline.sh file!
mkdir trim_galore ## Create a new folder for trimmed data
cat samples.txt | parallel -j 6 "trim_galore --fastqc -q 20 --illumina --length 20 --cores 4 -o trim_galore fastq/{}.fq.gz"
To view a trim_galore report:
less trim_galore/Control_1.fq.gz_trimming_report.txt
Let’s compare the fastqc reports. Run multiqc on the trimmed data and compare this with the reports for the raw data.
multiqc -f -o trim_galore trim_galore
Remember to add the trim_galore and multiqc steps to your pipeline.sh file.
Trim Galore simplifies the Cutadapt command and also runs fastqc on the trimmed data. However, it is worth looking through the full functionality of Cutadapt which has many options for filtering and trimming reads. Trimmomatic is another tool which has a very sensitive algorithm for detecting adapters, especially in paired-end reads.
It may be worth looking into a few other tools for quality control of RNA-seq data.
There are many tools available for mapping reads to genomes and transcriptomes, each with their own purposes and strengths. RNA-seq data requires a splice-aware aligner that can handle splice junctions in the sequencing reads. The two most popular aligners are STAR, which is very fast, and HISAT2, which is very accurate.
First, we need to select a reference genome to align to. Every time a reference genome is released or updated it is given a new name, often referred to as the genome build or assembly (..hg18, hg19, hg38). It is important to realise that different builds of the same genome are different sequences and thus their co-ordinate systems are incomparable. For instance position 10000000 on chr1 is T in hg19 and G in hg38.
We are going to map our reads to the latest release of the human genome (hg38 or GRCh38). We need to create an index file from the GRCh38 fasta sequence so that STAR can quickly access the reference sequences. Many of these indexes are pre-computed on our servers and stored under the /homes/genomes directory for everyone to use. The indexes for this workshop have are available in the annotation we linked earlier.
The code below shows how these were generated with STAR. You do not need to run this again.
## Download sequence files and latest annotations from Ensembl
#wget http://ftp.ensembl.org/pub/release-106/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz ## Use primary assembly without Alt contigs but including scaffolds
#wget http://ftp.ensembl.org/pub/release-106/gtf/homo_sapiens/Homo_sapiens.GRCh38.106.gtf.gz
#gzip -d Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
#gzip -d Homo_sapiens.GRCh38.106.gtf.gz
## Build STAR index
#STAR --runMode genomeGenerate --genomeDir STAR_index_hg38.ensembl106 --genomeFastaFiles Homo_sapiens.GRCh38.dna.primary_assembly.fa --sjdbGTFfile Homo_sapiens.GRCh38.106.gtf --sjdbOverhang 100 --runThreadN 20
Now that we have an index we can align our reads to the human genome with STAR.
mkdir STAR ## make a new folder for read alignments
cat samples.txt | parallel -j 6 "STAR --genomeDir annotation/STAR_index_hg38.ensembl106 --readFilesCommand zcat --readFilesIn trim_galore/{}_trimmed.fq.gz --runThreadN 4 --outSAMtype BAM SortedByCoordinate --outFileNamePrefix STAR/{}. --sjdbGTFfile annotation/Homo_sapiens.GRCh38.106.gtf --outWigType wiggle --outWigNorm RPM"
We are using the following parameters:
Note that we do not need to include the –sjdbGTFfile here as it was already used in creating the index. We can provide the annotations at either step or provide a different file with additional annotation.
We recommend reading the STAR manual to fully understand all of the parameters and output options. Running alignment software with default parameters may not be the best option for your data.
When your alignment has completed, take a look at the output STAR report:
less STAR/Control_1.Log.final.out
The standard output for most mapping software is SAM (sequence alignment/map format). SAM files contain many columns that describe the position of each alignment as well as information on the quality of the alignment, mismatches, the number of times a read mapped, mapping of paired ends and other custom flags and statistics.
SAM files can be very large so there are compressed alternatives BAM and CRAM. The samtools package has many useful tools for viewing, filtering and manipulating files in SAM format. We will use some of these below.
Take a look at the SAM format
specification and the first few lines of your BAM output using
samtools
.
samtools view STAR/Control_1.Aligned.sortedByCoord.out.bam | less
The second column is the SAM flag and contains coded information about each alignment.
Use the Explain SAM flags resource to find out more about the alignments in your file. They appear to contain the flags 0,16,256 and 272.
We can also see the sam file header using the -H flag which contains information on the parameters and indexes used to create the file.
samtools view -H STAR/Control_1.Aligned.sortedByCoord.out.bam | less
We can index our bam files with samtools index
. This
creates an index file for quick programmatic access to the binary file.
Indexing is required for some of the samtools programs to work as well
as for viewing the reads on genome browsers.
cat samples.txt | parallel -j 6 "samtools index STAR/{}.Aligned.sortedByCoord.out.bam" ## Index the bam files
The command samtools idxstats
outputs the number of
reads aligned to each sequence in our reference.
samtools idxstats STAR/Control_1.Aligned.sortedByCoord.out.bam
The third column represents the number of alignments.
Now that we have aligned our reads we may want to do some filtering before any downstream analysis. Make sure you are aware of the alignments that are reported by your mapping program and the parameters used. For instance, are unmapped reads included in the BAM file? Are all alignments to repeats reported or just one? Are paired-end alignments still reported if only one end maps?
There are many ways to filter your BAM files with samtools and other
programs to remove unwanted alignments that may negatively affect your
downstream analysis. First lets look at the alignments contained in one
of our bam files with samtools flagstat
:
samtools flagstat STAR/Control_1.Aligned.sortedByCoord.out.bam
Flagstat seems to report more alignments than the number of reads shown in our STAR report, how can this be? Our BAM file contains one record per alignment and some reads may align to multiple locations. STAR assigns each read a primary alignment, which is selected randomly from the top scoring alignments.
We can use samtools view -f
to include, and
-F
to exclude reads with a given SAM flag. The flag 260 is
the sum of 4 (read is unmapped) and 256 (not primary alignment). This
will filter unmapped reads and non-primary alignments to give us a
single alignment for each mapped read.
cat samples.txt | parallel -j 6 "samtools view -F 260 -bh -o STAR/{}.primary.bam STAR/{}.Aligned.sortedByCoord.out.bam; samtools index STAR/{}.primary.bam"
The -bh
option tells samtools view to output in BAM
format and to include the header.
If we run samtools flagstat
on this filtered bam file we
should now have one alignment per mapped read.
samtools flagstat STAR/Control_1.primary.bam
Multimap and duplicate reads are often confused so it is important to understand what these are and how they affect your data:
Multimap reads are difficult to analyse as their ambiguity can confound results. Many applications require the use of unique alignments only, thus multimap reads often need to be removed from your BAM file.
Alignment tools assign a mapping quality to each read (column 5 in BAM) between 0 and 255 that describes the confidence in the alignment position.
You should be aware that mapping
qualities differ between mapping tools. STAR uses mapping quality
scores of 0 and 255 to represent multimap and unique reads respectively.
Filtering out reads with a mapping quality < 255 means that all
remaining reads align to a single unique position. We can use
samtools view -q
to filter based on mapping quality.
cat samples.txt | parallel -j 6 "samtools view -b -q 255 STAR/{}.primary.bam -o STAR/{}.unique.bam; samtools index STAR/{}.unique.bam"
Duplicate reads are often observed as tall spikes in your read depth profile where reads are stacked directly on top of each other. A high level of duplication in your library may be due to over amplification by PCR or contamination. In RNA-seq experiments the chances of selecting identical sequences are much higher as our sample size is much smaller (length of transcriptome vs length of genome) and multiple copies of the same sequence (transcripts) exist independently. Thus, it is common practice to retain duplicate reads.
Learn more about duplication and removing duplicate reads:
In some cases you may wish to remove reads that align to specific regions of the genome. For instance, if you still have a lot of ribosomal RNAs that will skew your downstream analysis.
It is always a good idea to visually inspect your data on a genome browser. Has the sequencing worked as expected? Are there noticeable differences between samples by eye? Do RNA-seq reads align to the expected strand?
BAM files contain information about each read, which is great if you want to look at individual alignments and splice junctions, but they are typically large and slow to work with. If we are only interested in read coverage we can convert our BAM files to graphs of sequencing depth per base.
The wiggle file contains genomic intervals, represented by chromosome and position co-ordinates, along with a score. In this case the score represents read coverage, how many reads overlap with that position. A compressed version of this format, called bigWig, is used by genome browsers.
You may have noticed that we already asked STAR to output wiggle files. STAR gives us wiggle files for both unique and unique+multimap reads and also splits these up by strand. Let’s take a look at one of these files:
head STAR/Control_1.Signal.Unique.str1.out.wig
Let’s convert the unique files to bigWig format for visualisation. The tool wigToBigWig requires three arguments, an input file, a file of chromosome lengths, and the name of the output file.
mkdir visualisation ## Create a folder for visualisation files
cat samples.txt | parallel -j 6 "wigToBigWig STAR/{}.Signal.Unique.str1.out.wig annotation/Homo_sapiens.GRCh38.dna.primary_assembly.len visualisation/{}.unique.r.bw; wigToBigWig STAR/{}.Signal.Unique.str2.out.wig annotation/Homo_sapiens.GRCh38.dna.primary_assembly.len visualisation/{}.unique.f.bw"
Normalisation is the process of correcting read coverage levels in order to compare expression levels between genes and different samples. Normalisation methods need to account for several factors in RNA-seq data:
Several RNA-seq normalisation strategies exist:
You may have noticed that we used RPM normalisation to create our wiggle files earlier. This is the only option in STAR and is fine for visual QC of your data.
Of the options above, TPM normalisation is most
suitable for comparing expression levels between genes and between
samples. If you plan to perform comparitive analysis on aligned data, we
recommend using deepTools
bamCoverage, which can generate TPM normalised bigWig files from BAM
files. The bamCoverage
tool also has several options for
filtering, smoothing and normalising read coverage profiles.
Some analysis pipelines use genomeCoverageBed but be aware this does not output normalised values unless scaling factors are supplied explicitly.
None of these normalisation strategies are appropriate for differential expression analysis with gene count data. You will see later on that differential expression packages employ their own normalisation methods to handle sequencing depth and RNA composition.
IGV (Integrative genomics viewer) is a powerful genome browser from the Broad institute. It is perfect for viewing your sequencing files as loading data is easy, common formats are understood, views are customisable and navigation is quick.
Let’s also add of our bam files to the visualisation folder.
cd visualisation
ln -s ../STAR/*unique.bam* .
cd ../
It is best to download the desktop app for use on your own machine but if you are using a university managed computer you can access the web app at https://igv.org/app/.
Open IGV and set the genome to hg38, then find your visualisation folder online. In the desktop version you can drag and drop files into IGV. If you are using the webapp you will need to download the files you require and open them using the tracks menu.
Load the two bigWig tracks for MOV10_OE_1 into IGV:
First, let’s navigate the chromosome:
Now let’s customise our view:
There are two large spikes on chr1 (forward strand) and chr6 (reverse strand).
Now load a corresponding BAM file MOV10_OE_1.unique.bam
for one of your tracks. Make sure the .bai
index file is in
the same folder as the BAM. Navigate to the MOV10 gene, you should be
able to see all of the aligned reads.
Use the menu for the BAM file to customise the way reads are displayed.
We recommend IGV as it allows you to quickly explore your data. Several other genome browsers exist and the Ensembl and UCSC websites are particularly useful for viewing your data online alongside published experimental data and genomic annotations that exist in their databases.
Once we have inspected our data and are happy with our alignments, we can quantify expression levels for annotated transcripts. Pseudo-alignment tools, such as Salmon, map sequencing reads directly to transcripts. They are extremely fast, can handle multi-map reads, and take into account multiple isoforms, transcript lengths and sequencing biases when estimating abundance.
Like STAR, Salmon requires an index for alignment. The indexes require fasta sequences of all coding and non-coding RNAs as well as the genomic sequence. The Salmon index for this workshop has been pre-computed using the code below. You do not need to run this as it will already be in the annotation folder you linked previously.
## Download cdna and ncrna fasta files from Ensembl
#wget http://ftp.ensembl.org/pub/release-106/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh38.cdna.all.fa.gz
#wget http://ftp.ensembl.org/pub/release-106/fasta/homo_sapiens/ncrna/Homo_sapiens.GRCh38.ncrna.fa.gz
## Create Salmon indexes
## Concatenate hg38 cdna and ncrna
#zcat Homo_sapiens.GRCh38.cdna.all.fa.gz Homo_sapiens.GRCh38.ncrna.fa.gz > Homo_sapiens.GRCh38.cdna.ncrna.fa.tmp
## Make sure header is transcript name only and remove transcript version number
#perl -lane 'if(m/^>/){$id=(split " ",$_)[0];$id=(split "\\.",$id)[0];print $id;}else{print $_;}' Homo_sapiens.GRCh38.cdna.ncrna.fa.tmp > Homo_sapiens.GRCh38.cdna.ncrna.fa
#rm Homo_sapiens.GRCh38.cdna.ncrna.fa.tmp
## Get genome chr sequences in decoys.txt
#grep "^>" Homo_sapiens.GRCh38.dna.primary_assembly.fa | sed s/">"//g | cut -f 1 -d " " > decoys.txt
## Cat transcriptome and genome to make gentrome file:
#cat Homo_sapiens.GRCh38.cdna.ncrna.fa Homo_sapiens.GRCh38.dna.primary_assembly.fa > Homo_sapiens.GRCh38.cdna.ncrna.gentrome.fa
#gzip Homo_sapiens.GRCh38.cdna.ncrna.gentrome.fa
## Build Salmon index
#salmon index -t Homo_sapiens.GRCh38.cdna.ncrna.gentrome.fa.gz -d decoys.txt -p 24 -i hg38.cdna.ncrna.salmon.index -k 31
Creating Salmon indexes is more complicated than STAR but the bioinformatics team can help you out if you need to do this. Several pre-computed indexes exist in /homes/genomes. You can also read more about the method here.
We can now run Salmon to estimate expression levels of transcripts.
Biases in sequencing data can lead to over/under estimation of
expression of certain transcripts and Salmon has some functionality to
overcome this. We will use the salmon quant
command with
the following arguments:
mkdir salmon ## Make a folder for salmon output
cat samples.txt | parallel -j 6 "salmon quant -i annotation/hg38.cdna.ncrna.salmon.index -l SR -r trim_galore/{}_trimmed.fq.gz -p 5 --seqBias --gcBias -o salmon/{}"
The library type describes the type of reads used in the experiment. Here we have single (S), reverse-stranded (R) reads. Check the documentation for other arguments.
The main output of Salmon is a quant.sf file, take a look at one of these files:
less salmon/Control_1/quant.sf
For each transcript we have five columns:
What exactly is the effective length?
The sequence composition of a transcript affects how many reads are sampled from it. While two transcripts might have identical length, depending on the sequence composition we are more likely to generate fragments from one versus the other. The transcript that has a higer likelihood of being sampled, will end up with the larger effective length. The effective length is thus a corrected transcript length to account for sequence-specific and GC biases.
We now have our alignments (BAM), visualisation files (bigWig) and transcript quantification (quant.sf) and this is normally a branching point for downstream analyses.
The last thing we want to do is to run MultiQC on our folder to compile the reports from Trim_galore, STAR and Salmon:
multiqc -f .
Take a look at the multiqc report in your public folder:
Files are large and disk space is expensive, remove any unwanted or temporary files from your folder. We should always keep the raw data (fastq) and our final processed datasets (BAM, bigWig, quant.sf etc) as well as the code we used to generate them. Where you can, convert files to compressed or binary formats to save space e.g. fq to fq.gz, SAM to BAM, wig to bigWig.
rm trim_galore/*trim*fq.gz #Remove trimmed fastq temporary files
rm STAR/*.wig #Remove wig files once converted to bigWig
Your pipeline.sh script should look something like this (don’t run this code again!):
## PREPARATION
## Create a web directory to view output
ln -s $PWD ~/public_html/
## Create a folder for fastq files
mkdir fastq
## Link the raw data to the fastq folder
ls /library/training/RNA-seq_analysis/data/lesson_1/fastq/*fq.gz | parallel -j 6 ln -s {} fastq/
## Link the annotation folder
ln -s /library/training/RNA-seq_analysis/annotation .
## QC
## FastQ Screen
cat samples.txt | parallel -j 6 "fastq_screen --conf /homes/genomes/tool_configs/fastq_screen/fastq_screen.conf fastq/{}.fq.gz --outdir fastq"
## FastQC
cat samples.txt | parallel -j 6 "fastqc fastq/{}.fq.gz"
##Multi QC
multiqc -o fastq fastq
## PREPROCESSING
mkdir trim_galore ## Create a new folder for trimmed data
cat samples.txt | parallel -j 6 "trim_galore --fastqc -q 20 --illumina --length 20 --cores 4 -o trim_galore fastq/{}.fq.gz"
multiqc -f -o trim_galore trim_galore
## ALIGNMENT
mkdir STAR ## make a new folder for read alignments
## run STAR
cat samples.txt | parallel -j 6 "STAR --genomeDir annotation/STAR_index_hg38.ensembl106 --readFilesCommand zcat --readFilesIn trim_galore/{}_trimmed.fq.gz --runThreadN 4 --outSAMtype BAM SortedByCoordinate --outFileNamePrefix STAR/{}. --sjdbGTFfile annotation/Homo_sapiens.GRCh38.106.gtf --outWigType wiggle --outWigNorm RPM"
## index STAR
cat samples.txt | parallel -j 6 "samtools index STAR/{}.Aligned.sortedByCoord.out.bam"
## filter for primary reads
cat samples.txt | parallel -j 6 "samtools view -F 260 -bh -o STAR/{}.primary.bam STAR/{}.Aligned.sortedByCoord.out.bam; samtools index STAR/{}.primary.bam"
## filter for unique reads
cat samples.txt | parallel -j 6 "samtools view -b -q 255 STAR/{}.primary.bam -o STAR/{}.unique.bam; samtools index STAR/{}.unique.bam"
## Visualisation
mkdir visualisation
## Convert wig to bigWig for both strands
cat samples.txt | parallel -j 6 "wigToBigWig STAR/{}.Signal.Unique.str1.out.wig annotation/Homo_sapiens.GRCh38.dna.primary_assembly.len visualisation/{}.unique.r.bw; wigToBigWig STAR/{}.Signal.Unique.str2.out.wig annotation/Homo_sapiens.GRCh38.dna.primary_assembly.len visualisation/{}.unique.f.bw"
## link bigWigs to visualisation folder
cd visualisation
ln -s ../STAR/*unique.bam* .
cd ../
## Transcript abundance
mkdir salmon ## Make a folder for salmon output
## run Salmon
cat samples.txt | parallel -j 6 "salmon quant -i annotation/hg38.cdna.ncrna.salmon.index -l SR -r trim_galore/{}_trimmed.fq.gz -p 5 --seqBias --gcBias -o salmon/{}"
## Final report
multiqc -f .
## Tidy up
rm trim_galore/*trim*fq.gz #Remove trimmed fastq temporary files
rm STAR/*.wig #Remove wig files once converted to bigWig
With this script we can now run everything from start to end in one go:
mkdir RNA-seq_workshop_tmp #Create a temporary directory
cd RNA-seq_workshop_tmp #Move into that directory
cp ../pipeline.sh ../samples.txt . #Copy the pipeline and samples file into the new directory
bash pipeline.sh & #Run the shell script (See Below)
cd .. #Move back to the main directory
We use the unix tool bash
and the name of our script to
run our commands through the shell. We use the &
to run
everything in the background to continue using the terminal.
You can keep track of your pipeline by using ps
.
ps f
This will take several minutes to run so it’s best to set it running then take a bit of a break. When the pipeline completes take a look and make sure all of the output has been created.