RNA Sequencing Analysis

Direct sequencing of a cell’s RNA content (RNA-Seq) enables identification of novel genes and splice forms, detecting low abundance transcripts and sequence variations (SNPs). For the analysis of transcriptome, RNA-Seq has for a large part supplanted microarrays because it does not suffer from cross-hybridization and is able to accurately detect expression at the lower end of the dynamic range of the transcriptome.

A trial version of a pipeline for the analysis of RNA-Seq data is available through the Translational Research Institute (TRI). The input data for the pipeline are FASTQ files with short reads, produced, for example, by Illumina HighSeq. These input data are mapped onto a genome, counted, and normalized for further statistical analyses. The output data are tables of gene counts. The pipeline targets human samples, but it can be extended to other organisms with known genomes.

Requirements and Availability
The pipeline is implemented using the free open-source R-language. It requires the Rsamtools, GenomicFeatures and GenomicRanges R packages and the standalone Bowtie aligner. It works on both personal and high performance computers (HPC). A HPC is recommended if many samples require processing. The pipeline is available on a HPC machine at the University of Arkansas at Little Rock.

The first part of the pipeline is the alignment of short sequences from a standard FASTQ format file onto a reference genome. To perform this task, Bowtie, an ultrafast, memory-efficient short-read aligner, is used. It aligns short DNA sequences to the human genome at a rate of over 25 million 35-bp reads per hour. The pipeline employs the pre-built “H. Sapiens, USCS hg19” human index. Bowtie produces an output file in the standard SAM format.

The second part of the pipeline produces hit counts for known genes. To summarize aligned reads into genes the SAM output of Bowtie is converted into the compressed BAM format using the Rsamtools package. The GenomicFeatures package is used to define the location of genes in chromosome coordinates and to summarize known exons by genes. This package allows the download of gene information from the University of California, Santa Cruz genome browser. Finally, gene counts are found by considering the overlaps between each aligned short sequence and the exons under each ENSEMBLE gene with the GenomicRanges package. Short sequences overlapping with more than one target are ignored.

The third part of the pipeline includes counts normalization. The read per kilobase per million (RPKM) and transcripts per million (TPM) methods are most commonly applied.