Chapter 4 Session 4

In this session we will run through the basic steps for analysing a simple RNA-seq experiment using the limma-voom workflow. This includes:

  • filtering out lowly expressed genes
  • normalisation
  • creating a multidimensional scaling (MDS) plot
  • creating a design matrix
  • fitting gene-wise linear models (with empirical Bayes moderation to more accurately estimate gene-wise variability)
  • performing statistical testing for differential expression

The aim of this session is to give you experience with a real-world RNA-seq analysis, and making extensive use of an external library. We will not cover the statistics in any depth. In general analysis packages will want your data in some specific format, so it’s important to be able to manipulate the data to fit the package’s requirements.

Much of the materials here are explained in greater detail in the limma user’s guide. You can view this by typing help("limma") and following the links.

4.1 DGEList

The data we are looking at comes from three cell populations (basal, luminal progenitor (LP) and mature luminal (ML)) sorted from the mammary glands of female virgin mice, each profiled in triplicate.

Let’s start by creating our DGEList object. As a reminder, this object contains 3 key pieces of data:

  • counts: the main data of this object, a matrix of count values with samples along the columns and features/genes along the rows.
  • samples: a data frame containing annotation for the samples. The rows in this table describe the corresponding column of the counts data.
  • genes: a data frame containing annotation for the genes in the counts matrix. The rows in this table describe the corresponding row in the counts matrix.
# load required packages
library(edgeR)
library(limma)

# vector of file names
files <- dir(path = "data/counts", pattern = "GSM")
group <- factor(c("LP", "ML", "Basal", "Basal", "ML", "LP", "Basal", "ML", "LP"))
samplenames <- c("10_6_5_11", "9_6_5_11", "purep53", "JMS8-2", "JMS8-3", "JMS8-4", 
    "JMS8-5", "JMS9-P7c", "JMS9-P8c")


# create DGEList object
dge <- readDGE(files, path = "data/counts", columns = c(1, 3), group = group, labels = samplenames)

# add gene annotation information
dge$genes <- read.delim("data/Ses3_geneAnnot.tsv", stringsAsFactors = FALSE)

You can index the DGEList object by treating it as if it were the counts matrix, the object will handle the extraction and ordering of the sample and gene annotation data frames.

4.2 Filtering

The first step is to filter out lowly expressed genes. There are two main problems with low abundant genes:

  • Technical variation is more problematic for low abundance genes. This variation is thought to be due to two factors; insufficient mixing and low sampling fraction (McIntyre et al. 2011).
    • Insufficient mixing of solutions during library preparation can result in uneven distribution of reads.
    • RNA sequencing can be thought of as sampling. Measurement errors will occur simply due to the random nature of the sampling process. This problem affects lowly abundant RNA species more because the relative error for small count values is larger than it would be for more highly abundant RNA species.
  • Biologically, genes that are expressed at low, biologically not meaningful, levels are not of interest.

Removing these highly variable, lowly expressed genes increases your ‘power’ to detect differentially expressed genes (Bourgon, Gentleman, and Huber 2010), where ‘power’ is your ability to detect true positives. In testing for differential expression, a statistical test is conducted for each gene. When a high number of statistical tests are performed, a portion of them will be significant purely due to random chance. A common procedure to control for the number of false positive is to perform ‘multiple testing correction’ on the p-values. This adjusts the p-value in a way that reduces the number of false positives but comes at the cost of reduced power to detect true positives. If we filter out uninteresting, lowly expressed genes, we need to perform fewer statistical tests and reduce the impact that multiple testing adjustment has on detection power.

The filterByExpr() function provides an automatic way to filter genes.

Roughly speaking, by default, it keeps genes with a count of 10 or more, in at least as many samples as the smallest experimental group. In our experiment, there are 3 phenotype groups each with 3 samples. Therefore we retain only genes that have 10 or more counts in 3 or more samples.

More specifically, the actual filtering is done on counts per million, with similar result to the above criteria. This is to prevent bias against samples with small library sizes.

The output of this function is a vector of logicals, indicating which genes (rows) should be kept and which filtered.

keep <- filterByExpr(dge)
table(keep)
## keep
## FALSE  TRUE 
## 10555 16624
dge <- dge[keep, , keep.lib.sizes = FALSE]
dim(dge$counts)
## [1] 16624     9

We can see that we now have 16624 genes. We started with 27179 genes - meaning that ~40% of genes have been filtered out.

4.3 Normalisation

The aim of normalisation is to remove systematic technical effects. There are two main factors that need to be normalised for in RNA-seq:

  • Sequencing depth/library size - technically, sequencing a sample to half the depth will give, on average, half the number of reads mapping to each gene (Robinson and Oshlack 2010).
  • RNA composition - if a large number of genes are unique to, or highly expressed in, only one experimental condition, the sequencing capacity available for the remaining genes in that sample is decreased. For example, if there are only five genes being studied in two experimental groups, if one gene is particularly high in group A, then with limited sequencing depth, that gene will reduce the counts of the remaining four genes. The effect of this is that the remaining four genes appear under-expressed in group A compared to group B when the true amount of gene product is actually equal for these 4 genes (Robinson and Oshlack 2010).

Sequencing depth is accounted for by calculating the counts per million (cpm). This metric is calculated by:

  1. taking the library size (sum of all counts for a sample),
  2. dividing this by 1,000,000 to get the ‘per million’ scaling factor,
  3. then dividing all read counts for each gene in that sample by the ‘per million’ scaling factor

RNA composition can be accounted for by using more sophisticated normalisation methodologies. We will use ‘trimmed mean of M-values’ (TMM), which estimates relative RNA levels from RNA-seq data (Robinson and Oshlack 2010). Under the assumption that most genes are not differentially expressed, TMM calculates a library size scaling factor for each library (sample). This is done using the following steps:

  1. calculate the gene expression log fold changes and absolute expression values for pair-wise samples (selecting one sample from the experiment as a reference)
  2. remove the genes with the highest and lowest fold changes and absolute expression values
  3. take a weighted mean of the remaining genes (where the weight is the inverse of the approximate asymptotic variances). This gives the normalisation factor for each library (sample)

Subsequent steps in this analysis will use log-cpm values, calculated using the normalisation factors, which scales each library size.

We can calculate the normalisation factors, specifying that we want to use the "TMM" method:

dge <- calcNormFactors(dge, method = "TMM")

This function calculates the normalisation factors for each library (sample) and puts this information in the samples data frame. Note that it takes dge (our DGEList object as input) and returns a DGEList object as well.

Let’s take a look at our normalisation factors:

dge$samples
##                              files group lib.size norm.factors
## 10_6_5_11 GSM1545535_10_6_5_11.txt    LP 32857304    0.8943956
## 9_6_5_11   GSM1545536_9_6_5_11.txt    ML 35328624    1.0250186
## purep53     GSM1545538_purep53.txt Basal 57147943    1.0459005
## JMS8-2       GSM1545539_JMS8-2.txt Basal 51356800    1.0458455
## JMS8-3       GSM1545540_JMS8-3.txt    ML 75782871    1.0162707
## JMS8-4       GSM1545541_JMS8-4.txt    LP 60506774    0.9217132
## JMS8-5       GSM1545542_JMS8-5.txt Basal 55073018    0.9961959
## JMS9-P7c   GSM1545544_JMS9-P7c.txt    ML 21305254    1.0861026
## JMS9-P8c   GSM1545545_JMS9-P8c.txt    LP 19955335    0.9839203

These normalisation factors are all close to 1 for all samples, suggesting minimal difference in relative RNA levels between samples.

boxplot(log(dge$counts + 0.5))

boxplot(cpm(dge$counts, log = TRUE))

Challenge 4.1

  1. Create a boxplot of the normalisation factors versus group.

  2. rowSums() can be used on a matrix to take sums across the rows. Use this to retrieve annotations of all genes with more than 1,000,000 total counts across all samples.

  3. What are the 5 most highly expressed genes within the ML samples?

4.4 MDS plots

Before we perform statistical tests, it’s useful to perform some exploratory visual analysis to get an overall idea of how our data is behaving.

MDS is a way to visualise distances between sets of data points (samples in our case). It is a dimensionality reduction technique, similar to principal components analysis (PCA). We treat gene expression in samples as if they were coordinates in a high-dimensional coordinate system, then we can find “distances” between samples as we do between points in space. Then the goal of the algorithm is to find a representation in lower dimensional space such that points that the distance of two objects from each other in high dimensional space is preserved in lower dimensions.

The plotMDS() from limma creates an MDS plot from a DGEList object.

plotMDS(dge)

Each point on the plot represents one sample and is ‘labelled’ using the sample name. The distances between each sample in the resulting plot can be interpreted as the typical log2-fold-change between the samples, for the most differentially expressed genes.

We can change the labelling to use the name of the group the sample belongs to instead:

plotMDS(dge, labels = group)

This shows us that the phenotype groups tend to cluster together, meaning that the gene expression profiles are similar for samples within a phenotype group. The ‘Basal’ type samples quite close together while the ‘LP’ (luminal progenitor) and ‘ML’ (mature luminal) type samples are further apart, signifying that their expression profiles are more variable.

To make the three phenotype groups more distinct in our graph, we are going to colour samples from each group differently. To do this, we will use the col argument in plotMDS(). col takes in a vector the same length as the number of points in the plot (9 in our case, as there are 9 samples). Each element of the vector should be a colour name (R understands over 600 colour names), indicating what colour that sample should be.

To make this more clear, take a look at the table below, which lists all the samples and the phenotype group they belong to:

Samples Group
10_6_5_11 LP
9_6_5_11 ML
purep53 Basal
JMS8-2 Basal
JMS8-3 ML
JMS8-4 LP
JMS8-5 Basal
JMS9-P7c ML
JMS9-P8c LP

For example, let’s say we wanted LP samples to be coloured green, ML samples to be coloured red and Basal samples to be coloured blue. The col argument would then require a vector that we can generate as follows

group_col <- dge$samples$group
levels(group_col) <- c("blue", "green", "red")
group_col <- as.character(group_col)
group_col
## [1] "green" "red"   "blue"  "blue"  "red"   "green" "blue"  "red"   "green"

We can also add a legend to the figure by running the legend() function immediately after a new figure is created. We have to specify where to position the legend as well as the labels and colours within the legend.

plotMDS(dge, labels = group, col = group_col)
legend("topright", legend = c("Basal", "LP", "ML"), fill = c("blue", "green", "red"))

4.5 Linear modelling

The next step of the limma-voom analysis is to fit a linear model for each gene. A linear model is a broad class of statistical models that predict a variable of interest using one or more ‘explanatory’ (also called ‘predictor’) variables. The most basic type is linear regression, which models the relationship between a continuous variable and continuous or categorical ‘explanatory’ variables. It uses an equation that looks like this:

\(Y = \beta_{0} + \beta_{1}X_{1} + \beta_{2}X_{2}...\)

This equation is saying that a response variable of interest \(Y\) is equal to a constant (\(\beta_{0}\)) plus the sum of the covariates (\(X_{i}\)) each multiplied by a constant coefficient (\(\beta_{i}\)).

Our experiment is quite simple, since there is only a single covariate, the cell type. The true benefit of using linear models is in its ability to accommodate more complex designs including multiple covariates.

To fit the linear models in the limma-voom framework we need two objects in addition to our data:

  • A design matrix, representing the covariates.
  • A contrast matrix, representing the specific comparison we wish to make.

4.5.1 Design matrix

The design matrix specifies the values of the covariates for each sample. This is represented as a matrix due to the mathematical convenience.

To generate a design matrix. We use the function model.matrix(), with the expression ~0 + group. This returns a matrix representing the design where there is no intercept term and group is the only covariate. If we omit the 0 then there would be an intercept in the model, and if we included more covariates then more columns would be generated.

design <- model.matrix(~0 + group, data = dge$samples)
design
##           groupBasal groupLP groupML
## 10_6_5_11          0       1       0
## 9_6_5_11           0       0       1
## purep53            1       0       0
## JMS8-2             1       0       0
## JMS8-3             0       0       1
## JMS8-4             0       1       0
## JMS8-5             1       0       0
## JMS9-P7c           0       0       1
## JMS9-P8c           0       1       0
## attr(,"assign")
## [1] 1 1 1
## attr(,"contrasts")
## attr(,"contrasts")$group
## [1] "contr.treatment"

There are 9 rows, one for each sample. Along the columns are the names of the groups. The values in the cells denote membership of the particular sample for a particular group, as our groups in this case are mutually exclusive, each row contains only a single 1 to denote membership in a single group.

4.5.2 Contrasts

‘Contrasts’ let us ask specific questions between our experimental groups. In our data we have 3 experimental groups, if we are to test for differential expression, we are most likely interested in differences between only two of the groups at a time. Contrasts let us specify exactly what we’re testing for, and is also represented by a matrix just like the design.

A contrast matrix can be made using the makeContrasts() function. Within this function, we specify the name of each specific contrast and the formula for that contrast. For example, the BasalvsLP contrasts compares the difference between the Basal and LP groups. Note that the name of the phenotype groups must be written exactly as they are in the column names of our design matrix (see above).

In addition to the individual contrasts, the function must know about the design of the model. This is passed through the levels argument, which either accepts a matrix with the column names corresponding to levels of your experimental groups, or the levels themselves as a character vector

contr.matrix <- makeContrasts(BasalvsLP = "Basal - LP", BasalvsML = "Basal - ML", 
    LPvsML = "LP - ML", levels = design)  # alternatively 'levels = colnames(design)'
contr.matrix
##        Contrasts
## Levels  BasalvsLP BasalvsML LPvsML
##   Basal         1         1      0
##   LP           -1         0      1
##   ML            0        -1     -1

Note that the sum of all the numbers along each column is 0. The first column is the contrast for the difference between Basal (1) and LP (-1). This property is required for valid contrast matrices. An alternative test may be between one group and the average of the others which would look like c(1, -0.5, -0.5) down one of the columns.

4.5.3 Variance modelling with voom

We are now ready to fit our linear models. Limma fits linear models to the data with the assumption that the underlying data is normally distributed. Count data is generally not normally distributed, but log transforming count data gives it a roughly normal distribution sufficient for linear models to work well. To do this, limma transforms the raw count data to log-cpm using library sizes and the normalisation factors we calculated previously.

In addition to the normalisation steps, the limma-voom pipeline uses the voom() function to generate weights for the individual genes based on a modelled mean-variance relationship. This modelling allows use to get more information out of small sample sizes as the weights prevent our model from being more heavily influenced by more variable data points.

The voom() function takes as arguments, our DGEList object and our design matrix. It also optionally outputs a plot of the mean-variance relationship of our data, called the ‘voom-plot’.

v <- voom(dge, design, plot = TRUE)

The output of voom() (our variable v) is an EList object which contains the following elements:

  • genes - a data frame of gene annotation data.
  • targets - data frame of sample data.
  • E - numeric matrix of normalised log-cpm values.
  • weights - numeric matrix of precision weights.
  • design - the design matrix.

Challenge 4.2

  1. Create the MDS plot with legend, but with solid dots instead of text within the plot.

  2. Which of the elements of the EList is newly generated, which are taken from previous objects?

4.5.4 Fitting the linear model

We are now ready to fit our linear model with lmFit(), which calculates coefficients we defined in our design matrix design. The resulting object, vfit is a MArrayLM object. It contains a information about our genes (the same data frame as genes from our EList object v above), the design matrix and a number of statistical outputs. Of most interest to us is the coefficients, stored in an element called coefficients. The first rows of this matrix is shown below. Each gene is row and is labelled using the EntrezID. Each column gives coefficients for each of our phenotype groups. These coefficients are weighted averages of the log-cpm of each gene in each group.

vfit <- lmFit(v, design)
head(vfit$coefficients)
##             Basal        LP        ML
## 497097  3.0241632 -4.490392 -3.944477
## 20671   0.2681245 -2.488746 -2.024896
## 27395   4.3271126  3.901078  4.365378
## 18777   5.2069566  4.976083  5.654066
## 21399   5.2108711  4.901842  4.876380
## 58175  -1.9296994  3.581328  3.133985

We can then use contrasts.fit() to calculate coefficients for each contrast (or ‘comparison’) we specified in our contr.matrix. The output is also an object of the class MArrayLM (also known as an MArrayLM object). When we inspect the coefficients element now, we can see that each column is a contrast that we specified in our contrast matrix.

vfit <- contrasts.fit(vfit, contrasts = contr.matrix)
head(vfit$coefficients)
##         Contrasts
##           BasalvsLP   BasalvsML      LPvsML
##   497097  7.5145557  6.96864007 -0.54591559
##   20671   2.7568708  2.29302094 -0.46384989
##   27395   0.4260347 -0.03826548 -0.46430022
##   18777   0.2308732 -0.44710891 -0.67798213
##   21399   0.3090294  0.33449065  0.02546125
##   58175  -5.5110274 -5.06368468  0.44734272

4.6 Statistical testing

The next step is to carry out statistical testing to determine which genes are differentially expressed. The function eBayes() computes moderated t-statistics, moderated F-statistics and log-odds of differential expression for each gene, given a fitted linear model. ‘Moderated’ refers to empirical Bayes moderation, which borrows information across genes to obtain more accurate measures of variability for each gene. This also increases our power to detect differentially expressed genes.

efit <- eBayes(vfit)

We can now look at the number of differentially expressed genes using the decideTests() function. The output of this function is a matrix where each column is a contrast (comparison of interest) and each row is a gene. The numbers 1, -1 and 0 mean up-regulated, down-regulated or not significantly differentially expressed, respectively.

Note that decideTests() also accounts for multiple testing. The default method is Benjamini and Hochberg (Benjamini and Hochberg 1995) but several others are also available.

dt <- decideTests(efit)
dt
## TestResults matrix
##         Contrasts
##          BasalvsLP BasalvsML LPvsML
##   497097         1         1      0
##   20671          1         1      0
##   27395          0         0      0
##   18777          0        -1     -1
##   21399          0         1      0
## 16619 more rows ...

To obtain the total number of differentially expressed genes for each comparison, we can add the function summary():

summary(dt)
##        BasalvsLP BasalvsML LPvsML
## Down        4500      4850   2701
## NotSig      7307      6996  11821
## Up          4817      4778   2102

The function topTable() can be used to obtain more information on the differentially expressed genes for each contrast. topTable() takes as arguments the MArrayLM object output by eBayes() (efit), the contrast name of interest and the number of top differentially expressed genes to output. Note that the contrast name must be given in quotes and must be exactly as written in the contrast matrix contr.matrix.

It outputs a data frame with the following information:

  • Gene details - gene information, from the gene element of the MArrayLM object (efit).
  • logFC - the log2 fold change of the contrast.
  • AveExpr - the average log2 expression of that gene.
  • t - moderated t-statistic.
  • P.Value - p value.
  • adj.P.Val - adjusted p value.
  • B - log-odds that the gene is differentially expressed.
top <- topTable(efit, coef = "BasalvsLP", n = Inf)
head(top)
##        ENTREZID   SYMBOL TXCHROM     logFC  AveExpr         t      P.Value
## 12521     12521     Cd82    chr2 -4.095479 7.069637 -35.46821 3.383539e-12
## 22249     22249   Unc13b    chr4 -4.350553 5.663171 -32.38571 8.648521e-12
## 16324     16324    Inhbb    chr1 -4.721417 6.460922 -30.80826 1.447137e-11
## 14245     14245    Lpin1   chr12 -3.768977 6.294017 -29.94805 1.937313e-11
## 218518   218518 Marveld2   chr13 -5.215232 4.930008 -30.87689 1.414332e-11
## 12759     12759      Clu   chr14 -5.306847 8.856581 -29.55253 2.221529e-11
##           adj.P.Val        B
## 12521  4.660973e-08 18.41326
## 22249  4.660973e-08 17.42652
## 16324  4.660973e-08 17.06291
## 14245  4.660973e-08 16.85869
## 218518 4.660973e-08 16.76959
## 12759  4.660973e-08 16.70221

With that we can complete our analysis by writing out some results

write.csv(top, file = "BasalvsLP.csv")

4.7 MA Plot

The MA plot is a plot of log-fold-change (M-values) against log-expression averages (A-values), this is a common plot in RNA sequencing analysis to visualise the result of differential expression tests. It can be created using the plotMA() from the limma package. Creating this plot requires 3 pieces of information:

  • object = efit: The the fitted object containing the log-fold-change and log-expression averages
  • coef = 1: The column number of the contrast to plot since there are 3 different contrasts fitted within the object.
  • status = dt[, 1]: A vector of numerics denoting whether a gene is up-regulated or down-regulated.
plotMA(efit, coef = 1, status = dt[, "BasalvsLP"])

We can also save this plot programmatically as a PDF for further editing. To do this we use pdf() to turn on the pdf capture device, run the command that creates the plot, which is now captured by the pdf, and then turn the device off.

pdf(file = "BasalvsLP-MAPlot.pdf")
plotMA(efit, coef = 1, status = dt[, "BasalvsLP"])
dev.off()

Challenge 4.3

  1. What are the symbols of the top 5 most differentially expressed genes between basal and luminal progenitor samples in chromosome X based on adjusted p-value?

  2. Create a volcano (scatter) plot of log-fold-change on the x-axis and -log10(p-value) on y-axis.

References

Benjamini, Yoav, and Yosef Hochberg. 1995. “Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing.” Journal of the Royal Statistical Society: Series B (Methodological) 57 (1): 289–300.

Bourgon, Richard, Robert Gentleman, and Wolfgang Huber. 2010. “Independent Filtering Increases Detection Power for High-Throughput Experiments.” Proceedings of the National Academy of Sciences 107 (21): 9546–51.

McIntyre, Lauren M, Kenneth K Lopiano, Alison M Morse, Victor Amin, Ann L Oberg, Linda J Young, and Sergey V Nuzhdin. 2011. “RNA-Seq: Technical Variability and Sampling.” BMC Genomics 12 (1): 293.

Robinson, Mark D, and Alicia Oshlack. 2010. “A Scaling Normalization Method for Differential Expression Analysis of Rna-Seq Data.” Genome Biology 11 (3): R25.