This vignette goes through an example of a gene expression meta-analysis according to the method of Hughey and Butte (2015). Running this meta-analysis for the first time requires an internet connection, as some SOFT files must be fetched from GEO.

Before running the meta-analysis, you’ll need to download the data and install the custom cdf packages as described at vignette('prepare_example').

Introduction

The meta-analysis will use data from the following four studies:

Each study contains samples from various lung cancer subtypes. For this meta-analysis, we are going to train a predictive model for classifying a sample as adenocarcinoma (AD), squamous cell carcinoma (SQ), or small-cell lung carcinoma (SCLC), which will identify a set of genes whose expression varies consisently between subtypes across the multiple studies.

Set the parameters for the meta-analysis

Open R and set the working directory to the metapredict_example folder that you previously created. Then we’ll load the necessary packages and set the parameters for the meta-analysis. After running the meta-analysis once, you can change denovo to FALSE. In this example, we’re going to use three studies for cross-validation and the fourth as validation.

library('metapredict')
library('ggplot2')
library('RColorBrewer')

parentFolderPath = 'expression_data'
expressionDataFile = 'expression_data.rds'
denovo = TRUE

foldidColname = 'study'
className = 'class'
family = 'multinomial'
alpha = 0.9
discoveryStudyNames = c('Bhattacharjee', 'GSE11969', 'GSE29016')
classLevels = c('AD', 'SQ', 'SCLC')

Load the study and sample metadata

For this meta-analysis, the files of study metadata and sample metadata have already been created.

studyMetadataPath = system.file('extdata', 'study_metadata.csv', package = 'metapredict')
studyMetadata = read.csv(studyMetadataPath, stringsAsFactors = FALSE)

sampleMetadataPath = system.file('extdata', 'sample_metadata.csv', package = 'metapredict')
sampleMetadata = read.csv(sampleMetadataPath, stringsAsFactors = FALSE)

Load the gene expression data

Now we’ll load the expression data for each study, performing normalization and transformation and mapping probes to genes, where necessary. Running getStudyDataList will take a few minutes.

if (denovo) {
  esetList = getStudyDataList(parentFolderPath, studyMetadata)
  saveRDS(esetList, file = expressionDataFile)
} else {
  esetList = readRDS(expressionDataFile)}
#> Loading study Bhattacharjee...
#> Loading required package: hgu95av2hsentrezgcdf
#> Loading required package: AnnotationDbi
#> Loading required package: stats4
#> Loading required package: BiocGenerics
#> Loading required package: parallel
#> 
#> Attaching package: 'BiocGenerics'
#> The following objects are masked from 'package:parallel':
#> 
#>     clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
#>     clusterExport, clusterMap, parApply, parCapply, parLapply,
#>     parLapplyLB, parRapply, parSapply, parSapplyLB
#> The following objects are masked from 'package:stats':
#> 
#>     IQR, mad, sd, var, xtabs
#> The following objects are masked from 'package:base':
#> 
#>     anyDuplicated, append, as.data.frame, basename, cbind, colnames,
#>     dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
#>     grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
#>     order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
#>     rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
#>     union, unique, unsplit, which, which.max, which.min
#> Loading required package: Biobase
#> Welcome to Bioconductor
#> 
#>     Vignettes contain introductory material; view with
#>     'browseVignettes()'. To cite Bioconductor, see
#>     'citation("Biobase")', and for packages 'citation("pkgname")'.
#> Loading required package: IRanges
#> Loading required package: S4Vectors
#> 
#> Attaching package: 'S4Vectors'
#> The following object is masked from 'package:base':
#> 
#>     expand.grid
#> 
#> Loading study GSE11969...
#> Setting options('download.file.method.GEOquery'='auto')
#> Setting options('GEOquery.inmemory.gpl'=FALSE)
#> Parsed with column specification:
#> cols(
#>   .default = col_double()
#> )
#> See spec(...) for full column specifications.
#> File stored at:
#> /var/folders/tr/2dlhwxf107gcfr6vx0gpg4580000gp/T//RtmpLETwVW/GPL7015.soft
#> 
#> Loading study GSE29016...
#> Parsed with column specification:
#> cols(
#>   .default = col_double(),
#>   ID_REF = col_character()
#> )
#> See spec(...) for full column specifications.
#> File stored at:
#> /var/folders/tr/2dlhwxf107gcfr6vx0gpg4580000gp/T//RtmpLETwVW/GPL6947.soft
#> Loading study GSE30219...
#> Loading required package: hgu133plus2hsentrezgcdf
#> 

Cross-study normalization of discovery datasets

Now we’ll merge the discovery datasets and use ComBat to perform cross-study normalization.

ematList = extractExpressionData(esetList, sampleMetadata)
ematDiscovery = mergeStudyData(ematList[discoveryStudyNames], sampleMetadata)
#> Found3batches
#> Adjusting for0covariate(s) or covariate level(s)
#> Standardizing Data across genes
#> Fitting L/S model and finding priors
#> Finding parametric adjustments
#> Adjusting the Data

Leave-one-study-out cross-validation

Before running cross-validation, we’ll use the sample metadata to create the weights and foldid vectors that will be passed to cv.glmnet. metapredictCv returns a list, with one element for each value of alpha. Since we’re doing multinomial classification (AD, SQ, and SCLC), cv.glmnet uses the multinomial deviance as the measure of goodness of fit (the lower the better).

glmnetArgs = makeGlmnetArgs(sampleMetadata[sampleMetadata$study %in% discoveryStudyNames,],
                            foldidColname = foldidColname)

cvFitList = metapredictCv(ematDiscovery, sampleMetadata, yName = className,
                          weights = glmnetArgs$weights, foldid = glmnetArgs$foldid,
                          alpha = alpha, family = family, lambda.min.ratio = 0.001,
                          keep = TRUE)
cvFit = cvFitList[[1]]

plot(cvFit)

Coefficients of the best model from cross-validation

Let’s look at the coefficients for the best model from cross-validation (ignore the warning about “Stacking not well defined…”). This plot shows which genes are used to predict lung cancer subtype. For example, increased expression of any of three keratins (KRT17, KRT5, KRT6A) increases the probability that a sample is SQ. The number in parentheses next to each gene symbol is the Entrez gene ID.

fitResult = cvFit$glmnet.fit
lambda = cvFit$lambda.min

plotCoefficients(fitResult, lambda, classLevels = classLevels) +
  scale_fill_brewer(type = 'qual', palette = 3) +
  theme(axis.text.y = element_text(size = 8))

Expression of selected genes in the discovery set

We can also examine the expression of those selected genes in the merged expression data. In this heatmap, hierarchical clustering is used to determine the order of the genes and the order of the samples within each class.

annoNames = c('study', 'class')
annoLevels = list(discoveryStudyNames, classLevels)
names(annoLevels) = annoNames
annoColors = list(brewer.pal(3, 'Dark2'), brewer.pal(3, 'Paired'))
names(annoColors) = annoNames
names(annoColors[[1]]) = annoLevels[[1]]
names(annoColors[[2]]) = annoLevels[[2]]

plotExpressionHeatmap(fitResult, lambda, ematDiscovery, sampleMetadata, annoLevels,
                      annoColors, classLevels = classLevels, fontsize_row = 7)

Predict subtype for samples in the validation dataset

Now we’ll test the model on the last dataset. Internally, metapredict runs mergeStudyData to combine the discovery datasets and the validation dataset, then trains a predictive model on samples from the discovery dataset and predicts the subtype for samples from the validation dataset.

predsList = metapredict(ematList, studyMetadata, sampleMetadata, discoveryStudyNames,
                        alpha = alpha, lambda = lambda, weights = glmnetArgs$weights,
                        family = family)
#> Found4batches
#> Adjusting for0covariate(s) or covariate level(s)
#> Standardizing Data across genes
#> Fitting L/S model and finding priors
#> Finding parametric adjustments
#> Adjusting the Data

Confusion matrix for validation samples

One way to check prediction accuracy of a multinomial classifier is with a confusion matrix. These predictions look pretty good.

calcConfusionValidation(predsList, lambda = lambda, sampleMetadata = sampleMetadata,
                        classLevels = classLevels)
#> $GSE30219
#>          predictedClass
#> trueClass AD SQ SCLC
#>      AD   85  0    0
#>      SQ    9 52    0
#>      SCLC  4  0   17

Class probabilities for validation samples

Instead of looking at the single predicted subtype, we can also look at the probabilities for all four subtypes. Both the confusion matrix and this plot suggest that metapredict can combine data from multiple studies to learn an accurate classifier of lung cancer subtype.

plotClassProbsValidation(predsList, sampleMetadata, className, classLevels, size = 1.5,
                         ggplotArgs = list(scale_color_brewer(type = 'qual', palette = 3)))