This vignette describes how to prepare data and metadata for a gene expression meta-analysis according to the method of Hughey and Butte (2015).

Find relevant datasets

The first step is to find microarray datasets relevant to what you’re interested in. Most datasets will be on NCBI GEO or ArrayExpress, but not all, so it’s a good idea to also search the literature.

Create the table of study metadata

The study metadata should be a comma-delimited text file, with one row for each study and at least the following columns:

Name Description
study Name of the study, which must be unique.
studyDataType Indicates how the expression data is stored (see below for details).
platformInfo Microarray platform, used for mapping probes to genes (see below for details).

There are currently five options for studyDataType:

studyDataType Description
affy_geo Raw Affymetrix data from a GEO study.
affy_custom Raw Affymetrix data from a non-GEO study (e.g., ArrayExpress).
affy_series_matrix Normalized, untransformed, probe-level Affymetrix data in a GEO series matrix file.
series_matrix Normalized, log-transformed (or equivalent) data in a GEO series matrix file.
eset_rds Normalized, log-transformed (or equivalent) data, already mapped to Entrez Gene IDs, saved as an ExpressionSet in an RDS file.

The options for platformInfo depend on the studyDataType:

studyDataType platformInfo
affy_geo, affy_custom, or affy_series_matrix Name of corresponding BrainArray custom cdf
series_matrix Corresponding GPL identifier
eset_rds ready

Download the gene expression data

The format of the gene expression data for each study should correspond to its studyDataType. All the folders and/or files with expression data should be in one parent folder.

studyDataType Format of expression data if name of study is GSE98765
affy_geo or affy_custom Folder named GSE98765 containing cel or cel.gz files.
affy_series_matrix or series_matrix File from GEO named GSE98765_series_matrix.txt.
eset_rds RDS file named GSE98765.rds containing a Bioconductor ExpressionSet.

If necessary, download packages and files for mapping Affy probe sets to Entrez Gene IDs

  1. For studies whose studyDataType is affy_geo or affy_custom, install the custom CDF package(s). See the installCustomCdfPackages documentation for details.

  2. For studies whose studyDataType is affy_series_matrix, download the custom CDF mapping(s). See the downloadCustomCdfMappings documentation for details.

Check that all studies whose studyDataType == 'series_matrix' are supported

  1. Open R and execute the following commands, replacing “<path to study metadata file>” as appropriate.

    studyMetadata = read.csv('<path to study metadata file>', stringsAsFactors = FALSE)
    metapredict::getUnsupportedPlatforms(studyMetadata)
  2. If any platforms are not supported but you still want to include those studies in the meta-analysis, you will need to edit the source of the metapredict package.
    1. In the getStudyData() function, add another else if statement that tells the function how to map probes to Entrez Gene IDs for that platform. Look at the code for currently supported platforms to see examples of how this is done.
    2. Add the GPL for the microarray platform to the character vector of supported platforms in the getSupportedPlatforms() function.
    3. Submit a pull request to https://github.com/hugheylab/metapredict, so other people can analyze data from that platform without going through the same hassle!

Create the table of sample metadata

The sample metadata should be a comma-delimited text file, with one row for each sample and at least the following columns:

Name Description
study Name of the corresponding study.
sample Name of the sample, which must be unique across all studies.
outcome, class, or something similar Variable that the meta-analysis will be trying to predict.

The format of the sample names depends on the studyDataType:

studyDataType Format of sample names
affy_geo or affy_custom Names of the .cel or .cel.gz files (excluding the file extension).
affy_series_matrix or series_matrix Names of the GSM identifiers from the series matrix file.
eset_rds colnames of the expression matrix in the ExpressionSet.