Creating Annotated Data Frames from GEO with the GEOquery package

In this post, we will go over how to use the GEOquery package to download a data matrix (or eset object) directly into R and append specific probe annotation information to this matrix for it to be exported as a csv file for easy manipulation in Excel or spreadsheet tools. This is especially useful for sharing data with collaborators who are not familiar with R and would rather look up there favorite genes in a spreadsheet format.

Advertisements

Mining gene expression data from publicly available databases is a great way to find evidence to support you working hypothesis that gene X is relevant in condition Y. You may also want to mine publicly available data to build on an existing hypothesis or simply to find additional support for your favorite gene in a different animal model or experimental condition. In this post, we will go over how to use the GEOquery package to download a data matrix (or eset object) directly into R and append specific probe annotation information to this matrix for it to be exported as a csv file for easy manipulation in Excel or spreadsheet tools. This is especially useful for sharing data with collaborators who are not familiar with R and would rather look up there favorite genes in a spreadsheet format.

First, let’s start by opening an R session and creating a function to return the eset (ExpressionSet) object or the original list object downloaded by the getGEO() function in R.


getGEOdataObjects <- function(x, getGSEobject=FALSE){
# Make sure the GEOquery package is installed
require("GEOquery")
# Use the getGEO() function to download the GEO data for the id stored in x
GSEDATA <- getGEO(x, GSEMatrix=T, AnnotGPL=FALSE)
# Inspect the object by printing a summary of the expression values for the first 2 columns
print(summary(exprs(GSEDATA[[1]])[, 1:2]))

# Get the eset object
eset <- GSEDATA[[1]]
# Save the objects generated for future use in the current working directory
save(GSEDATA, eset, file=paste(x, ".RData", sep=""))

# check whether we want to return the list object we downloaded on GEO or
# just the eset object with the getGSEobject argument
if(getGSEobject) return(GSEDATA) else return(eset)
}

We can test this function on a GEO dataset such as GSE73835 as follows:


# Store the dataset ids in a vector GEO_DATASETS just in case you want to loop through several GEO ids
GEO_DATASETS <- c("GSE73835")

# Use the function we created to return the eset object
eset <- getGEOdataObjects(GEO_DATASETS[1])
# Inspect the eset object to get the annotation GPL id
eset 

You will see the following output:

ExpressionSet (storageMode: lockedEnvironment)
assayData: 45281 features, 6 samples
element names: exprs
protocolData: none
phenoData
sampleNames: GSM1904293 GSM1904294 … GSM1904298 (6 total)
varLabels: title geo_accession … data_row_count (35 total)
varMetadata: labelDescription
featureData
featureNames: ILMN_1212602 ILMN_1212603 … ILMN_3163582 (45281 total)
fvarLabels: ID Species … ORF (30 total)
fvarMetadata: Column Description labelDescription
experimentData: use ‘experimentData(object)’
Annotation: GPL6887

We will first need to download the annotation file for GPL6887. Then we can create a data frame with the probe annotation categories we are most interested in as follows:


# Get the annotation GPL id (see Annotation: GPL10558)
gpl <- getGEO('GPL6887', destdir=".")
Meta(gpl)$title

# Inspect the table of the gpl annotation object
colnames(Table(gpl))

# Get the gene symbol and entrez ids to be used for annotations
Table(gpl)[1:10, c(1, 6, 9, 12)]
dim(Table(gpl))

# Get the gene expression data for all the probes with a gene symbol
geneProbes <- which(!is.na(Table(gpl)$Symbol))
probeids <- as.character(Table(gpl)$ID[geneProbes])

probes <- intersect(probeids, rownames(exprs(eset)))
length(probes)

geneMatrix <- exprs(eset)[probes, ]

inds <- which(Table(gpl)$ID %in% probes)
# Check you get the same probes
head(probes)
head(as.character(Table(gpl)$ID[inds]))

# Create the expression matrix with gene ids
geneMatTable <- cbind(geneMatrix, Table(gpl)[inds, c(1, 6, 9, 12)])
head(geneMatTable)

# Save a copy of the expression matrix as a csv file
write.csv(geneMatTable, paste(GEO_DATASETS[1], "_DataMatrix.csv", sep=""), row.names=T)

Let’s take a look at the first 6 lines of the data frame we just created with the head() function.

example1As you can see once we export this data frame as a csv file, it is much easier for others to open this file as a spreadsheet and get useful information such as the gene symbol or entrez id with the expression values across the samples.

Hope this helps and happy collaborations!

Author: radiaj

I am a Research Scientist & Bioinformatician who specializes in Immunology and Cancer Biology. I routinely use R and other programing languages to explore genomic data of cancer cells to identify molecular changes and risk factors that contributes to cancer development.

3 thoughts on “Creating Annotated Data Frames from GEO with the GEOquery package”

  1. Excellent post! This helps a lot. I followed the exact same steps using GSE1159 and GPL96, and because of the different platform column names, I replaced:

    geneProbes <- which(!is.na(Table(gpl)$Symbol))
    — with —
    geneProbes head(geneMatTable)
    33.4 44.9 35.8 50.3 37.3 56.5 34.8 32.7 69.7 101.5 32.6 49.4 64.2 40.7 33.6 36 53.4 209.6 31.8 60.4 49.2 44.8
    1053_at 99.5 52.7 50.8 56.5 32.5 66.2 58.1 89.7 72.8 58.4 66.6 104.1 55.4 73.4 73.1 53.6 35.7 73.6 89.1 57.7 21.2 76.3
    117_at 50.0 17.1 53.4 66.7 19.6 32.3 29.3 32.8 11.7 28.9 82.8 32.2 23.3 21.8 13.8 110.2 21.5 68.5 14.1 36.8 106.3 14.3
    121_at 182.9 203.6 135.1 186.3 217.8 234.5 128.2 147.0 127.8 175.5 127.5 157.1 124.8 153.9 173.1 109.9 196.0 178.1 128.1 184.9 127.6 177.1

    The GSM identifiers are missing as the column names and it appears the first row of expression values has become the column names (instead of the GSM numbers). There are also several error messages in the console saying that the column names are duplicated:

    Warning message:
    Duplicated column names deduplicated: ‘35.8’ => ‘35.8_1’ [25], ‘28.9’ => ‘28.9_1’ [40], etc…

    Interestingly, the GPL column names are coming through:

    Sequence Type Representative Public ID ENTREZ_GENE_ID
    1053_at Exemplar sequence M87338 5982
    117_at Exemplar sequence X51757 3310
    121_at Exemplar sequence X69699 7849

    Any suggestions for what I might be doing wrong and how to get the GSM numbers to come through so that I can later match the columns by GSM to other tables?

    Like

    1. Hi Dennis, normally you would not need to change the geneProbes names if the GPL file matches the GSE id in the descriptions. I must admit this code has been up for more than 2 years so it might be revisited if the GEO query output has changed. I also recommend keeping your columns of interest and then filter the matrix to 1 probe by gene by taking the most variable probe and then match with other tables. I hope this helps.
      Best wishes,
      Radia

      Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s