Converting Gene Names in R with AnnotationDbi

There are many ways to convert gene accession numbers or ids to gene symbols or other types of ids in R and several R/Bioconductor packages to facilitate this process including the AnnotationDbi, annotate, and biomaRt packages. In this post, we are going to learn how to convert gene ids with the AnnotationDbi and org.Hs.eg.db package.

Advertisements

There are many ways to convert gene accession numbers or ids to gene symbols or other types of ids in R and several R/Bioconductor packages to facilitate this process including the AnnotationDbi, annotate, and biomaRt packages. In this post, we are going to learn how to convert gene ids with the AnnotationDbi and org.Hs.eg.db package. You could potentially modify this code to work with other species such as mice with the org.Mm.eg.db package.

For example, say we have a gene expression matrix stored in M1 created from an eset object you downloaded from GEO. The study I will be using for this example is A Leukemic Stem Cell Expression Signature is Associated with Clinical Outcomes in Acute Myeloid Leukemia deposited on GEO with the accession id GSE24006. To view the script on how to generate the expression set (eset) object see the post – Retrieving Gene Expression Data  Objects & Matrices From GEO.

# Convert you eset object to a matrix with the exprs() function
library(Biobase)
M1 <- exprs(eset)

# Convert the row names to entrez ids
library("AnnotationDbi")
library("org.Hs.eg.db")
columns(org.Hs.eg.db)

geneSymbols <- mapIds(org.Hs.eg.db, keys=rownames(M1), column="SYMBOL", keytype="ENTREZID", multiVals="first")
head(geneSymbols)

The mapIds() function from the AnnotationDbi package returns a named vector making it simple to retrieve entrez id for a given gene as follows:

gene.to.search <- c("658", "1360")
geneSymbols[gene.to.search]

# returns the gene symbols of the entrez
# "BMPR1B" "CPB1"

We can create a function to return a matrix with gene symbols instead of entrez ids as follows:

getMatrixWithSymbols <- function(df){
require("AnnotationDbi")
require("org.Hs.eg.db")

geneSymbols <- mapIds(org.Hs.eg.db, keys=rownames(df), column="SYMBOL", keytype="ENTREZID", multiVals="first")

# get the entrez ids with gene symbols i.e. remove those with NA's for gene symbols
inds <- which(!is.na(geneSymbols))
found_genes <- geneSymbols[inds]

# subset your data frame based on the found_genes
df2 <- df[names(found_genes), ]
rownames(df2) <- found_genes
return(df2)
}

# Now, let's use the function to create a matrix for the genes with gene symbols
M1symb <- getMatrixWithSymbols(M1)

We can generalize this function to go back and forth between gene symbols and entrez ids (or other ids) as follows:

We can generalize this function to go back and forth between gene symbols and entrez ids (or other ids) as follows:


# This function can take any of the columns(org.Hs.eg.db) as type and keys as long as the row names are in the format of the keys argument
getMatrixWithSelectedIds <- function(df, type, keys){
require("AnnotationDbi")
require("org.Hs.eg.db")

geneSymbols <- mapIds(org.Hs.eg.db, keys=rownames(df), column=type, keytype=keys, multiVals="first")

# get the entrez ids with gene symbols i.e. remove those with NA's for gene symbols
inds <- which(!is.na(geneSymbols))
found_genes <- geneSymbols[inds]

# subset your data frame based on the found_genes
df2 <- df[names(found_genes), ]
rownames(df2) <- found_genes
return(df2)
}

# for example, going from SYMBOL to ENTREZID
M1entrez <- getMatrixWithSelectedIds(M1symb, type="ENTREZID", keys="SYMBOL")

Stay tuned for more posts on Converting Gene Names in R with the annotation and biomaRt package.