Simulating genes and counts for DESeq2 analysis

Sometimes it is helpful to simulate gene expression data to test code or to see how your results look with simulated values from a particular probability distribution. Here I am going to show you how to simulate RNAseq expression data counts from a uniform distribution with a mininum = 0 and maximum = 1200.

Sometimes it is helpful to simulate gene expression data to test code or to see how your results look with simulated values from a particular probability distribution. Here I am going to show you how to simulate RNAseq expression data counts from a uniform distribution with a mininum = 0 and maximum = 1200.

# Get all human gene symbols from biomaRt
library("biomaRt")
mart <- useMart(biomart="ensembl", dataset = "hsapiens_gene_ensembl")
my_results <- getBM(attributes = c("hgnc_symbol"), mart=mart)
head(my_results)

# Simulate 100 gene names to be used for our cnts matrix
set.seed(32268)
my_genes <- with(my_results, sample(hgnc_symbol, size=100, replace=FALSE))
head(my_genes)

# Simulate a cnts matrix
cnts = matrix(runif(600, min=0, max=1200), ncol=6)
cnts = apply(cnts, c(1,2), as.integer)
head(cnts)
dim(cnts)

 

Now, say we run DESeq2 to look for differentially expressed genes between our two simulated groups.

# Running DESEQ2 based on https://bioconductor.org/packages/release/bioc/vignettes/gage/inst/doc/RNA-seqWorkflow.pdf
library("DESeq2")
grp.idx <- rep(c("KO", "WT"), each=3)
coldat=DataFrame(grp=factor(grp.idx, levels=c("WT", "KO")))

# Add the column names and gene names
colnames(cnts) <- paste(grp.idx, 1:6, sep="_")
rownames(cnts) <- my_genes
head(cnts)

# Run DESeq2 analysis on the simulated counts
dds <- DESeqDataSetFromMatrix(cnts, colData=coldat, design = ~ grp)
dds <- DESeq(dds)
deseq2.res <- results(dds)
deseq2.fc=deseq2.res$log2FoldChange
names(deseq2.fc)=rownames(deseq2.res)
exp.fc=deseq2.fc

head(exp.fc)
#  SDAD1 SVOPL SRGAP2C MTND1P2 CNN2P8 IL13
# -0.48840808 0.32122109 -0.55584857 0.00184246 -0.15371042 0.11555792 

Now let’s see how many simulated genes had a log2 fold change greater than 1 by chance.


# Load the fold changes from DESeq2 analysis and order in decreasing order
geneList = sort(exp.fc, decreasing = TRUE) # log FC is shown
head(geneList)

gene <- geneList[abs(geneList) >= 1]
head(gene)

# C1orf216
#-1.129836

Now it’s your turn!  What other probability distributions could we simulate data from to perform a mock RNA seq experiment to determine how many genes could be different by chance? You can even use a bootstrap approach to calculate the p-value after running 1000 permutations of the code. Of course, to circumvent these problems we use adjusted p values but it is always nice to go back to basics and stress the importance of applying statistical methods when looking at differentially expressed genes. I encourage you all to leave your answers in the comment section below to inspire others.

Happy R programming!

Converting mouse to human gene names with biomaRt package

Converting mouse gene names to the human equivalent and vice versa is not always as straightforward as it seems, so I wrote a function to simplify the task. The function takes advantage of the getLDS() function from the biomaRt to get the hgnc symbol equivalent from the mgi symbol.

Converting mouse gene names to the human equivalent and vice versa is not always as straightforward as it seems, so I wrote a function to simplify the task. The function takes advantage of the getLDS() function from the biomaRt to get the hgnc symbol equivalent from the mgi symbol. For example, let’s convert the following mouse gene symbols, Hmmr, Tlx3, and Cpeb4, to their human equivalent.

musGenes <- c("Hmmr", "Tlx3", "Cpeb4")

# Basic function to convert mouse to human gene names
convertMouseGeneList <- function(x){

require("biomaRt")
human = useMart("ensembl", dataset = "hsapiens_gene_ensembl")
mouse = useMart("ensembl", dataset = "mmusculus_gene_ensembl")

genesV2 = getLDS(attributes = c("mgi_symbol"), filters = "mgi_symbol", values = x , mart = mouse, attributesL = c("hgnc_symbol"), martL = human, uniqueRows=T)
humanx <- unique(genesV2[, 2])

# Print the first 6 genes found to the screen
print(head(humanx))
return(humanx)
}

We can just as easily write a function to go from human to mouse genes.

# Basic function to convert human to mouse gene names
convertHumanGeneList <- function(x){

require("biomaRt")
human = useMart("ensembl", dataset = "hsapiens_gene_ensembl")
mouse = useMart("ensembl", dataset = "mmusculus_gene_ensembl")

genesV2 = getLDS(attributes = c("hgnc_symbol"), filters = "hgnc_symbol", values = x , mart = human, attributesL = c("mgi_symbol"), martL = mouse, uniqueRows=T)

humanx <- unique(genesV2[, 2])

# Print the first 6 genes found to the screen
print(head(humanx))
return(humanx)
}

genes <- convertHumanGeneList(humGenes) 

If you have any other suggestions on how to convert mouse to human gene names in R, I would love to hear them just email me at info@rjbioinformatics.com.

Creating Annotated Data Frames from GEO with the GEOquery package

In this post, we will go over how to use the GEOquery package to download a data matrix (or eset object) directly into R and append specific probe annotation information to this matrix for it to be exported as a csv file for easy manipulation in Excel or spreadsheet tools. This is especially useful for sharing data with collaborators who are not familiar with R and would rather look up there favorite genes in a spreadsheet format.

Mining gene expression data from publicly available databases is a great way to find evidence to support you working hypothesis that gene X is relevant in condition Y. You may also want to mine publicly available data to build on an existing hypothesis or simply to find additional support for your favorite gene in a different animal model or experimental condition. In this post, we will go over how to use the GEOquery package to download a data matrix (or eset object) directly into R and append specific probe annotation information to this matrix for it to be exported as a csv file for easy manipulation in Excel or spreadsheet tools. This is especially useful for sharing data with collaborators who are not familiar with R and would rather look up there favorite genes in a spreadsheet format.

First, let’s start by opening an R session and creating a function to return the eset (ExpressionSet) object or the original list object downloaded by the getGEO() function in R.


getGEOdataObjects <- function(x, getGSEobject=FALSE){
# Make sure the GEOquery package is installed
require("GEOquery")
# Use the getGEO() function to download the GEO data for the id stored in x
GSEDATA <- getGEO(x, GSEMatrix=T, AnnotGPL=FALSE)
# Inspect the object by printing a summary of the expression values for the first 2 columns
print(summary(exprs(GSEDATA[[1]])[, 1:2]))

# Get the eset object
eset <- GSEDATA[[1]]
# Save the objects generated for future use in the current working directory
save(GSEDATA, eset, file=paste(x, ".RData", sep=""))

# check whether we want to return the list object we downloaded on GEO or
# just the eset object with the getGSEobject argument
if(getGSEobject) return(GSEDATA) else return(eset)
}

We can test this function on a GEO dataset such as GSE73835 as follows:


# Store the dataset ids in a vector GEO_DATASETS just in case you want to loop through several GEO ids
GEO_DATASETS <- c("GSE73835")

# Use the function we created to return the eset object
eset <- getGEOdataObjects(GEO_DATASETS[1])
# Inspect the eset object to get the annotation GPL id
eset 

You will see the following output:

ExpressionSet (storageMode: lockedEnvironment)
assayData: 45281 features, 6 samples
element names: exprs
protocolData: none
phenoData
sampleNames: GSM1904293 GSM1904294 … GSM1904298 (6 total)
varLabels: title geo_accession … data_row_count (35 total)
varMetadata: labelDescription
featureData
featureNames: ILMN_1212602 ILMN_1212603 … ILMN_3163582 (45281 total)
fvarLabels: ID Species … ORF (30 total)
fvarMetadata: Column Description labelDescription
experimentData: use ‘experimentData(object)’
Annotation: GPL6887

We will first need to download the annotation file for GPL6887. Then we can create a data frame with the probe annotation categories we are most interested in as follows:


# Get the annotation GPL id (see Annotation: GPL10558)
gpl <- getGEO('GPL6887', destdir=".")
Meta(gpl)$title

# Inspect the table of the gpl annotation object
colnames(Table(gpl))

# Get the gene symbol and entrez ids to be used for annotations
Table(gpl)[1:10, c(1, 6, 9, 12)]
dim(Table(gpl))

# Get the gene expression data for all the probes with a gene symbol
geneProbes <- which(!is.na(Table(gpl)$Symbol))
probeids <- as.character(Table(gpl)$ID[geneProbes])

probes <- intersect(probeids, rownames(exprs(eset)))
length(probes)

geneMatrix <- exprs(eset)[probes, ]

inds <- which(Table(gpl)$ID %in% probes)
# Check you get the same probes
head(probes)
head(as.character(Table(gpl)$ID[inds]))

# Create the expression matrix with gene ids
geneMatTable <- cbind(geneMatrix, Table(gpl)[inds, c(1, 6, 9, 12)])
head(geneMatTable)

# Save a copy of the expression matrix as a csv file
write.csv(geneMatTable, paste(GEO_DATASETS[1], "_DataMatrix.csv", sep=""), row.names=T)

Let’s take a look at the first 6 lines of the data frame we just created with the head() function.

example1As you can see once we export this data frame as a csv file, it is much easier for others to open this file as a spreadsheet and get useful information such as the gene symbol or entrez id with the expression values across the samples.

Hope this helps and happy collaborations!

Converting Gene Names in R with AnnotationDbi

There are many ways to convert gene accession numbers or ids to gene symbols or other types of ids in R and several R/Bioconductor packages to facilitate this process including the AnnotationDbi, annotate, and biomaRt packages. In this post, we are going to learn how to convert gene ids with the AnnotationDbi and org.Hs.eg.db package.

There are many ways to convert gene accession numbers or ids to gene symbols or other types of ids in R and several R/Bioconductor packages to facilitate this process including the AnnotationDbi, annotate, and biomaRt packages. In this post, we are going to learn how to convert gene ids with the AnnotationDbi and org.Hs.eg.db package. You could potentially modify this code to work with other species such as mice with the org.Mm.eg.db package.

For example, say we have a gene expression matrix stored in M1 created from an eset object you downloaded from GEO. The study I will be using for this example is A Leukemic Stem Cell Expression Signature is Associated with Clinical Outcomes in Acute Myeloid Leukemia deposited on GEO with the accession id GSE24006. To view the script on how to generate the expression set (eset) object see the post – Retrieving Gene Expression Data  Objects & Matrices From GEO.

# Convert you eset object to a matrix with the exprs() function
library(Biobase)
M1 <- exprs(eset)

# Convert the row names to entrez ids
library("AnnotationDbi")
library("org.Hs.eg.db")
columns(org.Hs.eg.db)

geneSymbols <- mapIds(org.Hs.eg.db, keys=rownames(M1), column="SYMBOL", keytype="ENTREZID", multiVals="first")
head(geneSymbols)

The mapIds() function from the AnnotationDbi package returns a named vector making it simple to retrieve entrez id for a given gene as follows:

gene.to.search <- c("658", "1360")
geneSymbols[gene.to.search]

# returns the gene symbols of the entrez
# "BMPR1B" "CPB1"

We can create a function to return a matrix with gene symbols instead of entrez ids as follows:

getMatrixWithSymbols <- function(df){
require("AnnotationDbi")
require("org.Hs.eg.db")

geneSymbols <- mapIds(org.Hs.eg.db, keys=rownames(df), column="SYMBOL", keytype="ENTREZID", multiVals="first")

# get the entrez ids with gene symbols i.e. remove those with NA's for gene symbols
inds <- which(!is.na(geneSymbols))
found_genes <- geneSymbols[inds]

# subset your data frame based on the found_genes
df2 <- df[names(found_genes), ]
rownames(df2) <- found_genes
return(df2)
}

# Now, let's use the function to create a matrix for the genes with gene symbols
M1symb <- getMatrixWithSymbols(M1)

We can generalize this function to go back and forth between gene symbols and entrez ids (or other ids) as follows:

We can generalize this function to go back and forth between gene symbols and entrez ids (or other ids) as follows:


# This function can take any of the columns(org.Hs.eg.db) as type and keys as long as the row names are in the format of the keys argument
getMatrixWithSelectedIds <- function(df, type, keys){
require("AnnotationDbi")
require("org.Hs.eg.db")

geneSymbols <- mapIds(org.Hs.eg.db, keys=rownames(df), column=type, keytype=keys, multiVals="first")

# get the entrez ids with gene symbols i.e. remove those with NA's for gene symbols
inds <- which(!is.na(geneSymbols))
found_genes <- geneSymbols[inds]

# subset your data frame based on the found_genes
df2 <- df[names(found_genes), ]
rownames(df2) <- found_genes
return(df2)
}

# for example, going from SYMBOL to ENTREZID
M1entrez <- getMatrixWithSelectedIds(M1symb, type="ENTREZID", keys="SYMBOL")

Stay tuned for more posts on Converting Gene Names in R with the annotation and biomaRt package.

Working with Venn Diagrams

In this post, we will learn how to create venn diagrams for gene lists and how to retrieve the genes present in each venn compartment with R.

In this post, we will learn how to create venn diagrams for gene lists and how to retrieve the genes present in each venn compartment with R.

In this particular example, we will generate random gene lists using the molbiotools gene set generator but you can use your own gene lists if you prefer. Specifically, we will generate a random list of 257 genes to represent those that are upregulated in condition and another list of 1570 genes to represent those that are upregulated in condition B.

Screen Shot 2016-06-21 at 2.01.15 PM

Then, we will sort and paste the gene lists in an excel document we will save as randomGeneLists.xlsx.

Now, let’s load the data into R using the gdata package.

library("gdata")
geneLists <- read.xls("randomGeneLists.xlsx", sheet=1, stringsAsFactors=FALSE, header=FALSE)
head(geneLists)

# Notice there are empty strings to complete the data frame in column 1 (V1)
tail(geneLists)

# To convert this data frame to separate gene lists with the empty strings removed we can use lapply() with our home made  function(x) x[x != ""]
geneLS <- lapply(as.list(geneLists), function(x) x[x != ""])

# If this is a bit confusing you can also write a function and then use it in lapply() 
removeEMPTYstrings <- function(x) {

 newVectorWOstrings <- x[x != ""]
 return(newVectorWOstrings)

}
geneLS2 <- lapply(as.list(geneLists), removeEMPTYstrings)

# You can print the last 6 entries of each vector stored in your list, as follows:
lapply(geneLS, tail)
lapply(geneLS2, tail) # Both methods return the same results

# We can rename our list vectors
names(geneLS) <- c("ConditionA", "ConditionB")

# Now we can plot a Venn diagram with the VennDiagram R package, as follows:
require("VennDiagram")

VENN.LIST <- geneLS
venn.plot <- venn.diagram(VENN.LIST , NULL, fill=c("darkmagenta", "darkblue"), alpha=c(0.5,0.5), cex = 2, cat.fontface=4, category.names=c("A", "B"), main="Random Gene Lists")

# To plot the venn diagram we will use the grid.draw() function to plot the venn diagram
grid.draw(venn.plot)

# To get the list of gene present in each Venn compartment we can use the gplots package
require("gplots")

a <- venn(VENN.LIST, show.plot=FALSE)

# You can inspect the contents of this object with the str() function
str(a)

# By inspecting the structure of the a object created, 
# you notice two attributes: 1) dimnames 2) intersections
# We can store the intersections in a new object named inters
inters <- attr(a,"intersections")

# We can summarize the contents of each venn compartment, as follows:
# in 1) ConditionA only, 2) ConditionB only, 3) ConditionA & ConditionB
lapply(inters, head) 

example2

Now you are ready, to review the genes in each section of the venn diagram separately. Alternatively, you can always use Venny web tool that is a great way to start looking at your data and then write a modified version of this R script to make a more exhaustive figure or facilitate downstream analysis in your script.

Feel free to leave comments or email me at info@rjbioinformatics.com.