Working with Venn Diagrams

In this post, we will learn how to create venn diagrams for gene lists and how to retrieve the genes present in each venn compartment with R.

In this post, we will learn how to create venn diagrams for gene lists and how to retrieve the genes present in each venn compartment with R.

In this particular example, we will generate random gene lists using the molbiotools gene set generator but you can use your own gene lists if you prefer. Specifically, we will generate a random list of 257 genes to represent those that are upregulated in condition and another list of 1570 genes to represent those that are upregulated in condition B.

Screen Shot 2016-06-21 at 2.01.15 PM

Then, we will sort and paste the gene lists in an excel document we will save as randomGeneLists.xlsx.

Now, let’s load the data into R using the gdata package.

geneLists <- read.xls("randomGeneLists.xlsx", sheet=1, stringsAsFactors=FALSE, header=FALSE)

# Notice there are empty strings to complete the data frame in column 1 (V1)

# To convert this data frame to separate gene lists with the empty strings removed we can use lapply() with our home made  function(x) x[x != ""]
geneLS <- lapply(as.list(geneLists), function(x) x[x != ""])

# If this is a bit confusing you can also write a function and then use it in lapply() 
removeEMPTYstrings <- function(x) {

 newVectorWOstrings <- x[x != ""]

geneLS2 <- lapply(as.list(geneLists), removeEMPTYstrings)

# You can print the last 6 entries of each vector stored in your list, as follows:
lapply(geneLS, tail)
lapply(geneLS2, tail) # Both methods return the same results

# We can rename our list vectors
names(geneLS) <- c("ConditionA", "ConditionB")

# Now we can plot a Venn diagram with the VennDiagram R package, as follows:

venn.plot <- venn.diagram(VENN.LIST , NULL, fill=c("darkmagenta", "darkblue"), alpha=c(0.5,0.5), cex = 2, cat.fontface=4, category.names=c("A", "B"), main="Random Gene Lists")

# To plot the venn diagram we will use the grid.draw() function to plot the venn diagram

# To get the list of gene present in each Venn compartment we can use the gplots package

a <- venn(VENN.LIST, show.plot=FALSE)

# You can inspect the contents of this object with the str() function

# By inspecting the structure of the a object created, 
# you notice two attributes: 1) dimnames 2) intersections
# We can store the intersections in a new object named inters
inters <- attr(a,"intersections")

# We can summarize the contents of each venn compartment, as follows:
# in 1) ConditionA only, 2) ConditionB only, 3) ConditionA & ConditionB
lapply(inters, head) 


Now you are ready, to review the genes in each section of the venn diagram separately. Alternatively, you can always use Venny web tool that is a great way to start looking at your data and then write a modified version of this R script to make a more exhaustive figure or facilitate downstream analysis in your script.

Feel free to leave comments or email me at

Pre-processing .CEL files in R

This post shows you how to compare data from two separate studies without the hassle of tackling batch effects, etc. By scaling and centring the data in both studies, you can look for trends in the data and look for gene expression changes that go in a similar direction.

One of the most efficient ways to pre-process microarray data in R, is to use the oligo R/Bioconductor package. In a few lines of code you can go from raw .CEL files to a normalized data matrix you can work with for downstream analysis. It is particularly useful, if you wish to reanalyse a subset of .CEL files from a previously published dataset.

For example, say we are interested in comparing relative gene expression levels of Atf1, Atf3, Brca1 i in the lung, liver, and bone marrow of CB17 mice to previously published mouse tissue expression data on GEO. From our study, we have a data matrix of normalized expression values we obtained from our microarray study stored in CB17mat object.

First, we will scale all gene expression values by row to obtain center and scale these values to have an idea which organ expresses the highest levels of Atf1, Atf3, and Brca1 relative to the others in our study. To do this we will use the scale() with the default settings center=TRUE and scale=TRUE. Since the scale default function scales and centers columns we will need to transpose our matrix before proceeding.

# From your gene expression matrix stored in CB17mat
genes = c("Atf1", "Atf3", "Brca1")
CB17scal <- t(apply(CB17mat[genes, ], 1, scale))

# We will also add the missing column names to our scaled matrix
colnames(CB17scal) <- colnames(CB17mat)

# You can also plot a heatmap to look at the effects of the scaling on the expression levels across the tissues using the gplots package

# Clusters the rows (and potentially columns) by Pearson correlation as distance method
corrdist = function(x) as.dist(1-cor(t(x), method="pearson"))

# and Ward method as the agglomeration method
hclust.avl = function(x) hclust(x, method="ward.D2")

# with dendrogram="row" and Colv=NA, we are only clustering the rows i.e. genes
png(filename="example1.png", width=10, height=10, units = 'in', res = 300)
heatmap.2(CB17scal, dendrogram="row", Colv=NA, scale="none", key.title="", col=rev(redblue(250)), trace='none', cexCol=1.2, cexRow=1.5, hclustfun=hclust.avl, distfun=corrdist, margins=c(8, 12))


As you can see Atf1 and Brca1 are relatively higher in bone marrow, whereas Atf3 is relatively higher in the lung compared to the other organs.

Now let’s analyze the liver, lung and bone marrow data from the Large-scale analysis of the human and mouse transcriptomes study from Su et al, 2002, PNAS Apr 2;99(7):4465-70. The individual files from the liver, lung and bone marrow were downloaded from GSE97 and the data was normalized with the oligo package as follows:

# Move the CEL files to ~/MY_WORKING_DIRECTORY/filesToAnalyse
# Set the working directory to the folder you would like to save your results

# Load the libraries needed for the analysis

# Load the packages needed for the analysis
# You can choose to save the CEL files for your tissues of interest
geneCELs <- list.celfiles("~/MY_WORKING_DIRECTORY/filesToAnalyse", full.names=TRUE)

affyGeneFS <- read.celfiles(geneCELs)

# RMA at the probet level
geneCore <- rma(affyGeneFS)

# Inspect the eset object

# For the featureData info for the array used in this study

probeids <- featureNames(geneCore)
geneAnnotation <- select(mgu74a.db, probeids, c("SYMBOL", "ENTREZID", "GENENAME"), multiVals="first")

# Save the ESET data and annotation
save(geneCore, geneAnnotation, file="geneCoreTissueV2.RData")
saveRDS(geneCore, "geneCore.rds")
saveRDS(geneAnnotation, "geneAnnotation.rds")

# Convert the eset object to a matrix to get the gene expression values
# for Atf1, Atf3, and Brca1
M1 <- exprs(geneCore)

Gene_Symbols <- sapply(rownames(M1), getGeneSymbol, df=geneAnnotation)
DM1 <- data.frame(M1, Gene_Symbols=Gene_Symbols[rownames(M1)], stringsAsFactors=FALSE)

# Use can create a function to select the probe with the top interquantile range with IQR()
# to represent the gene expression value or use the TopIqrSymbolMatrix()
# in the "Useful Functions to Work with Microarrays" post (coming soon!)

tissueM1 <- TopIqrSymbolMatrix(DM1)
tissueExprs <- tissueM1[, 1:ncol(M1)]
rownames(tissueExprs) <- tissueM1$Gene_Symbols
head(tissueExprs[, 1:4])

# To get the geo file ids
geoColnames <- gsub(".CEL", "", colnames(tissueExprs))

# Get the new colnames based on tissue
# Create an excel sheet with the GEO .CEL file id and tissue sample names you would like to use
# Load the excel sheet as a data frame using the gdata package
TissueNames <- read.xls("GSE97_FileSampleIdentifier.xlsx", sheet=1, stringsAsFactors=FALSE, row.names=1)
TissueSamples <- TissueNames[geoColnames, 2]

# Replace the tissue exprs with the tissue names
colnames(tissueExprs) <- TissueSamples
saveRDS(tissueExprs, "tissueExprs.rds")

# Now create a matrix for our 3 genes of interest
genes = c("Atf1", "Atf3", "Brca1")
genesMat <- as.matrix(M1[genes, ])

# Let's scale and center the expression values
genesMatScal <- t(apply(genesMat, 1, scale))

# Add the column names
colnames(genesMatScal) <- rep(c("liver", "lung", "bone marrow"), each=2)

png(filename="example1b.png", width=10, height=10, units = 'in', res = 300)
heatmap.2(genesMatScal, dendrogram="row", Colv=NA, scale="none", key.title="", col=rev(redblue(250)), trace='none', cexCol=1.2, cexRow=1.5, hclustfun=hclust.avl, distfun=corrdist, margins=c(8, 12))


By comparing both heatmaps, we can see that Atf1 and Brca1 have more similar expression patterns across the tissues than Atf3 in these mice compared to our CB17 mice tissue data.
Therefore, Brca1 and Atf1 tissue difference might be more consistent among the tissues from different mice.

Alternatively, it could just be an artifact of the small sample size and/or how the samples were processed before running the arrays. That being said, it’s a good thing to reanalyse studies and compare with your arrays to get an idea of sample variability and how the experimental design, pre- and post- processing affects the overall interpretation of the results.

Feel free to leave comments or email me at