Fundamentals of R Programming and Statistical Analysis Video Course

Here is the link to my new course at PACKT publishing. I apologize in advance for some of their video editing choices but you will definitely learn a lot and be able to work through a variety of practical examples to meet your bioinformatic needs. I will upload the R code on GitHub and post the links to the files for all the videos in the course section of my website rjbioinformatics.com. So be sure to stay tuned!

Here is the link to my new course at PACKT publishing. I apologize in advance for some of their video editing choices but you will definitely learn a lot and be able to work through a variety of practical examples to meet your bioinformatic needs. I will upload the R code on GitHub and post the links to the files for all the videos in the course section of my website https://rjbioinformatics.com/video-course/. So be sure to stay tuned!

Here is an overview of the course available at:

https://www.packtpub.com/big-data-and-business-intelligence/fundamentals-r-programming-and-statistical-analysis-video.

Video Description

The R language is widely used among statisticians and data miners to develop statistical software and data analysis.

In this course, we’ll start by diving into the different types of R data structures and you’ll learn how the R programming language handles data. Then we’ll look in-depth at manipulating different datasets in R. After that, we’ll dive into data visualization with R, using basic plots, heat maps, and networks. We’ll explore the different flow control loops of the R programming language, and you’ll learn how to debug your code.

In the second half of the course, you’ll get hands-on working with the various statistical methods in R programming. You’ll find out how to work with different probability distributions, various types of hypothesis testing, and statistical analysis with the R programming language.

By the end of this video course, you will be well-versed in the basics of R programming and the various concepts of statistical data analysis with R.

Style and Approach

This fast-paced, practical guide is filled with real-world examples that will take you on a journey through the various concepts and phases of statistical analysis using the R programming language.

Happy R programming :0)

Radia

Fundamentals of R Programming and Statistical Analysis Video Course link

The R code is available on GitHub at https://github.com/radiaj/fundaRprogStatistics.

Simulating genes and counts for DESeq2 analysis

Sometimes it is helpful to simulate gene expression data to test code or to see how your results look with simulated values from a particular probability distribution. Here I am going to show you how to simulate RNAseq expression data counts from a uniform distribution with a mininum = 0 and maximum = 1200.

Sometimes it is helpful to simulate gene expression data to test code or to see how your results look with simulated values from a particular probability distribution. Here I am going to show you how to simulate RNAseq expression data counts from a uniform distribution with a mininum = 0 and maximum = 1200.

# Get all human gene symbols from biomaRt
library("biomaRt")
mart <- useMart(biomart="ensembl", dataset = "hsapiens_gene_ensembl")
my_results <- getBM(attributes = c("hgnc_symbol"), mart=mart)
head(my_results)

# Simulate 100 gene names to be used for our cnts matrix
set.seed(32268)
my_genes <- with(my_results, sample(hgnc_symbol, size=100, replace=FALSE))
head(my_genes)

# Simulate a cnts matrix
cnts = matrix(runif(600, min=0, max=1200), ncol=6)
cnts = apply(cnts, c(1,2), as.integer)
head(cnts)
dim(cnts)

 

Now, say we run DESeq2 to look for differentially expressed genes between our two simulated groups.

# Running DESEQ2 based on https://bioconductor.org/packages/release/bioc/vignettes/gage/inst/doc/RNA-seqWorkflow.pdf
library("DESeq2")
grp.idx <- rep(c("KO", "WT"), each=3)
coldat=DataFrame(grp=factor(grp.idx, levels=c("WT", "KO")))

# Add the column names and gene names
colnames(cnts) <- paste(grp.idx, 1:6, sep="_")
rownames(cnts) <- my_genes
head(cnts)

# Run DESeq2 analysis on the simulated counts
dds <- DESeqDataSetFromMatrix(cnts, colData=coldat, design = ~ grp)
dds <- DESeq(dds)
deseq2.res <- results(dds)
deseq2.fc=deseq2.res$log2FoldChange
names(deseq2.fc)=rownames(deseq2.res)
exp.fc=deseq2.fc

head(exp.fc)
#  SDAD1 SVOPL SRGAP2C MTND1P2 CNN2P8 IL13
# -0.48840808 0.32122109 -0.55584857 0.00184246 -0.15371042 0.11555792 

Now let’s see how many simulated genes had a log2 fold change greater than 1 by chance.


# Load the fold changes from DESeq2 analysis and order in decreasing order
geneList = sort(exp.fc, decreasing = TRUE) # log FC is shown
head(geneList)

gene <- geneList[abs(geneList) >= 1]
head(gene)

# C1orf216
#-1.129836

Now it’s your turn!  What other probability distributions could we simulate data from to perform a mock RNA seq experiment to determine how many genes could be different by chance? You can even use a bootstrap approach to calculate the p-value after running 1000 permutations of the code. Of course, to circumvent these problems we use adjusted p values but it is always nice to go back to basics and stress the importance of applying statistical methods when looking at differentially expressed genes. I encourage you all to leave your answers in the comment section below to inspire others.

Happy R programming!