BioConductor

Bioconductor is a collection of R scripts and extensions for analyzing microarray data. There is even a package for Affymetrix chips.

Obtaining Bioconductor

There are two options in getting Bioconductor. First install it yourself on your own computer or use the one on packrat (which you need to get access too).
  • packrat is defunct as of 2007. currently not installed on any of my machines, but that could be remedied. contact KarelSF

Installing R and Bioconductor

First, you need to have R to install Bioconductor. You can get R from CRAN. For Jagaur (MacOSX 10.2) users, I highly recommend using Fink to install R and an X-windows system (I recommend apple's version). Panther (MacOSX 10.3) and Jagaur you can get a binary that uses the Aqua interface. This is fine if you're the only one using it. If you want to use it from X-windows from other machines (like getting a nice dual G5 with 8 GB for a lab), then you need to build it yourself, see the RAqua Developer Page for details. Some of the things you'll need are libreadline, g77, tcltk, teTeX suite (tetex-base from fink), and something to view PDFs (needed to see PDF the documention).

> ./configure --enable-R-shlib --with-blas='-framework vecLib' --with-lapack
> make
> make check (optional)
> make install

For OS X and X11, the path to R might not be there. Therefore, you will need to add /usr/local/bin to your path.

For tcsh:

> setenv PATH ${PATH}:/usr/local/bin

For bash:

> export PATH=$PATH:/usr/sbin/:/usr/local/bin

You might want to add these to your .login or .profile file.

Now you need to get Bioconductor. There is a script that you need to run and it will take a few minutes (remember, if you didn't install R you might need to have admin access to do this, see below).

sudo R
>source("getBioC.R")
>getBioC()
>q()    ## remember to quit, you are root now!

You can also try this:

sudo R
>source("http://www.bioconductor.org/getBioC.R")
>getBioC("affy","release") 
>q()

Adding CDFs to Bioconductor

In order to use the affy package with affymetrix array, you need to load in the CDF files that correspond to your chips.

Downloading cdf packages

The easy way it to get them from cdfenvs. Download the files and extract them. You then can permanently load the package into R by typing:
sudo R INSTALL celeganscdf

Making your own cdf package

If you have a custom chip or it is to new, you will need to make your own package. You need the disk that came with your chips (or grab it off of the array computer). Next, make sure you have the makecdfpackage (this is in development at this time) at http://www.bioconductor.org/repository/devel/package/html/index.html
1-get the package
2-install it by typing sudo R INSTALL makecdenv
3-then in R load it by typing library(makecdfname)
4-then try
 pname <- cleancdfname(whatcdf("C_elegans.CEL"))
 make.cdf.package("Celegans.CDF", packagename=pname)
5-finish up by installing your new package by typing sudo R INSTALL celeganscdf

Using R and bioconductor to analysis chips

R is a complex (to me) and powerfull system. Check out the many help sites like: http://cran.r-project.org/other-docs.html

Getting Started

You just need the .CEL files for each chips you want to analysis in the current working directory. You can easily do this by starting R in that directory or while in R:
>setwd("mycells/")

Next, you need to load the affy library

>library(affy)

To get help:

>?setwd

If you want to quit:

>q()
Save workspace image? [y/n/c]:  

Ansering 'y' above will save all your objects (that would be any CEL files you read in) to an invisible file under you home directory. This saves you time from retyping in and waiting for any processing that you did, however, it saves this. So, you will have a copy of all your very big CEL files and other data.

How to get expression values from your samples

For the affy library, CELs are loaded into an AffyBatch object, which contains both the PM and MM probe pairs.

If you used the same scanner then you don't need to worry about much. Just use one of quick functions to process you CEL files (see the most recent Vignette for more functions). For example, I like RMA:

>library(affy)
>myRMA <- justRMA()  ## this is the C-version of RMA, which is much faster
Background correcting
Normalizing
Calculating Expression
>

If you want MAS5 type of expression values (warning: this is slow)

>library(affy)
>myCELs <- ReadAffy()  ## reads in all the arrays in the current working directory
>myMAS <- mas5(myCELs)
background correction: mas 
PM/MM correction : mas 
expression values: mas 
background correcting...done.
22625 ids to be processed
.........
> 

If you want to uses different methods (also slow):

>library(affy)
>myCELs <- ReadAffy() 
>myExpr <- expresso(...)  ## see the Vignette or ?expresso for options

To see a list of arrays:

>sampleNames(myRMA)

NOTE: The [1] is the first position in the vector (array list).

Saving the expression values

R is a powerful statistic program and it is possible to do stats with the expression values, but it might be easier to use dCHIP (so that you can directly compare)

>exprs2excel(myRMA, file="myRMA.csv")   ## saves values in a comma-separated file 

If you use RMA, its expression values are log2. So you need to open up the file in excel and transform it. Or you can do this in R (see below, not fully tested). You can also load your data into GeneX -lite ( http://www.ncgr.org/genex/ )and a postgreSQL database.

tmp.exp <- exprs(myRMA)
tmp.exp2 <- c(nrow(tmp.exp), ncol(tmp.exp))
tmp.exp2 <- 2^tmp.exp[]
myRMA2 <- new("exprSet", exprs=tmp.exp2, phenoData = phenoData(myRMA))

Analyzing your data

Normalization

To see how your data looks, you can draw boxplots of all your chips (if on the Mac you need to doing this from X-windows). Before you normalize use:

>boxplot(myCELs, col = c(2,2,4,4))  
## col prints the first two arrays as red the second two as blue, see ?colors for more information

To save the image see below.

This makes a boxplot of the log base 2 of ONLY 5000 PM and MM pairs (ver 1.2?). Also, the outliers are included in the tails instead of at 1.5 IQR (i.e. range=0 not range=1.5)

The default normalization is quantiles (see below), but you should read the Vignette and normalize.methods(myData) to see a list of other methods.

>getOption("BioC")$affy$normalize.method
[1] "quantiles"

Now normalize your data (for all done on the same scanner) and view it:

>normalized.CELs <- normalize(myCELs) ## normalize.CELS is like normalize_CELs in C or Perl
>boxplot(normalized.CELs, col = c(2,2,4,4)

You should see a nice difference between the first and second boxplots. If you want them on the same image:

>par(mfrow=c(2,1))   ## top-bottom arrangment or c(1,2) for side-by-side
>boxplot(myCELs, col=c(2,2,4,4))  
>title("Unnormalized Arrays")
>boxplot(normalized.CELs, col=c(2,2,4,4))
>title("Normalized Arrays")

boxplots

For the brave, here is how to make different types of boxplots:
>boxplot(data.frame(log2(intensity(myCELs)))     ## plots all points length(intensity(myCELs)) 
>tmp.pm <- pm(myCELs)
>boxplot(data.frame(log2(tmp.pm)))
>tmp.index <- unlist(indexProbes(myCELs,which="pm")) ## get a list of all pm locations
>tmp.index <- tmp.index[seq(1,length(tmp.index), len=5000)] ## get the first 5000 from the list
>boxplot(data.frame(log2(intensity(myCELs)[tmp.index,]))) ## boxplot them

Scatter Plots and Correlation Values

After you get expression values (justRMA, etc...), you can compare the arrays to each other.

>cor.RMA <- cor(exprs(myRMA))    
>symnum(cor.RMA)           ## prints a symbolic table, see the last line for what the symbol values are
>cor.RMA                          ## prints the actually values
>write.table(cor.RMA, "RMA_cor.txt", sep = "\t")  ## saves it as a tab-delimited file

To visualize the arrays you can use my function corPlot (see corPlot.R).

>source("corPlot.R")
>corPlot(myRMA)

Repeat the above with the following to save it as either a jpeg or postscript file (PDFs are possible but you need Ghostscript installed). For the jpeg, you don't need to be in X-windows.

>jpeg(filename = "corRMA.jpeg", width=1000, height=1000)   ## or png
>corPlot(myRMA)
>dev.off()         ## saves the file

If you want to add a title, I don't know how. title(main="my plot") will not work because the last active plot is the bottom-right plot.

You might need to play with the size to get everything to fit nicely.

Filtering Genes

Say that you have a file of genenames (probe ids) that you have determined are interesting and they are in a tab delimited file as such:

probe.set            accession            gene
194124_s_at       WP:CE32791       C04H5.3 
194100_at          WP:CE11792       K06B4.6 
.
.

To read in the table and filter:

genelist <- read.table("gene_list.tab",sep='\t',header=T)
geneindex <- geneNames(myRMA) %in% genelist[,1]
myRMA.subset <- exprs(myRMA)[geneindex, ]

Clustering and Dendrograms

You should really look this up someplace else for more information about what you should use. However, here is a quick way to create them (for more info about the functions use ?)

Euclidean Clustering

For an Euclidean Cluster (from Sandrine Dudoit and Robert Gentleman):

ecTr <- dist(t(exprs(myRMA)), method = "euclidean")
hecTr <- hclust(hecTr, method = "average")
plot(hecTr, main = "Hierarchical clustering dendrogram")

You might try these for methods: maxium, manhattan, canberra or binary.

Now for nicer labels (from Adaikalavan Ramasamy) and such:

myRMA.matrix <- exprs(myRMA)
cn <- sampleNames(myRMA.matrix)
cn <- sapply( strsplit(cn, split="\/"), function(x) x[ length(x) ] )
cn <- sub( ".cel$|.CEL$", "", cn)
sampleNames(myRMA.matrix) <- cn

plot(hecTr, labels = cn, main = "Hierarchical clustering dendrogram", xlab = "", sub = "Average linkage, Euclidean distance for scaled arrays")

Coefficient of Correlation Clustering

Euclidian distance probably is not the best metric to reflect the biological concept of co-regulation. Pearson's coefficient of correlation better reflects this concept, but it gives an indication of similarity rather than a distance. A similarity can however be easily converted by a distance, with the folmula.

dist = max(sim) - sim

for the coefficient of correlation, the maximal value is 1. We can thus calculate the "correlation distance" accordingly.

myRMA.correlation <- as.dist(1 - cor(myRMA.matrix))
myRMA.hcorrelation <- hclust(myRMA.correlation)

The method cor() returns an object of class matrix. For the hierarchical clustering, we need an object of class dist. The method as.dist() is used to cast (reformat) a matrix into a dist object. For a Pearson's coefficient the distance has to be Zero-mean normalized and Unit-norm normalized. RMA only does unit-norm normalized. Therefore, you only get a stadard correlation.

To get something very similar (if not the same) as dCHIP:

myRMA.correlation <- as.dist(1 - cor(myRMA.matrix))
myRMA.hcorrelation <- hclust(myRMA.correlation, method = "centroid")
plot(myRMA.hcorrelation, hang = -1)

-- Main.mcolosim - 27 Feb 2003

Topic attachments
I Attachment Action Size Date Who Comment
elseR corPlot.R manage 1.7 K 2003-08-27 - 17:37 UnknownUser  
Topic revision: r13 - 2007-06-08 - 10:52:38 - KarelSF
 
Copyright © 2009 by the contributing authors of Brandeis University. All material on this collaboration platform is the property of the contributing authors.