Showing posts with label matlab. Show all posts
Showing posts with label matlab. Show all posts

2012/10/09

Making Color Ramps in Matlab

When visualizing an array of data in a heatmap, a good color map makes a world of difference.

Thanks to my work in 'omics (i.e. transcriptomics - microarrays and RNASeq) I've looked at a lot of heatmaps over the past couple of years, and generated quite a few to boot.  Back in my Matlab heavy grad school days, I was generally happy with the default 'jet' color scheme (which given it's double rainbow-eseque aesthetics would make some individuals on this planet overly emotional).  Suffice it to say, I was a bit wary of straying far from the available maps (the others I used semi-regularly were "bone", "gray", and "hot").

Today I needed to create a nice color ramp in a GUI tool I've developed in Matlab for a dataset that spanned [-Inf, Inf].  Ideally, it should have three color stops:
  • a "cool" color for extreme negative values
  • a neutral color for 0
  • a "hot" color for extreme positive values
The most "viewable" ramp of this sort (e.g. one that the color non-blinds and color blinds can equally enjoy) would be:
  • blue
  • black
  • yellow

If I were generating this ramp in R it would be quite trivial with the colorRampPalette() function:
bky.ramp = colorRampPalette(c('blue', 'black', 'yellow'))

The above line would create a function bky.ramp() that you could use to specify a ramping palette of arbitrary length for a heatmap() (or any other plotting function):
heatmap(X, col=bky.ramp(256))


Doing this in Matlab is similar, but a tad more obscure.  If you look at the help for the colormap() function it says:
A colormap is an m-by-3 matrix of real numbers between 0.0 and 1.0. Each row is an RGB vector that defines one color. The kth row of the colormap defines the kth color, where map(k,:) = [r(k) g(k) b(k)]) specifies the intensity of red, green, and blue.
colormap(map) sets the colormap to the matrix map. If any values in map are outside the interval [0 1], you receive the error Colormap must have values in [0,1].
 I know that the colors I need are:
  • blue = [0 0 1]
  • black = [0 0 0]
  • yellow = [1 1 0]
but how do I ramp between them?  Well for that you need interp1():
 interp1 1-D interpolation (table lookup)
    YI = interp1(X,Y,XI) interpolates to find YI, the values of the
    underlying function Y at the points in the array XI. X must be a
    vector of length N.
    If Y is a vector, then it must also have length N, and YI is the
    same size as XI.  If Y is an array of size [N,D1,D2,...,Dk], then
    the interpolation is performed for each D1-by-D2-by-...-Dk value
    in Y(i,:,:,...,:).
    If XI is a vector of length M, then YI has size [M,D1,D2,...,Dk].
    If XI is an array of size [M1,M2,...,Mj], then YI is of size
    [M1,M2,...,Mj,D1,D2,...,Dk].
In its simplest invocation, it does linear interpolation between supplied points in Y over points XI. How is this used to create a BKY color ramp with 256 levels?  Like so:
bkyramp = interp1([blue; black; yellow], linspace(1,3,256));

If you're the type that likes to encapsulate things in reusable functions (which I am), you end up with something like this:

2012/05/03

RegEx: Named Capture in R

I consider myself a decent RegEx user.  References to famous quotes about RegEx aside, I find it intuitive, like its speed and that it makes my code simple (more so than the alternative anyhow). Thus, I use RegEx where I can in the growing grab bag of languages I consider myself proficient in:
  • *nix command line / shell scripts
  • Javascript
  • PHP
  • Matlab
  • Python
  • R
Now we arrive to the point of disappointment - R.  You see, more often than not, I use 'named capture' to extract parts from a RegEx match.  It's way easier than keeping array indices straight (especially after the code has collected a couple cobwebs).  Unlike its counterparts above (i.e. Matlab and Python), R does not implement named capture all that intuitively.  In fact, named capture is a new feature in R's generic RegEx functions (regexpr, gregexpr) as of version 2.14.0 (released sometime late 2011) and hasn't changed in 2.15 (released 2012-03-30).

To get a sense of R's named capture inadequacy, here's a simple scenario ...

The Problem:

You are given a list of files with names like:
  • chA_0001
  • chA_0002
  • chA_0003
  • chB_0001
  • chB_0002
  • chB_0003
Your task is to separate identify the channel (either 'A' or 'B') and file ID (0001, 0002, ..., etc).

The regular expression with named capture to do this is quite simple:
ch(?[A-Z])\_(?[0-9]{4})

which, given the list of file names, should return some structure with a property:value pairs of the sort:
  • ch : A, A, A, B, B, B
  • id : 0001, 0002, 0003, 0001, 0002, 0003

The Solutions:

Here's some Matlab code that basically does this in one line:


which would result in the following console output:

Now here's the equivalent R code:

There is a lot of work here! To help explain what's going on, here's the corresponding console output:

Here's what's happening:
  1. regexpr(..., perl=T) is used to create a regular expression result with named capture which is placed in the $result item of the output list.
    $result
    [1] 1 1 1 1 1 1
    attr(,"match.length")
    [1] 8 8 8 8 8 8
    attr(,"useBytes")
    [1] TRUE
    attr(,"capture.start")
         ch id
    [1,]  3  5
    [2,]  3  5
    [3,]  3  5
    [4,]  3  5
    [5,]  3  5
    [6,]  3  5
    attr(,"capture.length")
         ch id
    [1,]  1  4
    [2,]  1  4
    [3,]  1  4
    [4,]  1  4
    [5,]  1  4
    [6,]  1  4
    attr(,"capture.names")
    [1] "ch" "id"
    
    This result is pretty unusable since all of the important captured information is buried in attribute settings.
  2. To do anything with the output from regexpr(), the result from #1 has to have its attributes probed using attr() (via a for loop) to get:
    • captured group names
    • start locations within the strings of the captured groups
    • length of the captured groups (oddly/depressingly, end positions are not returned)
    The combination of the above is used by substr() to extract the actual match strings from the input list:
    rex$names[[.name]] = substr(rex$src,
                                attr(rex$result, 'capture.start')[,.name],
                                attr(rex$result, 'capture.start')[,.name]
                                + attr(rex$result, 'capture.length')[,.name]
                                - 1)
    
  3. The above steps are encapsulated into a much easier to use function re.capture() that allows for one-line-ish extraction:
    > src
    [1] "chA_0001" "chA_0002" "chA_0003" "chB_0001" "chB_0002" "chB_0003"
    > pat
    [1] "ch(?[A-Z])\\_(?[0-9]{4})"
    > re.capture(pat, src)$names$ch
    [1] "A" "A" "A" "B" "B" "B"
    > re.capture(pat, src)$names$id
    [1] "0001" "0002" "0003" "0001" "0002" "0003"
    

Summary

All told, it takes three functions and a for loop to get a user friendly named capture result! While I was able to make a one-liner function out of the ordeal, it's a shame that someone on the R development team couldn't build this into the return values for regexpr() and gregexpr(). Granted, I'm not the first to wish for something better. Perhaps this is something to look forward to in R 2.16?

2012/04/13

One app, three languages

This past week at work I had the opportunity to code the same algorithm using each of the three scientific programming/scripting languages I'm familiar with:
The list above is the order that the (re)-coding was done and serves as a beginning of an answer as to why I had|wanted to do such repetitive work.

Before getting into the details first the problem: Grab multiple optimized DNA sequences in a MS Excel workbook and format them as a FASTA text file for use with a webapp for rare codon analysis.  Prior to seeking my help, users were manually copying the sequences (located in one cell across multiple sheets) into a MS Word document. This was fine for 2-5 sequences, but got seriously tedious and error prone for anything >10.
Lastly, this is all done in Windows (for added craziness).

Round 1: Matlab (aka the 500lb gorilla)

Out of the (very expensive) box it has MS Excel read/write capabilities via the functions
xlsfinfo
xlsread
xlswrite

Adding the (also expensive) Bioinformatics Toolbox gives FASTA file io via
fastaread
fastawrite

Code:


Using the Matlab Compiler (again, if you've paid for it) this distills nicely into a tidy command line executable.

The problems began (as usual) immediately after deployment.

First, in order to run any compiled Matlab code, users need to install the weighty (~400MB) Matlab Component Runtime (MCR), which in corporate IT-lockdown land is it's own form of enjoyment.

Second, and a horrendous PITA if you ask me, the version of the MCR users need depends on the version of Matlab the code was compiled in.  Worse, there is no backward compatibility.  In this case, users needed version 7.16 of the runtime to correspond with my Matlab 2011b.  However, the program that generated the sequences in the first place (also a Matlab compile job) was made with Matlab 2009a.

It was late in the afternoon, the IT guys were gone, and I didn't want to have to deal with any craziness of conflicting runtimes.

Sorry Matlab, you suck.

Round 2: Python (parsel tongue any one?)

There's a lot of hubub about how NumPy/SciPy + matplotlib in distributions like Python(X,Y) can totally replace Matlab - FOR FREE!  I've yet to really delve into that module stack.  For the task at hand, bare python has all that's needed with a few modules (again ALL FREE)

MS Excel spelunking:
xlrd
xlwt

FASTA file acrobatics (and much much more):
BioPython

Code:


As an important note, the python code completes almost instantaneously whereas the Matlab code took at least 5 seconds (about 1 sec per worksheet in the source .xlsx file). I don't know why Matlab takes so long to get/put data from/into an Excel workbook, but I sure hope the Mathworks engineers are working on it. Sure, this is slightly unfair since python is byte-compiled when run, but on a modern PC with 1GHz of multi-core processing power and 3GB of RAM, I expect performance, darn-it.
As a small slight to the Python camp, the xlrd/xlwt modules are only compatible with the older .xls (Microsoft Excel 97-2000) format files and not the newer XML based .xlsx files. So it does require one extra step ... par for the course.
Compiling to an console executable is made easy with Py2Exe.

Deploying is a snap - zip and email everything in the ./dist folder of where you Py2Exe'd your source code.

Of course, getting users that were originally in happy point and click land to work at the C: prompt kinda stopped this awesome train of free and open source progress dead in its tracks.

Round 3: R (statistics will help you find that buried treasure, er p-value)

R directly works with MS Excel files? Yup:
xlsx

FASTA files, Bioinformatics and statistics go hand-in-hand, this is pretty much a given:
seqinr

Code:


As far as speed goes, the R code bested Matlab and was on par with Python. Pretty interesting and consistent with the benchmarks posted by the Julia team.
A small word of warning, the xlsx package uses the Apache POI java library in the background and does run into significant memory cloggery (at least on my workstation) when working with sheets heavily laden with rows and/or columns.
Compiling, well, I'm not completely sure it exists yet (although RCC looks interesting). Of course, who needs to compile if you can just drop this on your webserver behind rApache and a simple webform.  There, user command-line aversion solved.

Wrap-up

So what did this exercise in redundancy teach me? Well, thanks to the plethora of open-source tools, there is more than one way to skin a software deployment cat. It also shows how completely originally niche platforms like Python and R have come to parity with juggernauts like Matlab. Last, it has me satisfied (for now) in my programming poly-glottery.