I've been managing my photos in Google's Picasa since 2009. I personally think it's a good application with all the social connectivity I need (Facebook, Google+). I expected that Google, being a company built on information and search, would have made the search bar in Picasa as robust as their web search bar. Unfortunately, not so.
If you use any of Google's web products, you should be familiar with searching by dates using keys like...
date:
between:
before:
after:
Well, these don't work in Picasa. Worse, all that there is for filtering by date is a slider bar that limits the range from some time in the past to the current date. This is not at all useful if you are only looking for photos taken in the month of June, 2012.
Yes, Picasa naturally organizes photos in date named folders. However, this convenience is lost when importing instant uploads from Google+, which just drops a pile of photos/videos into one "Instant Uploads" folder.
What the Picasa search bar does do is index all text fields/tags in your photos. If EXIF data is saved with your photos, there is a field called "Camera Date" which holds the date (according to the camera) that the photo was taken. This field is formatted as:
yyyy:mm:dd hh:mm:ss
(where hours are in 24-hr format)
So, if you want to search for June 2012 photos, type:
2012:06
into the search bar. So, while this works to grab photos from a specific month, there still isn't a way to find photos from a specific date range. Hopefully, this will be a new feature in an upcoming release.
Today I went to the RStudio site to post a feature request. I noticed up in the top navigation a link called "Shiny". I like things that are shiny. It turns out Shiny is a new package the RStudio team has developed for easily making reactive web applications from R code.
One of the tools I built at my current job is a web application that uses R as the server side analysis engine with A LOT of HMTL/CSS and AJAX tomfoolery in the browser to make things responsive a pretty.
From the looks of it, the shiny new Shiny package will significantly reduce the UI development time. In fact, it appears to make web UI development more like UI development in desktop python apps using [insert your favorite toolkit here] (note: so far I mostly have experience with wx).
Ever since I started using R (back in 2007) I often lamented the lack of a way to create easy to use GUIs that could encapsulate complex analysis scripts for non-R users. In MATLAB, this is done using GUIDE or low level uicontrol() functions. In 2009, RGG looked promising, and I also found SciViews-R, but I never really had/invested the time to look deeper into either of these tools. Besides, doing things from the comfort of a web browser is all the rage these days (and certainly makes deploying app updates easier).
A friend of mine is finally joining the smartphone using masses and asked to the general Facebook public what type they should get, iPhone or Android (or other). Here's what I posted in response.
I'm
an Android user and will upgrade to another Android when the time
comes. I have plenty of iPhone toting friends, and yes, the iPhone is a
decent piece of hardware. It is an mobile "experience" that is
carefully constructed and controlled by one company - Apple. Therein
lies why I won't get an iPhone.
Apple
has a knack for keeping people locked to their products. Get an iPhone
and you have to get apps that are in the Apple App store, get
music/books from Apple's iTunes Store. The iPhone 5 uses a proprietary,
and expensive, charging/data sync cable that only Apple sells (and is
currently not licensing to third parties). Upgrading iOS has the
tendency to remove apps made by companies Apple doesn't like (e.g.
Google, as is the case with iOS 6 replacing Google Maps with the poorly
excuted Apple Maps). I just don't like that sort of micromanaging -
especially by a company who's only after my money.
I
won't say that Google/Android is better for everyone. I just know it's
better for how I want to use my phone. I like how flexible and
customizable it is. I like that it can be useable from a free upgrade
phone to a premium one that would cost me $300.
Ultimately,
by the numbers iPhones and (premium) Android phones are equivalent. On
both you can check your email, post on Facebook, check the
weather/traffic, take photos, and occasionally make a phone call. I
really depends on who bothers you less as they look over your shoulder -
Apple or Google.
When visualizing an array of data in a heatmap, a good color map makes a world of difference.
Thanks to my work in 'omics (i.e. transcriptomics - microarrays and RNASeq) I've looked at a lot of heatmaps over the past couple of years, and generated quite a few to boot. Back in my Matlab heavy grad school days, I was generally happy with the default 'jet' color scheme (which given it's double rainbow-eseque aesthetics would make some individuals on this planet overly emotional). Suffice it to say, I was a bit wary of straying far from the available maps (the others I used semi-regularly were "bone", "gray", and "hot").
Today I needed to create a nice color ramp in a GUI tool I've developed in Matlab for a dataset that spanned [-Inf, Inf]. Ideally, it should have three color stops:
a "cool" color for extreme negative values
a neutral color for 0
a "hot" color for extreme positive values
The most "viewable" ramp of this sort (e.g. one that the color non-blinds and color blinds can equally enjoy) would be:
blue
black
yellow
If I were generating this ramp in R it would be quite trivial with the colorRampPalette() function:
The above line would create a function bky.ramp() that you could use to specify a ramping palette of arbitrary length for a heatmap() (or any other plotting function):
heatmap(X, col=bky.ramp(256))
Doing this in Matlab is similar, but a tad more obscure. If you look at the help for the colormap() function it says:
A colormap is an m-by-3 matrix of real
numbers between 0.0 and 1.0. Each row is an RGB vector that defines
one color. The kth row of the colormap defines
the kth color, where map(k,:)=[r(k)g(k)b(k)]) specifies
the intensity of red, green, and blue.
colormap(map) sets the colormap
to the matrix map. If any values in map are
outside the interval [0 1], you receive the error Colormap
must have values in [0,1].
I know that the colors I need are:
blue = [0 0 1]
black = [0 0 0]
yellow = [1 1 0]
but how do I ramp between them? Well for that you need interp1():
interp1 1-D interpolation (table lookup)
YI = interp1(X,Y,XI) interpolates to find YI, the values of the
underlying function Y at the points in the array XI. X must be a
vector of length N.
If Y is a vector, then it must also have length N, and YI is the
same size as XI. If Y is an array of size [N,D1,D2,...,Dk], then
the interpolation is performed for each D1-by-D2-by-...-Dk value
in Y(i,:,:,...,:).
If XI is a vector of length M, then YI has size [M,D1,D2,...,Dk].
If XI is an array of size [M1,M2,...,Mj], then YI is of size
[M1,M2,...,Mj,D1,D2,...,Dk].
In its simplest invocation, it does linear interpolation between supplied points in Y over points XI. How is this used to create a BKY color ramp with 256 levels? Like so:
If you're the type that likes to encapsulate things in reusable functions (which I am), you end up with something like this:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Submitting papers for publication is a painful process in many many many ways. One of the most common modes of torture is having to reformat your manuscript from one set of guidelines to another. Here I feature one shiny bit of ludicrousness that really makes me wonder where journal editors priorities are.
Math Equations and DOCX
If your manuscript is or will be in DOCX and contains equations, you
must follow the instructions below to make sure that your equations are
editable when the file enters production.
If you have not yet composed your article, you can ensure that the
equations in your DOCX file remain editable in DOC by enabling
“Compatibility Mode” before you begin. To do this, open a new document
and save as Word 97-2003 (*.doc). Several features of Word 2007/10 will
now be inactive, including the built-in equation editing tool. You can
insert equations in one of the two ways listed below.
If you have already composed your article as DOCX and used its
built-in equation editing tool, your equations will become images when
the file is saved down to DOC. To resolve this problem, re-key your
equations in one of the two following ways.
Use MathType to create the equation. MathType is the recommended method for creating equations.
Go to Insert > Object > Microsoft Equation 3.0 and create the equation.
If, when saving your final document, you see a message saying
“Equations will be converted to images,” your equations are no longer
editable and PLoS will not be able to accept your file.
Seriously folks. It's 2012. Let *.doc and Microsoft Equation 3.0 die already. While your at it, let's figure out a universal formatting guideline to submit with. All of science will thank you.
Here at work I've been in the business of developing webapps using R as the backend computational framework. The list of parts to get this running is pretty lightweight, just:
I'm not going to cover how to set these things up here, there is pretty good documentation around the web and on rApache's site. Instead, I'm going to talk about a hair pulling setback I encountered early on.
Problem
R scripts run behind rApache cannot load rJava without throwing an HTTP 500 error
Details
Specifically, if you look at the error_log file you see something like the following:
Loading required package: rJava
Error : .onLoad failed in loadNamespace() for 'rJava', details:
call: dyn.load(file, DLLpath = DLLpath, ...)
error: unable to load shared object '/usr/local/lib64/R/library/rJava/libs/rJava.so':
libjvm.so: cannot open shared object file: No such file or directory
Error: package 'rJava' could not be loaded
Running the same R script from
a user login session ... no problem.
behind PHP (via a system() call) ... no problem.
Suffice it to say, this had me really really stumped. Stumped enough to give up temporarily and settle with calling R code that needed rJava via a PHP-to-shell intermediary. Of course, that got confusing and unscalable quite quickly, forcing me to find a real solution.
So I started digging and found one unanswered post on the rApache Google Group relating to this problem dating back to 2010 (it's answered now, with my solution as detailed below). Not helpful.
More digging produced this post, which pointed me in the direction of the LD_LIBRARY_PATH variable, which apparently you shouldn't mess with directly unless you want a lot of R pain.
Using the following one line test script:
cat(Sys.getenv()['LD_LIBRARY_PATH'], '\n')
I quickly determined that rApache was NOT setting this variable, or anything else defined in
R/etc/ldpaths
before creating an instance of R.
From the folks that work on RStudio, R needs this variable set before starting R for rJava to initialize correctly - i.e. be able to find libjvm.so.
So how do you do this in an Apache process? I know that using a SetEnv directive in httpd.conf is a dead end. Thankfully, folks at the Ubuntu forums found a way.
Solution
Here's my modification of the Ubuntu forum solution.
which happens to be the direct parent path to libjvm.so on my server.
Step 2:
As root, run:
/sbin/ldconfig
Step 3:
Restart Apache
Wrap-up
After all this rigamarole
it appears that I can load packages that depend on rJava from within
rApache - i.e. lines like
library(rJava)
no longer complain and I'm not getting any more HTTP 500 errors as a result, which makes me happy for the moment. How long this happiness lasts depends. R scripts within rApache still don't see an LD_LIBRARY_PATH variable, but at least the parent Apache process knows where to find libjvm.so.
In my ongoing quest to webappify various R scripts I discovered that rApache cannot load any R packages that depend on rJava. For several of the scripts that I've written that grab data out of MS Excel files, and therein use the xlsx package, this is a serious brick wall.
In my current workaround, I've resorted to using a shell script to do the xls(x) to .RData conversion. Then I stumbled upon the gdata package. Buried deep deep deep within the documentation it is a function called
read.xls()
that relies on Perl rather than Java to do the heavy lifting of crawling both of Microsoft's proprietary binary and xml based formats.
Testing is currently underway and a comparative write-up is planned.
I consider myself a decent RegEx user. References to famous quotes about RegEx aside, I find it intuitive, like its speed and that it makes my code simple (more so than the alternative anyhow). Thus, I use RegEx where I can in the growing grab bag of languages I consider myself proficient in:
*nix command line / shell scripts
Javascript
PHP
Matlab
Python
R
Now we arrive to the point of disappointment - R. You see, more often than not, I use 'named capture' to extract parts from a RegEx match. It's way easier than keeping array indices straight (especially after the code has collected a couple cobwebs). Unlike its counterparts above (i.e. Matlab and Python), R does not implement named capture all that intuitively. In fact, named capture is a new feature in R's generic RegEx functions (regexpr, gregexpr) as of version 2.14.0 (released sometime late 2011) and hasn't changed in 2.15 (released 2012-03-30).
To get a sense of R's named capture inadequacy, here's a simple scenario ...
The Problem:
You are given a list of files with names like:
chA_0001
chA_0002
chA_0003
chB_0001
chB_0002
chB_0003
Your task is to separate identify the channel (either 'A' or 'B') and file ID (0001, 0002, ..., etc).
The regular expression with named capture to do this is quite simple:
ch(?[A-Z])\_(?[0-9]{4})
which, given the list of file names, should return some structure with a property:value pairs of the sort:
ch : A, A, A, B, B, B
id : 0001, 0002, 0003, 0001, 0002, 0003
The Solutions:
Here's some Matlab code that basically does this in one line:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
which would result in the following console output:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
There is a lot of work here! To help explain what's going on, here's the corresponding console output:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
The above steps are encapsulated into a much easier to use function re.capture() that allows for one-line-ish extraction:
> src
[1] "chA_0001" "chA_0002" "chA_0003" "chB_0001" "chB_0002" "chB_0003"
> pat
[1] "ch(?[A-Z])\\_(?[0-9]{4})"
> re.capture(pat, src)$names$ch
[1] "A" "A" "A" "B" "B" "B"
> re.capture(pat, src)$names$id
[1] "0001" "0002" "0003" "0001" "0002" "0003"
Summary
All told, it takes three functions and a for loop to get a user friendly named capture result! While I was able to make a one-liner function out of the ordeal, it's a shame that someone on the R development team couldn't build this into the return values for regexpr() and gregexpr(). Granted, I'm not the first to wish for something better. Perhaps this is something to look forward to in R 2.16?
This past weekend, I spent a couple of days in Seattle reviewing my life as a postdoc there.
Part of that involved getting as many details on how my former lab used to process microarrays so that I could bring that experience to my new job. Basically, I want to bring what is currently a MS Excel and basic RMA affair into a more modern Systems Biology "Big Data" light.
and the hard to implement but obvious advice of Environment Mapping (cataloging ALL genotypes and environment conditions for each microarray sample) the lot of existing experiments pronto.
As I've volunteered to present these concepts in about two weeks time, next week is going to be fun.
This past week at work I had the opportunity to code the same algorithm using each of the three scientific programming/scripting languages I'm familiar with:
The list above is the order that the (re)-coding was done and serves as a beginning of an answer as to why I had|wanted to do such repetitive work.
Before getting into the details first the problem: Grab multiple optimized DNA sequences in a MS Excel workbook and format them as a FASTA text file for use with a webapp for rare codon analysis. Prior to seeking my help, users were manually copying the sequences (located in one cell across multiple sheets) into a MS Word document. This was fine for 2-5 sequences, but got seriously tedious and error prone for anything >10.
Lastly, this is all done in Windows (for added craziness).
Round 1: Matlab (aka the 500lb gorilla)
Out of the (very expensive) box it has MS Excel read/write capabilities via the functions
xlsfinfo
xlsread
xlswrite
Adding the (also expensive) Bioinformatics Toolbox gives FASTA file io via
fastaread
fastawrite
Code:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Using the Matlab Compiler (again, if you've paid for it) this distills nicely into a tidy command line executable.
The problems began (as usual) immediately after deployment.
First, in order to run any compiled Matlab code, users need to install the weighty (~400MB) Matlab Component Runtime (MCR), which in corporate IT-lockdown land is it's own form of enjoyment.
Second, and a horrendous PITA if you ask me, the version of the MCR users need depends on the version of Matlab the code was compiled in. Worse, there is no backward compatibility. In this case, users needed version 7.16 of the runtime to correspond with my Matlab 2011b. However, the program that generated the sequences in the first place (also a Matlab compile job) was made with Matlab 2009a.
It was late in the afternoon, the IT guys were gone, and I didn't want to have to deal with any craziness of conflicting runtimes.
Sorry Matlab, you suck.
Round 2: Python (parsel tongue any one?)
There's a lot of hubub about how NumPy/SciPy + matplotlib in distributions like Python(X,Y) can totally replace Matlab - FOR FREE! I've yet to really delve into that module stack. For the task at hand, bare python has all that's needed with a few modules (again ALL FREE)
FASTA file acrobatics (and much much more): BioPython
Code:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
As an important note, the python code completes almost instantaneously whereas the Matlab code took at least 5 seconds (about 1 sec per worksheet in the source .xlsx file). I don't know why Matlab takes so long to get/put data from/into an Excel workbook, but I sure hope the Mathworks engineers are working on it. Sure, this is slightly unfair since python is byte-compiled when run, but on a modern PC with 1GHz of multi-core processing power and 3GB of RAM, I expect performance, darn-it.
As a small slight to the Python camp, the xlrd/xlwt modules are only compatible with the older .xls (Microsoft Excel 97-2000) format files and not the newer XML based .xlsx files. So it does require one extra step ... par for the course.
Compiling to an console executable is made easy with Py2Exe.
Deploying is a snap - zip and email everything in the ./dist folder of where you Py2Exe'd your source code.
Of course, getting users that were originally in happy point and click land to work at the C: prompt kinda stopped this awesome train of free and open source progress dead in its tracks.
Round 3: R (statistics will help you find that buried treasure, er p-value)
FASTA files, Bioinformatics and statistics go hand-in-hand, this is pretty much a given: seqinr
Code:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
As far as speed goes, the R code bested Matlab and was on par with Python. Pretty interesting and consistent with the benchmarks posted by the Julia team.
A small word of warning, the xlsx package uses the Apache POI java library in the background and does run into significant memory cloggery (at least on my workstation) when working with sheets heavily laden with rows and/or columns.
Compiling, well, I'm not completely sure it exists yet (although RCC looks interesting). Of course, who needs to compile if you can just drop this on your webserver behind rApache and a simple webform. There, user command-line aversion solved.
Wrap-up
So what did this exercise in redundancy teach me? Well, thanks to the plethora of open-source tools, there is more than one way to skin a software deployment cat. It also shows how completely originally niche platforms like Python and R have come to parity with juggernauts like Matlab. Last, it has me satisfied (for now) in my programming poly-glottery.