Visualizing Pairwise Queries in R

August 2, 2014   


If you are doing a lot biological research and are interested in identifying whether an association exists between all the pairwise combinations between two sets of terms (e.g. two gene lists), you can use pubmed search results as a proxy for relative association.

In the example below, I show the results from organisms x diseases to give a rough estimate of how much each disease is studied in a given organism. Of course, this should all be taken with a (big) grain of salt because these organisms and diseases have many synonyms or related terms (e.g. M. Musculus is often referred to as Mouse in the literature). Additionally, the result count is based off of whether or not the terms were found together within the title and abstract of the literature only – and not the body of the text in many cases.

Bioinformatics  gist

A Short tour around Lake Michigan

July 12, 2014   

I’ve given bicycle touring a try. Originally I wanted to bike around Lake Michigan, but it turns out to be over 1,400 miles. So I compromised on a three day trip around a good chunk and making use of the ferry from Muskeegon, MI to Milwaukee, WI. This was my first time – so I also decided to stay in hotels. Next time I intend to camp. I learned a few valuable lessons along the way!

  • Pack less stuff! – I had way too much. In fact, I wound up breaking two spokes on the second day.
  • Shorten the days – Having never gone more than 40 miles in a single day, I decided to go 108 on the first day. Yeah. I probably should have gone more like 60-70 each day. By the time I got to my destination each day I was too tired to do anything. Part of the experience is seeing new places.
  • Get a proper touring bike – I didn’t use a touring bike because I don’t have one (yet). I used a Trek 7.2. My wrists hurt a lot for parts of the trip. Next time I’ll get a proper touring bike with the appropriate handle bars.
Trip around Lake MichiganDay 1</span> · Day 2 · Ferry · Day 3



Outside Muskeegan

After biking 8 hours - I found I could eat whatever I wanted.

Boardwalk in Muskeegan, MI

Middle of Lake Michigan - into the fog

Lakefront trail in Milwaukee

The trail outside Racine, WI

Biking  cycling

How to plot all of your Runkeeper Data

May 30, 2014   
Screen Shot 2014-05-27 at 10.50.04 PM</p>
Runs in Iowa City
Running and Biking in Chicago

If you use runkeeper and pay for a yearly subscription (runkeeper elite), you can export your data and plot all of your activities simultaneously using R. I’ve written a script for doing so (Special thanks to flowing data which has a tutorial that helped with a few key parts of this).

The script does a few unique things.

  • Runkeeper exports data in gpx format. If you ever pause an activity within runkeeper or you lose GPS reception briefly, the GPS path will get split into multiple paths within the same file. The script will retain all paths and plot them separately.
  • This script will merge in the type of activities so you can plot different types of activities by color.
  • Finally, cluster analysis is used to segregate different locations when plotting. If you are like me and have moved around a bit – this is necessary as plotting distant locations on the same map (e.g. Chicago and Boston) is not feasible and results in distant locations being plotted as single points.


  1. Export your runkeeper data. The option is available for subscribers only under the settings menu.

Exporting Runkeeper Data

Exporting can be done from within the settings menu

  1. Place the script below within a folder containing your runkeeper data. Set the num_locations variable to the number of places you have lived/run. This will be used to pull out the number of distinct running locations automatically.</p>
  2. Install the necessary R packages. You can run the following code within R to do so.
<pre>install.packages("fpc") install.packages("plyr") install.packages("dplyr") install.packages("mapproj")</pre>
  1. Run the script below from within R Studio or on unix based machines using RScript plot_runkeeper.R. If you are using Rstudio, be sure to set the working directory using setwd()
Programming  R  biking gist running

Where I Run and Bike in Chicago

May 25, 2014   


Using runkeeper and with the help of a tutorial at flowing data, I was able to plot all of the running and biking I’ve been doing in Chicago since moving here two years ago. The blue is running and the black is biking.

Chicago  bike chicago

Double Checking FASTQs

May 24, 2014   

When you have performed a sequencing project, quality control is one of the first things you will need to do. Unfortunately, sample mix-ups and other issues can and do happen. Systematic biases can also occur by machine and lane.

This script will extracting basic information from a set of FASTQs and output it to summary file (fastq_summary.txt). This will work with demultiplexed FASTQs generated by Illumina machines that appear in the following format:


  • @HWI-EAS209_0006_FC706VJ – Machine name
  • 5 – lane
  • 58 – tile within flowcell lane
  • 5894 – x coordinate of cluster within tile
  • 21141 – y coordinate of cluster within tile
  • #ATCACG – index
  • /1 – member of pair (/1 or /2)

The script below will extract the machine name, lane, index, and pair.

Bash  Bioinformatics  Programming  Python  fastq gist python