Use Google Docs to identify gene-disease associations in Pubmed

March 3, 2014   

Google Docs allows you to import XML. By using NCBIs esearch service, you can query pubmed for a list of genes. Stick the following code in A2, and a keyword in B2:

=importXML("http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term=" & B2 ,"(//Count)[1]")

What is more valuable, however, is if given a gene list – you can query pubmed for each gene combined with a second keyword like a disease.

For example, suppose you are studying Cleft lip and Palate and are left with a set of genes identified from a gene expression analysis. Now you want to see if any of those genes have published findings on them related to cleft lip and palate.

You can use the & operator to concatenate two keywords (gene & ” ” & disease). In B2 below you would put the following:

= C2 & " " & D2

The result would look something like this.

Pubmed result counts are in Column A1.Pubmed result counts are in Column A1.

Bioinformatics  Tips  google_doc pubmed


An R Function for Opening a dataframe in Excel (Mac Only)

February 18, 2014   

Coming from Stata, I have found the dataframe viewer to be inadequate in RStudio. I am just looking for simple sorting, rearranging, and filtering. That is why I wrote this simple function for opening a dataframe in Excel. The file is saved to a temporary data folder with a random set of letters (to avoid annoying dialogues in excel warning you about opening a file again).

This may be worth sticking in your .RProfile so it is always available.

It would be great to see a cross-platform solution, possibly browser based that allows simple rearranging, sorting, and filtering of columns on the fly. If anyone is interested in collaborating on such a solution or knows of one please let me know.

excel <- function(df) {
  f <- paste0(tempdir(),'/', make.names(deparse(substitute(df))),'.',paste0(sample(letters)[1:5],collapse=""), '.csv')
  write.csv(df,f)
  system(sprintf("open -a 'Microsoft Excel' %s",f))
}

To use, just type:

excel(df)

…and Microsoft Excel will open with the dataframe (or filtered dataframe).

Programming  R  Tips  excel R


Alfred Workflow for Creating a Data Analysis Project

January 25, 2014   

This idea I got from my brother – the idea is to keep any data analysis/bioinformatic projects I work on organized by sticking to a standard template. I wrote an Alfred Workflow for generating the template. There are a couple key features:

Directory Structure

Directory Structure

  • Markdown (md) extension – is used for the readme because its simple and so that the directory is ready for github if desired.
  • Data Folder – This directory is used for storing raw data and scripts that are used to clean and prepare data for analysis.
  • analysis – This directory contains the scripts for producing statistics and visualizing data.
    • report – any publications or presentations that come of the project can be stored in the report folder.
      • run.sh is a two line script that will run prepare_data.sh and analysis.sh. This allows you to reproduce the entirety of your work all at once and verify your results. </ul> What are your thoughts? How could this be improved?

Usage

Navigate to the directory where you would like to create the project template; open alfred and type

project [a name for your project]


Download

Alfred  Bioinformatics  Tips  Utilities 


Downloading and storing bioinformatic databases locally

January 20, 2014   

If you need to annotate biological data there are plenty of resources online (UCSC Genome Browser, BioMart), and plenty of programmatic tools to interact with these databases as well. But if you are going to be annotating a large dataset (like ChIP-Seq or RNA-Seq data) – you will probably not want to rely on web based services because a) It is inefficient b) You may get throttled or banned.

If you use python – its easy to download and store data in an SQlite database. This allows you to query the database using SQL and quickly and efficiently annotate large datasets.

Below – you will see that that is what I have done here for HapMap allele frequency data (2010-08_phaseII+III), and it allows me to retrieve allele frequency data from 26,278,275 rows across 11 populations instantly. The database itself is 3.22 Gb. A zipped version (~1Gb) is available Here.

Screen Shot 2014-01-20 at 12.07.25 AM

You will need sqlalchemy for this script to work. Install using pip install sqlalchemy.

Bioinformatics  Programming  gist python


Use Google to Find Lecture Notes

November 10, 2013   

This may seem obvious – but I’ve discovered a wonderful trick if you ever need to review a science topic quickly or are trying to learn something new, try searching google like this:

(topic) + Lecture filetype:pdf

You’ll find that tons of professors post their lecture notes online. Also try using filetype:ppt or leave filetype off (as some professors host websites with lecture notes).

google results microarray + lecture filetype:pdf</p>
Google results found when searching ‘Microarray + lecture filetype:pdf’
Tips  Science Tips