Guitar Printouts

August 17, 2016    Guitar

I put these guitar-related printouts (A chord diagram sheet and a fretboard diagram sheet) together years ago:


August 4, 2016    Alfred

Search Quiver from Alfred! Quiver-alfred quickly constructs a database of your notes for fast and easy querying.



Type qset to set your quiver library location. Quiver-Alfred constructs a database of your notes to make querying as fast as possible. The database should refresh once every hour and should only take a few seconds to create.

Type q to use!

You can search tags by hitting q #.

Browse Notes within notebook:

Full Text Search using sqlite:

memoise - Caching in the cloud

July 27, 2016    R R Package  gist

Memoisation is a technique wherein the results of functions are cached based on inputs. For example, the following function calculates the fibonnaci sequence in R.

Note that this is a rather innefficient way of calculating values of the fibonnacci sequence. However, it is a useful example for understanding memoisation. The following code uses Hadley Wickhams package memoise.

In the above example, the memoise() function generates a memoised function, which will automatically cache results. If the function is run again with the same parameters, it will return the cached result. Implementing memoisation can significantly speed up analysis when functions that take time to run are repeatedly called.

What if you are running similar analyses within a cluster environment? The ability to cache results in a centralized datastore could increase the speed of analysis across all machines. Alternatively, perhaps you work on different computers at work and at home. Forgetting to save/load intermediate files may require long-running functions to be run again. Further, managing and retaining intermediate files can be cumbersome and annoying. Again, caching the results of memoised function in a central location (e.g. cloud-based storage) can speed up analytical pipelines across machines.

Recently I’ve put some work into developing additional caches for the memoise package available here. This version can be used to cache items locally or remotely in a variety of environments. Supported environments include:

  • R environment (cache_local)
  • Google Datastore (cache_datastore)
  • Amazon S3 (cache_aws_s3)
  • File system (cache_filesystem; allows dropbox, google drive to be used for caching)

There are a few caveats to consider when using this version of memoise. If you use the external cache options, it will take additional time to retrieve cached items. This is preferable in cluster environments where syncing files across instances/nodes can be difficult. However, when working at home/work, using locally synced files is preferable.




Google Datastore

Amazon S3

Automatically construct / infer / sense bigquery schema

December 30, 2015    Programming  bigquery

Bigquery is a phenomenal tool for analyzing large datasets. It enables you to upload large datasets and perform sophisticated SQL queries on millions of rows in seconds. Moreover, it can be integrated with R using Bigrquery, which can be used to interact with bigquery using some of the functions in dplyr.

It is easy to upload datasets to bigquery, although it requires you to specify a schema. If you have a lot of columns in a dataset this can be a pain to do manually - so I wrote a script to automate the process. The script automatically determines the variable types within the first 500 rows of a tab-delimited dataset. To get started, download the python script below and save it as


Save the gist as a script and run it as follows:

python <file>

The script supports plain text and gzipped files (which bigquery can load).

Output Example


Note that support for RECORD and TIMESTAMP fieldtypes is not supported.

Parallelize bcftools functions

November 21, 2015    Genetics Programming

bcftools is a great for working with variant call files. In general, it is very fast. However, I have found that the process of merging VCF files (using bcftools merge) and performing concordance checking (using bcftools gtcheck) can be a little bit slow. That is why I wrote two functions that take advantage of GNU Parallel to parallelize them.


The function vcf_chromosomes extracts chromosomes names from a VCF file using bcftools. Parallelization occurs across chromosomes.


parallel_bcftools_merge is run very similar to bcftools merge. The only difference is that you have to pipe it into bcftools to change it to the appropriate output. For example:

parallel_bcftools_merge -m all `ls *list_of_bcffiles` | bcftools view -O z > merged_vcf.vcf.gz

The parallel_bcftools_merge function will generate a temporary vcf for every chromosome. You can use all flags except for -O with this function.


parallel_bcftools_gtcheck should not be used with --all-sites, or --plot. I recommend using this function with -H and -G 1 to calculate the absolute number of differences in terms of homozygous calls between samples. Also, this function requires datamash (on OSX, install with brew install datamash)

The output file is slightly different than what bcftools normally outputs. In general, I use this function specifically to calculate conocordance between individual fastq runs - like this:

parallel_bcftools_gtchek -H -G 1 union_samples.vcf.gz > concordance.tsv

This parallelized version generates concordances for each chromosome and then merges the results together using datamash. Output looks like this:

sample_i sample_j discordance number_of_sites concordance
BGI2-RET1-ED3049 BGI1-RET1-ED3049 927 2344043 0.999605
BGI1-RET1-CB4856 BGI1-RET1-CB4852 144484 2171694 0.933469
BGI1-RET1-CX11315 BGI1-RET1-CB4852 106964 2721950 0.960703
BGI1-RET1-CX11315 BGI1-RET1-CB4856 137200 2059983 0.933398
BGI1-RET1-DL238 BGI1-RET1-CB4852 148217 2097343 0.929331
BGI1-RET1-DL238 BGI1-RET1-CB4856 124132 1803664 0.931178
BGI1-RET1-DL238 BGI1-RET1-CX11315 146580 1996802 0.926593