Parallelize bcftools functions

November 21, 2015   

bcftools is a great for working with variant call files. In general, it is very fast. However, I have found that the process of merging VCF files (using bcftools merge) and performing concordance checking (using bcftools gtcheck) can be a little bit slow. That is why I wrote two functions that take advantage of GNU Parallel to parallelize them.

Usage

The function vcf_chromosomes extracts chromosomes names from a VCF file using bcftools. Parallelization occurs across chromosomes.

parallel_bcftools_merge

parallel_bcftools_merge is run very similar to bcftools merge. The only difference is that you have to pipe it into bcftools to change it to the appropriate output. For example:

parallel_bcftools_merge -m all `ls *list_of_bcffiles` | bcftools view -O z > merged_vcf.vcf.gz

The parallel_bcftools_merge function will generate a temporary vcf for every chromosome. You can use all flags except for -O with this function.

parallel_bcftools_gtcheck

parallel_bcftools_gtcheck should not be used with --all-sites, or --plot. I recommend using this function with -H and -G 1 to calculate the absolute number of differences in terms of homozygous calls between samples. Also, this function requires datamash (on OSX, install with brew install datamash)

The output file is slightly different than what bcftools normally outputs. In general, I use this function specifically to calculate conocordance between individual fastq runs - like this:

parallel_bcftools_gtchek -H -G 1 union_samples.vcf.gz > concordance.tsv

This parallelized version generates concordances for each chromosome and then merges the results together using datamash. Output looks like this:

sample_i sample_j discordance number_of_sites concordance
BGI2-RET1-ED3049 BGI1-RET1-ED3049 927 2344043 0.999605
BGI1-RET1-CB4856 BGI1-RET1-CB4852 144484 2171694 0.933469
BGI1-RET1-CX11315 BGI1-RET1-CB4852 106964 2721950 0.960703
BGI1-RET1-CX11315 BGI1-RET1-CB4856 137200 2059983 0.933398
BGI1-RET1-DL238 BGI1-RET1-CB4852 148217 2097343 0.929331
BGI1-RET1-DL238 BGI1-RET1-CB4856 124132 1803664 0.931178
BGI1-RET1-DL238 BGI1-RET1-CX11315 146580 1996802 0.926593
Genetics  Programming 


An Alfred workflow for generating markdown tables from your clipboard

October 30, 2015   

Generate markdown tables from clipboard content.

Download

Usage

Copy a csv or tsv. The script will attempt to intelligently guess the format. For example, if you copy the table below:

carat   cut color   clarity depth   table   price   x   y   z
0.23    Ideal   E   SI2 61.5    55  326 3.95    3.98    2.43
0.21    Premium E   SI1 59.8    61  326 3.89    3.84    2.31
0.23    Good    E   VS1 56.9    65  327 4.05    4.07    2.31
0.29    Premium I   VS2 62.4    58  334 4.2 4.23    2.63
0.31    Good    J   SI2 63.3    58  335 4.34    4.35    2.75

Then type tbl in alfred and you will see the following:

tbl screen

You can create a table with or without a header. It will be copied to your clipboard as this:

|   carat | cut     | color   | clarity   | depth   |   table |   price |      x |    y |    z |
|--------:|:--------|:--------|:----------|:--------|--------:|--------:|-------:|-----:|-----:|
|    0.23 | Ideal   | E       | SI2       | 61.5    |    55   |     326 |   3.95 | 3.98 | 2.43 |
|    0.21 | Premium | E       | SI1       | 59.8    |    61   |     326 |   3.89 | 3.84 | 2.31 |
|    0.23 | Good    | E       | VS1       | 56.9    |    65   |     327 |   4.05 | 4.07 | 2.31 |
|    0.29 | Premium | I       | VS2       | 62.4    |    58   |     334 |   4.2  | 4.23 | 2.63 |
|    0.31 | Good    | J       | SI2       | 63.3    |    58   |     335 |   4.34 | 4.35 | 2.75 |

And it will render like this:

carat cut color clarity depth table price x y z
0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
Alfred  Utilities 


Fetch Citations in Google Sheets using pubmed() function

October 29, 2015   

If you need to fetch pubmed citations in aggregate it can be convenient to do so using pubmed identifiers. I’ve created a pubmed() function that can be added to a google sheet and used to fetch formatted html citations from pubmed. For example, entering the following into a cell:

=pubmed(23149456)

Will return an html-formatted citation:

<p><strong>The heritability of metabolic profiles in newborn twins.</strong><br />Alul FY, Cook DE, Shchelochkov OA, Fleener LG, Berberich SL, Murray JC, Ryckman KK, <br />(2013 Mar) <em>Heredity</em> 110 (3) 253-8</p>

This citations formats nicely as:

The heritability of metabolic profiles in newborn twins.
Alul FY, Cook DE, Shchelochkov OA, Fleener LG, Berberich SL, Murray JC, Ryckman KK,
(2013 Mar) Heredity 110 (3) 253-8

Setup

To implement the function, you’ll need to copy and paste the function below into the script editor and save it as a new project. Then it will become available within your google sheet. The script editor is available through the Tools > Script Editor

Programming  Google Sheets


An Alfred Workflow for wormbase

July 9, 2015   

I have created an Alfred workflow for looking up gene information in wormbase. You can search by wormbase ID (e.g. WBGene00006759) You can use it to search for genes. Returned results will include:

  • Gene identifiers
  • Location
  • Caenorhabditis orthologs
  • Publications

Download the latest version

Usage

Search for Genes
search

Get Gene Information
Get Gene Info

Alfred  Utilities  Wormbase


An Alfred Workflow for Codebox

June 25, 2015   

Codebox is a great program for storing and accessing snippets. It offers a quickbar menu item, but I thought Alfred might offer more functionality. So I wrote a workflow for it.

Download Codebox-Alfred workflow

Important!

The workflow works fairly well, but there are a few caveats. You should not do the following with your codebox libraries:

  • Don’t put spaces into tag, list, or folder names. Use an underscore instead.
  • Don’t nest folders/lists with the same name.

Usage

Set the codebox source using cb_src

set source

Invoke the workflow by typing ff

search directory

Browse tags with ff #<search>
search tags

Search all Snippets: ff <search>

search all

Alfred  Programming