If you need to work with data spread across a bunch of worksheets within an excel workbook, but you don’t want to do so in Microsoft Excel – here is a python script for extracting each individual workbook as a csv and exporting them all to a folder.
Previously, I’ve shown that you can use a mysql database browser (e.g. Sequel Pro) to access and browse the UCSC Genome Browser MySQL database.
If you have a small dataset that you would like to annotate, you can write SQL statements to fetch data. Below I show how you can use python to fetch genome coordinates by specifying gene and genome build.
Note: The UCSC browser mysql resource will throttle you if you make too many queries. If you need to annotate large datasets, all of the data is freely available for download here.
Understanding how data is formatted, and can be used.
If you are on a mac – Sequel Pro is a fantastic tool for browsing.
To browse the UCSC genome browser database, download Sequal Pro and enter in the following connection information into the login screen:
You should be presented with a screen that looks like this:
Databases are represented as Connection/Resource > Database > Tables > Rows. Sequal pro lets you connect to the UCSC Browser MySQL server. Each database represents a different genome. Remember that genomes change and improve over time – so there are multiple builds of each genome (e.g. hg18 and hg19) represented by separate databases.
Within each database are tables. These are the same tables referred to on the UCSC website – and you can find out more about what the data represents on the browser website by clicking ‘describe table schema’ for the table of interest as shown below.
Unfortunately – one notable database biopython has trouble working with is the SNP database. This is due to the Bio.Entrez parser being unable to handle the XML returned from this database. One solution is to use a built in Python XML parser, but I thought I’d try to come up with an easier solution.
To solve this problem – I wrote a function for retrieving SNP data, and parsing it into an array. Feel free to build on this, and use it as you wish. Suggestions welcome!