A week ago I received my results from 23andme.com. Aside from the obvious points of interest, health risks, heritage, neanderthal composition, etc., I was also interested in getting my own data in raw format. While 23andme does provide a way to download your “raw” data, they are not really providing raw data. One cannot access the image data from the microarray sequencer that they used. What they do provide is formatted as follows:

# rsid  chromosome  position    genotype
rs4477212   1   82154   TT
rs3094315   1   752566  TC
rs3131972   1   752721  AA
rs12124819  1   776546  AC
rs11240777  1   798959  GA
rs6681049   1   800007  CC

Rows that begin with a ‘#’ are header rows, of which, there may be as many as you please. 23andme puts some data in here, like which reference the coordinates are based on. This is an interesting topic as the build being used has just recently changed from hg18 to hg19. If you downloaded your raw data before August 9, 2012, you have hg18, after, and you have hg19. However, someone forgot to update the header to reflect this, so it still reads “build36”.

The rsid column is a unique identifier for reference SNP identifier from dbSNP. These identifiers were more useful before the completion of the human genome project, as there was no coordinate system capable of resolving the locations of these various SNP’s. Now it is possible to address them like you might address a house, with the State or City being analogous to the chromosome and the street address being analogous to the “position”. The position is the number of bases from the beginning of the chromosome that a SNP is located at.

The final column is the genotype at the listed address. There are two bases listed because humans have two copies of each chromosome.

VCF Format

So this leaves us with a list of addresses. This is well and good, but many bioinformatics applications use a different format, not all that different, called the “Variant Call Format”. Specifically, a tool for predicting the biological effects of mutations (bases different than the reference bases), uses the VCF format. It is snpEff, or SNP effect predictor.

In order to facilitate the use of the various and sundry tools that use the VCF format, I have made a tool for converting the 23andme raw format to VCF. It is the 23andme2vcf converter. In order to use it, follow these steps:

git clone git://github.com/arrogantrobot/23andme2vcf.git
cd 23andme2vcf
perl 23andme2vcf.pl /path/to/23andme/raw/data.zip /desired/path/to/output.vcf

If you do not use git, you may download the tarball from github, unpack it, and run line 3 of the above commands.