So, I recently got to work with vcf file format for the first time. It is been quite a mess to start with, but after all the helps I can gather around here (Thank you very much, YK & HS !!!) I finally came down to a solution.
You probably have heard about “plink”, “plink/seq”, and “vcftools”. Among the three programs, “plink” has been around the longest, and one of the reason may be because it is versatile, has good documentation, and quite easy to use. Although plink/seq was inspired by plink, its documentation is still during development (I tried to be optimistic here).
For this short tutorial, I am going to use PLINK/BED file as a medium for file conversion, since it is the fastest file to manipulate and the most efficient to use in terms of disc space and memory requirement (from my experience).
“So, my question is how do I convert my GWAS data to vcf format, with a specific reference allele that I want to use i.e. not the default ‘major allele’ as a reference as we typically do”
So, I now have PLINK/Binary format file call csOmni25 and a file containing reference allele csOmni25.refAllele
As a short note: csOmni25.refAllele looks like this
-------- csOmni25.refAllele ---------------- rs12345 A rs12958 G rs10596 C rs18607 T ... ... ---------end csOmni25.refAllele--------------
- First, we will convert PLINK/Binary format file so that A1 [reference allele] correspond to the reference allele that we want
NOTE: when you create the reference allele file, make sure that all reference alleles are in UPPER CASE.
- Second, we will import plink/bed to plink/seq and write out vcf format file
Here’s the script that I use. It’s that simple. And for your reference, it took me a week to figure this out, and tested it.
#!/bin/sh ##-- SCRIPT PARAMETER TO MODIFY--## PLINKFILE=csOmni25 REF_ALLELE_FILE=csOmni25.refAllele NEWPLINKFILE=csOmni25Ref PLINKSEQ_PROJECT=csGWAS ## ------END SCRIPT PARAMETER------ ## #1. convert plink/binary to have the specify reference allele plink --noweb --bfile $PLINKFILE --reference-allele $REF_ALLELE_FILE --make-bed --out $NEWPLINKFILE #2. create plink/seq project pseq $PLINKSEQ_PROJECT new-project #3. load plink file into plink/seq pseq $PLINKSEQ_PROJECT load-plink --file $NEWPLINKFILE --id $NEWPLINKFILE #4. write out vcf file, as of today 4/6/2012 using vcftools version 0.1.8, although the documentation says that you can write out a compressed vcf format using --format BGZF option, vcftools doesn't recognize what this option is. So, I invented my own solution pseq $PLINKSEQ_PROJECT write-vcf | gzip > $NEWPLINKFILE.vcf.gz
At the end, this will create a compressed vcf file “csOmni25Ref.vcf.gz” with the specified reference alleles.