Convert plink format to vcf

So, I recently got to work with vcf file format for the first time. It is been quite a mess to start with, but after all the helps I can gather around here (Thank you very much, YK & HS !!!) I finally came down to a solution.

You probably have heard about “plink”, “plink/seq”, and “vcftools”. Among the three programs, “plink” has been around the longest, and one of the reason may be because it is versatile, has good documentation, and quite easy to use. Although plink/seq was inspired by plink, its documentation is still during development (I tried to be optimistic here).

For this short tutorial, I am going to use PLINK/BED file as a medium for file conversion, since it is the fastest file to manipulate and the most efficient to use in terms of disc space and memory requirement (from my experience).

“So, my question is how do I convert my GWAS data to vcf format, with a specific reference allele that I want to use i.e. not the default ‘major allele’ as a reference as we typically do”

So, I now have PLINK/Binary format file call csOmni25 and a file containing reference allele csOmni25.refAllele

As a short note: csOmni25.refAllele looks like this

-------- csOmni25.refAllele ----------------

rs12345 A

rs12958 G

rs10596 C

rs18607 T



---------end csOmni25.refAllele--------------

  1. First, we will convert PLINK/Binary format file so that A1 [reference allele] correspond to the reference allele that we want
    NOTE: when you create the reference allele file, make sure that all reference alleles are in UPPER CASE.
  2. Second, we will import plink/bed to plink/seq and write out vcf format file

Here’s the script that I use. It’s that simple. And for your reference, it took me a week to figure this out, and tested it.







## ------END SCRIPT PARAMETER------ ##

#1. convert plink/binary to have the specify reference allele

plink --noweb --bfile $PLINKFILE --reference-allele $REF_ALLELE_FILE --make-bed --out $NEWPLINKFILE

#2. create plink/seq project

pseq $PLINKSEQ_PROJECT new-project

#3. load plink file into plink/seq


#4. write out vcf file, as of today 4/6/2012  using vcftools version 0.1.8, although the documentation says that you can write out a compressed vcf format using --format BGZF option, vcftools doesn't recognize what this option is. So, I invented my own solution

pseq $PLINKSEQ_PROJECT write-vcf | gzip > $NEWPLINKFILE.vcf.gz

At the end, this will create a compressed vcf file “csOmni25Ref.vcf.gz” with the specified reference alleles.


12 thoughts on “Convert plink format to vcf”

  1. Hi,
    I just wanted to ask how you created the REF_ALLELE_FILE, as you mentioned that major allele is not necessarily the reference allele.
    Thank you

    1. I believe that there’s a reference allele column included already in vcf file. So, it is pretty straight forward using a command like cut or awk to extract the column needed to generate the reference allele file.

      1. There is indeed a reference allele column in vcf file, but you don’t have vcf file yet to extract the reference allele column from

  2. Hi Farad, you’re right. In general, the reference allele can be any allele that you want to use. When starting with bi-allelic data from PLINK, I generally use “allele 1” in the bim file. If you have external population that you want to keep the reference allele consistent, which might not have the same minor allele, you can use the reference allele file for this purpose.

    1. Thank you for your response, Hypotheses. So, you mean this “reference allele” file is not necessarily contain the reference alleles from REFSEQ and I should choose A2 (the major allele) as the reference allele when working on a single dataset?

  3. Farad, the choice is arbitrary. When doing the analysis, we tend to think that the majority of the population with no disease have the major allele, so we use it as a reference for comparing the risk from having minor allele versus the major allele. Hope that gives you some answer.


  4. The choice of reference allele is arbitary, but when comparing datasets, it must be consistent. So there may be a good reason to make a reference allele file. It’s a little cumbersome, but if you have your 1-based position, and your reference sequence fasta you can make a bed file from it (use awk) and then use the bedtools subprogram getfasta with the -tab option, and you have your reference alleles with coordinates. Normally I ditch the rs numbers, and change each snp identifier into a chr:pos string (use awk again). Good luck!

  5. sorry for extending the problem of preparing reference allele file, it is still confusing me. I have a dataset in plink format (with imputed snps), and how i create a ref.allele file. for it (so further i can convert it to vcf)? Is ref.allele file is prepared based on 1000 Genomes Project phase 1 reference panel. Could you, please, help.

  6. I’d recommend you to take a looj at plink version 1.9. they might have a tool to help you with the task that you want to do. These are mostly command line tools that you just need to type the specific commands and give the required input for the programs to run.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.