As the doctor gone rogue

April 22, 2011

Split Whole-Genome PLINK Binary Files to Small Chunks

Filed under: bash, data management, genetics, plink, Uncategorized — hypotheses @ 1:03 pm

I prefer to store my GWAS data in PLINK binary format. The PLINK binary file format gives us many advantages. With tons of data, binary file format is a very efficient way to save disk space. To give you some number, you can store the data of 2000 people from Illumina Human 1M chip with ~1 million marker in 1 CD (~700Mb). I prefer to keep all the genetic markers in the same file. This is convenient when you want to extract a list of SNPs from several chromosome. You can just go to this one file and extract them out.
However, keeping everything in one single file is not always a good solution to all the problems. For most of the genome-wide association analysis, we can speed up the analysis using parallelization. The simplest way to do this is to split your data into several smaller chunks and submit each part for analysis on several computers (preferably computing cluster). Although this is so-called "poor man" parallelization, it gets the job done in no time. 
For this purpose, from a single PLINK binary file format, I wrote a script to split this file by chromosome (in by_chr directory), and for each chromosome into a smaller chunk of your choice (I used 5000 for the example below).

#!/bin/bash
## generate PLINK script to split whole genome PLINK BED TO smaller chunks
set -e
#set -o verbose
split_by_chromosome=0
prefix=$1
chunk=$2
postfix=$3
if [ ! `type -p plink` ];then
echo "Cannot find PLINK. Make sure PLINK is in your \$PATH";
exit 1
fi
if [ "${prefix}" == "" ]; then
echo "Usage: ./splitPlink.sh [prefix_plink_input] [chunk_size] [prefix_plink_output]"
echo "Example: $sh splitPlink.sh /research/imp_data/shared/washuDataReformatted/blackR21 10000 blackSR21"

echo "Exit no input file"
exit 1
fi
if [ "${chunk}" == "" ]; then
echo "Usage: ./splitPlink.sh [prefix_plink_input] [chunk_size] [prefix_plink_output]"
echo "Example: $sh splitPlink.sh /research/imp_data/shared/washuDataReformatted/blackR21 10000 blackSR21"
echo "Exit no chunk size"
exit 1
fi
if [ "${postfix}" == "" ]; then
echo "Usage: ./splitPlink.sh [prefix_plink_input] [chunk_size] [prefix_plink_output]"
echo "Example: $sh splitPlink.sh /research/imp_data/shared/washuDataReformatted/blackR21 10000 blackSR21"
echo "Need prefix of output"
exit 1
fi
echo "File ${prefix} by ${chunk} SNPs chunk"
echo "Spliting PLINK ${prefix} into ${chunk} file size"
if [ ! -e ${prefix}.bim ];then
echo "File ${prefix}.bim not found"
exit 1
fi
chrs=`awk '{print $1}' ${prefix}.bim | sort -nur`
echo "Chromosome in Map File: ${chrs}" | tr "\n" " "
echo ""

## 0. Create script directory if not exist
if [ ! -d by_chr ]; then mkdir -p by_chr; fi
if [ ! -d script ]; then mkdir -p script; fi

# 1. Split by chromosome first
for chr in ${chrs}; do
cat < script/plinkSplit.chr${chr}.sh
#!/bin/sh
# extracting chromosome $chr from ${prefix}
plink --noweb --bfile ${prefix} --chr ${chr} --make-bed --out by_chr/${postfix}_chr${chr}
EOF
if [ $split_by_chromosome -eq 1 ]; then
echo "Processing map of chromosome $chr"
grep -w "^$chr" ${prefix}.bim > by_chr/${postfix}_chr${chr}.bim
fi
sh script/plinkSplit.chr${chr}.sh
done

## 2. Split chromosome to a chunk of ${chunk} SNPs
for chr in $chrs;
do
if [ ! -d c${chr} ]; then mkdir -p c${chr}; fi
mapfile=by_chr/${postfix}_chr${chr}.bim
## wait till finish extracting bim by chromosome to submit the following jobs
## default wait_duration = 15 minutes (change value below for longer wait time
wait_duration=5 ## Wait for 1 hour before stopping
count=1
while [ "$count" -le "${wait_duration}" ];
do
if [ ! -e by_chr/${postfix}_chr${chr}.bed ]; then
echo -n -e "\r                                                   \r$count. Waiting for file by_chr/${postfix}_chr${chr}.bed"
let "count += 1"
sleep 60
else
let "count = wait_duration + 1"
n_end=`wc -l by_chr/${postfix}_chr${chr}.bim | awk '{print $1}'`
echo "Chromosome $chr has $n_end SNPs"
snp_start=1
let "snp_end = snp_start + ${chunk} -1"
if [ $snp_end -gt ${n_end} ];then snp_end=${n_end};fi
pc=1
echo "Extracting chromosome $chr from $snp_start to ${snp_end}"
while [ "$snp_end" -le "${n_end}" ];
do
first_snp=`tail -n +${snp_start} ${mapfile} | head -n 1| awk '{print $2}'`
last_snp=`tail -n +${snp_end} ${mapfile} | head -n 1 | awk '{print $2}'`
echo "First SNP $first_snp Last SNP $last_snp"
cat <<EOF > script/plinkSplit.chr${chr}.pc${pc}.sh
#!/bin/sh
echo "Extracting chromosome $chr from $snp_start to ${snp_end}"
plink --noweb --bfile by_chr/${postfix}_chr${chr} \
--from $first_snp --to $last_snp --make-bed \
--out c${chr}/${postfix}_chr${chr}_p${pc}
exit 0
EOF
sh script/plinkSplit.chr${chr}.pc${pc}.sh
let "pc += 1"
let "snp_start += ${chunk}"
let "snp_end += ${chunk}-1"
if [ $snp_end -gt ${n_end} ];then let "snp_end= n_end + 1" ;fi
done ## while loop all snps by chromosome
fi ## if checking whether chromosome has been extracted or not
done ## while loop waiting for split by chromosome
## Terminate if PLINK has not finished extracting file
if [ ! -e by_chr/${postfix}_chr${chr}.bed ]; then
echo "Timeover: PLINK failed to \extract by_chr/${postfix}_chr${chr}.bed"
exit 1
fi
done ## for loop all chromosomes
exit 0

So, this script will first split your file into a plink binary format by chromosome (in by_chr). Then, it will split each chromosome into a smaller file. The files by_chr may not be very useful for parallelization, but believe me, there're times that you will get to use them.

Blog at WordPress.com.