As the doctor gone rogue

February 20, 2014

A note on getting start with samtools on Mac OsX 10.8.5 Mountain Lion

Filed under: NGS — Tags: , , , , — hypotheses @ 9:50 pm

samtools is a handy tool for sequence alignment and mapping (

For more information, please refer to the original article here:

  • Li H.*, Handsaker B.*, Wysoker A., Fennell T., Ruan J., Homer N., Marth G., Abecasis G., Durbin R. and 1000 Genome Project Data Processing Subgroup (2009) The Sequence alignment/map (SAM) format and SAMtools. Bioinformatics, 25, 2078-9. [PMID: 19505943]

To get it to work on mac, I’ve decided to compile it.

First, get the sourcecode from

Then, decompress the files in ~/bin

Compile it as usual

make install

You might need to have administrative right to write to system folder as well.

Next, we'll need to add a path to ~/.bash_profile

Include this in ~/.bash_profile to point to SAMTOOLS_HOME

export SAMTOOLS_HOME=~/bin/samtools-0.1.19
export PATH=$SAMTOOLS_HOME/bcftools/:$PATH

Now, samtools should be ready for you to use.

July 26, 2013

How to wget with proxy authentication?

Filed under: bash, NGS — Tags: , , , — hypotheses @ 2:25 am

Once again, I have a problem with proxy server authentication through my university network. Trying to install the new KGGSeq software to do next-generation sequencing data analysis.

As a quick fix, with cygwin, here is what I did.

1. Need to tell bash that  that we are using a proxy server

## Add these to ~/.bashrc for my bash start up shell

## Add these to ~/.bashrc for my bash start up shell

export http_proxy=$proxy

2. Need to tell wget what username and password to use with the proxy server.

As an example to download KGGSeq through cygwin, here’s what I did.

wget --proxy-user "bhoom" --proxy-password "bhoom_password"

Wget – ArchWiki.

I’m still not quite sure why they still use it. There seems to be several other enterprise authentication system, but all other systems are probably pricy? But does price justify all the other troubles we all have with slow connection for every website, problems running many bioinformatics software that cannot connect through proxy-server, etc?

June 16, 2013

Ion Torrent PGM vs PacBio vs MiSeq

Filed under: genetics, NGS — hypotheses @ 1:05 am

“It’s a lot cheaper to buy PGM compared to other sequencing platform. So, should we buy it?”

A common concern regarding this question is whether the sequencing quality is alright? This is one of the first concern Ion Torrent seems to have experience since they first launched their first sequencer.

Quail, et al took a look at three platform in their paper. Although they have only looked at microbial genome with variable GC/AT content. They showed that there still seems to be a problem with Ion Torrent PGM platform when they sequence Plasmodium genome. Moreover, the false positive rate of base calling from Ion Torrent platform is still higher.

This data may not be applicable to human genome sequencing, but it deserves a closer look in my opinion.

BMC Genomics | Full text | A tale of three next generation sequencing platforms: comparison of Ion torrent, pacific biosciences and illumina MiSeq sequencers.

July 19, 2012

Modifier Genes to Protect From Alzheimer’s Disease, Serious Infection in Cystic Fibrosis: Topol on Genomics

Filed under: genetics — hypotheses @ 4:42 pm

Modifier Genes to Protect From Alzheimer’s Disease, Serious Infection in Cystic Fibrosis: Topol on Genomics.

Dr. Topol talked about APP gene and its preservative effect on cognitive function, increase longevity, and protect against Alzheimer disease. You can find the full article here:


The bad news is this APP-A673T variant only exists in 0.5% of the studied Scandinavian population.

July 2, 2012

UCSC genome browser human gene location?

Filed under: data management, genetics — hypotheses @ 4:16 pm

To download UCSC RefSeq Gene info with HUGO gene name, start and stop codon or transcription start-end site, locate the refflat.txt.gz file on the annotation database
– Hg18:
– Hg19:
The reFlat.txt.gz contains the following columns as described in the reFlat.sql file
– 1) Hugo gene name
– 2) chromosome
– 3) strand (+/-)
– 4) Transcription start position
– 5) Transcription end position
– 6) Coding region start
– 7) Coding region end
– 8) Number of exons
– 9) Exon start positions
– 10) Exon end positions

These info should come in handy when you have to map the location of genetic marker to genes. I post the script I wrote to do this mapping latter if there is anyone interested.

March 11, 2012

Calculating heritability using SOLAR

Filed under: bash, data management, genetics, SOLAR — Tags: , — hypotheses @ 12:12 pm

As a not to myself, I occasionally have to do this occasionally. This time, simply preparing file for calculation of heritability in SOLAR. This is based on a variance-component analysis, comparing two likelihoods from two model fitting polygenic model and sporadic model.

I’m going to post the script here with all the comments about how to prepare files needed for SOLAR to do the job.

#Date: 03/11/2012
#By: Bhoom Suktitipat, MD, PhD
#Goal: Prepare file with phenotype for solar heritability analysis
#Current location: ~/Documents/GeneSTAR/CAC/solar

# ------ EXAMPLE OF cacforsolar.csv ------- #
# Family,ID,Mother,Father,Sex,cacbres
# 255,25598,,,f,
# 255,25599,,,m,
# 255,2550001,25598,25599,m,
# 255,2550101,25598,25599,f,
# ...

## -------------------------- ##
## -- MISSING DATA: can be blank
## 1) Preparing pedigree file
##    pedfile require these columns
##    id, fa, mo, sex, famid
##    - need to have both parents in the data, or none
##    - sex can be coded as m/f, M/F, 1/2
##    - famid field is optional; only required when id is not unique
echo "famid,id,fa,mo,sex" > cacbres.ped
tail -n +2 $INPUTFILE | awk -F"," '{print $1","$2","$4","$3","$5}' >> cacbres.ped
## 2) Preparing phenotype file
##    phenotype file require these columns
##    id, PHE1, PHE2, PHE3, famid, ...
##    -- famid is optional as in pedfile.
echo "famid,id,cacbres" > cacbres.phe
tail -n +2 $INPUTFILE | awk -F"," '{print $1","$2","$6}' >> cacbres.phe
## 3) Marker file (optional)
##    Field: id, genotype1, genotype2, ...
##    -- MISSING GENOTPE: 0/251, -/251, /251
##    -- Allele delimiter: AB; A B; A/B all accepted
## 4) Map file (optional) --## space-delimited fields!!!
##    -- FIRST LINE: chromosome number, and optional name of the map
##       Ex: 20 Haldane
##       - chromosome name can be string e.g. 20q14
##    -- MARKER LINE: marker name, marker location in (cM)
##       - must be ascending order
##       Ex: D20S101 0.0
## 5) SOLAR SCRIPT for heritability analysis
## Modeling example:
cat <<EOF > cacbres.solarscript
load pedigree cacbres.ped
load phenotypes cacbres.phe
trait cacbres
solar <cacbres.solarscript > cacbres.solarout

January 19, 2012

Biological Sequence Analysis (1)

Filed under: genetics, miscellaneous — Tags: , — hypotheses @ 1:47 pm

NHGRI started a series of lectures on Current Topics in Genome Analysis 2012 two weeks ago. For more info you can find out at  Youtube videos are also available for you to watch. This week’s lecture is about “Biological Sequence Analysis” by Andy Baxevanis.   My notes of the talk are summarized here.  The main topic of the talk involves biological sequence alignment and alignment tools and algorithms, including BLAST.  This is a pretty good lecture if you have been away from BLAST for a while and a good introduction for people who are new to genetics.

As a rule of thumb, and a general idea of what you should remember. When you are doing local sequence alignment, you will have to encounter with several matrix of scoring the sequence similarity.

–          Several alignments scoring matrix exists, e.g. PAM46, BLOSUM62. The number following the scoring matrix name is how the two sequence similarity should be “at most”. To look for more distantly related sequence, use the scoring matrix with lower number.

–          Gap: local alignment should allow at least 1 in every 20 basepair.

–          The return results from BLAST are those results that passed the scoring threshold. This doesn’t imply significant level. Some of these results, however, are considered statistically significant.

–          To assess the biological significance, “Karlin-Altschul Equation”, a normalized probability, as a function of # of letters in the query, # of letters in the database, and the size of search space. This “E-value” represents the number of false positive, and you want this to be as low as possible.

  • Look for E < 10E-6 for nucleotide BLAST
  • Look for E < 10E-3 for protein BLAST

–          As a reference for human genome RefSeq is a good starting place for BLAST.  RefSeq provides a single reference sequence for each molecule of the central dogma (DNA, mRNA, protein).  The database is non-redundant, updated to reflect the current knowledge of sequence data and biology, and is being curated.

–          Options to consider changing

  • Expected threshold: change the E-value as suggested above.
  • Matrix: change this to reflect how similar of the sequence you want to find.
  • Filter: Always filter out region with low complexity, e.g. homopolymeric region. These regions can confound the significant level of the results. (more false positive)

–          Identities: For protein based search, look for at least 25% identity. For nucleotide, look for sequence with at least 75% identity!

–         BLAT is the tool for finding location of an unknown sequence, or gene, e.g. exon, intron, promoter or unknown region in the genome.  BLAT: Blast Like Alignment Tool, much faster than BLAST, can find exact match of sequence down to L=33.  When looking for sequence fragments or unknown genes, BLAT is a good tool to start looking for location of these sequences in the genome. BLAT is available on UCSC Genome Browser.

October 25, 2011

Assigning SNPs to Genes and Gene Regions — the ALIGATOR way

Filed under: genetics, pathway analysis — hypotheses @ 6:05 pm

By principal various people doing this differently based on a different sets of (known/consensus) gene list. In the paper by Holmans, et al 2009 describing the “ALIGATOR” program to perform Gene ontology Analysis of GWAS,

1. “seq-gene” containing the list of genes can be downloaded from

2. Pseudo genes containing “pseudo” in the “feature_id” are excluded

3. Extract records with “feature_type” : “gene”, “group_label” : “”reference”, and “tax_id” : “9606”.

4. Retain these fields: chromosome, chr_start, chr_stop, and feature_id (NCBI gene ID)

5. SNP ID and chromosomal location are compared to the file from (4) ==> This is the hard part to be continued.


April 22, 2011

Split Whole-Genome PLINK Binary Files to Small Chunks

Filed under: bash, data management, genetics, plink, Uncategorized — hypotheses @ 1:03 pm

I prefer to store my GWAS data in PLINK binary format. The PLINK binary file format gives us many advantages. With tons of data, binary file format is a very efficient way to save disk space. To give you some number, you can store the data of 2000 people from Illumina Human 1M chip with ~1 million marker in 1 CD (~700Mb). I prefer to keep all the genetic markers in the same file. This is convenient when you want to extract a list of SNPs from several chromosome. You can just go to this one file and extract them out.
However, keeping everything in one single file is not always a good solution to all the problems. For most of the genome-wide association analysis, we can speed up the analysis using parallelization. The simplest way to do this is to split your data into several smaller chunks and submit each part for analysis on several computers (preferably computing cluster). Although this is so-called "poor man" parallelization, it gets the job done in no time. 
For this purpose, from a single PLINK binary file format, I wrote a script to split this file by chromosome (in by_chr directory), and for each chromosome into a smaller chunk of your choice (I used 5000 for the example below).

## generate PLINK script to split whole genome PLINK BED TO smaller chunks
set -e
#set -o verbose
if [ ! `type -p plink` ];then
echo "Cannot find PLINK. Make sure PLINK is in your \$PATH";
exit 1
if [ "${prefix}" == "" ]; then
echo "Usage: ./ [prefix_plink_input] [chunk_size] [prefix_plink_output]"
echo "Example: $sh /research/imp_data/shared/washuDataReformatted/blackR21 10000 blackSR21"

echo "Exit no input file"
exit 1
if [ "${chunk}" == "" ]; then
echo "Usage: ./ [prefix_plink_input] [chunk_size] [prefix_plink_output]"
echo "Example: $sh /research/imp_data/shared/washuDataReformatted/blackR21 10000 blackSR21"
echo "Exit no chunk size"
exit 1
if [ "${postfix}" == "" ]; then
echo "Usage: ./ [prefix_plink_input] [chunk_size] [prefix_plink_output]"
echo "Example: $sh /research/imp_data/shared/washuDataReformatted/blackR21 10000 blackSR21"
echo "Need prefix of output"
exit 1
echo "File ${prefix} by ${chunk} SNPs chunk"
echo "Spliting PLINK ${prefix} into ${chunk} file size"
if [ ! -e ${prefix}.bim ];then
echo "File ${prefix}.bim not found"
exit 1
chrs=`awk '{print $1}' ${prefix}.bim | sort -nur`
echo "Chromosome in Map File: ${chrs}" | tr "\n" " "
echo ""

## 0. Create script directory if not exist
if [ ! -d by_chr ]; then mkdir -p by_chr; fi
if [ ! -d script ]; then mkdir -p script; fi

# 1. Split by chromosome first
for chr in ${chrs}; do
cat < script/plinkSplit.chr${chr}.sh
# extracting chromosome $chr from ${prefix}
plink --noweb --bfile ${prefix} --chr ${chr} --make-bed --out by_chr/${postfix}_chr${chr}
if [ $split_by_chromosome -eq 1 ]; then
echo "Processing map of chromosome $chr"
grep -w "^$chr" ${prefix}.bim > by_chr/${postfix}_chr${chr}.bim
sh script/plinkSplit.chr${chr}.sh

## 2. Split chromosome to a chunk of ${chunk} SNPs
for chr in $chrs;
if [ ! -d c${chr} ]; then mkdir -p c${chr}; fi
## wait till finish extracting bim by chromosome to submit the following jobs
## default wait_duration = 15 minutes (change value below for longer wait time
wait_duration=5 ## Wait for 1 hour before stopping
while [ "$count" -le "${wait_duration}" ];
if [ ! -e by_chr/${postfix}_chr${chr}.bed ]; then
echo -n -e "\r                                                   \r$count. Waiting for file by_chr/${postfix}_chr${chr}.bed"
let "count += 1"
sleep 60
let "count = wait_duration + 1"
n_end=`wc -l by_chr/${postfix}_chr${chr}.bim | awk '{print $1}'`
echo "Chromosome $chr has $n_end SNPs"
let "snp_end = snp_start + ${chunk} -1"
if [ $snp_end -gt ${n_end} ];then snp_end=${n_end};fi
echo "Extracting chromosome $chr from $snp_start to ${snp_end}"
while [ "$snp_end" -le "${n_end}" ];
first_snp=`tail -n +${snp_start} ${mapfile} | head -n 1| awk '{print $2}'`
last_snp=`tail -n +${snp_end} ${mapfile} | head -n 1 | awk '{print $2}'`
echo "First SNP $first_snp Last SNP $last_snp"
cat <<EOF > script/plinkSplit.chr${chr}.pc${pc}.sh
echo "Extracting chromosome $chr from $snp_start to ${snp_end}"
plink --noweb --bfile by_chr/${postfix}_chr${chr} \
--from $first_snp --to $last_snp --make-bed \
--out c${chr}/${postfix}_chr${chr}_p${pc}
exit 0
sh script/plinkSplit.chr${chr}.pc${pc}.sh
let "pc += 1"
let "snp_start += ${chunk}"
let "snp_end += ${chunk}-1"
if [ $snp_end -gt ${n_end} ];then let "snp_end= n_end + 1" ;fi
done ## while loop all snps by chromosome
fi ## if checking whether chromosome has been extracted or not
done ## while loop waiting for split by chromosome
## Terminate if PLINK has not finished extracting file
if [ ! -e by_chr/${postfix}_chr${chr}.bed ]; then
echo "Timeover: PLINK failed to \extract by_chr/${postfix}_chr${chr}.bed"
exit 1
done ## for loop all chromosomes
exit 0

So, this script will first split your file into a plink binary format by chromosome (in by_chr). Then, it will split each chromosome into a smaller file. The files by_chr may not be very useful for parallelization, but believe me, there're times that you will get to use them.

May 17, 2010

Alternate color manhattan plot

Filed under: genetics, R — hypotheses @ 8:31 pm

I am trying to make my own manhattan plot the way I want for a long time, and this one seems the be the shortest version that I come up with. Here’s the feature of my function.
1. Should be transportable. Only requires “CHR”, “BP”, “P” to run at a minimum.
2. You can choose to draw your own reference lines, using — cutoffs
3. You can define how tall your graph will be for comparison, –using maxy
4. It is based on basic plot function. So, you can add title to your plot later on. Add more color points, using points [need to convert your coordinate to the same way first].
The down sides.
1. I use the order on chromosome, instead of chromosomal position. So, it is not good to use this code if you only have one gene, and want to the signal location on the chromosome.
2. The signals are overlapping between chromosome in the low p-value region. However, this does not obscure your signal in the top. Improvement can be added.
3. Not all chromosomes are labelled on the X-axis. This is probably the most annoying things that need to be improved.
4. It doesn’t have much advantage in terms of speed. It still took 83 seconds to plot 860,000 signals. That’s roughly 10,000 points per second, though.
Besides all these it still seems pretty good.
manhattan <- function(data,color=c("black","grey"),maxy,cutoffs= c(5e-8)) {
if (! sum(c("CHR","BP","P") %in% names(data)) == 3) stop("Data must have CHR BP P column")
data <- data[order(data$CHR,data$BP),]
data <- data[!$P),]
nsnps <- nrow(data)
data$COOR <- 1:nsnps
ticks <- tapply(data$COOR,data$CHR,median)
data$nlogp <- -log10(data$P)
chr <- sort(unique(data$CHR))
if (missing(maxy)) {
	} else {
	xaxt="n",type="n",xlab="chromosome",ylab="-log10(p)")	}
axis(1,at=ticks,labels=chr) <- ifelse(data$CHR%%2,1,0)
for (coi in 1:length(cutoffs)) {
Older Posts »

Create a free website or blog at