Category Archives: genetics

A note on getting start with samtools on Mac OsX 10.8.5 Mountain Lion

samtools is a handy tool for sequence alignment and mapping (

For more information, please refer to the original article here:

  • Li H.*, Handsaker B.*, Wysoker A., Fennell T., Ruan J., Homer N., Marth G., Abecasis G., Durbin R. and 1000 Genome Project Data Processing Subgroup (2009) The Sequence alignment/map (SAM) format and SAMtools. Bioinformatics, 25, 2078-9. [PMID: 19505943]

To get it to work on mac, I’ve decided to compile it.

First, get the sourcecode from

Then, decompress the files in ~/bin

Compile it as usual

make install

You might need to have administrative right to write to system folder as well.

Next, we'll need to add a path to ~/.bash_profile

Include this in ~/.bash_profile to point to SAMTOOLS_HOME

export SAMTOOLS_HOME=~/bin/samtools-0.1.19
export PATH=$SAMTOOLS_HOME/bcftools/:$PATH

Now, samtools should be ready for you to use.


How to wget with proxy authentication?

Once again, I have a problem with proxy server authentication through my university network. Trying to install the new KGGSeq software to do next-generation sequencing data analysis.

As a quick fix, with cygwin, here is what I did.

1. Need to tell bash that  that we are using a proxy server

## Add these to ~/.bashrc for my bash start up shell

## Add these to ~/.bashrc for my bash start up shell

export http_proxy=$proxy

2. Need to tell wget what username and password to use with the proxy server.

As an example to download KGGSeq through cygwin, here’s what I did.

wget --proxy-user "bhoom" --proxy-password "bhoom_password"

Wget – ArchWiki.

I’m still not quite sure why they still use it. There seems to be several other enterprise authentication system, but all other systems are probably pricy? But does price justify all the other troubles we all have with slow connection for every website, problems running many bioinformatics software that cannot connect through proxy-server, etc?

Ion Torrent PGM vs PacBio vs MiSeq

“It’s a lot cheaper to buy PGM compared to other sequencing platform. So, should we buy it?”

A common concern regarding this question is whether the sequencing quality is alright? This is one of the first concern Ion Torrent seems to have experience since they first launched their first sequencer.

Quail, et al took a look at three platform in their paper. Although they have only looked at microbial genome with variable GC/AT content. They showed that there still seems to be a problem with Ion Torrent PGM platform when they sequence Plasmodium genome. Moreover, the false positive rate of base calling from Ion Torrent platform is still higher.

This data may not be applicable to human genome sequencing, but it deserves a closer look in my opinion.

BMC Genomics | Full text | A tale of three next generation sequencing platforms: comparison of Ion torrent, pacific biosciences and illumina MiSeq sequencers.

Modifier Genes to Protect From Alzheimer’s Disease, Serious Infection in Cystic Fibrosis: Topol on Genomics

Modifier Genes to Protect From Alzheimer’s Disease, Serious Infection in Cystic Fibrosis: Topol on Genomics.

Dr. Topol talked about APP gene and its preservative effect on cognitive function, increase longevity, and protect against Alzheimer disease. You can find the full article here:


The bad news is this APP-A673T variant only exists in 0.5% of the studied Scandinavian population.

UCSC genome browser human gene location?

To download UCSC RefSeq Gene info with HUGO gene name, start and stop codon or transcription start-end site, locate the refflat.txt.gz file on the annotation database
– Hg18:
– Hg19:
The reFlat.txt.gz contains the following columns as described in the reFlat.sql file
– 1) Hugo gene name
– 2) chromosome
– 3) strand (+/-)
– 4) Transcription start position
– 5) Transcription end position
– 6) Coding region start
– 7) Coding region end
– 8) Number of exons
– 9) Exon start positions
– 10) Exon end positions

These info should come in handy when you have to map the location of genetic marker to genes. I post the script I wrote to do this mapping latter if there is anyone interested.

Calculating heritability using SOLAR

As a not to myself, I occasionally have to do this occasionally. This time, simply preparing file for calculation of heritability in SOLAR. This is based on a variance-component analysis, comparing two likelihoods from two model fitting polygenic model and sporadic model.

I’m going to post the script here with all the comments about how to prepare files needed for SOLAR to do the job.

#Date: 03/11/2012
#By: Bhoom Suktitipat, MD, PhD
#Goal: Prepare file with phenotype for solar heritability analysis
#Current location: ~/Documents/GeneSTAR/CAC/solar

# ------ EXAMPLE OF cacforsolar.csv ------- #
# Family,ID,Mother,Father,Sex,cacbres
# 255,25598,,,f,
# 255,25599,,,m,
# 255,2550001,25598,25599,m,
# 255,2550101,25598,25599,f,
# ...

## -------------------------- ##
## -- MISSING DATA: can be blank
## 1) Preparing pedigree file
##    pedfile require these columns
##    id, fa, mo, sex, famid
##    - need to have both parents in the data, or none
##    - sex can be coded as m/f, M/F, 1/2
##    - famid field is optional; only required when id is not unique
echo "famid,id,fa,mo,sex" > cacbres.ped
tail -n +2 $INPUTFILE | awk -F"," '{print $1","$2","$4","$3","$5}' >> cacbres.ped
## 2) Preparing phenotype file
##    phenotype file require these columns
##    id, PHE1, PHE2, PHE3, famid, ...
##    -- famid is optional as in pedfile.
echo "famid,id,cacbres" > cacbres.phe
tail -n +2 $INPUTFILE | awk -F"," '{print $1","$2","$6}' >> cacbres.phe
## 3) Marker file (optional)
##    Field: id, genotype1, genotype2, ...
##    -- MISSING GENOTPE: 0/251, -/251, /251
##    -- Allele delimiter: AB; A B; A/B all accepted
## 4) Map file (optional) --## space-delimited fields!!!
##    -- FIRST LINE: chromosome number, and optional name of the map
##       Ex: 20 Haldane
##       - chromosome name can be string e.g. 20q14
##    -- MARKER LINE: marker name, marker location in (cM)
##       - must be ascending order
##       Ex: D20S101 0.0
## 5) SOLAR SCRIPT for heritability analysis
## Modeling example:
cat <<EOF > cacbres.solarscript
load pedigree cacbres.ped
load phenotypes cacbres.phe
trait cacbres
solar <cacbres.solarscript > cacbres.solarout

Biological Sequence Analysis (1)

NHGRI started a series of lectures on Current Topics in Genome Analysis 2012 two weeks ago. For more info you can find out at  Youtube videos are also available for you to watch. This week’s lecture is about “Biological Sequence Analysis” by Andy Baxevanis.   My notes of the talk are summarized here.  The main topic of the talk involves biological sequence alignment and alignment tools and algorithms, including BLAST.  This is a pretty good lecture if you have been away from BLAST for a while and a good introduction for people who are new to genetics.

As a rule of thumb, and a general idea of what you should remember. When you are doing local sequence alignment, you will have to encounter with several matrix of scoring the sequence similarity.

–          Several alignments scoring matrix exists, e.g. PAM46, BLOSUM62. The number following the scoring matrix name is how the two sequence similarity should be “at most”. To look for more distantly related sequence, use the scoring matrix with lower number.

–          Gap: local alignment should allow at least 1 in every 20 basepair.

–          The return results from BLAST are those results that passed the scoring threshold. This doesn’t imply significant level. Some of these results, however, are considered statistically significant.

–          To assess the biological significance, “Karlin-Altschul Equation”, a normalized probability, as a function of # of letters in the query, # of letters in the database, and the size of search space. This “E-value” represents the number of false positive, and you want this to be as low as possible.

  • Look for E < 10E-6 for nucleotide BLAST
  • Look for E < 10E-3 for protein BLAST

–          As a reference for human genome RefSeq is a good starting place for BLAST.  RefSeq provides a single reference sequence for each molecule of the central dogma (DNA, mRNA, protein).  The database is non-redundant, updated to reflect the current knowledge of sequence data and biology, and is being curated.

–          Options to consider changing

  • Expected threshold: change the E-value as suggested above.
  • Matrix: change this to reflect how similar of the sequence you want to find.
  • Filter: Always filter out region with low complexity, e.g. homopolymeric region. These regions can confound the significant level of the results. (more false positive)

–          Identities: For protein based search, look for at least 25% identity. For nucleotide, look for sequence with at least 75% identity!

–         BLAT is the tool for finding location of an unknown sequence, or gene, e.g. exon, intron, promoter or unknown region in the genome.  BLAT: Blast Like Alignment Tool, much faster than BLAST, can find exact match of sequence down to L=33.  When looking for sequence fragments or unknown genes, BLAT is a good tool to start looking for location of these sequences in the genome. BLAT is available on UCSC Genome Browser.