As the doctor gone rogue

November 27, 2015

Appending multiple vcf

Filed under: Uncategorized — Tags: , , — hypotheses @ 12:10 am

If you have multiple vcf files split by chromosome from the same samples, this is the case when performing joint variant calls of multiple samples in GATK. At the end of the day, if you want to have a single vcf file from this project, CatVariants tool (a command line tool in GATK) is pretty fast. Although I think this might be just the case of simple cat of multiple files except the vcf header, this tools still come in handy especially when you already have GATK installed. (more…)

October 9, 2015

apt-get error with … for package ‘linux-headers-3.2.0-60’ is missing final newline

Filed under: Uncategorized — Tags: , , , — hypotheses @ 7:49 am

I recently clone a virtual machine image and setting up locally. When trying to install a new package through <code>apt-get</code> in Ubuntu. I ran into a strange problem of “missing final newline” in linux-headers.

Trying to google for solution, I found a page in 2004 mentioning this problem. It suggested that a file in /var/lib/dpkg/info/smbf.list is the cause of the error. So, I tried to locate this file but without any success.

I notice that there is “linux-headers-3.2.0-60.list” and a few other files with similar names in this folder. So, well, why not trying to delete them.

It seems like deleting these files fix the problem and allow apt-get to install the package without any problems. So far the system seems normal. We haven’t allowed any additional access through the server, only deleting a few files that don’t look essential for the OS to run.

Further remove other apt-get unused file through apt-get clean seems to solve the problem.

August 16, 2015

Some benefits of clinical exome sequencing

Filed under: Uncategorized — Tags: — hypotheses @ 6:49 am

To date, clinical exome sequencing yielded around 25% diagnostic rate of unknown genetic diseases.

The number came from two recently published studies cited in Berg. JAMA 312(18), 2014.

The major concern in making the diagnosis remained the prior probability for each disease, which contributes directly to the false positive or false negative diagnosis of each specific condition. Also, genome-annotation remains a major challenge to all the scientists playing the roles in interpreting the results of these genome sequencing, whether it’s the whole genome or just an an exome.

The general questions that most patients would want to know include

  1. the diagnosis
  2. some explanations how the disease happened
  3. Anyone else in the family might be affected?
  4. Any treatment?
  5. Any future

Read more on the two cited references:

May 22, 2014

Wiki software for Mac

Filed under: Uncategorized — Tags: , , — hypotheses @ 11:57 pm

I’ve updated Maverick (OsX 10.8) to OsX server recently. It came with an easy interface to configure. The best part for me seems to be the wiki feature. However, this comes with a cache-22. For some reasons that I cannot figure out, and I found that similar problem has been reported for users in Thailand running OsX server.

When Wiki feature is enabled, I get an error saying that ERROR:  time zone displacement out of range This is quite dreadful! The only way to solve the problem seems to be fresh installation which is out of the question for me. Without “Wiki feature”, the server and web service would be running without any problem. So, that’s one option for me, to not enabling the wiki feature. 

My alternatives for using the built-in Wiki feature is, therefore, looking for other wiki platforms available for Mac. From Wikipedia,, the list is narrowed down to MoinMoin and MojoMojo, which is available on Mac. MoinMoin was written in Python, and the other one in Perl (recently changed the name to Catalyst software for some reason).

MoinMoin does not require database support, either Posgres, MySql, etc, which seems to be the cause of the dreaded Timezone displacement errors. Therefore, I’m going to go forward with MoinMoin installation and give it a try. I really want to try Confluence; however, they do not support it on Mac. So, Confluence is out of the picture right now. 


May 20, 2012

Obtaining R Object name

Filed under: Uncategorized — Tags: , — hypotheses @ 9:07 am

While programming in R, occasionally I wish I can just get the name of an object that I am working with and print out the name, or use the name as a name for the current plot that I am creating.

This is more convenient when creating a function that your user might not have an idea yet what the title should be. By default, the title name in most plot functions in R is created using “deparse(substitute(x))” function

mydata <- cbind(c(1:10),c(1:10))

Name the plot after the name of the data frame used to create this plot.

May 3, 2012

Getting help finding sourcecode in R

Filed under: Uncategorized — Tags: , — hypotheses @ 3:41 pm

(From:  page 43)

 It’s always a good idea to look under the hood and see how things work! Sometimes, that’s the only way to make sure that other people’s codes or programs work the way you really expect it to be, especially the free ones.


  1. The simplest case is typeing the function’s name without “()” following the function’s name

> matrix
function (data = NA, nrow = 1, ncol = 1, byrow = FALSE, dimnames = NULL)
    if (is.object(data) || !is.atomic(data))
        data <- as.vector(data)
    .Internal(matrix(data, nrow, ncol, byrow, dimnames, missing(nrow),
<environment: namespace:base>

However, the comments are not included,  and it might still be better to really look at the sourcecode.
You can find the original sources after unpakcing the source package, in the directory “PackageName/R/”.
For R’s base pakcages, the R code is in$R_HOME/src/library/PakageName/R/.

  1. For codes hidden in a namespace,  type getAnywhere(“FunctionName”) to find out

> plot.factor
Error: object ‘plot.factor’ not found
> getAnywhere(plot.factor)
A single object matching ‘plot.factor’ was found
It was found in the following places
  registered S3 method for plot from namespace graphics
with value

function (x, y, legend.text = NULL, …)


<environment: namespace:graphics>

It is possible

to ask for available methods with methods(print).

The function of interest is the S3 method print.lm()

> methods(print)
  [1] print.acf*                              
  [2] print.anova                             
  [3] print.lm

A method hidden in a namespace

can be accessed (and therefore printed) directly

using the ::: operator as in stats:::print.lm.

For S4 related sources, it is advisable to look at the package’s source files directly!

  1. For “COMPILED” code sources, these are always more problematic. Why you type the name of these functions, what you will see are something like these: .C(), .Call(), .Fortran(), .External(), or .Internal(), or .Primitive().

– The first step is to look up the entry point in file ‘$R_HOME/src/main/names.c’, if the calling R function is either .Primitive() or .Internal().
You will find something like this
{“rnorm”,        do_random2,        8,        11,        3,        {PP_FUNCALL, PREC_FN,        0}},
This tells you to find “do_random2” in the source files.
– Then, grep ‘do_random2″ *.* in the source directory should point you to the correct file to look at. In this case, the source is in “random.c”


Now it’s time to learn “c” in order to understand more about these sources. At least, you will have to know the program structure well enough to be able to read other people’s codes.

April 20, 2012

Understanding Random Number Generator in R

Filed under: Uncategorized — Tags: , — hypotheses @ 5:04 pm

“Mersenne-Twister”is default method for generating random number. The brief description is


“From Matsumoto and Nishimura (1998). A twisted GFSR with period 2^19937 – 1 and equidistribution in 623 consecutive dimensions (over the whole period). The ‘seed’ is a 624-dimensional set of 32-bit integers plus a current position in that set.”


At any given moment, “the current seed” is stored in “.Random.seed”. For a short example below, here is what the .Random.seed looks like.

> runif(5)
[1] 0.5214241 0.6072482 0.8581209 0.1057113 0.4943451
> length(.Random.seed)
[1] 626

# It is 626 long.

> head(.Random.seed)
  [1]         403           5   524275616  -891164866   839495241  2076756731  -695076150

# The first value is the type of random number generator, in this case “Mersenne-Twister”.

# The second value is the “current position” in the set

# The other 624 numbers stored in .Random.seed are what they call “seed”.

# So, if you simulated another 5 random numbers,

> runif(5)
[1] 0.9766740 0.9481603 0.1786681 0.4026092 0.7110552

# The current position changed from 5 (above) to 10 (see below). Notice that other 624 numbers in the set remained the same.

> head(.Random.seed)
[1] 403 10 619611805 -1461824745 -1054662018 -1340796360

# Once you simulate 624th random number, the index of current position will be 624.



> runif(1)
[1] 0.8074707
> head(.Random.seed)
[1] 403 624 619611805 -1461824745 -1054662018 -1340796360

# When you simulate one more random number, a new seed will then be used.

> runif(1)
[1] 0.6062263
> head(.Random.seed)
[1] 403 1 -1016206940 813281659 -1786346428 -363208600

# So, any given moment, you can save the current seed that will be used for the next simulation by


> my.seed <- .Random.seed
> runif(10)
 [1] 0.8797226 0.2143323 0.6826555 0.4366940 0.4330555 0.5745667 0.7406138 0.4646236 0.8014502 0.9818474

# This saved the current seed to be used for the next simulation in “my.seed”

# You can then restore the current seed and reproduce the above simulated numbers by

> .Random.seed <- t(my.seed)
> runif(10)
[1] 0.8797226 0.2143323 0.6826555 0.4366940 0.4330555 0.5745667 0.7406138 0.4646236 0.8014502 0.9818474


# See that the runif(10) here gave the same numbers as the one above.

April 18, 2012

New hope for patients with type I diabetes

Filed under: Uncategorized — hypotheses @ 6:17 pm

If I read this and understand correctly, patients with type I diabetes will have chance to be cured by stem cell/gene therapy soon.

Developmental biology: Re-evaluating gut insulin instinct : Article : Nature Reviews Genetics.

According to this article

  • Talchai, C. et al. Generation of functional insulin-producing cells in the gut by Foxo1 ablation. Nature Genet. 7 Feb 2012 (doi: 10.1038/ng.2215) Article

Talchai, et al were seeking to understand the role of Foxo1 ablation in mice. They found that by knocking out Foxo1, progenitor cells in the gut (which are the same cell lineage as the insulin producing beta cells in pancreas) will develop into insulin producing cells.

A lot more work to be done, but this same idea is very interesting and can probably be applied to other diseases as well. By knowing how to modify a common progenitor cells into the right target cells, any diseases could probably be treated.


November 21, 2011

Converting Column Data to Table (2)

Filed under: Uncategorized — hypotheses @ 6:03 pm

Last time I wrote about converting column to table in Excel. Today, I have to do this again, and went back to read what I wrote. One of the new problems that I have today is that I have multiple columns in Table A. But, I want to convert this long table into Table B.

The problem is essentially similar to having a data in Long Format (one individual has multiple line for each variable).

The formula I posted last time was essentially following this format.


This still applies. From the example I used last time,

For example, your original data is from $B$2:$B$100, containing item 1 to 9 (i.e. p =9, of 11 repeated measurements). You want to cut row 11-19 and put it right next to B2:B10. Instead, put this =OFFSET($B$2,9*(COLUMN()-COLUMN($B$2))+(ROW()-ROW($B$2)),0,1,1) in cell C2, then copy it to C2:L2, then copy it cover C10:L10. This will give you a 9 row by 11 column table.

If you don’t want to put the new data right next to the old data, but want the first cell of the new table to be “H1”. What you have to do is the change the reference for calculating the distance from the reference cell. In this case, I change $B$2 [the previous first cell of the new table] to $H$1. And, it is as simple as that.



April 22, 2011

Split Whole-Genome PLINK Binary Files to Small Chunks

Filed under: bash, data management, genetics, plink, Uncategorized — hypotheses @ 1:03 pm

I prefer to store my GWAS data in PLINK binary format. The PLINK binary file format gives us many advantages. With tons of data, binary file format is a very efficient way to save disk space. To give you some number, you can store the data of 2000 people from Illumina Human 1M chip with ~1 million marker in 1 CD (~700Mb). I prefer to keep all the genetic markers in the same file. This is convenient when you want to extract a list of SNPs from several chromosome. You can just go to this one file and extract them out.
However, keeping everything in one single file is not always a good solution to all the problems. For most of the genome-wide association analysis, we can speed up the analysis using parallelization. The simplest way to do this is to split your data into several smaller chunks and submit each part for analysis on several computers (preferably computing cluster). Although this is so-called "poor man" parallelization, it gets the job done in no time. 
For this purpose, from a single PLINK binary file format, I wrote a script to split this file by chromosome (in by_chr directory), and for each chromosome into a smaller chunk of your choice (I used 5000 for the example below).

## generate PLINK script to split whole genome PLINK BED TO smaller chunks
set -e
#set -o verbose
if [ ! `type -p plink` ];then
echo "Cannot find PLINK. Make sure PLINK is in your \$PATH";
exit 1
if [ "${prefix}" == "" ]; then
echo "Usage: ./ [prefix_plink_input] [chunk_size] [prefix_plink_output]"
echo "Example: $sh /research/imp_data/shared/washuDataReformatted/blackR21 10000 blackSR21"

echo "Exit no input file"
exit 1
if [ "${chunk}" == "" ]; then
echo "Usage: ./ [prefix_plink_input] [chunk_size] [prefix_plink_output]"
echo "Example: $sh /research/imp_data/shared/washuDataReformatted/blackR21 10000 blackSR21"
echo "Exit no chunk size"
exit 1
if [ "${postfix}" == "" ]; then
echo "Usage: ./ [prefix_plink_input] [chunk_size] [prefix_plink_output]"
echo "Example: $sh /research/imp_data/shared/washuDataReformatted/blackR21 10000 blackSR21"
echo "Need prefix of output"
exit 1
echo "File ${prefix} by ${chunk} SNPs chunk"
echo "Spliting PLINK ${prefix} into ${chunk} file size"
if [ ! -e ${prefix}.bim ];then
echo "File ${prefix}.bim not found"
exit 1
chrs=`awk '{print $1}' ${prefix}.bim | sort -nur`
echo "Chromosome in Map File: ${chrs}" | tr "\n" " "
echo ""

## 0. Create script directory if not exist
if [ ! -d by_chr ]; then mkdir -p by_chr; fi
if [ ! -d script ]; then mkdir -p script; fi

# 1. Split by chromosome first
for chr in ${chrs}; do
cat < script/plinkSplit.chr${chr}.sh
# extracting chromosome $chr from ${prefix}
plink --noweb --bfile ${prefix} --chr ${chr} --make-bed --out by_chr/${postfix}_chr${chr}
if [ $split_by_chromosome -eq 1 ]; then
echo "Processing map of chromosome $chr"
grep -w "^$chr" ${prefix}.bim > by_chr/${postfix}_chr${chr}.bim
sh script/plinkSplit.chr${chr}.sh

## 2. Split chromosome to a chunk of ${chunk} SNPs
for chr in $chrs;
if [ ! -d c${chr} ]; then mkdir -p c${chr}; fi
## wait till finish extracting bim by chromosome to submit the following jobs
## default wait_duration = 15 minutes (change value below for longer wait time
wait_duration=5 ## Wait for 1 hour before stopping
while [ "$count" -le "${wait_duration}" ];
if [ ! -e by_chr/${postfix}_chr${chr}.bed ]; then
echo -n -e "\r                                                   \r$count. Waiting for file by_chr/${postfix}_chr${chr}.bed"
let "count += 1"
sleep 60
let "count = wait_duration + 1"
n_end=`wc -l by_chr/${postfix}_chr${chr}.bim | awk '{print $1}'`
echo "Chromosome $chr has $n_end SNPs"
let "snp_end = snp_start + ${chunk} -1"
if [ $snp_end -gt ${n_end} ];then snp_end=${n_end};fi
echo "Extracting chromosome $chr from $snp_start to ${snp_end}"
while [ "$snp_end" -le "${n_end}" ];
first_snp=`tail -n +${snp_start} ${mapfile} | head -n 1| awk '{print $2}'`
last_snp=`tail -n +${snp_end} ${mapfile} | head -n 1 | awk '{print $2}'`
echo "First SNP $first_snp Last SNP $last_snp"
cat <<EOF > script/plinkSplit.chr${chr}.pc${pc}.sh
echo "Extracting chromosome $chr from $snp_start to ${snp_end}"
plink --noweb --bfile by_chr/${postfix}_chr${chr} \
--from $first_snp --to $last_snp --make-bed \
--out c${chr}/${postfix}_chr${chr}_p${pc}
exit 0
sh script/plinkSplit.chr${chr}.pc${pc}.sh
let "pc += 1"
let "snp_start += ${chunk}"
let "snp_end += ${chunk}-1"
if [ $snp_end -gt ${n_end} ];then let "snp_end= n_end + 1" ;fi
done ## while loop all snps by chromosome
fi ## if checking whether chromosome has been extracted or not
done ## while loop waiting for split by chromosome
## Terminate if PLINK has not finished extracting file
if [ ! -e by_chr/${postfix}_chr${chr}.bed ]; then
echo "Timeover: PLINK failed to \extract by_chr/${postfix}_chr${chr}.bed"
exit 1
done ## for loop all chromosomes
exit 0

So, this script will first split your file into a plink binary format by chromosome (in by_chr). Then, it will split each chromosome into a smaller file. The files by_chr may not be very useful for parallelization, but believe me, there're times that you will get to use them.

Older Posts »

Blog at