Appending multiple vcf

If you have multiple vcf files split by chromosome from the same samples, this is the case when performing joint variant calls of multiple samples in GATK. At the end of the day, if you want to have a single vcf file from this project, CatVariants tool (a command line tool in GATK) is pretty fast. Although I think this might be just the case of simple cat of multiple files except the vcf header, this tools still come in handy especially when you already have GATK installed. Continue reading “Appending multiple vcf”


apt-get error with … for package ‘linux-headers-3.2.0-60’ is missing final newline

I recently clone a virtual machine image and setting up locally. When trying to install a new package through <code>apt-get</code> in Ubuntu. I ran into a strange problem of “missing final newline” in linux-headers.

Trying to google for solution, I found a page in 2004 mentioning this problem. It suggested that a file in /var/lib/dpkg/info/smbf.list is the cause of the error. So, I tried to locate this file but without any success.

I notice that there is “linux-headers-3.2.0-60.list” and a few other files with similar names in this folder. So, well, why not trying to delete them.

It seems like deleting these files fix the problem and allow apt-get to install the package without any problems. So far the system seems normal. We haven’t allowed any additional access through the server, only deleting a few files that don’t look essential for the OS to run.

Further remove other apt-get unused file through apt-get clean seems to solve the problem.

Some benefits of clinical exome sequencing

To date, clinical exome sequencing yielded around 25% diagnostic rate of unknown genetic diseases.

The number came from two recently published studies cited in Berg. JAMA 312(18), 2014.

The major concern in making the diagnosis remained the prior probability for each disease, which contributes directly to the false positive or false negative diagnosis of each specific condition. Also, genome-annotation remains a major challenge to all the scientists playing the roles in interpreting the results of these genome sequencing, whether it’s the whole genome or just an an exome.

The general questions that most patients would want to know include

  1. the diagnosis
  2. some explanations how the disease happened
  3. Anyone else in the family might be affected?
  4. Any treatment?
  5. Any future

Read more on the two cited references:

Wiki software for Mac

I’ve updated Maverick (OsX 10.8) to OsX server recently. It came with an easy interface to configure. The best part for me seems to be the wiki feature. However, this comes with a cache-22. For some reasons that I cannot figure out, and I found that similar problem has been reported for users in Thailand running OsX server.

When Wiki feature is enabled, I get an error saying that ERROR:  time zone displacement out of range This is quite dreadful! The only way to solve the problem seems to be fresh installation which is out of the question for me. Without “Wiki feature”, the server and web service would be running without any problem. So, that’s one option for me, to not enabling the wiki feature. 

My alternatives for using the built-in Wiki feature is, therefore, looking for other wiki platforms available for Mac. From Wikipedia,, the list is narrowed down to MoinMoin and MojoMojo, which is available on Mac. MoinMoin was written in Python, and the other one in Perl (recently changed the name to Catalyst software for some reason).

MoinMoin does not require database support, either Posgres, MySql, etc, which seems to be the cause of the dreaded Timezone displacement errors. Therefore, I’m going to go forward with MoinMoin installation and give it a try. I really want to try Confluence; however, they do not support it on Mac. So, Confluence is out of the picture right now. 


Obtaining R Object name

While programming in R, occasionally I wish I can just get the name of an object that I am working with and print out the name, or use the name as a name for the current plot that I am creating.

This is more convenient when creating a function that your user might not have an idea yet what the title should be. By default, the title name in most plot functions in R is created using “deparse(substitute(x))” function

mydata <- cbind(c(1:10),c(1:10))
Name the plot after the name of the data frame used to create this plot.

Getting help finding sourcecode in R

(From:  page 43)

 It’s always a good idea to look under the hood and see how things work! Sometimes, that’s the only way to make sure that other people’s codes or programs work the way you really expect it to be, especially the free ones.


  1. The simplest case is typeing the function’s name without “()” following the function’s name

> matrix
function (data = NA, nrow = 1, ncol = 1, byrow = FALSE, dimnames = NULL)
    if (is.object(data) || !is.atomic(data))
        data <- as.vector(data)
    .Internal(matrix(data, nrow, ncol, byrow, dimnames, missing(nrow),
<environment: namespace:base>

However, the comments are not included,  and it might still be better to really look at the sourcecode.
You can find the original sources after unpakcing the source package, in the directory “PackageName/R/”.
For R’s base pakcages, the R code is in$R_HOME/src/library/PakageName/R/.

  1. For codes hidden in a namespace,  type getAnywhere(“FunctionName”) to find out

> plot.factor
Error: object ‘plot.factor’ not found
> getAnywhere(plot.factor)
A single object matching ‘plot.factor’ was found
It was found in the following places
  registered S3 method for plot from namespace graphics
with value

function (x, y, legend.text = NULL, …)


<environment: namespace:graphics>

It is possible

to ask for available methods with methods(print).

The function of interest is the S3 method print.lm()

> methods(print)
  [1] print.acf*                              
  [2] print.anova                             
  [3] print.lm

A method hidden in a namespace

can be accessed (and therefore printed) directly

using the ::: operator as in stats:::print.lm.

For S4 related sources, it is advisable to look at the package’s source files directly!

  1. For “COMPILED” code sources, these are always more problematic. Why you type the name of these functions, what you will see are something like these: .C(), .Call(), .Fortran(), .External(), or .Internal(), or .Primitive().

– The first step is to look up the entry point in file ‘$R_HOME/src/main/names.c’, if the calling R function is either .Primitive() or .Internal().
You will find something like this
{“rnorm”,        do_random2,        8,        11,        3,        {PP_FUNCALL, PREC_FN,        0}},
This tells you to find “do_random2” in the source files.
– Then, grep ‘do_random2″ *.* in the source directory should point you to the correct file to look at. In this case, the source is in “random.c”


Now it’s time to learn “c” in order to understand more about these sources. At least, you will have to know the program structure well enough to be able to read other people’s codes.

Understanding Random Number Generator in R

“Mersenne-Twister”is default method for generating random number. The brief description is


“From Matsumoto and Nishimura (1998). A twisted GFSR with period 2^19937 – 1 and equidistribution in 623 consecutive dimensions (over the whole period). The ‘seed’ is a 624-dimensional set of 32-bit integers plus a current position in that set.”


At any given moment, “the current seed” is stored in “.Random.seed”. For a short example below, here is what the .Random.seed looks like.

> runif(5)
[1] 0.5214241 0.6072482 0.8581209 0.1057113 0.4943451
> length(.Random.seed)
[1] 626

# It is 626 long.

> head(.Random.seed)
  [1]         403           5   524275616  -891164866   839495241  2076756731  -695076150

# The first value is the type of random number generator, in this case “Mersenne-Twister”.

# The second value is the “current position” in the set

# The other 624 numbers stored in .Random.seed are what they call “seed”.

# So, if you simulated another 5 random numbers,

> runif(5)
[1] 0.9766740 0.9481603 0.1786681 0.4026092 0.7110552

# The current position changed from 5 (above) to 10 (see below). Notice that other 624 numbers in the set remained the same.

> head(.Random.seed)
[1] 403 10 619611805 -1461824745 -1054662018 -1340796360

# Once you simulate 624th random number, the index of current position will be 624.



> runif(1)
[1] 0.8074707
> head(.Random.seed)
[1] 403 624 619611805 -1461824745 -1054662018 -1340796360

# When you simulate one more random number, a new seed will then be used.

> runif(1)
[1] 0.6062263
> head(.Random.seed)
[1] 403 1 -1016206940 813281659 -1786346428 -363208600

# So, any given moment, you can save the current seed that will be used for the next simulation by


> my.seed <- .Random.seed
> runif(10)
 [1] 0.8797226 0.2143323 0.6826555 0.4366940 0.4330555 0.5745667 0.7406138 0.4646236 0.8014502 0.9818474

# This saved the current seed to be used for the next simulation in “my.seed”

# You can then restore the current seed and reproduce the above simulated numbers by

> .Random.seed <- t(my.seed)
> runif(10)
[1] 0.8797226 0.2143323 0.6826555 0.4366940 0.4330555 0.5745667 0.7406138 0.4646236 0.8014502 0.9818474


# See that the runif(10) here gave the same numbers as the one above.