As the doctor gone rogue

January 31, 2017

Import data from Excel to R using gdata library

Filed under: data management, R — Tags: , , , , — hypotheses @ 9:00 pm

There are several ways to import data into R.

The standard way, what it used to be, is from a text file using read.table() function.

For excel files, the most famous spreadsheet software on the world, several libraries can be used to import data from .xls file, for example

 RODBC 

. In the past, the problem was with the xlsx file, which was not supported yet.

Recently, I discovered that

 gdata 

can be used to import xlsx file now. So, this bypass the step that I normally have to save the excel file to text file and do the regular file import.

Here’s how:

library(gdata)
data <- read.xls(xls="myData.xlsx",sheet=1,header=TRUE, as.is=TRUE)
Advertisements

November 1, 2015

Transpose Table Sideway

Filed under: data management, R — Tags: , , — hypotheses @ 2:51 am

I’ve come across a problem needing to transpose to wide table into a long format. I’m not talking about the longitudinal data quite yet, the one where you have one individual getting multiple measurements over time.

The question then is get a lot simpler than having to manipulate longitudinal data, which you can do with

library(reshape)

in R. See:

 

?melt
?cast

 

 

Recently, the

library(data.table)

has come into my rescue. With fread function reading in large data frame (or data table) has become much faster. Therefore, base on the simple fread and write.table. here comes the transpose function. You can get the script from my short script transposeR.r github Genetics Library (which has just recently been updated).

Rscript transposeR.r data_1.txt data_2.txt

You can also use wildcard.

Rscript transposeR.r data_?.txt

I mostly tested it on mac, if your windows machine doesn’t play with ls command then, the script might not work with multiple file wildcard.

July 2, 2012

UCSC genome browser human gene location?

Filed under: data management, genetics — hypotheses @ 4:16 pm

To download UCSC RefSeq Gene info with HUGO gene name, start and stop codon or transcription start-end site, locate the refflat.txt.gz file on the annotation database
– Hg18: http://hgdownload.cse.ucsc.edu/goldenPath/hg18/database/refFlat.txt.gz
– Hg19: http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/refFlat.txt.gz
The reFlat.txt.gz contains the following columns as described in the reFlat.sql file
– 1) Hugo gene name
– 2) chromosome
– 3) strand (+/-)
– 4) Transcription start position
– 5) Transcription end position
– 6) Coding region start
– 7) Coding region end
– 8) Number of exons
– 9) Exon start positions
– 10) Exon end positions

These info should come in handy when you have to map the location of genetic marker to genes. I post the script I wrote to do this mapping latter if there is anyone interested.

April 6, 2012

Convert plink format to vcf

Filed under: data management, plink, plinkseq, vcf, vcftools — hypotheses @ 12:20 pm

So, I recently got to work with vcf file format for the first time. It is been quite a mess to start with, but after all the helps I can gather around here (Thank you very much, YK & HS !!!) I finally came down to a solution.

You probably have heard about “plink”, “plink/seq”, and “vcftools”. Among the three programs, “plink” has been around the longest, and one of the reason may be because it is versatile, has good documentation, and quite easy to use. Although plink/seq was inspired by plink, its documentation is still during development (I tried to be optimistic here).

For this short tutorial, I am going to use PLINK/BED file as a medium for file conversion, since it is the fastest file to manipulate and the most efficient to use in terms of disc space and memory requirement (from my experience).

“So, my question is how do I convert my GWAS data to vcf format, with a specific reference allele that I want to use i.e. not the default ‘major allele’ as a reference as we typically do”

So, I now have PLINK/Binary format file call csOmni25 and a file containing reference allele csOmni25.refAllele

As a short note: csOmni25.refAllele looks like this


-------- csOmni25.refAllele ----------------

rs12345 A

rs12958 G

rs10596 C

rs18607 T

...

...

---------end csOmni25.refAllele--------------

  1. First, we will convert PLINK/Binary format file so that A1 [reference allele] correspond to the reference allele that we want
    NOTE: when you create the reference allele file, make sure that all reference alleles are in UPPER CASE.
  2. Second, we will import plink/bed to plink/seq and write out vcf format file

Here’s the script that I use. It’s that simple. And for your reference, it took me a week to figure this out, and tested it.


#!/bin/sh

##-- SCRIPT PARAMETER TO MODIFY--##

PLINKFILE=csOmni25

REF_ALLELE_FILE=csOmni25.refAllele

NEWPLINKFILE=csOmni25Ref

PLINKSEQ_PROJECT=csGWAS

## ------END SCRIPT PARAMETER------ ##

#1. convert plink/binary to have the specify reference allele

plink --noweb --bfile $PLINKFILE --reference-allele $REF_ALLELE_FILE --make-bed --out $NEWPLINKFILE

#2. create plink/seq project

pseq $PLINKSEQ_PROJECT new-project

#3. load plink file into plink/seq

pseq $PLINKSEQ_PROJECT load-plink --file $NEWPLINKFILE --id $NEWPLINKFILE

#4. write out vcf file, as of today 4/6/2012  using vcftools version 0.1.8, although the documentation says that you can write out a compressed vcf format using --format BGZF option, vcftools doesn't recognize what this option is. So, I invented my own solution

pseq $PLINKSEQ_PROJECT write-vcf | gzip > $NEWPLINKFILE.vcf.gz

At the end, this will create a compressed vcf file “csOmni25Ref.vcf.gz” with the specified reference alleles.

April 5, 2012

R FAQ: How can I format a string containing a date into R “Date” object??

Filed under: data management, R — hypotheses @ 3:39 pm

This sounds like a problem that statisticians occasionally have to deal with. It is quite simple, just convert the string to date or time if you remember what command to use. If your date variable has a value like this 2007-05-31 for May 31, 2007. You can simply use the example below.


as.Date(DATE_VARIABLE,format='%Y-%m-%d')

For more example on date, I refer you to UCLA R-help page:

R FAQ: How can I format a string containing a date into R “Date” object??.

For a complete list of date-time formatting conversion see this


?strptime

March 11, 2012

Calculating heritability using SOLAR

Filed under: bash, data management, genetics, SOLAR — Tags: , — hypotheses @ 12:12 pm

As a not to myself, I occasionally have to do this occasionally. This time, simply preparing file for calculation of heritability in SOLAR. This is based on a variance-component analysis, comparing two likelihoods from two model fitting polygenic model and sporadic model.

I’m going to post the script here with all the comments about how to prepare files needed for SOLAR to do the job.


#!/bin/sh
#Date: 03/11/2012
#By: Bhoom Suktitipat, MD, PhD
#Goal: Prepare file with phenotype for solar heritability analysis
#Current location: ~/Documents/GeneSTAR/CAC/solar
INPUTFILE=cacforsolar.csv

# ------ EXAMPLE OF cacforsolar.csv ------- #
# Family,ID,Mother,Father,Sex,cacbres
# 255,25598,,,f,
# 255,25599,,,m,
# 255,2550001,25598,25599,m,
# 255,2550101,25598,25599,f,
# ...

## -------------------------- ##
## -- FILE FORMAT OUTPUT: csv
## -- MISSING DATA: can be blank
## 1) Preparing pedigree file
##    pedfile require these columns
##    id, fa, mo, sex, famid
##    - need to have both parents in the data, or none
##    - sex can be coded as m/f, M/F, 1/2
##    - famid field is optional; only required when id is not unique
echo "famid,id,fa,mo,sex" > cacbres.ped
tail -n +2 $INPUTFILE | awk -F"," '{print $1","$2","$4","$3","$5}' >> cacbres.ped
## 2) Preparing phenotype file
##    phenotype file require these columns
##    id, PHE1, PHE2, PHE3, famid, ...
##    -- famid is optional as in pedfile.
echo "famid,id,cacbres" > cacbres.phe
tail -n +2 $INPUTFILE | awk -F"," '{print $1","$2","$6}' >> cacbres.phe
## 3) Marker file (optional)
##    Field: id, genotype1, genotype2, ...
##    -- MISSING GENOTPE: 0/251, -/251, /251
##    -- Allele delimiter: AB; A B; A/B all accepted
## 4) Map file (optional) --## space-delimited fields!!!
##    -- FIRST LINE: chromosome number, and optional name of the map
##       Ex: 20 Haldane
##       - chromosome name can be string e.g. 20q14
##    -- MARKER LINE: marker name, marker location in (cM)
##       - must be ascending order
##       Ex: D20S101 0.0
## 5) SOLAR SCRIPT for heritability analysis
## Modeling example: http://bioweb2.pasteur.fr/docs/solar/06.chapter.html
cat <<EOF > cacbres.solarscript
load pedigree cacbres.ped
load phenotypes cacbres.phe
trait cacbres
polygenic
EOF
solar <cacbres.solarscript > cacbres.solarout

September 29, 2011

Print from here to there with “awk”

Filed under: bash, data management, R — hypotheses @ 12:16 pm

This does sound like a common thing to do.  You have a length text file that you only want to get some part of it. For example, I have a file that contain a structure like this

HEADER
BODY
++++++++++++++++++++++++++++
CONTENT I WANT TO GET

END OF FILE

Here, the part I want to grab is between the line with “++++++++++++++” and the blank line.


awk '/\+\+/,/^$/' INFILE

With this small awk trick, you request that awk  print the +++ line to the blank line to your terminal.

Now, you just have to remove the +++++++++ and the blank line. I do this with “Stream EDitor” i.e. sed. So the complete lines become something like this…


awk '/\+\+/,/^$/' INFILE | sed '/\+\+/d;/^$/d'

This can really be applied to extract some part of file with tags such as “XML” file. However, it is probably the a very efficient way to parse XML file manually one tag at a time. In R, you can do this more efficiently, using RSXML [http://www.omegahat.org/RSXML/]. And, if you are interacting with a website, you can easily combining it with RCurl [http://www.omegahat.org/RCurl/]

August 17, 2011

Convert Column to Table in Excel

Filed under: data management, excel — hypotheses @ 1:47 pm

Copy and paste is pretty simple and easy to do, but it is sometimes more error prone than writing a function to do it (that assumes the function written does what it is supposed to do).

I recently extracted a column of data that I want to convert into a table (matrix) for comparison across replicates. In STATA or R, this is the task called reshape. Since I have exported the data into our favorite spreadsheet software, I would like to do this within Excel. A bit of googling landed me to this website http://www.cpearson.com/excel/ColumnToTable.aspx, the article too long and the description too short. But at least it pointed me to “OFFSET”.

What you need is this formula

=OFFSET(REF_CELL, p*COL_AWAY_FROM_REFERENCE+ROW_AWAY_FROM_REFERENCE,0,1,1)

1. Reference Cell

2. How may rows your target cell is from 1): p*COL_AWAY_FROM_REFERENCE+ROW_AWAY_FROM_REFERENCE

3 How many columns your target cell is from 1): 0

4. How wide is the target region from the target cell.: width,height = 1,1

In this case you want to transpose a column with dimension 1xN to nxp (this include the first p rows of the original column data as the first column of the final nxp table).

by pasting this formula to column next to the original data, you will get the 2nd column containing row “p+1” from the original data column.

For example, your original data is from $B$2:$B$100, containing item 1 to 9 (i.e. p =9, of 11 repeated measurements). You want to cut row 11-19 and put it right next to B2:B10. Instead, put this =OFFSET($B$2,9*(COLUMN()-COLUMN($B$2))+(ROW()-ROW($B$2)),0,1,1) in cell C2, then copy it to C2:L2, then copy it cover C10:L10. This will give you a 9 row by 11 column table.

April 22, 2011

Split Whole-Genome PLINK Binary Files to Small Chunks

Filed under: bash, data management, genetics, plink, Uncategorized — hypotheses @ 1:03 pm

I prefer to store my GWAS data in PLINK binary format. The PLINK binary file format gives us many advantages. With tons of data, binary file format is a very efficient way to save disk space. To give you some number, you can store the data of 2000 people from Illumina Human 1M chip with ~1 million marker in 1 CD (~700Mb). I prefer to keep all the genetic markers in the same file. This is convenient when you want to extract a list of SNPs from several chromosome. You can just go to this one file and extract them out.
However, keeping everything in one single file is not always a good solution to all the problems. For most of the genome-wide association analysis, we can speed up the analysis using parallelization. The simplest way to do this is to split your data into several smaller chunks and submit each part for analysis on several computers (preferably computing cluster). Although this is so-called "poor man" parallelization, it gets the job done in no time. 
For this purpose, from a single PLINK binary file format, I wrote a script to split this file by chromosome (in by_chr directory), and for each chromosome into a smaller chunk of your choice (I used 5000 for the example below).

#!/bin/bash
## generate PLINK script to split whole genome PLINK BED TO smaller chunks
set -e
#set -o verbose
split_by_chromosome=0
prefix=$1
chunk=$2
postfix=$3
if [ ! `type -p plink` ];then
echo "Cannot find PLINK. Make sure PLINK is in your \$PATH";
exit 1
fi
if [ "${prefix}" == "" ]; then
echo "Usage: ./splitPlink.sh [prefix_plink_input] [chunk_size] [prefix_plink_output]"
echo "Example: $sh splitPlink.sh /research/imp_data/shared/washuDataReformatted/blackR21 10000 blackSR21"

echo "Exit no input file"
exit 1
fi
if [ "${chunk}" == "" ]; then
echo "Usage: ./splitPlink.sh [prefix_plink_input] [chunk_size] [prefix_plink_output]"
echo "Example: $sh splitPlink.sh /research/imp_data/shared/washuDataReformatted/blackR21 10000 blackSR21"
echo "Exit no chunk size"
exit 1
fi
if [ "${postfix}" == "" ]; then
echo "Usage: ./splitPlink.sh [prefix_plink_input] [chunk_size] [prefix_plink_output]"
echo "Example: $sh splitPlink.sh /research/imp_data/shared/washuDataReformatted/blackR21 10000 blackSR21"
echo "Need prefix of output"
exit 1
fi
echo "File ${prefix} by ${chunk} SNPs chunk"
echo "Spliting PLINK ${prefix} into ${chunk} file size"
if [ ! -e ${prefix}.bim ];then
echo "File ${prefix}.bim not found"
exit 1
fi
chrs=`awk '{print $1}' ${prefix}.bim | sort -nur`
echo "Chromosome in Map File: ${chrs}" | tr "\n" " "
echo ""

## 0. Create script directory if not exist
if [ ! -d by_chr ]; then mkdir -p by_chr; fi
if [ ! -d script ]; then mkdir -p script; fi

# 1. Split by chromosome first
for chr in ${chrs}; do
cat < script/plinkSplit.chr${chr}.sh
#!/bin/sh
# extracting chromosome $chr from ${prefix}
plink --noweb --bfile ${prefix} --chr ${chr} --make-bed --out by_chr/${postfix}_chr${chr}
EOF
if [ $split_by_chromosome -eq 1 ]; then
echo "Processing map of chromosome $chr"
grep -w "^$chr" ${prefix}.bim > by_chr/${postfix}_chr${chr}.bim
fi
sh script/plinkSplit.chr${chr}.sh
done

## 2. Split chromosome to a chunk of ${chunk} SNPs
for chr in $chrs;
do
if [ ! -d c${chr} ]; then mkdir -p c${chr}; fi
mapfile=by_chr/${postfix}_chr${chr}.bim
## wait till finish extracting bim by chromosome to submit the following jobs
## default wait_duration = 15 minutes (change value below for longer wait time
wait_duration=5 ## Wait for 1 hour before stopping
count=1
while [ "$count" -le "${wait_duration}" ];
do
if [ ! -e by_chr/${postfix}_chr${chr}.bed ]; then
echo -n -e "\r                                                   \r$count. Waiting for file by_chr/${postfix}_chr${chr}.bed"
let "count += 1"
sleep 60
else
let "count = wait_duration + 1"
n_end=`wc -l by_chr/${postfix}_chr${chr}.bim | awk '{print $1}'`
echo "Chromosome $chr has $n_end SNPs"
snp_start=1
let "snp_end = snp_start + ${chunk} -1"
if [ $snp_end -gt ${n_end} ];then snp_end=${n_end};fi
pc=1
echo "Extracting chromosome $chr from $snp_start to ${snp_end}"
while [ "$snp_end" -le "${n_end}" ];
do
first_snp=`tail -n +${snp_start} ${mapfile} | head -n 1| awk '{print $2}'`
last_snp=`tail -n +${snp_end} ${mapfile} | head -n 1 | awk '{print $2}'`
echo "First SNP $first_snp Last SNP $last_snp"
cat <<EOF > script/plinkSplit.chr${chr}.pc${pc}.sh
#!/bin/sh
echo "Extracting chromosome $chr from $snp_start to ${snp_end}"
plink --noweb --bfile by_chr/${postfix}_chr${chr} \
--from $first_snp --to $last_snp --make-bed \
--out c${chr}/${postfix}_chr${chr}_p${pc}
exit 0
EOF
sh script/plinkSplit.chr${chr}.pc${pc}.sh
let "pc += 1"
let "snp_start += ${chunk}"
let "snp_end += ${chunk}-1"
if [ $snp_end -gt ${n_end} ];then let "snp_end= n_end + 1" ;fi
done ## while loop all snps by chromosome
fi ## if checking whether chromosome has been extracted or not
done ## while loop waiting for split by chromosome
## Terminate if PLINK has not finished extracting file
if [ ! -e by_chr/${postfix}_chr${chr}.bed ]; then
echo "Timeover: PLINK failed to \extract by_chr/${postfix}_chr${chr}.bed"
exit 1
fi
done ## for loop all chromosomes
exit 0

So, this script will first split your file into a plink binary format by chromosome (in by_chr). Then, it will split each chromosome into a smaller file. The files by_chr may not be very useful for parallelization, but believe me, there're times that you will get to use them.

March 2, 2011

Using cut to extract specific columns from a fixed-width format file

Filed under: bash, data management — Tags: , , , , — hypotheses @ 12:07 pm

Occasionally, I will have to deal with a text file in a fixed width text format. In Linux, you can extract specific columns easily using cut

cut -b 1-10,15-20 < infile

This will give you column 1-10 and 15-20 of your “infile”

The additional option that is nice when you want to get rid of a few columns and keep the rest of them is using the option –complement (although from what I have heard, some systems might not have this implemented).

cut –complement -b 11,14,32,43-47,58-62,73-77,88-92,103-108 < infile

Other use of cut is to extract columns from any other type of file with delimiter such as “,” or space

by adding the option -d”_your_delim_” to the example above, and you can extract your infile.csv or infile.txt as well.

 

Older Posts »

Blog at WordPress.com.