Import data from Excel to R using gdata library


There are several ways to import data into R.

The standard way, what it used to be, is from a text file using read.table() function.

For excel files, the most famous spreadsheet software on the world, several libraries can be used to import data from .xls file, for example

 RODBC 

. In the past, the problem was with the xlsx file, which was not supported yet.

Recently, I discovered that

 gdata 

can be used to import xlsx file now. So, this bypass the step that I normally have to save the excel file to text file and do the regular file import.

Here’s how:

library(gdata)
data <- read.xls(xls="myData.xlsx",sheet=1,header=TRUE, as.is=TRUE)
Advertisements

Transpose Table Sideway


I’ve come across a problem needing to transpose to wide table into a long format. I’m not talking about the longitudinal data quite yet, the one where you have one individual getting multiple measurements over time.

The question then is get a lot simpler than having to manipulate longitudinal data, which you can do with

library(reshape)

in R. See:

 

?melt
?cast

 

 

Recently, the

library(data.table)

has come into my rescue. With fread function reading in large data frame (or data table) has become much faster. Therefore, base on the simple fread and write.table. here comes the transpose function. You can get the script from my short script transposeR.r github Genetics Library (which has just recently been updated).

Rscript transposeR.r data_1.txt data_2.txt

You can also use wildcard.

Rscript transposeR.r data_?.txt

I mostly tested it on mac, if your windows machine doesn’t play with ls command then, the script might not work with multiple file wildcard.

UCSC genome browser human gene location?


To download UCSC RefSeq Gene info with HUGO gene name, start and stop codon or transcription start-end site, locate the refflat.txt.gz file on the annotation database
– Hg18: http://hgdownload.cse.ucsc.edu/goldenPath/hg18/database/refFlat.txt.gz
– Hg19: http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/refFlat.txt.gz
The reFlat.txt.gz contains the following columns as described in the reFlat.sql file
– 1) Hugo gene name
– 2) chromosome
– 3) strand (+/-)
– 4) Transcription start position
– 5) Transcription end position
– 6) Coding region start
– 7) Coding region end
– 8) Number of exons
– 9) Exon start positions
– 10) Exon end positions

These info should come in handy when you have to map the location of genetic marker to genes. I post the script I wrote to do this mapping latter if there is anyone interested.

Convert plink format to vcf


So, I recently got to work with vcf file format for the first time. It is been quite a mess to start with, but after all the helps I can gather around here (Thank you very much, YK & HS !!!) I finally came down to a solution.

You probably have heard about “plink”, “plink/seq”, and “vcftools”. Among the three programs, “plink” has been around the longest, and one of the reason may be because it is versatile, has good documentation, and quite easy to use. Although plink/seq was inspired by plink, its documentation is still during development (I tried to be optimistic here).

For this short tutorial, I am going to use PLINK/BED file as a medium for file conversion, since it is the fastest file to manipulate and the most efficient to use in terms of disc space and memory requirement (from my experience).

“So, my question is how do I convert my GWAS data to vcf format, with a specific reference allele that I want to use i.e. not the default ‘major allele’ as a reference as we typically do”

So, I now have PLINK/Binary format file call csOmni25 and a file containing reference allele csOmni25.refAllele

As a short note: csOmni25.refAllele looks like this


-------- csOmni25.refAllele ----------------

rs12345 A

rs12958 G

rs10596 C

rs18607 T

...

...

---------end csOmni25.refAllele--------------

  1. First, we will convert PLINK/Binary format file so that A1 [reference allele] correspond to the reference allele that we want
    NOTE: when you create the reference allele file, make sure that all reference alleles are in UPPER CASE.
  2. Second, we will import plink/bed to plink/seq and write out vcf format file

Here’s the script that I use. It’s that simple. And for your reference, it took me a week to figure this out, and tested it.


#!/bin/sh

##-- SCRIPT PARAMETER TO MODIFY--##

PLINKFILE=csOmni25

REF_ALLELE_FILE=csOmni25.refAllele

NEWPLINKFILE=csOmni25Ref

PLINKSEQ_PROJECT=csGWAS

## ------END SCRIPT PARAMETER------ ##

#1. convert plink/binary to have the specify reference allele

plink --noweb --bfile $PLINKFILE --reference-allele $REF_ALLELE_FILE --make-bed --out $NEWPLINKFILE

#2. create plink/seq project

pseq $PLINKSEQ_PROJECT new-project

#3. load plink file into plink/seq

pseq $PLINKSEQ_PROJECT load-plink --file $NEWPLINKFILE --id $NEWPLINKFILE

#4. write out vcf file, as of today 4/6/2012  using vcftools version 0.1.8, although the documentation says that you can write out a compressed vcf format using --format BGZF option, vcftools doesn't recognize what this option is. So, I invented my own solution

pseq $PLINKSEQ_PROJECT write-vcf | gzip > $NEWPLINKFILE.vcf.gz

At the end, this will create a compressed vcf file “csOmni25Ref.vcf.gz” with the specified reference alleles.

R FAQ: How can I format a string containing a date into R “Date” object??


This sounds like a problem that statisticians occasionally have to deal with. It is quite simple, just convert the string to date or time if you remember what command to use. If your date variable has a value like this 2007-05-31 for May 31, 2007. You can simply use the example below.


as.Date(DATE_VARIABLE,format='%Y-%m-%d')

For more example on date, I refer you to UCLA R-help page:

R FAQ: How can I format a string containing a date into R “Date” object??.

For a complete list of date-time formatting conversion see this


?strptime

Calculating heritability using SOLAR


As a not to myself, I occasionally have to do this occasionally. This time, simply preparing file for calculation of heritability in SOLAR. This is based on a variance-component analysis, comparing two likelihoods from two model fitting polygenic model and sporadic model.

I’m going to post the script here with all the comments about how to prepare files needed for SOLAR to do the job.


#!/bin/sh
#Date: 03/11/2012
#By: Bhoom Suktitipat, MD, PhD
#Goal: Prepare file with phenotype for solar heritability analysis
#Current location: ~/Documents/GeneSTAR/CAC/solar
INPUTFILE=cacforsolar.csv

# ------ EXAMPLE OF cacforsolar.csv ------- #
# Family,ID,Mother,Father,Sex,cacbres
# 255,25598,,,f,
# 255,25599,,,m,
# 255,2550001,25598,25599,m,
# 255,2550101,25598,25599,f,
# ...

## -------------------------- ##
## -- FILE FORMAT OUTPUT: csv
## -- MISSING DATA: can be blank
## 1) Preparing pedigree file
##    pedfile require these columns
##    id, fa, mo, sex, famid
##    - need to have both parents in the data, or none
##    - sex can be coded as m/f, M/F, 1/2
##    - famid field is optional; only required when id is not unique
echo "famid,id,fa,mo,sex" > cacbres.ped
tail -n +2 $INPUTFILE | awk -F"," '{print $1","$2","$4","$3","$5}' >> cacbres.ped
## 2) Preparing phenotype file
##    phenotype file require these columns
##    id, PHE1, PHE2, PHE3, famid, ...
##    -- famid is optional as in pedfile.
echo "famid,id,cacbres" > cacbres.phe
tail -n +2 $INPUTFILE | awk -F"," '{print $1","$2","$6}' >> cacbres.phe
## 3) Marker file (optional)
##    Field: id, genotype1, genotype2, ...
##    -- MISSING GENOTPE: 0/251, -/251, /251
##    -- Allele delimiter: AB; A B; A/B all accepted
## 4) Map file (optional) --## space-delimited fields!!!
##    -- FIRST LINE: chromosome number, and optional name of the map
##       Ex: 20 Haldane
##       - chromosome name can be string e.g. 20q14
##    -- MARKER LINE: marker name, marker location in (cM)
##       - must be ascending order
##       Ex: D20S101 0.0
## 5) SOLAR SCRIPT for heritability analysis
## Modeling example: http://bioweb2.pasteur.fr/docs/solar/06.chapter.html
cat <<EOF > cacbres.solarscript
load pedigree cacbres.ped
load phenotypes cacbres.phe
trait cacbres
polygenic
EOF
solar <cacbres.solarscript > cacbres.solarout

Print from here to there with “awk”


This does sound like a common thing to do.  You have a length text file that you only want to get some part of it. For example, I have a file that contain a structure like this

HEADER
BODY
++++++++++++++++++++++++++++
CONTENT I WANT TO GET

END OF FILE

Here, the part I want to grab is between the line with “++++++++++++++” and the blank line.


awk '/\+\+/,/^$/' INFILE

With this small awk trick, you request that awk  print the +++ line to the blank line to your terminal.

Now, you just have to remove the +++++++++ and the blank line. I do this with “Stream EDitor” i.e. sed. So the complete lines become something like this…


awk '/\+\+/,/^$/' INFILE | sed '/\+\+/d;/^$/d'

This can really be applied to extract some part of file with tags such as “XML” file. However, it is probably the a very efficient way to parse XML file manually one tag at a time. In R, you can do this more efficiently, using RSXML [http://www.omegahat.org/RSXML/]. And, if you are interacting with a website, you can easily combining it with RCurl [http://www.omegahat.org/RCurl/]