Biological Sequence Analysis (1)

NHGRI started a series of lectures on Current Topics in Genome Analysis 2012 two weeks ago. For more info you can find out at  Youtube videos are also available for you to watch. This week’s lecture is about “Biological Sequence Analysis” by Andy Baxevanis.   My notes of the talk are summarized here.  The main topic of the talk involves biological sequence alignment and alignment tools and algorithms, including BLAST.  This is a pretty good lecture if you have been away from BLAST for a while and a good introduction for people who are new to genetics.

As a rule of thumb, and a general idea of what you should remember. When you are doing local sequence alignment, you will have to encounter with several matrix of scoring the sequence similarity.

–          Several alignments scoring matrix exists, e.g. PAM46, BLOSUM62. The number following the scoring matrix name is how the two sequence similarity should be “at most”. To look for more distantly related sequence, use the scoring matrix with lower number.

–          Gap: local alignment should allow at least 1 in every 20 basepair.

–          The return results from BLAST are those results that passed the scoring threshold. This doesn’t imply significant level. Some of these results, however, are considered statistically significant.

–          To assess the biological significance, “Karlin-Altschul Equation”, a normalized probability, as a function of # of letters in the query, # of letters in the database, and the size of search space. This “E-value” represents the number of false positive, and you want this to be as low as possible.

  • Look for E < 10E-6 for nucleotide BLAST
  • Look for E < 10E-3 for protein BLAST

–          As a reference for human genome RefSeq is a good starting place for BLAST.  RefSeq provides a single reference sequence for each molecule of the central dogma (DNA, mRNA, protein).  The database is non-redundant, updated to reflect the current knowledge of sequence data and biology, and is being curated.

–          Options to consider changing

  • Expected threshold: change the E-value as suggested above.
  • Matrix: change this to reflect how similar of the sequence you want to find.
  • Filter: Always filter out region with low complexity, e.g. homopolymeric region. These regions can confound the significant level of the results. (more false positive)

–          Identities: For protein based search, look for at least 25% identity. For nucleotide, look for sequence with at least 75% identity!

–         BLAT is the tool for finding location of an unknown sequence, or gene, e.g. exon, intron, promoter or unknown region in the genome.  BLAT: Blast Like Alignment Tool, much faster than BLAST, can find exact match of sequence down to L=33.  When looking for sequence fragments or unknown genes, BLAT is a good tool to start looking for location of these sequences in the genome. BLAT is available on UCSC Genome Browser.