Issuu on Google+

GT

PHYLOGENETIC PROFILING


State of the problem 

http://bioinfo.cacs.louisiana.edu/profile/ 

A tool for building phylogenetic profiles of a gene against a set of archaea, bacteria and eukaryotes

Phylogenetic profiling creates a binary vector where 1 denotes the presence of the gene in the genome, and 0 – the absence.


Proposed extension 

Rebuild phylogenetic profiles using more genomes than there were previously available

Given a sequence (protein), show profile “on the fly” and predict its function


implementation 

I focused on the 2nd extension for processing the sequence “on the fly”

Using NCBI Blastp online tool, I aligned the sequence against other genes from the system

I crawled the website and parsed the good results, with query cover (> 70%)

The genes are given by accession number / gi – from any of these ids I found the taxonomy number

The taxonomy number would represent the genome

Profiling: 1 for the taxonomies where a matching gene was found, 0 for all the others

I save all the taxonomies and their mapping to the domain {EUK, ARCH, BACT} in a database

The new profile will not be saved in the database


Demo with input a giardia gene


RESULT

This Giardia gene actually matches with genes from all three domains


To be continued 

Improve the speed by running local Blast instead of online

Functional prediction – the function might be assumed identical to the best matched gene

Re-build the previous phylogenetic profiles using more genomes & local BLAST


Questions?


Phylogenetic profiling