Page 1



L.L Gatlin

Do you HYPHY with (Data) Monkey!!

Perl one-liners for Bioinformaticians

Public Service Ad sponsored by IQLBioinformatics


December 2015


Topics Editorial....

03 Tools Roary: Analysis of Prokaryote Pan Genome on a large-scale 07

Do you HYPHY with (Data) Monkey !!





Disulphide Connectivity in Protein Tertiary Structure Prediction 13

T-Coffee : A tool that combines both local and global alignments 24


64 Programming

Perl one-liners for bioinformaticians




TIN: R package to analyze Transcriptome Instability 15

99 Meta Analysis

Venice Criteria: Overview

Systems Biology

Two Components System: Potential Drug Target in Mycobacterium tuberculosis 11


99 Genomics Mycobacteriophages & their potentials as source against Mycobacterial active molecules 19


EDITORIAL EXECUTIVE EDITOR FOZAIL AHMAD FOUNDING EDITOR MUNIBA FAIZA SECTION EDITORS ALTAF ABDUL KALAM MANISH KUMAR MISHRA SANJAY KUMAR PRAKASH JHA NABAJIT DAS REPRINTS AND PERMISSIONS You must have permission before reproducing any material from Bioinformatics Review. Send E-mail requests to Please include contact detail in your message. BACK ISSUE Bioinformatics Review back issues can be downloaded in digital format from at $5 per issue. Back issue in print format cost $2 for India delivery and $11 for international delivery, subject to availability. Pre-payment is required CONTACT PHONE +91. 991 1942-428 / 852 7572-667 MAIL Editorial: 101 FF Main Road Zakir Nagar, Okhla New Delhi IN 110025 STAFF ADDRESS To contact any of the Bioinformatics Review staff member, simply format the address as PUBLICATION INFORMATION Volume 1, Number 1, Bioinformatics Review™ is published monthly for one year(12 issues) by Social and Educational Welfare Association (SEWA)trust (Registered under Trust Act 1882). Copyright 2015 Sewa Trust. All rights reserved. Bioinformatics Review is a trademark of Idea Quotient Labs and used under license by SEWA Trust. Published in India


Pursuing PhD in Sciences? All the low hanging fruits in sciences have been plucked to kingdom come and it’s time to review the education system. It is about time to inculcate certain very important and fundamental questions in the young minds to take an informed decision regarding their career.

Dr. Prashant Pant Editor-in-Chief

Very recently, an article appeared in Nature on “Reform the PhD system or close it down” by Mark Taylor. This article emphasised on the very fact that most doctoral programs are just producing PhDs like anything and they have very poor absorption rate in Universities/Institutions and in corporate world due to deficiencies of the system and/or of the degree making them find no place. The article also talks about the medieval nature of most doctoral programs which have made them irrelevant and unsustainable with the growing number of PhDs churning out from the Universities all over the world. Two questions come to our mind. One, why this happened, and secondly, where things went wrong and who is to be blamed? Probable answer to the first question lies in the opening statement of this editorial and this was imminent. The second question however is more intriguing and needs discussion. Most doctoral programmes are designed so as to train students to perform research and analyses on a stereotyped mechanism which is not wrong but makes the scholars look at PhD as a lucrative option to get a doctor prefixed in their name without putting much pressure on their grey matter. PhDs are not about filling pages under five chapters after a couple of years. PhDs are not about stereotype work done scholar afte r scholar in a laboratory to fulfil the mentor’s desire to become a self-declared expert on a topic. PhDs are (and should be) about questions, and that too, genuine ones. They are (and should be) about beautiful experimental designs attacking the question from all corners and trying to answer it. They are about thinking what, why and how something happens and how that piece of

Letters and responses:

information can be taken further to serve another research question. Our education system will wake up one day and will introduce revolutionary modifications crashing the dreams of many PhD aspirants. We should not wait for that day to shine upon us rather we should prepare our young minds to start thinking. So the question you should be asking to yourself when you are in your graduation or post-graduation is “Do you have it in you to do research and earn a PhD degree”. If you don’t ask, you are going to repent heavily. If you question everything around you and can work relentlessly to try and answer a question, then PhDs are for you. If you can connect radically different aspects and weave them together into simpler forms, then PhDs are for you. Science can take you to any part of the world, good or bad, if you are ready for that, PhDs are for you. If you are ready to explore again and again, PhDs are for you. If you can give up other things in life for the sake of the questions, PhDs are for you. One should always remember, it is a (doctoral) degree in philosophy and not in sciences and therefore, more important is the question (that you ask) and the meaning/interpretation of the answer rather than plain science. So, start thinking!!


Roary: Analysis of Prokaryote Pan Genome on a large-scale Muniba Faiza Image Credit: Google Images “A new method to generate the pan genome of a s et of related prokaryotic is olate s and named the tool as ‘Roary’.”

The Microbial Pan Genome is the union of genes shared by genomes of interest. This term was first used by Medini

tool as ‘Roary’. It deals with thousands of isolates in a feasible time.

the relationships of the clusters based on the order of occurrence in the input sequences.

How Roary Works?

Since then, microbial genome data has been enormously increased, so to study processes such as selection and evolution, the construction of pan genome of species is required. But construction of pan genome from the real data available is very difficult and would not be accurate due to fragmented assemblies, poor annotation and also the contamination,i.e., microbial organisms can rapidly acquire genes from other organisms. Therefore, Andrew J. Page et al have developed

One annotated assembly per sample is input in the Roary from which coding regions are extracted and converted in to protein sequences, and all the partial sequences are removed and pre clustered using CDHIT (a fast program for clustering and comparing). This produces a reduced set of protein sequences.These reduced sequences are compared all-against-all with the help of BLASTP with a user defined percentage sequence identity (default 95%). Now, by using conserved neighborhood genes, homologous groups are split in to true orthologs. Finally, a graph is constructed showing

That’s how the orthologous genes of prokaryotes can be easily identified and the microbial evolution can be well studied. It is done on a large scale covering a large data set to analyse the pan genomes of prokaryotes. Other tools have also been made earlier than Roary for the same purpose,namely, PanOCT and PGAP, but Roary is more fast, heuristic and most feasible tool among them.

T in 2005.

a new method to generate the pan genome of a set of related prokaryotic isolates and named the

Bioinformatics Review | 7


Perl for


bioinformaticians Muniba Faiza Image Credit: Google Images “Perl one-liners are ex tremely s hort Perl s c ripts written in the form of a s tring of c ommands that fits onto one line. Perl one-liners c an be very us eful in ad-hoc proc es s ing or pars ing of files and s treams from a plethora of s ourc es . ”


erl one-liners are extremely

Try it! (of course, Perl must be



short Perl scripts written in

installed on your computer for the



the form of a string of

“perl” command to work).

(s/string1/string2/). Let us use

commands that fits onto one line. That would amount to a bit less than





purposes. Here’s the obligatory “Hello World!” one-liner in Perl and it’s output:

The most common and useful way to use such one-liners is to use them as stream processors on the command






“echo” to generate an empty input to act upon and “-p” to tell Perl to print the $_ variable (entire line) at the end:

connected by pipes to other

$ echo | perl -pe 's/$_/Hello





a Linux

$ perl -e 'print "Hello World!\n";'

command-line environment. To

Hello World!

process the stream one would commonly



Hello World!


Bioinformatics Review | 8

Notice that Perl iterates over all lines

in the file. This is a no-brainer with Perl

removing every end-of-line symbol

of the input (first create a file test

one-liners! Just replace the beginning of

on non-header lines:

with 3 empty lines):

each line with it’s number:

$ cat test | perl -pe 's/$_/Hello World!\n/;' Hello




Hello World!


cat test2 | perl -pe '$i++; s/^/$i: /;' 1:






3: Hello World!






's/^([^>]+)\n/$1/;END{print "\n"}' | grep -B1 TATATAA The “$1” is a special Perl variable created in regular expressions whenever you enclose something

Finally, let us introduce the “-i” switch

The “^” symbol denotes the

in parentheses. Here we do that

to make Perl do the changes directly on

beginning of the line in Perl regular

with entire lines that do not begin

a supplied file:

expressions. Notice that the one-

with a “>” character (“^” in

liner actually contains two lines of

brackets like “*^>+” means NOT

$ perl -pi -e 's/$_/Hello World!\n/;'


“>”, in this case we choose non-


semicolon (;).

header lines).

Bioinformaticians often process

Perl one-liners can be very useful

FASTA files with nucleotide or

in ad-hoc processing or parsing of

amino-acid sequences. Suppose

files and streams from a plethora

you have a FASTA file you would

of sources. Additional examples of

like to convert to a format where

clever Perl one-liners can be

every sequence occupies only one

found here or here.

This will result in the contents of test2getting overwritten with “Hello World!” now present on every line! Needless to say, the “-i” switch can be quite dangerous for it’s ability to completely overwrite files. Suppose you have a file where you would like to number the lines directly





line, so that you can apply “grep” to look for a specific k-mer in the sequence (say TATATAA for TATAbox). This can be easily done by

Bioinformatics Review | 9


Two Components System: Potential Drug Target in Mycobacterium tuberculosis Fozail Ahmad Image Credit: Google Images “To s et the s tage of infec tion, to es tablis h its elf in the hos t’s defending environment, to c aus e the pathogenic ity by overc oming the immune s ys tem and to es c ape out from any as s ailable hos t attac k, this TB c aus ing pathogen has developed a well -embodied s ys tem known as two-c omponent s ys tem (TCS).”


he genomic complexity and unknown functions of proteins/genes in Mycobacterium tuberculosis (Mt) has triggered an in-depth study of the entire genome to explore factors responsible for influencing Mt’s behaviour at molecular level. To set the stage of infection, to establish itself in the host’s defending environment, to cause the pathogenicity by overcoming the immune system and to escape out from any assailable host attack, this TB causing pathogen has developed a well-embodied system known as two-component system (TCS) that constitutes two proteins, universally designated as sensor protein and response regulator protein.

The basic function of these proteins is to sense environmental signals and respond accordingly. After interaction with suitable stimulating ligands, sensor protein, histidine kinase binds and hydrolyzes ATP, catalysing the auto-phosphorylation of a conserved Histidine residue and producing a high energy phosphoryl group. The phosphate is then transferred to the associated receiver protein known as response regulator at conserved Aspartic acid residue generating a high-energy acyl phosphate. Once phosphotransfer reaction has taken place, the response regulator gets activated, allowing it to carry out its specific function. In most of the cases, activated sensor kinase modulates the transcription of DNA at a specific binding site located in target genome at promoter region.

The total effect is change in global gene expression that aids pathogen to respond in the initial signal sensed by histidine kinase. There are eleven such TCS in the pathogen. The primary task of such system is to control the expression of specific genes at specific time in response to the environmental conditions hence contributing to the growth of pathogen inside host. Since each of the TCS is obligated with distinct function, they are involved in orchestrating most of the gene regulatory processes. Out of eleven, only eight TCS have been studied comprehensively letting others to be scavenged by further genomic analysis of Mt. Interdisciplinary relevance : The systematic understanding of biological phenomena and demonstration of such microscopic

Bioinformatics Review | 10

processes have been subjected to a number of sophisticated experimental procedure in order to develop the deterministic or stochastic approaches that are skilled in unfolding real molecular system. Biological modeling and simulation are among those of biochemical annotating methodologies using wet lab data and understanding the scenario of real biological mechanism.

Systems biology opens a new area to analyse the raw data generated through wet lab experimentations by various modes of characterization and evaluation by mathematical modeling, simulations & network analyses as the sole implications into any biological issue. Two-component systems for their critical contributions in bacterial pathogenicity have provided us with new concepts for comprehending molecular mechanism which are yet to be explored. Limitations have been raised for it’s behaviour and activation so far as the exact regulatory mechanism is concerned.

Application of mathematical model and simulation over the regulatory behaviour would testify the real global association of TCS with entire genomic expression showing how this pathogen becomes so potentially virulent? Another important question that can be raised is at what level of gene activation the pathogenicity is rampant making host unimmunized? The scavenging initiative of all two-component systems would bring the molecular biology, chemistry, mathematics and network biology together to unfold the gene regulatory scenario of Mycobacterium tuberculosis in an exclusive manner.

Bioinformatics Review | 11


Disulphide Connectivity in Protein Tertiary Structure Prediction Muniba Faiza Image Credit: Google Images

“ The dis ulphide bonds formed between non -adjac ent Cys teine res idues are identified that would be c ros s linked from other pos s ible res idues . �


s the approach towards the protein structure prediction has increased and has been successful in most of the cases but still also a big challenge. To handle this situation, the Protein Structure prediction is divided in to separate sub classes to get the information about the whole system (i.e.,protein structure). One of these sub classes is Disulphide Connectivity. Under this, the disulphide bonds formed between non-adjacent Cysteine residues are identified that would be cross-linked from other possible residues.

Connectivity can be studied in two steps: first, by disulfide bonding state prediction and secondly, by disulphide connectivity prediction (DCP). The first approach classifies the cysteines bonded to another cysteines or any free cysteine according to their molecular states. DCP identifies the different pairs of cysteines that are bonded in a protein sequence. To perform these tasks, various predictors are available that are mainly based on Neural Networks (NN) and Support Vector Machines (SVMs), and other predictive methods.

Since the disulphide bridges/bonds plays an important role in the folding process, stability and function of a protein, therefore, the prediction of disulfide bonds connectivity can help in prediction of protein structure. Disulphide

An Artificial Neural Network is a computing system of interconnected elements where some external inputs are applied and the information is processed by the dynamic responses given by the system. ANN provides a likelihood

of forming a disulphide bond for each cysteine pair. Several algorithms are applied such as Gabow’s algorithm to implement NN in protein structure prediction. SVMs are the machine learning tool to predict tertiary structure from the primary sequence of proteins. This approach uses the Edmond-Gabow algorithm and PSSMs. After performing these operations, to validate the accuracy of predicted connectivity patterns there are two parameters: Rb & Qb. Rb is the ratio of the number of correctly predicted bonds to the total number of disulphide bonds (Nb) in test proteins. Qb is the ratio of the number of proteins whose connectivity patterns are correctly predicted (Nprot) to the total number of proteins (Nt) in the test set.

Bioinformatics Review | 12


TIN: R package to analyze Transcriptome Instability Muniba Faiza Image Credit: Google Images “ TIN is a new R pac kage whic h enables to analyze TIN from the ex pres s ion data. TIN is a s oftware pac kage of R modules that us es a framework to analyze ex pres s ion level data. � lternative Splicing plays a very essential role in proper functioning of eukaryotic cells. It acts as a regulatory mechanism for gene expression and any kind of disruption in this mechanism may lead to human diseases. Alternative splicing of premRNA is a major source of genetic variation in human beings and disruption of the splicing process may cause human diseases such as cancer. Cancer-associated variation which may occur at different levels of gene regulation, particularly during the processing of pre-mRNA into


mature mRNAs. So, better understanding of these mechanisms may provide insights into disease causes and development. TIN is a new R package which enables to analyze TIN from the expression data. TIN is a software package of R modules that uses a framework to analyze expression level data. WORKFLOW: TIN uses raw expression data (cell intensity,CEL files) as input and applies the FIRMA method (i.e., a method for detection of alternative splicing) estimating the expression

levels of transcriptome and the alternative splicing patterns between samples. FIRMA method gives a FIRMA score to each exon sample combination, which is based on the deviation of probes systems from the expected gene expression level. Thus, FIRMA score is the relative ratio between exon expression level and corresponding gene expression level. If FIRMA shows a strong positive score, then the differential exon is included and if it shows a negative score, then it implies that exon is skipped. Since alternative splicing is mediated by several splicing factors and proteins which remove introns

Bioinformatics Review | 13

from the pre-mRNA then joining the exons of mRNA together. Therefore, TIN basically test the association between splicing factor expression levels and amount of abnormal exon usage among the samples. For this, correlation between abnormal exon usage amounts and splicing factor expression levels tested across all samples is calculated. If the correlation is considerably lower, it indicates that the aberrant amounts of exon expression may be due to splicing factor expression.

After that, correlation is tested by using random gene sets, if the correlation is poor then it gives an indication that the abnormal exon usage can be attributed to the expression levels of the splicing factor genes. This is how by analyzing the gene expression levels and alternative splicing patterns we can easily monitor a developing disease or it can be predicted at an very early stage.

For further reading, click here. Note: An exhaustive list of references for this article is available with the author and is available on personal request, for more details write m

Fig.1 Workflow of TIN

Bioinformatics Review | 14


Venice Criteria: Overview Manish Kumar Mishra Image Credit: Google Images ” Venic e c riteria c an be unders tood as a s et of three s c ores whic h are us ed to grade the evidenc e produc ed by the s tudy.“


he plethora of research literature available to the modern day biologists provides the luxury to conduct a unique procedure- an analysis of the meta(data of data). GWASGenome Wide Association Studies find their utility in aiding the researcher narrowing down to a specific biomolecule, to target for any curative or vague analytical procedure for any particular trait. To make meta-analysis realistic and closer to truth one needs to scrutinize every individual study on some benchmark, VENICE CRITERIA here comes in handy.

Venice criteria can be understood as a set of three scores which are used to grade the evidence produced by the study. Each of these three score can attain a maximum of ‘A’ grade, followed by ‘B’, and ‘C’ based on how meticulous the study was. The first score is generated for “Amount”

numerical quantities, and the details of the same follow Amount 

1000 subjects, case: control= 1:1, for least common genetic group 

When trying to elaborate on each of these three grading criteria one must play in




100-1000 subjects, least common genetic group of

Second scoring is done for “Replication” and And final score is awarded for “Protection from bias”.

‘A’ grade is awarded for large scale evidence

interest 

For little evidence less than 100 subjects, least common interest



Bioinformatics Review | 15



Protection from bias

Extensively replicated study supported by at least 1 well conducted meta-analysis. Well








Biases in studies creep in from researchers’ preconceived notions, and affect the compilation of data and declaration of result, much like previous two conditions a study must also be scrutinized for biases that may have crept in.

methodological limitations, or the studies have inconsistency.


Biases are minimized still can affect the magnitude, but probably not the

The analyte lacks association or independently replicated study, has a flawed meta-analysis and no between study consistencies.




Based on the amount of missing information on generation of evidence, but the bias doesn’t clearly defer any associations. Evidence for bias is so heavy that it may affect the existence of any association between studies. Thus the grades may be scored as followsAAA– strong evidence AAB, ABA, ABB, BAA, BBA, BBB, BAB–moderate evidence Rest all scores will be treated as poor, unreliable evidence.

Bioinformatics Review | 16


Mycobacteriophages and their potential as source against Mycobacterial active biomolecules Sanjay Kumar Image Credit: Google Images “ There is a notable abs enc e of myc obac teriophages from the family Podoviridae (c ontaining s hort s tubby tails ), aris ing the ques tion whether long tails are needed to travers e the relatively thic k myc obac terial c ell envelope.”


e all are aware of the epidemics of threat created by Mycobaterium tuberculosis and other related species. But, down here in this article we show how nature provides the solution against it. As we know Bacteriophage (Bacterio= Bacteria’s, Phage= eater) infects several bacterium species. In contrast to it, a Mycobacteriophage is a member of a group of bacteriophages that infect mycobacterial species as their hosts e.g., Mycobacterium smegmatis and Mycobacterium

tuberculosis, the causative agent of tuberculosis. The rising incidence of tuberculosis, emergence of multi drug resistance inMycobacterium tuberculosis and a slow progress in finding new drugs makes mycobacteriophage a potential candidate for its use as a diagnostic and therapeutic tool against TB. All the characterized Mycobacteriophages are doublestranded DNA (dsDNA) tailed phages belonging to the order Caudovirales. Most are of the family Siphoviridae , characterized by long flexible non contractile tails, whereas phages of the family

Myoviridae, have contractile tails. There is a notable absence of mycobacteriophages from the family Podoviridae (containing short stubby tails), arising the question whether long tails are needed to traverse the relatively thick mycobacterial cell envelope. dsDNA tailed phages are either temperate, forming stable lysogens with a turbid plaque or lytic, forming clear plaques in which the host cells are killed. Mycobacteriophages can also be studied by the morphology of the plaques which vary in size and shape. Plaque morphology also depends on the burst size, which is the number of phage particles

Bioinformatics Review | 17

released on the lysis of the infected bacteria.

GENOMETRICS OF 70 SEQUENCED MYCOBACTERIOPHAGES Since the mycobacterial cell wall consists of a mycolic acid rich Mycobacterial outer membrane, attached to an arabinogalactan layer that is in turn linked to the

peptidoglycan, it poses significant challenge to the phages. This challenge is met by a set of proteins, namely Lysin B proteins that cleave the linkage of mycolic acids to the arabinogalactan layer, holins that regulate lysis timing, and the endolysins (LysinAs) that hydrolyze peptidoglycan.

Phages affect hosts with a holinendolysin system essential for programmed lysis. Endolysin is found to be associated with a protein component of the phage tail involved in facilitating the penetration of the murein during injection of the genome into the host. Holins are small membrane proteins that form holes in the membrane through which the endolysin can pass. Holins control the length of the infective cycle for lytic phages so as to achieve lysis at an optimal time.

Endolysins can be a source of potential antibacterial because of its specificity (targeting only a few strains of bacteria) and thus replacing antibiotics (which have a more wide ranging effect), their low probababilty of developing resistance inMycobacterium and novel mode of action. Bioinformatics can assist this particular field of research by finding several other proteins existing on this planet or to prepare other such options having similar pharmacophore (physical and chemical attributes) properties. We can demolish the various disease threats by using natural options provided to us and can remain healthy on this planet. The only point to be remembered for this is, NATURE CAN SATISFY OUR NEEDS, BUT IT CANNOT SUSTAIN OUR GREED….. AS A HEALTHY BODY CONSISTS OF A HEALTHY MIND, THE SAME WAY.. A CONSERVED PLANET CONSERVES ITS SPECIES TOO….. Hatfull, Graham F. “Mycobacteriophages: genes and genomes.” Annual review of microbiology 64 (2010): 331-

Bioinformatics Review | 18


Do you HYPHY with (Data) Monkey !! Prashant Pant Image Credit: Google Images “ Datamonkey is a web interfac e ( whic h us es HyPhy batc h files to ex ec ute mos t of its tools and pac kages for the c omputational analys es . ”


yPhy, acronym for Hypothesis Testing Using Phylogenies ( was written & designed by Kosakovsky Pond and workers to provide likelihood-based analyses on molecular evolutionary data sets and help detect differential rates of variability within a coding sequence datasets. It is freely available, has a Graphical User Interface and can be used by anyone with or without much computer language or programming exposure. It was earlier presumed that substitution rates were uniform over an alignment of homologous DNA/Protein sequences but many workers studying molecular evolutionary processes influencing rates and patterns of evolution negated this presumption with quite a lot of data and this is

especially true for highly evolving gene family datasets and for viral genomes. Natural selection takes place at different domains/regions/sites which are under positive, negative or neutral selection pressures. Positive selection originates with more of non-synonymous substitutions in a protein coding sequence influencing the fitness advantage (protein structure and function) of an organism whereas negative selection takes place with more of synonymous substitution in a protein coding sequence leaving the amino acid sequence or protein structure and function unchanged. A neutral evolution is said to be taking place when the nonsynonymous substitutions does not affect the protein structure and function and rate of nonsynonymous substitutions. The rate

of synonymous and nonsynonymous substitutions is given by dS and dN respectively. In the case of neutral evolution, dS and dN are observed to be in equilibrium. Accordingly, the ratio of dN/dS given by ω=β/α (also referred to as dN/dS) has become a standard measure of selective pressure. The total ω for a sequence alignment is referred to as Global ω. Global ω with a value of approximately 1 signifies neutral evolution, below 1 suggests negative selection whereas ω more than 1 implies positive selection. To start with the analyses, all one needs is, a suitable codon substitution model as detected by MODELTEST program (available online), a nexus formatted sequence alignment file (must be codon data file) and a Maximum Liklihood tree of the data.

Bioinformatics Review | 19

Datamonkey is a web interface ( which uses HyPhy batch files to execute most of its tools and packages for the computational analyses. This web interface can be used for estimating dS and dN over an alignment of coding sequences and also for identifying codons and lineages under selection. It also provides “state of the art” tests of codon based models to infer

signatures of positive darwinian selection by comparing rates of synonymous (dS) versus nonsynonymous (dN) mutations even in the presence of recombination. It actually reports ω (=dN/dS) using a variety of evolutionary models. Apart from this, Datamonkey also offers a number of packages such as GARD, SLAC, REL, FEL, EVOBLAST etc. These will be discussed in the next issue. Keep reading!!

A comprehensive list of references on the article are available upon request to the author ( m)

Bioinformatics Review | 20


T-Coffee : A tool that combines both local and global alignments Muniba Faiza Image Credit: Google Images “T-Coffee is a multiple s equenc e alignment tool whic h s tands for Tree-bas ed Cons is tenc y Objec tive Func tion for alignment Evaluation. It is a s imultaneous alignment whic h c ombines the bes t properties of loc al and global alignment and for this it als o us es the Smith -W aterman algorithm. .�


Coffee is a multiple sequence alignment tool which stands for Treebased Consistency Objective Function for alignment Evaluation. It is a simultaneous alignment which combines the best properties of local and global alignment and for this it also uses the SmithWaterman algorithm. T-Coffee is an advancement over other multiple alignment tools such as ClustalW, MUSCLE (discussed about in earlier article), etc.

Its main features include, first, it provides the multiple alignments using various data sources which is the library of pairwise alignments(global + local). Second

main feature is the optimization method which provides the multiple alignment that best fits in the input library.

Fig.1 Layout of the TCoffee strategy; the main steps required to compute a multiple sequence alignment using the TCoffee method. Square blocks designate procedures while

Bioinformatics Review | 21

rounded blocks data structures.


How T-Coffee works? 1. Generate Primary library of alignments: It consists of a set of pairwise alignments of all of the sequences to be aligned (here the alignment source is local). It may also include two or more different alignments of the same pair of sequences. Then the global alignment is done using ClustalW . 2. Derive primary library weights: The most reliable residue pair is obtained in this step using a weighted scheme. In this, a weight is assigned to each pair of aligned residues in the library. Here, sequence identity is the criteria to measure accuracy with more than 30 % identity. For each set of sequences, two libraries are constructed along with their weights, one using ClustaW and other using Lalign (program of FASTA package). 3. Combine Libraries: In this step, all the duplicated pairs are merged into a single entry that has a weight equal

to the sum of two weights, or a new entry is created for the pair being considered. 4. Extend library: A triplet approach involving intermediate-sequence method is used. For example, we have 4 sequences, A,B,C & D, it aligns A-B and with C and D as well and checks for the alignment. 5. Progressive alignment strategy: In this alignment strategy, a distance matrix is constructed using pairwise alignments between all the sequences, with the help of which a guide tree is constructed using Neighbor Joining (NJ) method (a method that first aligns the two closest sequences), the obtained pair of sequences are checked for gaps,again the next closest two sequences. This continue until

all the sequences have been aligned.

Fig.2 The library extension. (a) Progressive alignment. Four sequences have been designed. The tree indicates the order in which the sequences are aligned when using a progressive method such as ClustalW. The resulting alignment is shown, with the word CAT misaligned. (b) Primary library. Each pair of sequences is aligned using ClustalW. In these alignments, each pair of aligned residues is associated with a weight equal to the average identity among matched residues within the complete alignment (mismatches are indicated in bold type). (c) Library extension for a pair of sequences. The three possible alignments of sequence A and B are shown (A and B, A and B through C, A and B through D). These alignments are combined, as explained in the text, to produce the position-specific library. This library is resolved by dynamic programming to give the correct alignment. The thickness of the lines indicates the strength of the weight.

Bioinformatics Review | 22

Bioinformatics Review | 23

Subscribe to Bioinformatics Review newsletter to get the latest post in your mailbox and never miss out on any of your favorite topics.

Log on to

Bioinformatics Review | 24

Bioinformatics Review | 25


December issue of Bioinformatics Review. Available via


December issue of Bioinformatics Review. Available via