BIOINFORMATICS REVIEW - JANUARY 2016 ISSUE by Bioinformatics Review

JAN U ARY 2016 VOL 2 ISSUE 1

“The greatest leap in bioinformatics is to predict secondary structure of protein”

MUSCLE v/s T-COFFEE : An overview and different aspects

- Charles Wins

Genetic Algorithm: Explanation and Perl Code

Public Service Ad sponsored by IQLBioinformatics

Contents

January 2016

░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░

Topics

Editorial....

Programming

HTSeq : A Python framework to analyze high throughput sequencing data 06

CADD

Active learning in drug - target interactions 14

22 Algorithms Genetic Algorithm: Explanation and Perl Code 08

34 Tools MUSCLE v/s T-COFFEE : An overview and different aspects 12

CHIEF EDITOR

Dr. PRASHANT PANT EDITORIAL EXECUTIVE EDITOR FOZAIL AHMAD FOUNDING EDITOR MUNIBA FAIZA SECTION EDITORS TARIQ ABDULLAH ALTAF ABDUL KALAM MANISH KUMAR MISHRA SANJAY KUMAR PRAKASH JHA NABAJIT DAS REPRINTS AND PERMISSIONS You must have permission before reproducing any material from Bioinformatics Review. Send E-mail requests to info@bioinformaticsreview.com. Please include contact detail in your message. BACK ISSUE Bioinformatics Review back issues can be downloaded in digital format from bioinformaticsreview.com at $5 per issue. Back issue in print format cost $2 for India delivery and $11 for international delivery, subject to availability. Pre-payment is required CONTACT PHONE +91. 991 1942-428 / 852 7572-667 MAIL Editorial: 101 FF Main Road Zakir Nagar, Okhla New Delhi IN 110025 STAFF ADDRESS To contact any of the Bioinformatics Review staff member, simply format the address as firstname@bioinformaticsreview.com PUBLICATION INFORMATION Volume 1, Number 1, Bioinformatics Reviewâ&#x201E;˘ is published monthly for one year(12 issues) by Social and Educational Welfare Association (SEWA)trust (Registered under Trust Act 1882). Copyright 2015 Sewa Trust. All rights reserved. Bioinformatics Review is a trademark of Idea Quotient Labs and used under licence by SEWA trust. Published in India

EDITORIAL: Welcoming BiR in its 2nd year Bioinformatics, being one of the best field in terms of future prospect, lacks one thing - a news source. For there are a lot of journals publishing a large number of quality research on a variety of topics such as genome analysis, algorithms, sequence analysis etc., they merely get any notice in the popular press.

Dr. Prashant Pant

Editor

EDITORIAL

One reason behind this, rather disturbing trend, is that there are very few people who can successfully read a research paper and make a news out of it. Plus, the bioinformatics community has not been yet introduced to research reporting. These factors are common to every relatively new (and rising) discipline such as bioinformatics. Although there are a number of science reporting websites and portals, very few accept entries from their audience, which is expected to have expertise in some or the other field. Bioinformatics Review has been conceptualized to address all these concerns. We will provide an insight into the bioinformatics - as an industry and as a research discipline. We will post new developments in bioinformatics, latest research. We will also accept entries from our audience and if possible, we will also award them. To create an ecosystem of bioinformatics research reporting, we will engage all kind of people involved in bioinformatics - Students, professors, instructors and industries. We will also provide a free job listing service for anyone who can benefit out of it.

Letters and responses: info@bioinformaticsreview.com

BIOINFORMATICS PROGRAMMING

HTSeq : A Python framework to analyze high throughput sequencing data Muniba Faiza Image Credit: Google Images “HTSeq is a Python library whic h eas ily develops the s c ripts required to fulfill a partic ular tas k on the HT data.”

igh throughput sequencing is most widely used as it saves a lot of time and provide good results, and produces a huge amount of data which is difficult to manage and especially the tasks and operations performed on it are also very difficult. To ease this purpose, a Python framework have been introduced by Simon Anders and team members, this framework is known as “HTSeq”.HTSeq is a Python library which easily develops the scripts required to fulfill a particular task on the HT data. Basically,HTSeq reads various formats and break it down into recognized strings of characters for further analysis. It also

consists of different classes genomic coordinates, sequences, sequencing reads, alignments, gene model information, etc. Two stand-alone applications have also been developed along with HTSeq, namely, htseq-qa for read quality assessment and htseq-count for preprocessing RNA-Seq alignments for analyzing differential expression. HTSeq can read various formats such as FASTA, FASTQ (short reads), SAM/BAM (short-read alignments). Wherever appropriate, different parsers will yield the same type of record objects. For example,

the record class SequenceWithQualities is used whenever sequencing read with base-call qualities needs to be presented, and hence yielded by the FastqParser class and also present as a field in the SAM_Alignment objects yielded by SAM_Reader or BAM_Reader parser objects (Fig. 1). There are some specific classes to represent Genomic Position and Genomic Intervals of the sequence. In order to achieve good performance, various parts of HTSeq is written in ‘Cython’ ( a tool which translates Python code augmented with C).

Bioinformatics Review | 6

Fig. 1. ( a) The SAM_Alignment class as an example of an HTSeq data record: subsets of the content are bundled in object-valued fields, using classes (here SequenceWithQualities and GenomicInterval) that are also used in other data records to provide a common view on diverse data types. ( b) The cigar field in a SAM_alignment object presents the detailed structure of a read alignment as a list of CigarOperation. This allows for convenient downstream processing of complicated alignment structures, such as the one given by the cigar string on top and illustrated in the middle. Five CigarOperation objects, with slots for the columns of the table (bottom) provide the data from the cigar string, along with the inferred coordinates of the affected regions in read (â&#x20AC;&#x2DC;queryâ&#x20AC;&#x2122;) and reference. HTSeq also consists of a class which deals with the gapped alignments,

namelySAM_Alignment, with multipl e alignments and with paired-end data. HTSeq provides a function,pair_SAM_alignments_with _buffer, to pair up the alignment records by keeping a buffer of reads whose end pair has not yet been found, and so facilitates processing data on the level of sequenced fragments rather than reads. HTSeq also facilitates the storage of genome-position-dependent data, which means that each base pair position on the genome can be given a particular value that can be easily stored and retrieved by simply entering the same value. The script htseq-qa is a simple tool for initial quality assessment of sequencing runs. It produces plots that summarize the nucleotide compositions of the positions in the read and the base-call qualities. As we discussed earlier in this article that htseq-count is a tool for RNASeq data analysis. It counts for each gene that how many aligned reads overlap the sequence exons. Since it is designed specifically to analyse differential expression only reads mapping unambiguously to a single gene are considered and the reads aligned to multiple positions or overlapping with more than one gene are discarded. In case of paired-end data, htseq-count counts only the fragment not the reads because the

two paired ends originating from the same fragment provide only evidence for one cDNA fragment and should hence be counted only once. In this way, HTSeq offers a comprehensive solution to facilitate a wide range of programming tasks in HTS data analysis. For further reading, click here. Note: An exhaustive list of references for this article is available with the author and is available on personal request, for more details write tomuniba@bioinformaticsreview.co m

Bioinformatics Review | 7

ALGORITHMS

Genetic Algorithm: Explanation and Perl Code Tariq Abdullah Image Credit: Stock Photos “Genetic Algorithm was developed by John Holland. It us e the c onc epts of Natural Selec tion and Genetic Inheritanc e and tries to mimic the biologic al evolution. It falls under the c ategory of algorithms known as Evolutionary Algorithms . ”

comes to bioinformatics algorithms, Genetic algorithms top the list of most used and talked about algorithms in bioinformatics. Understanding Genetic algorithm is important not only because it helps you to reduce computational time taken to get result but also because it is inspired by how nature works.

hen

In this article, you will learn how genetic algorithm works, the basic concept behind it and we will also write a program to illustrate the concepts. You can skip the explanation if you already know the basic concepts of Genetic Algorithm Genetic Algorithm was developed by John Holland. It use the concepts of

Natural Selection and Genetic Inheritance and tries to mimic the biological evolution. It falls under the category of algorithms known asEvolutionary Algorithms. It can be used to find solution to the hard problems where we don’t know much about the search space. Let us understand how genetic algorithm works. For this, let us consider a cancer associated gene expression matrix. This matrix contains all the known genes found in human being and their level of expression. For a given problem, the genetic algorithm works by maintaining a set of candidate solutions and then applies three operators over them – Selection, Recombination and

Mutation, which are collectively known as stochastic operator. 

Selection: In nature, if an organism is adapted to the environment, its population will grow relative to its quality of adaptation. This is referred to as selection. It means if a solution meets the conditional constraints, it is replicated at a rate which is proportional to the relative quality.



Recombination: In nature, two similar chromosomes of the surviving individual exchange genes during sexual reproduction in a process known as Crossing Over. In GA we decompose two distinct solutions and randomly mix

Bioinformatics Review | 8

their parts to form novel solutions 

understand

clearly.

Mutation: Random changes in an existing chromosome may lead to some fitter individual. This concept is utilized to randomly perturbs a candidate solution



Generate: It will generatechromosomes containi ng 5 values(specified in variable $GeneNumberConstraint) selected at random at positions



Mutate: It mutates a chromosome at random position with a random value less than specified in $HighestMutationValue



Survival Check: It checks if the newly formed chromosome is viable. i.e. It has a value that is upto a minimum specification. (Checking for fitness)



Recombine: It will form new combinations from existing chromosome by crossing them over with each other.

1. produce an initial population of individuals 2. evaluate the fitness of all individuals 3. while termination condition not met do 4. select fitter individuals for reproduction 5. recombine between individuals

The program We are going to implement the Genetic Algorithm and write a program in Perl for it. Although not purely applicable to a real life problem, but it should be sufficient to familiarize you with Genetic Algorithm.

9. End while

Suppose that you had a set of Gene expression data. The data is for all 25000 genes in the human genome and you want to find out what are the five values among all 25000 values whose sum can give you the highest number.

Have a look at the Genetic Algoithm illustrated in the diagram below to

For the purpose of this program we will require four subroutines:

$CurrentHighest=0;

$HighestMutationValue = 110;

print "The Total Genes are: $genes\n";

$GeneNumberConstraint = 5;

generate();

6. mutate individuals 7. evaluate the fitness of the modified individuals 8. generate a new population

@GeneExpressionData = (1,3,8,5,2,4,46,6,78,7,9, 9 ,0,1,1,1,5,59,9,97,7,6,5, 45 ,4,3,23,2,22,2,2,4,5,5,6, 54); @SolutionSpace = ();

$InitialThreshold

= 10;

$genes = scalar @GeneExpressionData; @chromosome = (); $sum = 0; $steps= 10;

The Code If you wish, you can download the Perl code on GitHubhttps://github.com/bioinform aticsreview /geneticalgorithm So here is the final code implementing Genetic Algorithm in Perl:

$steps = 10; for($p=0;$p<=$steps;$p++){ generate(); SurvivalCheck(); mutate(); SurvivalCheck(); recombine(); SurvivalCheck();

Bioinformatics Review | 9

} print "\n\n Genetic Algorithm Result \n\n\n\t\tHighest Detected: $CurrentHighest in $steps Steps\n\n";

my $random_number = int(rand(3)) + 1; $pos1 = int(rand($GeneNumberConst raint));

sub mutate{

$pos1 = int(rand($GeneNumberConst raint));

$randpos = int(rand($gene));

$swap = $chromosome1[$pos1];

$n = int(rand($HighestMutation Value));

$chromosome1[$pos1] = $chromosome2[$pos2];

$chromosome[$randpos] = $n; print "\n Mutation Took Place in Chromosome @chromosome "; } sub recombine { print "\nRecombining\n\n";

$chromosome2[$pos2] = $swap; }

@chromosome = (); @chromosome = @chromosome1;

else{ print "\nSpecies Didn't Survive! \n"; return 0; } } sub generate{ @chromosome = (); for($i=1;$i<=$GeneNumberC onstraint;$i++){ $n = int(rand($genes)); push @chromosome, $GeneExpressionData[$n]; $sum += $GeneExpressionData[$n];

}

sub SurvivalCheck{

@chromosome2 = $SolutionSpace[int rand($p)];

foreach $val (@chromosome){

for($i=0; $i<=$GeneNumberConstraint /2; $i++){

}

print "The Recombination led to @chromosome";

@chromosome1 = $SolutionSpace[int rand($p)];

print "Random Sequence Chromosome from Solution Space: @chromosome1 and @chromosome2";

Expression: $CurrentHighest"; return 1;

$sum = 0;

$sum += $val; } if($sum>$CurrentHighest){

print "\n\n\nGenerated Chromosome: @chromosome \n"; } Thats all! Feel free to comment and discuss if you have any confusion. Like this article? Share it.. ha?

$CurrentHighest = $sum; push @SolutionSpace, @chromosome; print "\nIndividual is alive! \nCurrent Highest

Bioinformatics Review | 10

TOOLS

MUSCLE v/s T-COFFEE : An overview and different aspects Muniba Faiza Image Credit: Google Images “MU SCLE and T-COFFEE both are multiple s equenc e alignment tools and als o helps to s tudy the evolutionary relations hips among the s pec ies .”

s I have discussed in my earlier articles about the multiple sequence alignment (MSA) tools (MUSCLE & T-COFFEE). Now in this article, we will discuss different aspects of these tools and which one is more preferred over the another. MUSCLE and T-COFFEE both are multiple sequence alignment tools and also helps to study the evolutionary relationships among the species.As I have already explained the algorithms involved in both the tools which are comparable. During the alignment using MUSCLE, it uses the UPGMA tree construction method which assumes that mutation occurs at the constant rate. This may be a fact which makes it different from other tools. On the positive side, MUSCLE is a tool which is known for its speed and accuracy on each of the four

benchmark test sets ( BAliBASE, SABmark, SMART and PREFAB). It is much faster than other MSA tools. MUSCLE also uses a progressive alignment which is iterated while it gets a better SP score (explained in “Basic concept of MSA” article). T-COFFEE is an improvisation over MUSCLE in the sense that it combines both global and local alignments which provides better results and it also qualifies the four benchmark tests. Second thing which makes it better than other tools is that it uses an optimization method which provides the multiple alignment that best fits in the input library. T-COFFEE also uses progressive alignment strategy similar to MUSCLE, but unlike MUSCLE, T-COFFEE uses Neighbor Joining tree construction method during alignment which corrects the assumption of UPGMA method and

assumes that mutation never occurs at a constant rate. Let us take protein sequences of ‘Keratin’ protein of few species and align them using both the tools and construct the respective phylogeny trees. In this example, I have taken FASTA sequences of:Homo sapiens (GI: 7717238) , Paralichthys olivaceus (GI: 10716084), Pseudomonas viridiflava (GI: 934022154) andPseudomonas aeruginosa (GI: 856785229). The results are as follows: As we have seen both the trees are slight different. The sequence of Paralichthys olivaceus is placed below to that of Homo sapiens, but it is placed above in tree constructed by T-COFFEE. Similarly, this is case with other two species. This is how MUSCLE & T-COFFEE are different from each other.

Bioinformatics Review | 11

T-COFFEE is more preferred over MUSCLE while aligning both closely or distantly related species but MUSCLE ia more suitable to align distantly related species since it

Fig 1. Tree constructed using MUSCLE.

uses global alignment only, but TCOFFEE uses both. Note:

author and is available on personal request, for more details write tomuniba@bioinformaticsreview.co m.

An exhaustive list of references for this article is available with the

Fig 2. Tree constructed using T-COFFEE.

Bioinformatics Review | 12

CADD

Active learning in drug-target interactions Muniba Faiza Image Credit: Google Images â&#x20AC;&#x153; Ac tive learning is a powerful tool for drug dis c overy and development where it reduc es the tedious proc es s of performing a number of ex periments whic h are required to produc e s ignific ant high-c onfidenc e predic tions .â&#x20AC;?

Active learning is a kind of machine learning. Basically in active learning, a learning algorithm is used to perform the desired experiments to produce a desired output. Active learning is a powerful tool for drug discovery and development where it reduces the tedious process of performing a number of experiments which are required to produce significant high-confidence predictions. However, practically it is difficult to decide when to stop the experimentation process. Therefore, if a reliable stopping criteria is applied to the algorithm reduces both time and cost of the experimentation process. The basic of active learning is having good predictive models to guide experimentation.

Active learning iteratively builds a model for drug-target interactions. Instead of relying on large training data sets as performed manually, the active learning procedure increases the training set step wise. Thus, the time and experimental cost is reduced and it is only spent on improving the model rather than for the verification of a specific model which even may not be the desired outcome or suits the specifications under consideration.

4. Accuracy measure predicted output

the

How active learning works? Active learning is an iterative process and is completed in four steps: 1. Initialization 2. Model 3. Active learning algorithm

The active learning strategy starts with an initialization step in which an interaction matrix for drug and target is formed. With the help of this matrix subset of known labels for the the drug and target kernels Kd and Kt respectively are provided.

Bioinformatics Review | 13

The model predicts the drug-target interactions. Based on the obtained predictions, the active learning algorithm is applied to find new experiments (labels) which will improve the model according to the requirements. Here, batchwise learning is applied where a fixed number of experiments is queried in each training round and thereby increases the size of experiments (labels).

Each training round has a specific time point and is measured by the number of experiments performed. For each time point the accuracy of the model is predicted by using various methods. The process is stopped on some conditions, for example, if a certain budget for performing experiments is reached or the predicted accuracy of the model is high enough.

This is the basic idea for active learning applied in drug-target predictions. It saves a lot of time and cost involved in performing experiments in vitro. For further reading click here Note: An exhaustive list of references for this article is available with the author and is available on personal request, for more details write to muniba@bioinformaticsreview.com

Bioinformatics Review | 14

. Subscribe to Bioinformatics Review newsletter to get the latest post in your mailbox and never miss out on any of your favorite topics.

Log on to www.bioinformaticsreview.com

Bioinformatics Review | 15

Bioinformatics Review | 16