Issuu on Google+

Proc. of Int. Conf. on Control, Communication and Power Engineering 2010

Machine Learning Approaches to Determine the “Drug-Likeness” of the Proteomic Targets Varun Gopal.K, Jiyesh M.M, Sankaranarayanan.A, P.K Krishnan Namboori

Harilal.P, Sai Krishna.A Computational Chemistry Group Amrita School of Biotechnology Amrita Vishwa Vidyapeetham University, Amritapuri, Kollam – 690 525, INDIA harilal.navami@gmail.com , saikrisharjun@gmail.com

Computational Chemistry Group Computational Engineering and Networking Amrita Vishwa Vidyapeetham University, Ettimadai, Coimbatore-641 105 varungopal19@gmail.com , n_krishnan@cb.amrita.edu

Abstract—Compounds from discovery are often poor candidates for lead optimization or preclinical testing because screening efforts focus on target affinity, while paying limited attention to ADME/Tox properties. So here the ADME/Tox properties of certain drugs have been studied and a mathematical model has been developed using machine learning algorithms. This model will predict whether the given molecule is a proteomic drug or not which will be a preliminary step in drug designing. 630 proteomic drugs and equal number of non-drugs were obtained from database sources. Around 1103 descriptors for both drugs and nondrugs were generated. The obtained datasets were manually validated. Descriptor load was reduced using PCA. Statistical machine learning techniques like ANN and SVM were used to explore the data and study drug-likeness. SVM was found to be the best classifier providing a classification accuracy of 93%.

This allows the identification of the chemical groups responsible for inducing a target biological effect in the organism. This technique was later modified to build a mathematical model to propose a relationship between a chemical structure and its biological function, called as quantitative structure-activity relationships (QSAR) Structural activity relationship is defined as the relationship between a chemical structure of a compound and its pharmacological activity. The structural activity relationship is further classified as ‘Quantitative structureactivity relationships’ (QSAR) and ‘Qualitative structureactivity relationships’ (qSAr). Quantitative structure-activity relationships (QSAR) represent an attempt to compare structural or property descriptors of compounds with activities [1]. These physicochemical descriptors include parameters to account for hydrophobicity, topology, electronic properties and the steric effects are determined empirically by computational methods. Activities used in QSAR include chemical measurements and biological assays. At present QSAR is mainly applied in the field of Drug designing.ADMETox is very important in the field of drug designing Owing to increasing importance in advancing high quality candidate drugs. ADMETox explains how a compound interacts with the rest of the body to cause an activity and toxicity (Fig.1).

Keywords— Machine Learning, Drug-Likness, Proteomic Targets, SVM

I.

INTRODUCTION

Drug likeliness is a qualitative means of analysis to check whether the given molecule is a drug or not and it is defined as a complex balance of various molecular properties and structural features which determine whether a particular molecule is similar to the known drugs. These properties, mainly hydrophobicity, electronic distribution, hydrogen bonding characteristics, molecule size and flexibility and presence of various pharmacophoric features influence the behavior of a molecule in a living organism, including bioavailability, transport properties, affinity to proteins, reactivity, toxicity, metabolic stability and many others.

The success of a drug’s journey through the body is measured in the dimensions of absorption, distribution, metabolism, and elimination (ADME).

Structure-activity relationships (SAR) are the conventional practices of medicinal chemistry which try to alter the effect or the activity of bioactive chemical compounds by changing their chemical structure [1]. Medicinal chemists utilize the techniques of chemical synthesis to introduce new chemical groups into the biochemical compound and test the alteration for their biological effects. Figure 1. Pictorial representation of ADME

253 © 2009 ACEEE


Proc. of Int. Conf. on Control, Communication and Power Engineering 2010

The DrugBank database is a distinctive bioinformatics and cheminformatics resource that combines detailed drug data with all-inclusive drug targets that includes sequence, structure, and pathway information [2, 3]. The database contains nearly 4800 drug entries including greater than 1,350 FDA-approved small molecule drugs, 123 FDAapproved protein/peptide drugs, 71 nutraceuticals and greater than 3,243 experimental drugs. In addition, more than 2,500 non-redundant protein drug target sequences are linked to these FDA approved drug entries. It contains 4800 drugs information, in that 697 are proteomic drugs.Supertoxic Database includes data from publicly available databases and scientific literature, gathering a huge amount of toxic compounds. Currently, there are about 60,000 structures with relating properties stored in the database [4]. In addition, properties like the number of hydrogen bond (H-bond) donors and acceptors, molecular weight or the octanol–water partition coefficient logP, which permits the evaluation of the Lipinski’s Rule of Five can be found within the database.

between choosing a large-margin classifier and the amount by which misclassified samples are tolerated. A higher value of C means that more importance is attached to minimizing the amount of misclassification than to finding a wide margin model [11]. In addition to the C parameter, each kernel may have a number of parameters associated with it. An Artificial Neural Network (ANN) is an information processing model that is inspired by our biological nervous system that processes information. The key element of this paradigm is the novel structure of the information processing system [8]. It is composed of a large number of highly interconnected processing elements (neurons) working in unison to solve specific problems. ANNs, like people, learn by example. An ANN is configured for a specific application, such as pattern recognition or data classification, through a learning process [9].PCA or Principal Component Analysis is a linear transformation method that diagonalizes the covariance matrix of the input data via a variance maximization process [10]. PCA selects features by transforming a high-dimensional original feature space into a smaller number of uncorrelated features called principal components. The first principal component accounts for as much of the variability in the data as possible, and each succeeding component accounts for as much of the remaining variability as possible.

PreADMET is a web-based application for predicting ADME data and building a drug-likeness library using Insilco methods [5-6]. The molecular descriptors for the molecules were generated. Molecular descriptor is a result of standardized numerical calculation from logical and mathematical interpretation of chemical information, such as chemical formula, molecular structure, interaction and etc., from a molecule. Theoretical molecular descriptor is categorized into the following categories like Constitutional descriptors, Topological descriptors, lipophilic descriptors, Geometrical descriptors, Electronic descriptors, Thermodynamic descriptors and other descriptors. II.

III.

MATERIALS AND METHOODS

In drug bank altogether there were 697 proteomic drugs. Out of which 630 drugs were collected. The remaining 67 drugs were eliminated because those drugs don’t have MOL files. Some of them are listed in table 1.

THEORY

The aim of machine learning is to build systems that can adapt to their environment and to learn from experience. Support vector machines and kernel methods have been established as powerful machine learning tools in various applications like bioinformatics. SVM is capable of representing nonlinear relationships and producing models that generalize well to unseen data [7]. For binary classification, a linear SVM finds an optimal linear separator between the two classes of data. This optimal separator is the one that results in the widest margin of separation between the two classes, as a wide margin implies that the classifier is better able to classify unseen data (Fig.2).

TABLE I.

LIST OF SOME DRUGS

Abarelix

Bosentan

Erythromycin

Tretinoin

Acamprosate

Erlotinib

Escitalopram

Triazolam

Acebotolol

Trichlormethiazide

Esmolol

Zaleplon

Acenocoumarol

Valdecoxib

Pivampicillin

Valrubicin

Acepromazine

Valganciclovir

Pofimer

Zimelidine

Bortezomib

Valproic Acid

Posaconazole

Ziprasidone

Same way from Supertoxic database, 630 toxic chemicals were collected. Some of them are listed in table 2. TABLE II.

LIST OF SOME TOXINS

4-Metoxybenzoic acid hydrazide 3-Methyl-4-furazancarbohydrazide 1,2,3,4,6,7,8,9-Octachorodibenzo-p-dioxin

Figure 2. Mapping of input vectors by SVM

2-Metoxy-4-nitrophenol

To regulate over fitting, SVMs have a complexity (capacity) parameter, C, which determines the trade-off

)

254 © 2009 ACEEE


Proc. of Int. Conf. on Control, Communication and Power Engineering 2010

The first step in the analysis was to extract all the possible descriptors using PreADMET online application .The descriptors include molecular descriptors, constitution descriptors, topological descriptors, geometrical descriptors and drug-likeness. Absorption, Distribution, Metabolism, Elimination, Toxicity and drug likeness (Lipinski’s rule) properties for these 630 drugs and 630 non-drugs were collected. All together 1103 descriptors were generated for both drugs and non drugs. It was found that some descriptors contribute only zero’s and blanks. These descriptors were manually validated. So after manual validation the descriptor number was reduced to 715. The data set was subjected to PCA. From PCA, it was suggested that out of 715 components, only 432 components were relevant for predicting the drug likeness. All the three data sets namely the original data set, manually validated data set and the final dataset after the PCA were given to the SVM for classification. The prediction performance was inspected by the 5-fold cross validation test, in which the data set of 1260 was divided into five subsets of roughly equal size. This means that the entire data was divided into training and test data in five different ways. After training the SVM with a collection of four subsets, the performance of the SVM was tested using the fifth subset. This process is repeated five times so that each subset is once used as the test data.

All the three datasets were subjected to SVM and the prediction accuracy for the original dataset was found to be 63% and for manually validated dataset it was found to be 67%. The prediction performance for the third data set was validated by 5-fold cross validation test. Total prediction accuracy for the final dataset was found to be 93%. C value at that point is 10.0. ANN was carried out and the results were found to be 80% and 86% for manually validated and final dataset after performing PCA respectively. The results show that SVM has better prediction accuracy of 93%. V.

Machine learning techniques help in making prediction of drug-likeness effectively. PCA reduces the descriptor load and identifies their relative importance. It was found that the accuracy of prediction by ANN is 86%. SVM, which is considered as a common bioinformatics prediction tool gives 93% prediction accuracy for the data set. REFERENCES [1]

[2]

As the classification is binary, SVM light was used. Results were analyzed for all the three data sets. The two datasets; manually validated dataset and final dataset after performing PCA were subjected to ANN using Matlab (2007a, The MathWorks) for analysis. Results of the two datasets were analyzed.

[3]

[4]

IV.

RESULTS AND DISCUSSION

Data were collected and descriptors were found out. The total number of data taken for the analysis was 1260. The total number of descriptors generated was 1103. Dataset was validated manually and the descriptor load was reduced to 715. This reduced Dataset was further subjected to PCA and again the descriptor load was reduced to 432. TABLE III.

[5]

[6]

FIVE-CROSS VALIDATION TABLE OF DATASET AFTER PERFORMING PCA

C-value

1

2

3

4

5

1.0

70.00%

74.34%

72.80%

76.87%

76.35%

5.0

70.00%

74.41%

72.80%

82.04%

76.52%

10.0

70.05%

74.31%

72.80%

92.74%

86.15%

15.0

70.09%

74.64%

72.80%

86.11%

79.15%

20.0

70.14%

74.55%

72.84%

84.90%

76.15%

25.0

70.21%

74.37%

72.84%

82.22%

76.03%

30.0

70.33%

74.21%

72.84%

72.03%

76.03%

50.0

70.99%

74.17%

72.81%

72.42%

76.03%

75.0

70.99%

74.06%

72.80%

70.72%

76.03%

100.0

70.99%

74.06%

72.80%

70.34%

76.03%

[7]

[8]

[9]

[10]

[11]

255 © 2009 ACEEE

CONCLUSION

G. A. Patani, E. J. LaVoie, Bioisosterism: A Rational Approach in Drug Design. Chem. Rev, 96, pp. 3147-3176. April 1996 Wishart ;DS, Knox C, Guo AC, Cheng D, Shrivastava S, Tzur D, Gautam B, Hassanali M, DrugBank: a knowledgebase for drugs, drug actions and drug targets, Nucleic Acids Res;36(Database issue), pp. 901-6, Jan 2008 Wishart DS, Knox C, Guo AC, Shrivastava S, Hassanali M, Stothard P, Chang Z, Woolsey, DrugBank: a comprehensive resource for in silico drug discovery and exploration, J. Nucleic Acids Res., 34(Database issue), pp. 668-72, Jan 2006 Schmidt, U. ; Struck, S. et al. SuperToxic: a comprehensive database of toxic compounds , Nucleic Acids Research, jan 2008. S.K.Lee, G.S.Chang, I.H.Lee, J.E.Chung, K.Y.Sung, K.T.No, The preadme: pc-based program for batch prediction of adme properties, EuroQSAR 2004, Istanbul, Turkey, 2004 S.K.Lee, I.H.Lee,H.J.Kim, G.S.Chang, J.E.Chung, K.T.No, The PreADME Approach: Web-based program for rapid prediction of physico-chemical, drug absorption and druglike properties, EuroQSAR 2002 Designing Drugs and Crop Protectants: processes, problems and solutions, Blackwell Publishing, Massachusetts, USA, pp. 418-420, 2003. Ovidiu Ivanciuc, Applications of Support Vector Machines in Chemistry, In: Reviews in Computational Chemistry, Volume 23, pp. 291–400, 2007 Cybenko, G.V. Approximation by Superpositions of a Sigmoidal function, Mathematics of Control, Signals and Systems, Vol. 2, pp. 303–314, 1989 Siegelmann, H.T. and Sontag, E.D. Analog computation via neural networks, Theoretical Computer Science, v. 131, no. 2, pp. 331–360, 1994 C. Ding and X. He. K-means Clustering via Principal Component Analysis, Proc. of Int'l Conf. Machine Learning (ICML 2004), pp 225–232,July2004. Thorsten Joachims, Learning to Classify Text Using Support Vector Machines. Dissertation, Kluwer, 2002.


193