Proceedings of MLLS 2015 by Paweł K

Proceedings of

Edited by: Dr. Bartosz Krawczyk, Wroclaw University of Technology, Poland e-mail: bartosz.krawczyk@pwr.edu.pl

Prof. Michał Woźniak, Wroclaw University of Technology, Poland e-mail: michal.wozniak@pwr.edu.pl

ENGINE Center, Wroclaw University of Technology Wrocław, Poland, 2015

This work was supported by EC under FP7, Coordination and Support Action, Grant Agreement Number 316097, ENGINE - European Research Centre of Network Intelligence for Innovation Enhancement (http://engine.pwr.wroc.pl/).

Editorial layout and cover design Bartosz Krawczyk

ENGINE - European Research Centre of Network Intelligence for Innovation Enhancement Wrocław University of Technology Wybrzeże Wyspiańskiego 27, 50-370 Wrocław, Poland

ISBN: 978-83-943803-0-4

Table of contents Preface………………………………………………………………………………………………………………………………4 Predicting Gait Retraining Strategies for Knee Osteoarthritis………………………………………………6 Probability Index of Metric Correspondence as a measure of visualization reliability……….16 Increasing Weak Classifiers Diversity by Omics Networks…………………………………………………28 Active Learning of Compounds Activity - Towards Scientifically Sound Simulation of Drug Candidates Identification…………………………………………………………………………………………………40 Learning symbolic features for rule induction in computer aided diagnosis……………………….52 Learning to rank chemical compounds based on their multiprotein activity using Random Forests…………………………………………………………………………………………………………………………….64 One-Class Rotation Forest for High-Dimensional Data Classification…………………………………76

PREFACE Life sciences, ranging from medicine, biology and genetics to biochemistry and pharmacology have developed rapidly in previous years. Computerization of those domains allowed to gather and store enormous collections of data. Analysis of such vast amounts of information without any support is impossible for human being. Therefore recently machine learning and pattern recognition methods have attracted the attention of broad spectrum of experts from life sciences domain. In order to offer a scientific environment to discuss recent and emerging trends in this domain we have organised a second edition of Workshop on Machine Learning in Life Sciences. This year it was co-located with prestigious European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases 2015 and held in Porto, Portugal on 11th Septmber 2015. This workshop aimed to emphasize the importance of interdisciplinary collaboration between life and computer sciences and to provide an international forum for both practitioners seeking new cutting-edge tools for solving their domain problems and theoreticians seeking interesting and real-life applications for their novel algorithms. Special focus was put on novel machine learning technologies, designed to tackle complex medical, biological, chemical or environmental data that take into consideration the specific background knowledge and interactions between the considered problems. Novel applications of machine learning and pattern recognition tools to contemporary life sciences problems, that will shed light on their strengths and weaknesses were discussed, as well as new methods for data visualization and methods for accessible presentation of results of machine learning analysis to life scientists. We aimed at gathering developments in intelligent processing of non-stationary medical, biological and chemical data and in proposals for efficient fusion of information coming from multiple sources.

Together with The Group of Machine Learning Research at the Jagiellonian University in Cracow, represented by Wojciech Czarnecki, Igor Podolak and Jacek Tabor in cooperation with the Institute of Pharmacology, Polish Academy of Sciences, Cracow, represented by Andrzej Bojarski and Sabina Smusz we prepared a competition with objective to predict the activity of selected chemical structures against a set of given proteins. This was a difficult, multi-label learning task with a large, novel dataset published for the purposes of this challenge. With great pleasure, organizing committee of the challenge would like to announce that the winner of the competition is team consisting of: Damian Leśniak, Jagiellonian University Piotr Kruk, Jagiellonian University Michał Kowalik, Jagiellonian University

which achieved schore of 68.85% on the test set, beating second team by the large margin of over 3%. Results have been reproduced and confirmed. Additionally, our we had a pleasure of enjoying an invited speech by Prof. Igor Podolak from Jagiellonian University, Cracow, Poland, entitled: “Cheminformatics and Machine Learning: tasks and problems”. After a rigorous per-review process we have accepted 7 high quality papers that are being presented in this electronic proceedings collection. We take this opportunity to thank all contributors for submitting their papers to this workshop and to express our deepest gratitude to all of the reviewers that participated in the evaluation of submissions. Their joint efforts allowed us to make this Workshop a success. Finally, we are immensely grateful to the organizers of European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases 2015 for giving a permission to this workshop and for their continuous support for our efforts. This Workshop was supported by EC under FP7, Coordination and Support Action, Grant Agreement Number 316097, ENGINE European Research Centre of Network Intelligence for Innovation Enhancement: http:// engine.pwr.wroc.pl/ which has reimbursed the conference fee to authors of accepted and presented papers.

We are looking forward to next edition of Workshop on Machine Learning in Life Sciences to take

place in 2016.

Bartosz Krawczyk Michał Woźniak

Predicting Gait Retraining Strategies for Knee Osteoarthritis Benjamin Wittevrongel1? , Irma Ravkic2? , Wannes Meert2 , Jesse Davis2 , Tim Gerbrands3,4 and Benedicte Vanwanseele3,4 1

Laboratory for Neuro- and Psychophysiology, KU Leuven, Belgium Departement of Computer Science, KU Leuven, Leuven, Belgium 3 Departement of Kinesiology, KU Leuven, Leuven, Belgium Health Innovation and Technology, Fontys University of Applied Sciences 2

Abstract. Symptomatic knee osteoarthritis is one of the most common types of osteoarthritis and is one of the top ten causes of years of life lost due to disability. One common treatment strategy involves the prolonged use of anti-inflammatory medications. Recently, gait retraining has been proposed as non-invasive and non-pharmacological treatment option for knee osteoarthritis. However, many possible gait retraining strategies exist and it is unknown a priori which strategy will work best for a given patient. Hence, it is often necessary to pursue a trial-and-error approach to find the best strategy. In this paper we investigate using two standard machine learning techniques, decision trees and rule sets, to build models based on features of a subjectâ&#x20AC;&#x2122;s normal gait to predict which strategy will work best. We were able to learn reasonably accurate models for this task. Furthermore, as the learned models are interpretable, a domain expert was able to gain several insights into the problem by inspecting them. This work shows that machine learning can be useful for prediction of treatment strategies in rheumatology.

Keywords: knee osteoarthritis, gait retraining prediction, machine learning

Introduction

Symptomatic knee osteoarthritis (OA) is one of the most common types of OA and is one of the top ten causes of â&#x20AC;&#x2122;years of life lost due to disabilityâ&#x20AC;&#x2122; [7]. Approximately 44% of the human population will develop symptoms of this disease during their lifetime [11] and this percentage rises to 66% among people suffering from obesity [11]. Additionally, OA entails high socio-economic costs [2, 1], mainly due to visits to doctors, surgery, and medication as well as indirect costs such as missed work time. Gait retraining has recently been proposed as a non-invasive treatment strategy for knee OA because several studies have shown that a high external knee adduction moment (EKAM) during walking is closely related to the progression ?

These authors contributed equally to this work.

of knee OA as the EKAM reflects the medio-lateral load distribution on the tibio-femoral joint [9, 13]. Gait retraining attempts to slow OA’s progression by modifying a patient’s gait kinematics to reduce the EKAM. Several gait modification techniques have been tested and shown to effectively reduce the EKAM, including medialising the knee during the stance phase (Medial Thrust) [3], leaning the trunk in the direction of the stance leg (Trunk Lean) [10] and increasing the toe-out angle [12], among others. Determining the best gait retraining strategy in practice usually relies on intuition and experimentation which can be imprecise and time consuming. In contrast, Fregly et al. [4, 3] designed the Medial Thrust retraining strategy by using a computational approach. They employed a dynamic optimization of a patient-specific, full-body gait model to predict the 3-D gait modifications that reduce both peaks of the external knee adduction torques. While this strategy was able to pinpoint exactly how a patient can achieve the greatest EKAM reduction, calculating full-body models necessitated the use of lab expensive motion capture camera system to capture the patient’s gait as well as an expensive software package. Additionally, the study was performed on only one subject. In this paper, we will tackle the task of selecting the appropriate gait retraining strategy as a machine learning problem. Specifically, we will look at building a model based on characteristics of an individual’s habitual gait. We will examine data from both healthy and arthritic subjects that was collected as part of a study to compare which strategies are most effective in practice [5]. We obtained a model with an area under the ROC curve of 0.75 when training on healthy subjects and 0.92 when training on arthritic subjects. Additionally, we presented the learned models to a domain expert for interpretation. She determined a promising location for attaching a gyroscope to an individual in order to measure gait characteristics that are predictive of the best retraining strategy. This offers the potential to avoid using an expensive motion capture camera to select a retraining strategy.

Materials and Methods

This section describes the data set we used, challenges we encountered processing the data, and our experimental methodology for predicting the best gait retraining strategy. 2.1

Data Collection

The data we use has been collected as part of Tim Gerbrands’ PhD thesis. The study includes 62 subjects aged between 18 and 65 of which 28 subjects were diagnosed with symptomatic medial tibiofemoral knee osteoarthritis and the rest were healthy individuals. Each subject is described by a number of gait features and the expert determined the class label to be the gait retraining strategy that maximally reduced the external knee adduction moment. The subset of this data containing only healthy patients is analysed in the work of Gerbrands et al. [5].

The data was collected as follows. In order to acclimate to the environment and equipment, each participant walked freely and barefoot on a 13 meter long walkway. Next, each subject was asked to perform the three walking conditions described in Table 1 and an investigator provided visual examples for each condition. Each participant was instructed to implement the strategies to the greatest extent possible at a self-selected speed such that walking was still comfortable. A practice period of five minutes was allowed, during which the investigator provided verbal feedback. Each subject performed the normal walking condition first and then the remaining conditions were presented in a randomized order. Five successful trials of each condition were captured. Between conditions, a subject was asked to walk comfortably for approximately three minutes in order to minimize interference from one condition to the next. For each subject, the most effective gait modification was determined to be the one that resulted in the largest positive reduction in external knee adduction moment compared to the normal walking condition.

Walking condition Normal Walking Trunk Lean Medial Thrust

Instruction Walk freely and comfortably as you would on the street. Lean right with the torso as the right foot has floor contact. Move the right knee inwards/medial during right legged stance.

Table 1: Instructions given to the subjects for each walking condition.

During each trial, a variety of sensors are employed to measure various aspects of the subjectâ&#x20AC;&#x2122;s gait. Joint angles of the leg and torso were measured at a frequency of 100Hz with a dual camera wireless active 3D-system (Charnwood Dynamics Ltd., Codamotion CX 1). Ground reaction force was measured at a frequency of 1000Hz during one step per trial with a recessed force plate (Advanced Mechanical Technology, Inc., OR 6-7). EKAM was calculated through inverse dynamics in which the knee centre served as the shank-fixed axesâ&#x20AC;&#x2122; origin. Each subject is then described by eleven gait features which are listed in Table 2. Gerbrands et al. [5] showed that the Medial Thrust and Trunk Lean strategies affect both the overall peak and impulse and also provide the greatest reduction in EKAM. The former reduces the EKAM by bringing the knee joint closer to the center axis of the body and the latter reduces EKAM by having the patient lean more towards the supporting leg. Of the 34 healthy subjects, 18 were assigned Medial Thrust as the class label and 16 have Trunk Lean. Of the 28 arthritic subjects, four have Medial Thrust as the class label and 24 have Trunk Lean.

Feature

Description The maximal extent to which the subject bends Knee Flexion the knee over one gait cycle. The angle of the upper body while supporting on Trunk Angle the arthritic leg. Minimal deviation from perfect alignment of upKnee Adduction per and lower leg over one gait cycle. Maximal deviation from perfect alignment of upKnee Abduction per and lower leg over one gait cycle. Defines the maximal deviation of the tibia from Tibia Angle the perpendicular angle to the ground. The extent to which the subject moves the toes Toe Out Angle outwards during gait. Maximal absolute value of the EKAM over one Knee Adduction Moment Peak gait cycle. Knee Adduction Moment Impulse Area under the EKAM curve over one gait cycle. The first peak in the vertical component of the 1st Peak vGRF ground reaction force. The second peak in the vertical component of the 2nd Peak vGRF ground reaction force. The speed at which the subject walked during the Walking Speed trials.

Table 2: Gait features used in classification task.

2.2

Data Challenges and Preprocessing

Before applying the machine learning approaches, we performed an iterative investigation of the data to ensure its consistency. First, we identified outlier values by looking at histograms and boxplots for each feature. We identified one value of toe-out angle that the domain expert determined was an outlier, and we marked this value as missing. Next, based on the assumption that the same type of movement for both patient classes should have similar characteristics, we calculated the mean, minimum and maximum value for each feature in the data. We observed in the data that the value of the maximal trunk angle for some subjects was lower than their minimal trunk angle. Finally, several issues arose when comparing the data from healthy and arthritic subjects, which were collected at different times. One was that identical features were coded with different names, which we resolved by consulting with the domain expert. A subtler issue was that several features had different signs. This arose because computing them required choosing one leg as a reference. All healthy subjects used the same leg, whereas for the arthritic subjects the selected leg depended on which knee was arthritic. Discovering this required several iterations with the domain experts. Nine out of 11 features in the data set had at least one and at most three values missing. For each feature we replace its missing values with the average value of the known values of the feature. Since machine learning techniques

can be sensitive to the range of values that numeric attributes take on [15], we explored whether discretization could improve the results. In this paper we used k-equal frequency binning which divides the data into k groups such that each group contains approximately the same number of values. When handling missing values and discretization, we only considered the training data when computing the average value of a missing feature and selecting the bin widths for discretization. 2.3

Methodology

We use the following three versions of the data in our experiments: 1. Healthy. This subset of the data contains individuals who do not suffer from arthritis. 2. Arthritic. This subset of the data consists of individuals who have knee osteoarthritis. 3. Combined. This data set combines both the healthy and arthritic subjects. For each of these data sets we will use the classifiers and empirical evaluation as specified below. Classifiers In the experiments, we compare the predictive power of two classification models: decision trees and rule sets [8]. These learners both produce models that are easily understandable, which enables domain experts to analyze the learned models without needing in-depth knowledge of machine learning. A decision tree consists of internal nodes which represent tests performed on attributes and leaf-nodes which decide the label of an instance. Decision trees classify instances by sorting them from the root of the tree to a leaf-node reached by following the path established by successful internal node tests. A rule set classifier makes a prediction based on a set of IF-THEN rules. The IF part of a rule contains a number of tests on features. If the IF part is satisfied for some instance, the THEN part determines its label. The rules are applied and evaluated in order from first to last. If none of the IF-THEN rules are satisfied, a default label is assigned. The implementations of the decision tree and rule set learner we use are provided by Weka [6] and we use the J48 [14, 15] tree learning algorithm and the PART [15] rule learner. We configured both algorithms to not allow pruning. While this decision risks the overfitting, we hope to learn more informative models that provide insight into the underlying process. Furthermore, we fixed the minimum number of instances each leaf must contain to three. Evaluation Methodology We perform leave-one-patient out cross-validation to estimate the predictive performance of the learned models because this evaluation methodology is typically used when there are very few examples. This means that we repeatedly learn models on all but one patient, and use these

models to predict the best retraining strategy for the left-out patient. We evaluate our models by reporting their prediction accuracy. Because the accuracy is influenced by the skewed class distribution, we indicate which results outperform the baseline classifier that always predicts the most frequent class label in the training data. We also report the area under the ROC curve (AUC) for the best model for each data set.

Experiments

The goal of the experiments is to explore and answer the following questions: Q1. Q2. Q3. Q4.

3.1

Can we learn accurate models for predicting the best gait retraining strategy? Will the same model apply to both the healthy and arthritic populations? Do the learned models provide insights for a domain expert? Can the learned models provide guidance for placing an accelerometer or a gyroscope on an individual to enable the relevant measurements to be made outside of an expensive lab set up? Results

Next, we present experimental results for data of healthy, arthritic, and combined data set. Healthy. Table 3 presents the results on the healthy patients. This appears to be a hard prediction problem as many learned models do not outperform the majority classifier. However, when discretizing into 5-equal frequency bins we are able to learn reasonably accurate models. Classifier Decision Tree Rule Set Accuracy AUC Accuracy AUC None 52.9% 0.38 44.1% 0.52 3-equal frequency binning 41.2% 0.38 44.1% 0.52 4-equal frequency binning 47.1% 0.54 47.1% 0.52 5-equal frequency binning 61.8%† 0.52 76.5%† 0.75 Discretization

Table 3: The accuracies and AUCs for the models learned using only the nonarthritic subjects. The best result is in bold, and † denotes results that are better than majority classifier which has accuracy of 52.9% for this data set.

Arthritic. Table 4 presents the results when training only on arthritic subjects. Again, we see that this is a challenging problem and the only model that outperforms the majority classifier is a decision tree built on data discretized into 5-equal frequency bins.

Classifier Decision Tree Rule Set Accuracy AUC Accuracy AUC None 85.7% 0.72 85.7% 0.72 3-equal frequency binning 78.6% 0.67 85.7% 0.68 4-equal frequency binning 75.0% 0.51 67.9% 0.46 5-equal frequency binning 92.9%† 0.92 85.7% 0.80 Discretization

Table 4: The accuracies and AUCs for the models trained using only the arthritic subjects. The best result is in bold, and † denotes results that are better than majority classifier which has accuracy of 85.7% for this data set.

Combined. Table 5 presents results learned on the combined data set of healthy and arthritic subjects. None of the models learned outperforms the majority classifier. One explanation is that arthritic subjects have changed their habitual gate to cope with pain and joint immobility associated with the disease. To further explore this hypothesis, we used the best learned model from the healthy subjects to predict the gait retraining strategy for each arthritic subject. This resulted in an accuracy of 75.0%, which is worse than the majority classifier. Similarly, we used the best model learned from arthritic subjects to predict the gait retraining strategies for each healthy subject. Again, this performed worse than the majority classifier with an accuracy of 47.1% and AUC of 0.48. This provides some additional evidence that there are differences in the habitual gait features for the healthy and arthritic subjects. Classifier Decision Tree Rule Set Accuracy AUC Accuracy AUC None 50.0% 0.31 32.25% 0.32 3-equal frequency binning 58.1% 0.60 61.3% 0.61 54.8% 0.56 4-equal frequency binning 51.6% 0.38 5-equal frequency binning 45.2% 0.43 59.7% 0.59 Discretization

Table 5: Accuracies and AUCs for the decision trees and rule sets learned on data consisting of both healthy and arthritic subjects. The best result is in bold, and † denotes results that are better than majority classifier which has accuracy of 64.5% for this data set.

3.2

Discovered Knowledge

We were also interested if the learned models for healthy and arthritic patients could provide any domain insight. Using the best determined settings, we learned

a decision tree and a rule set from each data set where we used all available examples for training the model. We presented the resulting models to a knee biomechanics and gait retraining expert for interpretation. Healthy. The best decision tree learned for healthy patients is shown in Figure 1. The expert identified the most interesting branch in the tree as the one with the narrow ranges of values for knee abduction (â&#x2C6;&#x2019;4.0, â&#x2C6;&#x2019;3.0] and adduction (0.8, 2.3] that results in a prediction of Trunk Lean. She theorized that if there is not a lot of movement in the knees during the normal gait then the subjects are better with the Trunk Lean strategy. The best rule set for healthy patients is depicted in Figure 2. The subjects for whom the first rule applies already bring their knee inwards a lot during their natural gait and thus already implement Medial Thrust to some extent. Thus, it makes sense that the Trunk Lean strategy is predicted for these subjects.

Fig. 1: Decision tree learned for the healthy subjects

Fig. 2: Rule set learned for the healthy subjects

Arthritic. The best decision tree for this data is shown in Figure 3. The decision tree identifies one range of the tibia angle that results in a Medial Thrust prediction. For the other ranges, Trunk Lean is predicted. The best rule set for arthritic patients is depicted in Figure 4. The rules that make use of the tibia angle are similar to what appears in the decision tree. The expert said

that rules with the trunk angle feature are less interesting because it is hard to accurately measure one degree differences with portable sensor. The expert found it interesting that the decision tree and the rule set predict Trunk Lean when the tibia angle is larger than six. The tibia angle can be easily measured by positioning a gyroscope on the lower leg, which would potentially allow making predictions outside of a lab setup that uses an expensive camerabased motion capture system.

Fig. 3: Decision tree learned for the arthritic subjects

Fig. 4: Rule set learned for the arthritic subjects

Conclusions and Future Work

This paper addresses the task of predicting the best gait retraining strategy for an individual with knee osteoarthritis. Several different gait retraining strategies exist, and it is hard to know a priori which strategy will be most suitable for a specific patient. We used machine learning to tackle this problem. We were able to learn reasonably accurate models when training on only healthy individuals or only arthritic individuals. We presented several learned models to a domain expert. She identified several intuitive patterns. Furthermore, she identified a possible location to place a portable on-body sensor that would allow the relevant data to be collected outside of a lab setting that requires expensive equipment. In the future, we hope to analyze more data collected from patients with knee osteoarthritis, and to explore additional features of an individualâ&#x20AC;&#x2122;s gait. We would also like to analyze data collected using inexpensive, portable sensors for this task.

References 1. Bitton, R.: The economic burden of osteoarthritis. The american journal of managed care 15(8) (2009) 2. Fautrel, B., Hilliquin, P., Rozenberg, S., Allaert, F.A., Coste, P., Leclerc, A., Rossignol, M.: Impact of osteoarthritis: results of a nationwide survey of 10,000 patients consulting for oa. Joint Bone Spine 72, 235–240 (2005) 3. Fregly, B.J., Reinbolt, J.A., Rooney, K.L., Mitchell, K.H., Chmielewski, T.L.: Design of patient-specific gait modifications for knee osteoarthritis rehabilitation. IEEE Transactions on Biomedical Engineering 54, 1687–1695 (September 2007) 4. Fregly, B.J., Rooney, K.L., Reinbolt, J.A.: Predicted gait modifications to reduce the peak knee adduction torque (July 2005) 5. Gerbrands, T., Pisters, M., Vanwanseele, B.: Individual selection of gait retraining strategies is essential to optimally reduce medial knee load during gait. Clinical Biomechanics (2014) 6. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The weka data mining software: an update. SIGKDD Explorations 11 (2009) 7. Lopez, A.D., Murray, C.J.: The global burden of disease, 1990–2020. Nature Medicine 4(11), 1241–1243 (November 1998) 8. Mitchell, T.M.: Machine learning. McGraw-Hill, Inc., New York, NY, USA, 1 edn. (1997) 9. Miyazaki, T., Wada, M., Kawahara, H., Sato, M., Baba, H., Shimada, S.: Dynamic load at baseline can predict radiographic disease progression in medial compartment knee osteoarthritis. Annals of the Rheumatic Diseases 61(7), 617–622 (2002) 10. M¨ undermann, A., Asay, J.L., M¨ undermann, L., Andriacchi, T.P.: Implications of increased medio-lateral trunk sway for ambulatory mechanics. Journal of biomechanics 41(1), 165–170 (2008) 11. Murphy, L., Schwartz, T.A., Renner, J.B., Koch, G., Kalsbeek, W.D., Jordan, J.M., Helmick, C.G., Tudor, G., Dragomir, A., Luta, G.: Lifetime risk of symptomatic knee osteoarthritis. Arthritis & Rheumatism (Arthritis Care & Research) 59(9), 1207–1213 (September 2008) 12. van den Noort, J.C., Schaffers, I., Snijders, J., Harlaar, J.: The effectiveness of voluntary modifications of gait pattern to reduce the knee adduction moment. Human Movement Science (2013) 13. Pollo, F.E., Otis, J.C., Backus, S.I., Warren, R.F., Wickiewicz, T.L.: Reduction of medial compartment loads with valgus bracing of the osteoarthritic knee. The American Journal Of Sports Medicine 30(3), 414–421 (2002) 14. Quinlan, R.: C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, San Mateo, CA (1993) 15. Witten, I., Frank, E., Hall, M.: Data mining: practical machine learning tools and techniques - third edition. Morgan Kaufmann Publishers (2011)

Probability Index of Metric Correspondence as a measure of visualization reliability 1 ´ Magdalena Wiercioch1 , Marek Smieja , and Jacek Tabor1

Faculty of Mathematics and Computer Science, Jagiellonian University Lojasiewicza 6, 30-348 Kraków, Poland {magdalena.wiercioch,marek.smieja,jacek.tabor}@ii.uj.edu.pl

Abstract. This paper proposes a metric to measure the quality of dimensionality reduction called Probability Index of Metric Correspondence (PIMC). PIMC quantifies how well a low-dimensional representation of high-dimensional input data reflects its original form. In other words, PIMC is an unsupervised technique which assigns a probability so that the projection of input data preserves the order of distances between every two pairs of elements. Moreover, we present an application of PIMC to alter the Treemap visualization method designed by B. Shneiderman. Introduced modification employs a greedy strategy such that the objects arrangement in plane is chosen based on the highest value of PIMC. The index was employed to assess existing visualization methods and proposed modification of Treemap on several real life datasets including a set of high dimensional chemical compounds. Experimental evaluation indicates that PIMC is a promising tool to quantify the visualization reliability as well as can improve the performance of existing projection methods.

Introduction

The projection and visualization techniques of high dimensional data play a crucial role in computer graphics, machine learning and data analysis [15], [8]. Although numerous algorithms have been introduced, there is no unified methodology how to assess the obtained results. The validity is usually identified by perceptual research. In this contribution we propose the Probability Index of Metric Correspondence (PIMC) which allows for a numerical assessment of projections. As a corollary its application to modify the Treemap visualization method is presented [22]. A growing demand for efficient visualization and validation techniques is motivated by real life examples [16], [12]. In cheminformatics the appropriate low dimensional representation of chemical compounds enables to search for drugs acting on various diseases with use of computer only (Computer Aided Drug Design - CADD) [1]. Since the most popular representation of compounds contains as many as 4860 coordinates [11], the visualization of chemical spaces is of great importance. Proposed index allows for an unsupervised visualization, where the labeled data is not required, and relies on comparing the distances between elements in input and output spaces. From a practical standpoint, the construction of PIMC is motivated by the following observation: Points that are close in the original space are expected to be close in “reduced” space as well. In order to apply the aforementioned rule in practice let dn be a distance in n-dimensional space and let dk refer to a distance between responses in k-dimensional visualization space, where n > k. PIMC focuses on quantifying the probability that: if dn (x, y) < dn (w, z) then dk (x, y) < dk (w, z), where x, y, w, z are dataset elements.

´ M. Wiercioch, M. Smieja, J. Tabor If the condition is satisfied for all objects, then the visualization is accurate. It means the metric structures of input data were properly preserved. In the opposite case when none of the elements comply with the rule, the projection yields an arrangements of points such that objects that were ‘close‘ to each other are far away now and vice-versa. If the condition holds for one-half the number of objects, the visualization is not reliable at all and it is random. We demonstrate how to modify Shneidersman’s Treemap visualization algorithm [22] to maximize the visualization reliability in terms of PIMC value. Treemap is a space-filling method which maps a hierarchical structure of data on 2-dimensional space by splitting the plane into hierarchical regions. Our extension, which we call PIMC Treemap (PTM), relies on making such a division which maximizes a PIMC at each step. To achieve a simple and computationally efficient tool a greedy approach has been employed. To demonstrate the utility of PIMC, we examine five visualization techniques: Principal Component Analysis [10], Factor Analysis [6], Independent Component Analysis [2], Treemap [22] and PIMC Treemap on several datasets including well-known UCI examples as well as life science datasets of high dimensional chemical compounds. Experimental results show that the proposed PIMC is a promising method for multivariate data visualization optimization and projection evaluation. The usefulness of PIMC to improve existing visualization algorithms is especially evident in the case of high dimensional data of chemical compounds, where PTM method outperforms standard Treemap and achieves comparable performance to the widely used standard techniques. The paper is organized as follows. Next section gives a brief review of related visualization measures and methods. Section 3 introduces a PIMC measure while section 4 contains a description of Treemap and its proposed modification: PIMC Treemap. Experiments are included in Section 5. The conclusion is given in Section 6.

Related work

Generally, the task of building metrics which addresses the problem of evaluation in different fields has been widely discussed [23], [26]. Challenges connected with visualization assessment have been covered in previous works. Sanyal et al. [21] show uncertainty comparing techniques for 1D and 2D datasets. Streit et al. provide a look at this area of research conducting analysis based on a probabilistic model [25]. It shows there is a great majority of different types of measures to consider in an evaluation. On the other hand, many authors have proposed various multidimensional data visualization techniques [20], [5], too. The commonly used method is Principal Components Analysis (PCA) [10] which aims at exposing the covariance structure of a set of features. Another tool utilized in visualization is Self-organizing map (SOM) [9]. It is a two-layer neural network which allows to rearrange the data taking a similarity into account. Typically SOM can be viewed as a two-dimensional hexagonal grid. Mutidimensional Scalling (MDS) is a popular non-linear technique that finds a set of vectors in k dimensional space such that the matrix of Euclidean distances among them corresponds as closely as possible to some function of the input dissimilarity matrix [3]. Another approach provides Factor Analysis (FA) [14] which attempts to represent a set of observed variables in terms of a number of common factors plus a factor which is unique to each variable. In particular, it is used to reduce many variables to a more manageable number. Moreover, Independent Component Analysis (ICA) [7] is a method in which the goal is to find a linear representation of nongaussian data so that the components are statistically independent, or as independent as possible. Such a representation seems to capture the essential structure of the data in many applications, including feature extraction. Finally, Shneiderman considers a method for drawing a tree structures that makes maximal use of specified rectangle area [22].

Visualization validity measurement

In this section we introduce a measure for visualization assessment called Probability Index of Metric Correspondence (PIMC). Firstly, let us establish a notation.

Probability Index of Metric Correspondence as a measure of visualization reliability We assume that X is n-dimensional input space and dn (x, y) is a distance between x, y ∈ X. We consider a visualization mapping: πk : X ∈ x → πk (x) ∈ Y, which projects input data onto k-dimensional space Y , where k < n. The distance between projections πk (x) and πk (y) is denoted by dk (x, y) := dk (πk (x), πk (y)). Generally, PIMC checks whether the relation of distances between every two pairs of points from the input space is preserved in the output space. Intuitively, points that are close in the original space should also be close in the visualization plane. The result is quantified as a probability of the event that the output space preserves the input distance order. The formal statement is given in the following definition: Definition 1. The Probability Index of Metric Correspondence (PIMC) of a visualization mapping πk on X is defined by: PIMC(X, πk ) := P ({x, y, z, w ∈ X : sign(dn (x, y)−dn (w, z)) = sign(dk (x, y)−dk (w, z))}) (1) where P (·) is a probability function and sign(·) denotes a sign function. It is worth to mention that the formula (1) of PIMC can be written less formally by the notation: P (dn (x, y) < dn (w, z) =⇒ dk (x, y) < dk (w, z)).

(2)

Clearly, it is impossible to preserve all the distances between objects from high dimensional space into low dimensional space, i.e. dn (x, y) = dk (x, y). Therefore, PIMC focuses on preserving the order of input data which is a less restrictive condition. PIMC assumes its maximum of 1 in case of ideal agreement when all distances are preserved. For completely not valid placement of elements located in reduced space PIMC gives value of 0 and forms the inverse distance relation. Therefore, objects that were at a large distance from one another, are now closely separated, whereas nearby points are far from each other. If PIMC is equal to 0.5, the arrangement of objects is totally random which characterizes the worst visualization methods. Remark 1. In practice, the crucial problem is to effectively compute or approximate PIMC. If the number of dataset elements is not high the calculation can be performed by taking all pairs of tuples into account. However, when the cardinality of X is high or even infinite, the approximation has to be applied. Since PIMC is based on a probability, one can use a sample of elements of X to estimate this index. The more elements are considered, the higher accuracy of PIMC is obtained.

Table 1. Coordinates of objects in 5-D space X. Object x1 x2 x3 x4 x5 A B C

4.8 0.5 3.1 2.6 1.3 3 2.2 3.9 2.8 3.5 1.3 5 4.1 3.8 0.6

One can also consider PIMC from a dual perspective, as a tool for the evaluation the quality of a metric in high dimensional space. Generally, high values of PIMC indicate that the visualization method as well as the metric on X are chosen properly. Given two metrics on X one might compare their performance by projecting data with use of an arbitrary visualization technique twice – each

´ M. Wiercioch, M. Smieja, J. Tabor time with use of different metric. A metric with higher PIMC score should better reflect the input space X. However, in this paper we do not follow this approach and assume that a metric on X is fixed. The following examples give the intuition behind PIMC. Example 1. Let us consider a 5-D space X with three points whose coordinates are shown in Table 1. Let consider three different transformations defined by: – T1 : (x1 , x2 , x3 , x4 , x5 ) → (x1 , x2 ) – T2 : (x1 , x2 , x3 , x4 , x5 ) → (x2 − x3 , x5 ) – T3 : (x1 , x2 , x3 , x4 , x5 ) → (x4 , x3 + x5 ). Table 2. Comparison of distances in the input and output space after apllying the visualization mappings. Transformation

|AC| |BC| |AB|

original distances T1 T2 T3

5.952 5.701 3.569 1.237

4.492 3.276 3.895 2.879

3.413 2.476 2.377 3.007

Table 2 presents original distances between objects and their projections. After applying transformation 1, we obtain PIMC(X, T1 ) = 1 since the metric structures of input data have been preserved, i.e. |BC| < |AC|, |AB| < |AC| and |AB| < |BC|. The second mapping causes PIMC(X, T2 ) = 0.67 (|BC| ≮ |AC|). Nevertheless, PIMC gives value of 0 for the last transformation T3 because none of the metric structures were preserved. Example 2. Figure 1 presents Peano Curve [18] which continuously maps a line segment [0, 1] onto a square [0, 1] × [0, 1]. Let us consider a projection π1 : [0, 1] × [0, 1] → [0, 1] defined by the inverse Peano Curve transformation. We have PIMC([0, 1] × [0, 1], π1 ) = 0.88 that is close to optimal.

Fig. 1. Visualization of [0, 1] × [0, 1] by a line segment [0, 1] defined by inverse Peano Curve transformation.

Example 3. In order to examine the influence of the noise in the visualization mapping on the PIMC value let an identity function id : [0, 1]×[0, 1] → [0, 1]×[0, 1] represents a trivial projection. Clearly, PIMC([0, 1] × [0, 1], id) = 1. We modify this transformation by adding a normally distributed noise to every point of the image, i.e., let π2 (x1 , x2 ) = (x1 + r1 , x2 + r2 ), where ri ∼ N (0, σ), for σ > 0, i = 1, 2. The relation between σ and corresponding PIMC([0, 1] × [0, 1], π2 ) presented in the Figure 2 shows that the more noise is given, the smaller value of PIMC is. PIMC stabilizing at 0.5 which means a random visualization.

Probability Index of Metric Correspondence as a measure of visualization reliability

Fig. 2. The impact of noise on PIMC value.

PIMC Treemap

B. Schneiderman proposed a visualization Treemap method which builds a hierarchy of regions. In this section we show how this method can be modified in order to maximize a visualization reliability. We start with a short description of Treemap and then present its enhanced form which we call PIMC Treemap (PTM). Treemap assumes that an input data is represented as a tree structure. Its general idea is to map this hierarchical structure in a 2-D space Y . Each tree node corresponds to the rectangle area in Y ; in particular a tree root corresponds to the entire Y . We traverse a tree in preorder visiting1 . When visiting a tree node the corresponding rectangle region is divided into smaller rectangles according the following rules: – the number of partitions is determined by the number of child nodes in the tree, – splitting direction is connected with actual tree level, i.e. horizontally at odd levels and vertically at even, – the size of constructed regions is proportional to the number of elements of corresponding child node. As a result visualization space is split into rectangles representing the input data. Recursive Treemap algorithm based on a preorder visiting and corresponding Split function is presented in the following pseudocodes:

Treemap Input: R ⊂ Y: {visualization rectangle area} node: {pointer to a node in a tree structure of data X} level: {level in a tree to indicate cuts to be made vertically and horizontally} Method: if node is null then return end if (R1 , R2 ) = Split(R, node, level) Treemap(R1 , node->left, level + 1) Treemap(R2 , node->right, level + 1)

Split 1

Note that other visiting configurations are also possible.

´ M. Wiercioch, M. Smieja, J. Tabor Input: R=[x1 , x2 ] × [y1 , y2 ] ⊂ Y: {visualization rectangle area} node: {pointer to a node in a tree structure of data X} level: {level in a tree to indicate cuts to be made vertically and horizontally} Method: if level is odd then {split rectangle vertically} R1 = [R.x1 , R.x1 + sizeof(node->left) · (R.x2 − R.x1 )] × [R.y1 , R.y2 ] sizeof(node) sizeof(node->left) R2 = [R.x1 + sizeof(node) · (R.x2 − R.x1 ), R.x2 ] × [R.y1 , R.y2 ] else {split rectangle horizontally} R1 = [R.x1 , R.x2 ] × [R.y1 , R.y1 + sizeof(node->left) · (R.y2 − R.y1 )] sizeof(node) · (R.y − R.y1 ), R.y2 ] R2 = [R.x1 , R.x2 ] × [R.y1 + sizeof(node->left) 2 sizeof(node) end if return (R1 , R2 )

Proposed extension of Treemap introduces a modification in Split function. Unlike Shneiderman’s method, our algorithm (PIMC Treemap - PTM) aims at maximizing the value of PIMC by selecting the best variant of rectangle area division. The decision of whether to divide rectangle horizontally or vertically is nondeterministic and depends on value of PIMC calculated for each (four in total) versions of objects arrangement. The division which gives the highest PIMC value is selected. The procedure runs recursively by dividing the rectangle area proportionally to the number of elements included in child nodes. The main difficulty of the above modification is that the exact positions of elements in the visualization space is not known when a given node is processed. More precisely, since a tree is traversed recursively from the top to the bottom, the locations of elements represented by the leaves are not determined when a splitting criterion is calculated for a given node. In consequence, an approximation form of PIMC has to be used in order to select an optimal variant of a split. Our reasoning is based on the fact that the final arrangement of points in a visualization plane is expected to be close to the uniform. Therefore, we assume that the elements are equally distributed over the area of associated rectangle. By such assumption, an object position is picked randomly inside the rectangle that contains it. Moreover, note that for big data, it takes too much time to verify all dataset elements. For that reason, we check the condition from Definition 1 for a fixed number of quadruples. Figure 3 shows the example of rectangle partition process. Let us say the rectangle was supposed to be split in the ratio 1:2 (see Figure 3(a)). Thus, we consider four variants of its partition: two vertically aligned (see Figures 3(d) and 3(e)) and two horizontally aligned (see Figures 3(b) and 3(c)). The decision which option is best in current situation depends on the value of PIMC.

(a)

(b)

(c)

(d)

(e)

Fig. 3. Process of splitting. After a few iterations the rectangle marked with grey is going to be split 3(a) into one of the four splitting configurations 3(b), 3(c), 3(d), 3(e).

Probability Index of Metric Correspondence as a measure of visualization reliability Our modification of Split function is as follows:

Split Input: R=[x1 , x2 ] × [y1 , y2 ] ⊂ Y: {visualization rectangle area} node: {pointer to a node in a tree structure of data X} Method: {(R1 , R2 ), (R3 , R4 ), (R5 , R6 ), (R7 , R8 )} = possibleSplitting(R) {see Figure 3} (P1 , P2 ) = argmax { PIMC(X, Y with R = Ri ∪ Ri+1 ): i = 1,3,5,7 } return (P1 , P2 )

Remark 2. One can also consider an alternative evaluation of PIMC splitting criterion to the one described above. In the previous approach we approximated the PIMC value after splitting by sampling the positions of elements in the visualization space from the uniform distribution. Since the number of samples is high, one can approximate PIMC making use of Central Limit Theorem. For this reason, a probability distribution of a random vector which indicates that a distance between one pair of rectangles in reduced space is not grater than a distance between the second pair has to be found. Both Treemap and PIMC Treemap run on data represented by a tree and it is their limitation. If data does not have a natural tree structure one has to apply for instance a hierarchical clustering algorithm [17] to form a binary tree for a data. In this case visualization results depend strictly on the performance of clustering.

Experiments

We have evaluated PIMC on 24 examples retrieved from UCI repository and 7 real-life datasets of chemical compounds. We considered 3 visualization techniques mentioned in related work section: PCA, FA and ICA. Moreover, we also compared the performance between Treemap (TM) with proposed PIMC Treemap (PTM). TM and PTM methods used hierarchical clustering algorithm with complete linkage functions to obtain tree structure of data [24]. We used scikit-learn [19] implementations of ICA and FA as well as R version of PCA. The codes of TM and PTM was written in C# and is available for the publicity from http://ww2.ii.uj.edu.pl/~wiercioc/pimc together with datasets of chemical compounds used in the experiments.

5.1

UCI datasets

Firstly, we checked how five (previously mentioned) algorithms behave when low dimensional data is considered. Results of two dimensional projection of UCI data sets are shown in Table 3. The visual inspection suggests that PCA, FA and ICA projections provided comparable PIMC values. This was confirmed by Kruskal-Wallis test [13]: at 0.05 significance level there were no reasons to reject the null hypothesis stated that these three algorithms have the same performance. Additionally, Wilcoxon Signed-Rank Test [28] emphasized that at the 0.05 level ICA yields better results than PTM. Such a result just proves that PTM is not intended to use when data does not have hierarchical structure. Furthermore, TM do not appear to be successful since index is close to 0.6 (which means almost random projection). The above experiment reveals the weaknesses of TM which can be partially overcome by application of PIMC. Proposed PTM gave significant improvement with respect to TM but was not able to provide such good results as other visualization techniques. In our opinion, such a poor visualization result is mainly caused by the hierarchical clustering performed at the initial stage of the algorithm.

Â´ M. Wiercioch, M. Smieja, J. Tabor Table 3. PIMC values after applying PCA, FA and ICA, TM and PTM on UCI datasets.

5.2

dataset

#instances #attributes PCA FA ICA TM PTM

Ecoli Yeast Abalone Balance-scale Ionosphere Breast-cancer Iris Wine Glass Image Segmentation Haberman Zoo Statlog Seeds Parkinsons Madelon Hill-Valey Arcene Dorothea Housing Pima Indians Diabetes Page Blocks Skin Segmentation Fertility

336 1484 4177 625 351 286 150 178 214 2310 306 101 690 210 197 4300 606 900 1950 506 768 5473 245057 100

8 8 8 4 34 9 4 13 10 19 3 17 14 7 23 500 101 10000 100000 14 8 10 4 10

0.89 0.8 0.98 0.68 0.72 0.63 0.93 0.88 0.9 0.89 0.98 0.95 0.92 0.99 0.98 0.89 0.85 0.81 0.78 0.83 0.85 0.87 0.91 0.82

0.89 0.79 0.96 0.7 0.74 0.64 0.93 0.87 0.91 0.89 0.99 0.95 0.93 0.99 0.97 0.91 0.84 0.79 0.78 0.83 0.86 0.89 0.91 0.84

0.9 0.79 0.97 0.69 0.75 0.65 0.91 0.88 0.88 0.9 0.98 0.94 0.92 0.99 0.99 0.9 0.84 0.8 0.78 0.84 0.88 0.87 0.93 0.83

0.59 0.56 0.55 0.57 0.6 0.59 0.65 0.58 0.56 0.58 0.59 0.56 0.62 0.68 0.66 0.69 0.66 0.58 0.59 0.58 0.64 0.63 0.63 0.63

0.67 0.6 0.63 0.7 0.75 0.68 0.85 0.67 0.65 0.65 0.7 0.68 0.74 0.8 0.82 0.84 0.81 0.68 0.7 0.75 0.8 0.79 0.85 0.77

Chemical compounds

Previous experiment focuses on relatively low dimensional data sets. It is worth noting that dimensionality reduction is especially expected for high dimensional data. For this reason, we focused on real-world examples including various datasets of selected chemical compounds. The compounds are usually represented by fingerprints, i.e. binary strings which value of 1/0 at given position means presence/absence of specified property. Since different properties of compounds can be taken into account, a lot of fingerprint representations were introduced. In the experiments we used eight different fingerprints which dimensions are reported in the Table 4. In the space of chemical fingerprints various notions of distances can be applied. According to recent experimental results [24], Buser metric and complete linkage method have been applied. The spatial relationship between compounds in two-dimensional space was measured according to Euclidean metric. In the first experiments we focused on a set of compounds acting on 5-HT1 A receptor extracted from ChEMBL database [4]. It is one of the proteins responsible for the regulation of Central Nervous System. From the results summarized in Table 4 it can be noticed that, ICA gave the highest PIMC values in most cases. On the other hand, PTM performs better for Klekota Roth fingerprint which is considered as the most relevant representation by chemists. Furthermore, the best accuracy for all fingerprints was obtained for Graph Only with use of our method. This indicates that PTM might be useful in analysis of multidimensional data. Since PIMC values reported in Table 4 only provide information about reliability of applied projections, one may be curious about the arrangement of compounds after 2-D mapping. A demonstration of PCA, FA, ICA and PTM projections for Klekota Roth fingerprint is given in Figures 4(a), 4(b), 4(c), 4(d). Note that PTM (as well as TM), contrary to other methods, tries to fill the entire space

Probability Index of Metric Correspondence as a measure of visualization reliability

Table 4. PIMC values after applying PCA, FA, ICA, TM and PTM on dataset including 5-HT1A receptors ligands. fingerprint

#instances #attributes PCA FA ICA TM PTM

Klekota Roth Estate Extended Fingerprinter Graph Only MACCS PubChem Substructure

3696 3696 3696 3696 3696 3696 3696 3696

4860 79 1024 1024 1024 166 881 307

0.71 0.69 0.67 0.69 0.71 0.72 0.73 0.72

0.73 0.71 0.68 0.7 0.71 0.74 0.72 0.72

0.74 0.7 0. 67 0.7 0.73 0.76 0.73 0.74

0.6 0.55 0.56 0.55 0.6 0.58 0.56 0.55

0.75 0.61 0.61 0.62 0.77 0.64 0.67 0.62

with datasets elements. This is one of the reasons of lower PIMC values for space-filling projections as Treemap. Investigated space of compounds has been manually clustered by the experts in the field into 26 chemical groups [27]. We checked the location of one of such distinguished classes called Terminal Amides in the obtained visualization spaces. According to the visual inspection, data which belongs to Terminal Amides was divided into a few subgroups and the elements within the subgroups were highly concentrated (see Figure 4). Table 5. Overview of considered datasets and PIMC values after applying PCA, FA, ICA, TM and PTM on datasets of actives and inactive compounds of six biological receptors. receptor role M1 h1 5-HT7 5-HT2A 5-HT6 5-HT2C

modulates few of physiological functions has an impact on pathophysiological conditions influences on various neurological processes, such as aggression has an impact on central nervous system mediates both excitatory and inhibitory neurotransmission has an impact on central nervous system

#actives #inactives PCA FA ICA TM PTM 0.73 0.71 0.62 0.65

759

938

0.7

635

545

0.65 0.68 0.7 0.7 0.73

704 1835

339 851

0.78 0.79 0.8 0.69 0.72 0.58 0.6 0.6 0.67 0.69

1490 1210

341 926

0.66 0.7 0.73 0.59 0.64 0.69 0.73 0.7 0.76 0.78

Furthermore, we have examined visualization methods on six more datasets of chemical compounds, each including active and inactive compounds of receptors described in Table 5. The PIMC performance for investigated visualizations is presented in Table 5. Observe that, for 3 out of 6 receptors PTM has achieved better scores than other methods. Finally, using the Kruskal-Wallis Test we verified at .05 significance level that PCA, FA, ICA and PTM provide statistically identical outcomes.

5.3

Results

The following conclusions can be withdrawn from results of our experiments: â&#x20AC;&#x201C; Visualization performed with PCA, FA and ICA gave the highest values of PIMC for UCI datasets. However, since the dimensionality of such data was low, its visualization was not the most desirable task. â&#x20AC;&#x201C; PTM provided significantly better results than the standard TM method for all datasets. It suggests that an appropriate use of PIMC may also increase the performance of other visualization techniques.

´ M. Wiercioch, M. Smieja, J. Tabor

(a) PCA.

(b) FA

(d) PTM

Fig. 4. The results of visualizations of chemical compounds acting on 5-HT1A receptor ligands. Illustrations also contain the location of Terminal amides subclass in the visualization space.

– Figure 5 gives a further insight into diversity in PIMC values. According to it, ICA seems to be the most reliable visualization method. – PTM gave comparable results to other investigated methods when we focused on high dimensional real-life datasets of chemical compounds. In particular, it was the most accurate for 3 out of 6 receptors. Kruskal-Wallis Test applied to chemical samples has shown the algorithm performance is similar to well-known techniques.

Conclusion

In this paper we have introduced a distance preserving measure called Probability Index of Metric Correspondence (PIMC) to evaluate visualization reliability. We have proposed a modified version of Treemap algorithm for preparation of two-dimensional representation of data. We compared the results of five 2-D projection techniques by measuring the PIMC values. A number of synthetic and real-world datasets were considered. According to experimental results, PIMC seems to be a valuable measurement tool intended to quantify the visualization effect. Furthermore, one of the most important advantages of our index is that it can be used to optimize many kinds of mapping algorithms.

Acknowledgement This research was partially supported by the National Centre of Science (Poland) Grants No. 2014/13/N/ST6/01832 and 2014/13/B/ST6/01792.

Probability Index of Metric Correspondence as a measure of visualization reliability

(a) UCI datasets

(b) 5-HT1A receptors ligands

Fig. 5. Illustration of results diversity.

We are grateful to anonymous reviewers for their important comments and critics of this paper. The authors would also like to We thank Dawid Warszycki for useful discussions about chemical aspects of fingerprints.

References 1. Awale, M., van Deursen, R., Reymond, J.L.: Mqn-mapplet: Visualization of chemical space with interactive maps of drugbank, chembl, pubchem, gdb-11, and gdb-13. Journal of Chemical Information and Modeling 53(2), 509–518 (2013) 2. Comon, P.: Independent component analysis, a new concept? Signal Process. 36(3), 287–314 (1994) 3. Cox, T.F., Cox, M.: Multidimensional Scaling, Second Edition. Chapman and Hall/CRC, 2 edn. (2000) 4. Gaulton, A., Bellis, L.J., Bento, A.P., Chambers, J., Davies, M., Hersey, A., Light, Y., McGlinchey, S., Michalovich, D., Al-Lazikani, B., et al.: Chembl: a large-scale bioactivity database for drug discovery. Nucleic acids research 40(D1), D1100–D1107 (2012) 5. Gershon, N.D.: Visualization of an imperfect world. IEEE Computer Graphics and Applications 18(4), 43–45 (1998) 6. Hand, D.J.: Analysis of multivariate social science data, second edition by david j. bartholomew, fiona steele, irini moustaki, jane galbraith. International Statistical Review 76(3), 456–456 (2008) 7. Hyvärinen, A., Oja, E.: Independent component analysis: Algorithms and applications. Neural Netw. 13(4-5), 411–430 (2000) 8. Jackowski, K., Krawczyk, B., Wozniak, M.: Improved adaptive splitting and selection: the hybrid training method of a classifier based on a feature space partitioning. Int. J. Neural Syst. 24(3) (2014) 9. Jain, A.K., Dubes, R.C.: Algorithms for Clustering Data. Prentice-Hall, Inc., Upper Saddle River, NJ, USA (1988) 10. Jolliffe, I.: Principal Component Analysis. Springer Verlag (1986) 11. Klekota, J., Roth, F.P.: Chemical substructures that enrich for biological activity. Bioinformatics 24(21), 2518–2525 (2008) 12. Krawczyk, Bartosz, S.J.W.M.: Data stream classification and big data analytics. Neurocomputing 150, 238–239 (2015) 13. Kruskal, W.H., Wallis, W.A.: Use of ranks in one-criterion variance analysis. Journal of the American Statistical Association 47(260), 583–621 (1952) 14. Lawley, D.N., Maxwell, A.E.: Factor analysis as a statistical method. Journal of the Royal Statistical Society. Series D (The Statistician) 12(3), pp. 209–229 (1962) 15. Li, Y., Chen, L.: Big biological data: Challenges and opportunities. Genomics, Proteomics & Bioinformatics 12(5), 187 – 189 (2014), special Issue: Translational Omics 16. Mokbel, B., Lueks, W., Gisbrecht, A., Hammer, B.: Visualizing the quality of dimensionality reduction. Neurocomputing 112, 109–123 (2013)

´ M. Wiercioch, M. Smieja, J. Tabor 17. Murtagh, F.: A survey of recent advances in hierarchical clustering algorithms. Computer Journal 26(4), 354–359 (1983) 18. Peano, G.: Sur une courbe, qui remplit toute une aire plane. Mathematische Annalen 36(1), 157– 160 (1890) 19. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: Machine Learning in Python . Journal of Machine Learning Research 12, 2825–2830 (2011) 20. P˛ekalska, E., Duin, R.P.W.: The Dissimilarity Representation for Pattern Recognition: Foundations And Applications (Machine Perception and Artificial Intelligence). World Scientific Publishing Co., Inc., River Edge, NJ, USA (2005) 21. Sanyal, J., Zhang, S., Bhattacharya, G., Amburn, P., Moorhead, R.: A user study to compare four uncertainty visualization methods for 1d and 2d datasets. IEEE Transactions on Visualization and Computer Graphics 15(6), 1209–1218 (Nov 2009) 22. Shneiderman, B.: Tree visualization with tree-maps: A 2-d space-filling approach. ACM Transactions on Graphics 11, 92–99 (1991) 23. Sicilia, M.A., Rodríguez, D., García-Barriocanal, E., Sánchez-Alonso, S.: Empirical findings on ontology metrics. Expert Syst. Appl. 39(8), 6706–6711 (Jun 2012) ´ 24. Smieja, M., Warszycki, D., Tabor, J., Bojarski, A.J.: Asymmetric clustering index in a case study of 5-ht1A receptor ligands. PLoS ONE 9(7), DOI:10.1371/journal.pone.0102069, e102069 (07 2014) 25. Streit, A., Pham, B., Brown, R.: A spreadsheet approach to facilitate visualization of uncertainty in information. Visualization and Computer Graphics, IEEE Transactions on 14(1), 61–72 (Jan 2008) 26. Tang, W., Tsai, F.S., Chen, L.: Blended metrics for novel sentence mining. Expert Syst. Appl. 37(7), 5172–5177 (2010) 27. Warszycki, D., Mordalski, S., Kristiansen, K., Kafel, R., Sylte, I., Chilmonczyk, Z., Bojarski, A.J.: A linear combination of pharmacophore hypotheses as a new tool in search of new active compounds–an application for 5-ht1A receptor ligands. PloS one 8(12) (2013) 28. Wilcoxon, F.: Individual Comparisons by Ranking Methods. Biometrics Bulletin 1(6), 80–83 (Dec 1945)

Increasing Weak Classifiers Diversity by Omics Networks Vladim´ır Kunc, Jiˇr´ı Kl´ema, and Michael Andˇel Department of Computer Science, Czech Technical University in Prague, Technick´ a 2, 166 27 Prague, Czech Republic, {kuncvlad,klema,andelmi2}@fel.cvut.cz http://ida.felk.cvut.cz/

Abstract. The common problems in machine learning from omics data are the scarcity of samples, the high number of features and their complex interaction structure. The models built solely from measured data often suffer from overfitting. One of possible methods dealing with overfitting is to use prior knowledge for regularization. This work analyzes contribution of feature interaction networks in regularization of ensemble classifiers representing another approach to overfitting reduction. We study how utilization of feature interaction networks influences diversity of weak classifiers and thus accuracy of the resulting ensemble model. The network and its random walks are used to control the feature randomization during construction of weak classifiers, which makes them more diverse than in the well-known random forest. We experiment with different types of weak classifiers (trees, logistic regression, na¨ıve Bayes) and different random walk lengths and demonstrate that diversity of weak classifiers grows with increasing network locality of weak classifiers. Keywords: ensemble learning, random forest, prior knowledge, diversity, gene expression

Introduction

In recent years, the fields of genomics, proteomics and metabolomics (collectively omics) have been strongly influenced by progress in high-throughput technologies which led to a boost of generated data. These omics data are thoroughly analyzed to learn about the mechanisms of shaded diseases, to predict the disease onset and progression, or to set a proper treatment protocol. One of the common problems in learning from omics data is the scarcity of available samples, namely when compared with their large dimensionality. This inconvenient ratio, altogether with common noisiness, leads to overfitting as the number of possible hypotheses immensely exceeds the number of training examples. In the field of machine learning, the overfitting is commonly addressed by the means of regularization, when a prior hypothesis is imposed during the learning process. Another approach to deal with overfitting, namely the one caused by the noise,

Vladim´ır Kunc, Jiˇr´ı Kl´ema, Michael Andˇel

is ensemble learning. Ensemble methods train multiple classifiers and use these classifiers to create an aggregated classifier for a single task. These ensembles usually outperform each of the base classifiers, which they are composed from, in most classification tasks [7,8]. The key assumption of ensemble models is that the underlying classifiers are diverse, i.e., that they make different errors, and thus they can together achieve higher predictive performance than could have been obtained from any of the constituent classifiers [7,23]. In this paper, we study how regularization by domain knowledge may contribute to the diversity of ensemble classifier and consequently boost its performance. This paper follows the work of [2,1] where random forests (RFs) [5] get enriched with domain knowledge. The knowledge is here defined as prior known interaction between omics features. Since the interacting features are considered correlated, the base classifiers built on different sets of interacting features are assumed to be decorrelated and therefore more accurate in common. The goal is to show how ensemble classifiers can increase their diversity due to prior knowledge and be further used for better predictions of the onset and progression of heterogeneous multifactorial diseases such as myelodysplastic syndrome that serves as a case study in the experimental part.

Motivation and Related Work

The key feature of an ensemble is the diversity between its base classifiers. The ensemble provides higher accuracy, only if the ensemble members disagree about some inputs [6,15,25]. So far, there have been several attempts to increase this diversity and potentially its ability of generalization. A straightforward approach to deliver the ensemble diversity is manipulating training samples. This method is applied to unstable classifiers such as neural networks (NNs) and decision trees (DTs) [20]. Most known examples are bagging [4] and boosting [22]. Similarly, one may manipulate with the feature set. The random subspace (RS) method (also called attribute bagging) creates random subspaces of the feature space, each base classifier is trained using one of these subspaces [20,21,11]. This method is recommended especially when the dimension of the feature space is very high and most other classification methods suffer from the curse of dimensionality [11]. An orthogonal way to achieve diversity is by manipulating the algorithm that creates base classifiers. The learning algorithm can change its parameters, for example the topology of NN [12], the pruning factor of DTs [20] or the starting point and the way of traversal in the hypothesis search space [6]. According to Rokach [20], there are two methods of manipulating the space traversal — random-based strategy and collective performance strategy [20]. The random-based strategy uses randomness to gain higher diversity; one of the most common examples is a random forest [5], in which a weak classifier does not select the best feature in each its node but only the best feature from a certain feature subset. A different forest randomization strategy was used in [8], where, for each tree, “the 20 best candidate splits are computed, and then one of these is chosen uniformly at random”. Contrarily, the collective

Increasing Weak Classifiers Diversity by Omics Networks

performance based strategy creates the ensemble as a whole while trying to increase its accuracy by various means. The base classifiers might cooperate with each other in order to specialize, i.e., be diverse from others [20]. The most impacting of collective performance strategy is the family of penalty methods. They add a penalty term to the error function of an ensemble to encourage diversity among base classifiers [6,20,21]. Several penalty methods were proposed and analysed in the literature â&#x20AC;&#x201D; e.g., negative correlation learning [20,6] or root-quartic negative correlation learning [6]. Besides ensembles, another approach to address overfitting is regularization. It applies especially when the sample size n is much smaller than the number of measured features p, n p. As mentioned in Sect. 1, regularization means imposing certain hypothesis during the learning process, while this need not to be admitted if there is not enough training examples in its favour. This is very suitable in the case of domain knowledge [1,19,16]. The hypothesis may be defined as a set of feature interactions, which shows valid (or not) in a certain domain context. Nonetheless, the hypothesis imposed may be defined uninformedly too, merely by restraining the hypothesis space geometrically as in the case of margin classifiers, or purely as restrained hypothesis space as in the case of DT pruning. As mentioned above, one of the ways to make base classifiers diverse, is to manipulate their learning hypotheses. Here we can see how regularization meets ensemble learning, that regularizing each of base classifiers should make them diverse and potentially more powerful altogether. This was a motivation for our recent ensemble method [1], where the base trees are induced only from the genes lying close to each other in omics, namely protein-interaction, network. In the other words, as genes whose corresponding proteins bind or interact are assumed to be correlated in their mRNA expression, we assume that trees built on the interacting features shall be decorrelated in their prediction. Disagreement between the base prediction is fundamental for the ensemble diversity, being the key quantity in its measuring [15,6]. In this paper we investigate possibilities to gain this diversity through the domain knowledge about protein interactions.

Methods

In this chapter we briefly describe the measures commonly used for quantifying the ensemble diversity. Next, we discuss our recently proposed ensemble method, which we here further generalize and investigate as to its diversity induced by domain knowledge. There are two main approaches of measuring the diversity, pairwise and non-pairwise [15,6]. The non-pairwise measures mostly compare the output of base classifiers with the averaged output of the whole ensemble or are based on the idea of entropy. The pairwise measures calculate the average of a particular measure of all possible pairings of ensemble members [6]. For this research, four diversity measures were chosen â&#x20AC;&#x201D; entropy [15] and Kohavi-Wolpert measure (KW) [15,13] represent non-pairwise measures, average Q statistics (Qave ) [15,28] and double-fault measure (DF) [15,10] fall into pairwise measures.

3.1

Vladim´ır Kunc, Jiˇr´ı Kl´ema, Michael Andˇel

Network-constrained Forest (NCF)

NCF algorithm [1] combines two approaches to solving the n p problem common in the omics field, it utilises prior knowledge for creation of an ensemble of decision trees. Unlike RF, the NCF biases “the feature sampling process towards the genes and loci in general, which have been previously reported as candidates for causing the phenomenon being studied (. . . ) and consequently the omics features which directly or indirectly interact with those candidate genes” [1]. This sampling process is driven by a random walk on the biological interaction network integrating both mRNA and miRNA prior knowledge, the process starts from the candidate causal genes called seeds. When candidates causal genes are unknown, seeds are randomly sampled from the entire set and the probability of a gene being sampled as a seed is proportional to its out degree in the network. Further implementation details and pseudo-code are available in [1,2]. The crucial assumption behind the NCF is that gene that are close in the biological feature network are also correlated in their expression, therefore it is suitable to create weak classifiers grouping these features because it leads to decorrelating the individual weak classifiers and therefore better ensemble diversity. The biological background behind this method is discussed in [1] and the conclusion presented is that the weak “DTs may vaguely correspond to the individual disease factors and their network-local manifestations” [1]. The individual trees are constructed using the features in the network neighbourhood of a particular seed gene that was chosen for the tree. The neighbourhood is represented by a distribution function using which the feature set is sampled. This distribution is defined as a random walk of length k from the seed gene — it is more dense when closer to the seed gene and also it is not possible to reach genes that are further in the network than k. Therefore, the NCF is parametrized by the walk length k whose optimal value may be different for different tasks as it strongly influences the feature sampling [1]. A heuristic based on incidence of underfitted trees for setting the parameter k was proposed in [2]. The influence of k on the accuracy and diversity of weak learners and the overall accuracy of the ensemble is further analyzed in Sect. 4. 3.2

The Novel Method of Network-constrained Random Subspaces

In this section, we propose a generalization of the NCF algorithm called networkconstrained random subspaces (NCRS) which applies the idea of biased sampling of the feature set to the general ensemble random subspace method (see Sect. 2). The idea of NCF is not strictly related to ensembles of DTs and it is easily extensible to ensembles of other weak learners. Even though DTs as weak classifiers of forest have many advantages as, for example, direct interpretability and possible use of such forest for feature selection, other classifiers such as logistic regression (LR) or na¨ıve Bayes (NB) might be used as well. Moreover, turning RF and NCF into a tree independent ensemble method is allowed by a simple modification of the algorithm, feature sampling performed independently in tree nodes changes into sampling performed before construction of a whole weak learner.

Increasing Weak Classifiers Diversity by Omics Networks

The relationship between RF, RS, NCF, and NCRS is depicted in Fig. 1 — the RF and the NCF both sample the feature space in each node of each tree, however, the RS and the NCRS both sample the feature space only for each weak learner. The sampling in the NCF and the NCRS is network-constrained, i.e., its sampling procedure generates samples using random walks over the interaction network, while the sampling procedure in the RF and the RS is random. RF trees, node sampling

N CF network driven trees, node sampling

RS N CRS network driven

Fig. 1: The relationship between RF, RS, NCF and NCRS.

Experiments

Further described experiments had several objectives. First of all, the goal was to verify the accuracy of the newly proposed NCRS method, namely the generalization from DTs used in [2,1] towards LR and NB as weak classifiers. The second objective was to analyze the impact of different values of the parameter k defining the length of a random walk on the accuracy of both the whole ensemble and also of the individual weak classifiers. Moreover, [2,1] implies that the diversity of weak classifiers should be strongly influenced by the parameter k and in most cases, a longer walk should lead to smaller diversity among the weak classifiers in the ensemble as they become less specialized. The parameter k was analysed for similar values as used in [2,1] but also for more extreme values — e.g., for a random walk of length 100. Another objective was to experimentally validate the convergence of both NCRS and NCF as k → ∞. The NCF does not converge to RF, rather it converges to the stationary distribution of random walk π ∞ (v) = deg(v) |I| , where I is the the set of edges in the biological network [1]. However, the NCF converges to the stationary distribution only if there are no miRNA interactions present because such interactions are handled in a special way — when encountering the miRNA node in the walk, the walk always ends there, details are again available in [1]. The convergence was not experimentally validated in [1]. 4.1

Domain and Data

Data related to myelodysplastic syndrome (MDS) were used for most of the experiments. It is the same data that was used in the original experiment with NCF [2,1]. The data were provided by a collaborative laboratory at the Institute of Hematology and Blood Transfusion in Prague. The data were obtained for analysis of lenaledomide treatment of patients with myelodysplastic syndrome. The data consist of two datasets — mRNA with 16,666 attributes measuring the gene expression level and miRNA with 1,146 attributes measuring the

Vladim´ır Kunc, Jiˇr´ı Kl´ema, Michael Andˇel

expression level of particular miRNAs [1]. The samples were obtained from bone marrow (BM) CD34+ progenitor cells and from peripheral blood (PB) CD14+ monocytes and were obtained either before the treatment (BE) or during the treatment (DU). Moreover, the data can be further categorized by the partial deletion of the chromosome 5 (5q or non-5q). Using these categories, the data consisting of 75 samples were divided into 10 related datasets. Again, for the coherence of experiments with [2,1], the same prior knowledge in the form of gene networks and candidates causal genes was used. The prior knowledge is publicly available — in vitro validated miRNA-mRNA interactions are from TarBase 6.0 [24], in silico predicted interactions are from miRWalk database [9], experimentally validated protein-protein interactions are from Human Protein Reference Database [18], predicted protein-protein interactions are from [3] and MDS causal genes are from [27], according to [1]. 4.2

Experimental Protocol

The NCRS ensemble classifier was implemented in Python 3 as a modification of both the original NCF [1] and general Bagging classifier from machine learning library Scikit-learn [17] version 0.16.1. 10 times repeated stratified m-fold crossvalidation was used for MDS experiments, where m := min{10, c}, where c is the number of samples in the smallest class. This setting of m maximizes the number of stratified folds in tasks with small sample sets and keeps the common number of folds for the rest of tasks. All ensembles were built from 1000 weak classifiers using the RS method, each weak classifier accessed 100 features. Both the parameters were set in advance with no tuning. The number of weak classifiers was strongly limited by computational costs of both learning period and calculating pairwise diversity measures. The number of accessed features √ roughly followed the rule of thumb p implemented in [17]. Matthew’s correlation coefficient (MCC) was chosen as a measure of classification quality insensitive to classes with different sizes. The MCC was calculated for predictions for the whole dataset, not for individual folds, and then averaged over repetitions — in contrast to [1], where median was used instead of averaging. The random walk length k was set to k ∈ {1, 2, . . . , 15} for most experiments, for the rest it is explicitly noted which set of k was used.

Results

The results split into several parts — the comparison of NCRS with the unbiased RS method, the analysis of diversity, and the analysis of convergence of NCRS. NCRS and Unbiased Random Subspace Method. In the original study [2], the NCF was compared to the random subspace forest of DTs, however, our NCRS generalization allows the use of different weak classifiers in the ensemble. For this part of experiment, we have used NCRS with DTs (CART), LR and NB classifiers. In most tasks, the NCRS was better in terms of MCC for some

Increasing Weak Classifiers Diversity by Omics Networks

values of k than the unbiased RS with the same type of weak classifiers. For each datasets, there are three possible results — NCRS better for some vales of the parameter k (win), NCRS with exactly the same performance as RS for some values of k and worse for the rest (tie) and NCRS worse than RS for all k (loss). Tab. 1 displays results for different types of weak classifiers in the NCRS compared with the unbiased RS. However, this comparison is optimistically biased because NCRS was considered to be the winner if it was better for any value of k — in real case scenario, the parameter k could be either determined using internal cross-validation or by heuristic proposed in [2]. On the other hand, the NCRS was better in terms of MCC for any k ∈ {1, 2, . . . , 15} for many tasks — the k independent results are displayed in column Pessimistic Tab. 1 — therefore, the optimistic bias is not present in those experiments as this table only contains results that hold for any value of k ∈ {1, 2, . . . , 15}. Table 1: Performance of three different types of weak classifiers in terms of wins, ties, and losses, which were consistent for any k ∈ {1, 2, . . . , 15} k-dependent k-independent wins ties losses wins ties losses Decision Tree 8 1 1 5 1 1 4 1 3 1 1 Logistic Regression 5 Na¨ıve Bayes 7 1 2 6 1 2 Classifier Type

The results displayed in Tab. 1 compare whether the NCRS with the particular type of weak classifiers is better than RS with the same type of weak classifiers, they do not compare the suitability of used weak classifiers for the task as they do not show the absolute accuracy over the datasets. From this point of view, the NCRS with logistic regression performs best as depicted in Tab. 2. However, the original NCRS with decision trees is also very close to the NCRS LR — both in the rank and the average MCC. Table 2: Comparison of performance of different types of weak classifiers for both NCRS and RS ensembles. The MCC values are taken as the maximum MCC for k ∈ {1, 2, . . . , 15} for given classifier Task #samples NCRS DT NCRS LR NCRS NB BMBE DU5q 16 0.76 0.36 0.38 BMH ABE5q 21 1.00 1.00 1.00 BMH ABEnon-5q 16 1.00 1.00 1.00 BMH ADU5q 15 0.72 0.79 0.81 BMnon-5q 5qBT 17 1.00 1.00 0.75 PBBE DU5q 22 0.57 0.79 0.33 PBH ABE5q 19 0.99 1.00 0.82 PBH ABEnon-5q 14 0.83 1.00 0.84 PBH ADU5q 23 1.00 0.93 0.56 PBnon-5q 5qBE 13 0.96 1.00 0.82 Average MCC 0.88 0.89 0.73 Average rank 2.85 2.20 3.85

RS DT 0.34 1.00 0.87 0.75 0.66 0.57 1.00 0.81 0.92 0.86 0.79 4.00

RS LR 0.46 1.00 0.90 0.66 0.79 0.62 1.00 1.00 1.00 1.00 0.84 2.70

RS NB 0.10 1.00 0.87 0.71 0.62 0.14 0.84 0.65 0.66 0.64 0.62 5.40

Vladim´ır Kunc, Jiˇr´ı Kl´ema, Michael Andˇel

However, these experiments were biased because the best value of k based on the performance on the test set was chosen, better approach would be use internal cross-validation for determining the optimal value of k and then use this value on the test set. However, the datasets are very small, from 13 to 23 samples, and internal cross-validation would reduce the training or testing set even further. Even though it would still be possible, for example using leave-one-out crossvalidation, the experiments would be computationally costly, moreover, the k is to be set using the heuristic proposed in [2,1], therefore the cross-validation would not simulate the real use of the method. The heuristic also cannot be used for comparison as it is tree specific, modification of the heuristic for other learners is part of possible future work. Furthermore, the purpose of this experiment was to show that other weak classifiers are also suitable alternative to DT. The conclusion arising from this part of experiments is clear — the NCRS method is suitable also for other types of weak classifiers than just the DT. The NCRS method outperformed the RS method in most tasks for any of the three tested types of weak classifiers. In terms of absolute performance, the NCRS with LR outperformed other ensemble classifiers both in ranks and average MCCs. Analysis of Diversity. The analysis of the relationship between the walk length k and the diversity among classifiers in the ensemble is difficult because there are two main characteristics that are dependent on the parameter k — diversity and weak classifiers accuracy — and they cannot be analysed individually. For this reason, four different diversity measures were chosen to understand the dependency between diversity and accuracy in more depth. As proposed in [2,1], the diversity indeed seems to decrease with the length of random walk k as the weak classifiers become less and less specialized. On the other hand, the average MCC of weak classifiers is increasing with the length k in most cases. Therefore, the overall MCC of the ensemble is based on the proportion between the diversity growth and the weak classifiers accuracy growth. The overall MCC of the ensemble is the result of proportion of its weak classifiers accuracy and diversity. This is nicely shown in Fig. 2 where the ensemble 1 0.83

0.210 0.98

−0.200 0.080

0.020

−0.100 0.075

0.82

0.025

0.200 0.96

0.000

0.81

0.070 0.030

0.190

0.94

0.100

0.8 0.92

0.180

0.200

0.170

0.300

0.79 0.9 2

AWMCC MCC

entropy QAve

0.065 0.035 0.060

0.055

0.040

Fig. 2: The trade-off between the weak classifiers’ diversity and the accuracy. The graph represents task BMH ABE5q classified using NCRS with Logistic Regression weak classifiers. starts with diverse weak classifiers with lower accuracy for k = 1, then the diver-

Increasing Weak Classifiers Diversity by Omics Networks

sity is decreasing, however, the weak classifiers’ accuracy is steeply increasing, therefore the overall MCC of the ensemble reaches 1.0 and holds there while the diversity is still decreasing and the weak classifiers’ accuracy slowly increasing. However, even though the weak classifiers accuracy is increasing, they tend more and more to have correlated errors — these errors have bigger influence than the increasing accuracy and DF measure starts to decrease. At some point the accuracy of weak classifiers begins to slowly decrease but since the ensemble diversity is very low at this point, the ensemble MCC plummets — the decrease in MCC is not proportional to the decrease in the average MCC of weak classifiers (AWMCC) — roughly 1.5 % for the AWMCC while about 9 % for the MCC. As a whole, the NCRS algorithm manages the diversity nicely, in most cases, it starts with specialized and diverse weak classifiers and with increasing value of the parameter k, the diversity usually decreases and the average accuracy of weak classifiers increases. Tuning the random walk length k may allow to find the optimal trade-off between the diversity and the AWMCC resulting in high MCC of the whole ensemble. Only in several cases, the NCRS ends up with unexpected distribution of weak classifiers with higher AWMCC than the overall MCC. This phenomenon requires further analysis. However, it occurs only for particular combination of the dataset and the type of weak classifier, moreover it appears for particular values of k only. Analysis of The Convergence of NCRS. As described in Sect. 4, the NCRS converges to a stationary distribution of a random walk for k → ∞ where no miRNA nodes are present in the network. The goal of this experiment was to empirically validate the convergence, therefore this experiments utilises only candidate causal genes and mRNA interactions as prior knowledge. Parameter k was chosen from {2, 4, 6, 8, 10, 15, 20, 30, 40, 60, 80, 100, 150, 200}. The NCRS algorithm was also modified to sample the features with a probability π ∞ (v) = deg(v) |I| — i.e., the probability of a feature being sampled is proportional to its degree in the biological network. Some results of this experiment are depicted in Fig. 3, where the dotted lines represent the values of measures for the k independent degree proportional sampling NCRS, while the full lines represent the k dependent random walk sampling NCRS. In contrast to other plots, the scale of y-axes is very important in the convergence analysis — e.g., seemingly unconverging lines might be just caused by small fluctuations caused by the stochastic nature of the classifier as, for example, in Fig. 3a, where the values seemingly do not converge for increasing values of k, however, the scales of axes are very small, therefore the observed chaotic behavior is just small fluctuations around the desirable values. On the other hand, the convergence is ideally depicted in Fig. 3b, where all the measures nicely converge to the values obtained by the modified NCRS for higher values of k. The convergence manifests in other tasks too, albeit not as nicely. It seems that the values converge to different values in several tasks or that the convergence is biased a bit for some reason. Besides the bias, there are also other two possible explanations for such phenomenon. Firstly, it might be just a fluctuation of the stochastic-based original NCRS. Secondly and more im-

Vladim´ır Kunc, Jiˇr´ı Kl´ema, Michael Andˇel 1.2

0.410 0.57

0.148

0.87

0.147

0.405

0.86

0.400 0.070

0.84 0.395

0.55

0.83

AWMCC MCC

0.160

0.050

−0.080 0.420

−0.060

0.155 0.055

0.145 0.065

0.55

−0.100

0.6 0.064

0.065

0.85

1.1

0.063

0.146 0.56

0.62 0.440

0.060 0.56

0.062

0.58

0.150

0.400

0.9

0.144

0.143

−0.040 −0.020

0.066

0.060 0.145

0.56 20

100

entropy QAve

(a) Seemingly chaotic behavior.

AWMCC MCC

100

entropy QAve

(b) The ideal convergence example.

Fig. 3: Empirical validation of proposed convergence portantly, it might be caused by the stochastic nature of the modified NCRS as well. When there are changes due to the stochastic nature of the original NCRS, random fluctuation are expected as we fit the classifiers for different values of k and these fluctuations show in the smoothness of the measured points, however when dealing with stochastic nature of the modified NCRS, only one value is obtained and in spite of 10 repeated n-fold cross-validation, the obtained averaged values might still be significantly different from the hidden true expected values of the modified NCRS. This experiment strongly suggests that the proposed convergence of NCRS (NCF) holds, even though there are still several tasks which would need further analysis as the values seem to converge to a slightly biased point.

Conclusion

Ensemble methods have been widely applied to problems showing high dimensionality, small sample size and complex structure. The omics high-throughput data often have all the above-mentioned characteristics and ensemble methods represent a popular alternative in their classification [26]. In this paper, we focus on the omics data where the complex structure is partially known or assumed. In particular, we suppose that a feature interaction network exists and the interactions imply feature dependencies. Note that the dependencies can hardly be reliably identified from the data itself for the sake of small sample size. By contrast, the interaction network can be composed from the relationships and regulation formerly described in literature. We stem from our recently proposed NCF that modifies the well-known random forest for domains with known interaction networks. We have proposed its further simple generalization called network-constrained random subspace method which goes beyond DTs used in the original NCF. NCRS was empirically validated using the same datasets as in the original study [2,1]. It was conclusively shown that NCRS is suitable for different types of weak classifiers. To exemplify, the NCRS with logistic regression weak classifiers proved to outperform the originally proposed NCF (NCRS DT) on most MDS datasets. Importantly, both nave Bayes and logistic regression classifiers provide insight into the problem as they allow to easily analyze feature importance.

Increasing Weak Classifiers Diversity by Omics Networks

Furthermore, the role of diversity in NCRS was studied using popular diversity measures. As the feature sampling process in NCRS is parametrized by the length of random walk k, we have analysed its influence on the diversity and accuracy of the ensemble. We have empirically shown that the diversity usually decreases with increasing the length k as was hinted, but not tested, in [2,1]. The last experiments for the MDS datasets were validating the convergence NCRS for k → ∞ proposed in [1]. After that, we have tested the behavior of NCRS on benchmark datasets from [14], which, in contrast with MDS datasets, do not have miRNA data and candidate causal genes. The NCRS performed similarly as RS on most of these datasets but it slightly outperformed the RS on several datasets and was outperformed by the RS method on only one datasets. There is a lot of future work. First, with assumed increasing availability of omics data, the experiments will be replicated with more data. We expect increased statistical relevance and the possibility to analyse the influence of the size of the training set on the performance of NCRS compared to the unbiased RS. With more data, the prior knowledge is expected to become less important, however, the sample sizes where prior knowledge is superfluous are not realistic yet. Second, we plan to integrate other types of data and prior knowledge into NCRS (e.g.,DNA methylation arrays). The datasets with a large scale of measurements (GE, miRNA, DNA methylation) are still rare and small-sized, but their importance will increase. Third, a modified heuristic for finding the optimal length of random walk k that applies for ensembles of general weak classifiers, not just the NCF, shall be proposed. Last but not least, biology is not the only domain where the prior knowledge in the form of networks is available. The other tasks could be, e.g., document topic prediction or click prediction. Acknowledgments. This work was supported by the grants NT14539 and NT14377 of the Ministry of Health of the Czech Republic.

References 1. Andˇel, M., Kl´ema, J., Krejˇc´ık, Z.: Network-constrained forest for regularized classification of omics data. Methods 83, 88–97 (Jul 2015) 2. Andˇel, M., Kl´ema, J., Krejˇc´ık, Z.: Network-constrained forest for regularized omics data classification. In: 2014 IEEE International Conference on Bioinformatics and Biomedicine (BIBM). pp. 410–417 (Nov 2014) 3. Bossi, A., Lehner, B.: Tissue specificity and the human protein interaction network. Mol Syst Biol 5 (Apr 2009) 4. Breiman, L.: Bagging predictors. Machine Learning 24(2), 123–140 (1996) 5. Breiman, L.: Randoms forests. Machine Learning 45(1), 5–32 (2001) 6. Brown, G., Wyatt, J., Harris, R., Yao, X.: Diversity creation methods: a survey and categorisation. Information Fusion 6(1), 5–20 (Mar 2005) 7. Dietterich, T.G.: Ensemble methods in machine learning. Lecture Notes in Computer Science 1857, 1–15 (2000) 8. Dietterich, T.G.: An experimental comparison of three methods for constructing ensembles of decision trees: Bagging, boosting, and randomization. Mach Learn 40(2), 139–157 (Aug 2000)

Vladim´ır Kunc, Jiˇr´ı Kl´ema, Michael Andˇel

9. Dweep, H., Sticht, C., Pandey, P., Gretz, N.: miRWalk – database: Prediction of possible miRNA binding sites by “walking” the genes of three genomes. Journal of Biomedical Informatics 44(5), 839–847 (Oct 2011) 10. Giacinto, G., Roli, F.: Design of effective neural network ensembles for image classification purposes. Image and Vision Computing 19(9-10), 699–707 (Aug 2001) 11. Ho, T.K.: The random subspace method for constructing decision forests. IEEE Trans. Pattern Anal. Machine Intell. 20(8), 832–844 (1998) 12. Islam, M., Yao, X., Murase, K.: A constructive algorithm for training cooperative neural network ensembles. IEEE Trans Neural Netw 14(4), 820–834 (Jul 2003) 13. Kohavi, R., Wolpert, D.H.: Bias plus variance decomposition for zero-one loss functions. In: Proc. of the 13th Int. Conf. on Machine Learning. pp. 275–283. Morgan Kaufmann Publishers (1996) 14. Krejn´ık, M., Kl´ema, J.: Empirical evidence of the applicability of functional clustering through gene expression classification. IEEE/ACM Trans. Comput. Biol. and Bioinf. 9(3), 788–798 (May 2012) 15. Kuncheva, L., Whitaker, C.: Measures of diversity in classifier ensembles and their relationship with the ensemble accuracy. Mach Learn 51(2), 181–207 (May 2003) 16. Li, C., Li, H.: Network-constrained regularization and variable selection for analysis of genomic data. Bioinformatics 24(9), 1175–1182 (Mar 2008) 17. Pedregosa, F., Varoquaux, G., Gramfort, A., et al.: Scikit-learn: Machine learning in Python. Journal of Machine Learning Research 12, 2825–2830 (2011) 18. Prasad, T.S.K., Goel, R., Kandasamy, K., et al.: Human protein reference database–2009 update. Nucleic Acids Res 37(Database), D767–D772 (Jan 2009) 19. Rapaport, F., Zinovyev, A., Dutreix, M., et al.: Classification of microarray data using gene networks. BMC Bioinformatics 8(1), 35 (2007) 20. Rokach, L.: Taxonomy for characterizing ensemble methods in classification tasks: A review and annotated bibliography. Computational Statistics & Data Analysis 53(12), 4046–4072 (Oct 2009) 21. Rokach, L.: Ensemble-based classifiers. Artif. Intell. Rev. 33(1-2), 1–39 (Feb 2010) 22. Schapire, R.E.: The boosting approach to machine learning: An overview. In: Nonlinear estimation and classification, pp. 149–171. Springer (2003) 23. Scherbart, A., Nattkemper, T.W.: The diversity of regression ensembles combining bagging and random subspace method. In: Advances in Neuro-Information Processing, pp. 911–918. Springer Science + Business Media (2009) 24. Vergoulis, T., Vlachos, I.S., Alexiou, P., et al.: TarBase 6.0: capturing the exponential growth of miRNA targets with experimental support. Nucleic Acids Res 40(D1), D222–D229 (Dec 2011) 25. Woniak, M., Gra˜ na, M., Corchado, E.: A survey of multiple classifier systems as hybrid systems. Inf. Fusion 16, 3–17 (Mar 2014) 26. Yang, P., Yang, Y.H., Zhou, B.B., Zomaya, A.Y.: A review of ensemble methods in bioinformatics. CBIO 5(4), 296–308 (Dec 2010) 27. Yu, W., Clyne, M., Khoury, M.J., Gwinn, M.: Phenopedia and genopedia: diseasecentered and gene-centered views of the evolving knowledge of human genetic associations. Bioinformatics 26(1), 145–146 (Oct 2009) 28. Yule, G.U.: On the association of attributes in statistics: With illustrations from the material of the childhood society, &c. Phil. Trans. R. Soc. A 194(252-261), 257–319 (Jan 1900)

Active Learning of Compounds Activity â&#x20AC;&#x201C; Towards Scientifically Sound Simulation of Drug Candidates Identification Wojciech Marian Czarnecki1 , Stanisaw Jastrzebski1 , Igor Sieradzki1 , and Sabina Podlewska2 1

Faculty of Mathematics and Computer Science, Jagiellonian University, Krakow, Poland 2

Institute of Pharmacology, Polish Academy of Sciences, Krakow, Poland wojciech.czarnecki@uj.edu.pl, stanislaw.jastrzebski@uj.edu.pl, igor.sieradzki@uj.edu.pl, smusz@if-pan.krakow.pl

Abstract. Virtual screening is one of the vital elements of modern drug design process. It is aimed at identification of potential drug candidates out of large datasets of chemical compounds. Many machine learning (ML) methods have been proposed to improve the efficiency and accuracy of this procedure with Support Vector Machines belonging to the group of the most popular ones. Most commonly, performance in this task is evaluated in an offline manner, where model is tested after training on randomly chosen subset of data. This is in stark contrast to the practice of drug candidate selection, where researcher iteratively chooses batches of next compounds to test. This paper proposes to frame this problem as an active learning process, where we search for new drug candidates through exploration of the compounds space simultaneously with the exploitation of current knowledge. We introduce the proof of concept of the simulation and evaluation of such pipeline, together with novel solutions based on mixing clustering and greedy k-batch active learning strategy. Keywords: active learning, tanimoto coefficient, compounds activity prediction, cheminformatics, clustering, virtual screening

Introduction

Cheminformatics is a rapidly growing field at the intersection of computer science and chemistry. Due to the rapid growth of the amount of experimental data, the need for efficient, statistical methods for their deep and systematic analysis emerged. Classification models, such as Support Vector Machines are widely adapted [21, 25] to many problems in the field, in particular to the tasks connected with the prediction of biological activity of chemical compounds, on

WM Czarnecki, S Jastrzebski, I Sieradzki, S Podlewska

which we focus in our research. The main contribution of this paper is proposing realistic scenario for evaluating machine learning method performance in the above problem. Active learning is a relatively young paradigm [19], finding its applications mainly in natural language processing [24] and image recognition [23]. Its aim is to minimize the cost of preparing labeled training sets for supervised machine learning models, while preserving the resulting model efficiency. Surprisingly, such an approach is not common in cheminformatics, where the process of samples labeling is extremely expensive due to the cost of biological experiments (buying/synthesizing chemical compounds and performing in vitro experiments). Even though, there are examples of application of active learning in the evaluation of compounds biological activity [26], we argue that considered setting is unrealistic and thus obtained results are not reliable. In this paper we try to build common language for machine learning and cheminformatics research communities. The paper is structured as follows. First, we introduce some basic concepts and notations from active learning paradigm. Then, we briefly describe the task of chemical compound activity prediction. In the next sections, we introduce proposed experimental setting and active learning strategies used, whereas final parts include experimental evaluation and conclusions.

Active Learning

The classic supervised machine learning setting assumes that one is given a training set by some sampling process completely independent on the training procedure. However, in real life problems it is often the case that one has access to enormous amounts of unlabeled examples, and only obtaining labels is an expensive, time consuming “sampling process”. In particular, one can guide this process through selection of samples which should be labeled in order to maximize model efficiency while in the same time – minimize the number of samples requiring labeling. One example of such case is huge amount of unlabeled text available on the Internet, which can be downloaded without any problem, but if labeling of any type is needed – it requires a time-consuming process of linguists annotations. If one provides a closed loop between ML model and the process of training set construction, then an active learning method is obtained [19]. From more theoretical point of view, active learner is often defined in terms of utility function u : X → R, such that u(x) denotes the valuability of the knowledge of x label. Consequently, in each iteration, one adds to the training set point x, maximizing u(x) over U – set (often called pool ) of unlabeled samples. One important generalization of the above problem is so called k-batch scenario, where in each iteration learner has to select a subset of k points instead of just a single one, what is done analogously through definition of utility function

Active Learning of Compounds Activity

over subsets u : 2X → R. Such approaches are proven to extremely reduce the number of labels required for the construction of strong predictive models [17, 18, 14]. In this paper, we focus on using such a method in the field of cheminformatics, in particular for the problem of chemical compounds activity prediction.

Chemical compounds activity prediction

The increasing amount of data in the fields of cheminformatics make machine learning tools more and more popular. These methods are often used to predict whether a given chemical compound is active towards a given protein target. From ML perspective, this can be interpreted as a binary classification where inputs samples are compounds (represented in an appropriate way), and labels denoting whether a compound could be a drug candidate (positive) or not (negative), that is whether it is able to bind with the target protein or not. There are two very important aspects of such problem. The first one, is connected with the way data are collected and the second with how the data are represented. We briefly investigate both of these issues. One of the fundamental assumptions of most of the ML methods is that data are generated iid from some underlying distributions. Unfortunately for cheminformatic problems, it is not the case. There are two main reasons leading to heavy violation of this assumption [10]. First, researchers look for possible drug candidates in selected parts of chemical space which is the most probable to contain such objects (potential drug candidates). In other words, they often investigate neighbourhoods of known drugs, as well as exploit other expert/biological and chemical knowledge. Consequently, space of input sample is extremely skewed and does not represent the actual distribution of compounds (nor active ones). Second problem comes from positive result bias common in science – databases contain mostly record regarding active compounds (as such results can be relatively easily published), as well as inactive compounds which are highly similar to the active ones (so their inactivity is an interesting fact). Unfortunately, as the result, we lack enormous amount of information regarding inactive compounds. Most of the ML approaches require data to be a subset of Rd . In other words, we need to embed chemical compounds, which are very complex structures, into such space. Researchers proposed multiple ways of such transformations (fingerprints) [9, 22, 11, 6]. One popular family of such objects is constituted by binary fingerprints, consisting of a sequence of d predicates φi (·) (descriptors), which project compounds onto the vertices of the d-dimensional hypercube. For a given compound x ∈ X , such embedding is given by ϕ : X 3 x → [1φ1 (x) , 1φ2 (x) , . . . , 1φd (x) ]T ∈ {0, 1}d ⊂ Rd , where 1φi (x) equals 1 if φi (x) is true and 0 otherwise. See Fig. 1 for an example of multiple types of possible predicates used.

WM Czarnecki, S Jastrzebski, I Sieradzki, S Podlewska O

CH3

N O

S N

N H3 C

...

N O CH3 CH3

N ⊂x

predicates φi (x) fingerprint ϕ(x0 )

d N N

⊂ x |N|c ≥ 3 . . . 0

...

⊂x 1

Fig. 1. Sample fingerprint of the chemical molecule x0 . |A|x denotes the number of atoms/substructures A in x, so in particular A ⊂ x ⇐⇒ |A|x ≥ 1.

Due to the characteristics of the binary representation, one needs a specific methods of measuring similarity between objects described in such a way. In particular, in order to use Support Vector Machines (SVM), one should use a kernel designed for binary sequences. One of the well known methods, which is very successful in cheminformatic applications [1, 4] is Jaccard coefficient and corresponding Jaccard (or Tanimoto) kernel J, defined for two sets A and B as |A ∩ B| , |A ∪ B|

J(A, B) =

which in an obvious way can be translated to the operation over binary vectors ¯ A¯ and B Pd ¯i } min{A¯i , B ¯ ¯ J(A, B) = Pdi=1 . ¯i } max{A¯i , B i=1

However, there are more useful measures, which also denote valid kernels, one of which is Sørensen coefficient S S(A, B) =

|A ∩ B| , |A| + |B|

and analogously Pd ¯i } min{A¯i , B ¯ ¯ S(A, B) = Pdi=1 . P d ¯ ¯ i=1 Ai + i=1 Bi These two measures have been shown to perform very well in various tasks [5, 16] and both of them will be used in this paper in two ways: as measures of compounds similarity and as SVM kernel.

Proposed experimental setting

Active learning has been proposed for the exact same problem in the past [26]. Proposed approach is mathematically valid and is an important first step in applying active learning to the problem of drug discovery. However, in authors opinion, previous work did not capture the true nature of the virtual screening process. First of all, existing approach deals with single-query active

Active Learning of Compounds Activity

learning, which is completely unrealistic assumption. The compounds are never bought/synthesized and tested one by one - chemists buy or synthesize whole groups of compounds. The k-batch setting is crucial in order to truly simulate the procedure. Secondly, previous works assume the iid of the samples and so - that one can use whole set of known active/inactive compounds to model the true distribution of compounds. This is also false, as described in previous sections, due to high bias in the way compounds are tested. In particular, such experiments do not answer the fundamental question: Does given active learning strategy leads to the discovery of new, uknown drug candidates? We propose to model the problem using two important modifications to previous works: 1. one should use k-batch active learning scenario, 2. one has to identify a specific group of compounds which can be used to estimate the ability to find new drug candidates. Group mentioned above should consist of compounds which: – form a group including active compounds (favourably a chemical group), – are not present in the training set, – are common enough to ensure the reliable estimation of generalization capabilities. Let us first describe how one can find such cluster. We have performed a hierarchical clustering of data U using Agglomerative Clustering algorithm with maximum (complete) linkage criterion. Jaccard similarity measure was used as metric. Pair of clusters S, ℵ was selected as two disjoint subtrees meeting two criteria: clusters S and ℵ constitute respectively in at least 40% and 10% of original data and the ratio of average inter–cluster distance to average distance between samples is the biggest. This heuristic yields in most cases sensible clustering, which was further confirmed by visualization as can be seen in Fig. 2. However, it should be noted that for more robust generalization power estimation, manual clustering should be performed. In our case, it often happens that clusters are noisy, as for instance S might contain few samples close to ℵ, while manual clustering done by chemist wouldn’t include such a situation. Noisy clustering is battled in our case by performing exhaustive number of experiments with multiple proteins and fingerprints. Simulation starts from a random sample from S, which simulates (represents) the current chemical knowledge about compounds activity. During active learning process one should monitor efficiency on S, denoting model ability to correctly classify compounds similar to the known ones (local search of new drug candidates) as well as efficiency on ℵ, denoting model ability to actually discover new drugs. One can further split each of these two parts into train and test parts, one (train) available in a samples pool (their labels can be obtained during active learning process) and other (test) are only used to estimate the generalization capabilities of the model.

WM Czarnecki, S Jastrzebski, I Sieradzki, S Podlewska

Fig. 2. Visualization of the clustering, ℵ cluster is denoted by purple dots, yellow ones show remaining part of U. Semi-transparent objects denote unlabelled examples and finally black diamonds are samples selected by each strategy.

Proposed active learning strategy

There are dozens of very efficient, successful strategies for active learning where one selects a simple instance in each iteration. However, in batch scenario, where one selects k points in each iteration even the simplest approaches are computationally expensive [8] or even NP-hard [2]. For this reason, it is a common choice to use a simple, single instance-based strategy, to rank points and select k most promising ones. Unfortunately, such an approach leads to the selection of highly correlated data, which can work even worse than passive learning [20]. This problem is somehow similar to many others in ML, in particular the construction of ensemble of learners [13]. In both of the above-mentioned cases, one needs to select a set of objects which provide some knowledge, but at the same time, the diversification should be ensured. In the context of active learning, it is a common practice [2] to look for a set of samples maximizing1 uC (A) = (1 − C)

1 X 2 u(a) + C |A| |A|(|A| − 1) a∈A

d(a, b),

a,b∈A×A

where C is a parameter denoting balance between maximizing utility u(·) and inner batch distances d(·, ·). Unfortunately, finding solution of such a problem is known to be NP-hard. Thus researchers often use heuristic simplifications, with a very popular quasi-greedy solution [12, 2]. In such an approach one builds query set A iteratively by first selecting sample maximizing u(·) and then in ith iteration (thus |A| = i − 1) one selects a = arg maxa∈U uC (A ∪ {a}). It is easy to notice that such an approach requires O(k 2 |U|) time and leads to very rough estimation of the true solution. One can improve the above method through introduction of randomized starts at the cost of additional computations. The idea is to select first sample with probability proportional to u(·) value and then 1

In the original work min distance was used instead of a mean.

Active Learning of Compounds Activity

use quasi-greedy approach. After multiple such starts one selects the one yielding maximum uC (·) value. We propose to follow a different generalization path instead. In order to enforce internal diversification of the batch, we merge the quasi-greedy strategy with non-Euclidean clustering. The idea is to first split dataset into M clusters so selecting mini-batches from each of them should yield distant samples and then run quasi-greedy approach in each of them so also internal distances inside each mini-batch are big. Following Alg. 1 shows the exact procedure.

Algorithm 1 Cluster-based Sørensen-Jaccard sampling 1: procedure CSJM (U, k) 2: A ← {} 3: U1 , . . . , UM ← find M clusters using Sørensen(U) 4: for i = 1 to M do 5: Q ← select k/M samples by Quasi-greedy using Jaccard(Ui ) 6: A←A∪Q 7: end for 8: return A 9: end procedure

There are many ways of performing clustering based on Sørensen coefficient. One of them is to build a Sørensen kernel [16] and run a kernelized k-means algorithm [7]. Another approach [5], yielding similar results in much shorter time, is to randomly select a subset of compounds {Ci }hi=1 , span a new space through projection ϕ(x) = [S(x, C1 ), · · · , S(x, Ch )]T , and use a simple k-means (or any other clustering technique) in the projected space. In this paper, we follow the second path due to the simplicity and efficiency of such an approach.

Experiments

Let us briefly outline the experimental setting. We use datasets consisting of chemical compounds of experimentally confirmed activity/inactivity towards six different proteins, leading to six, binary, classification problems. We use ExtFP, MACCSFP and PubchemFP [27] fingerprints to embed compounds in the {0, 1}d space. As a main model we use SVM with Jaccard kernel, due to its known applicability in the domain. We analyze three different sizes of batches (number of compounds selected in each iteration), namely k = 20, 50, 100. SVM is retrained at each iteration and its hyperparameter C is fitted using internal 5-fold cross validation. All experiments are performed with repeated, randomized, stratified train/test splits in order to minimize the variance of the results. We investigate five selection strategies:

WM Czarnecki, S Jastrzebski, I Sieradzki, S Podlewska

– passive learner, simply selecting samples at random, – greedy uncertainty sampling, as a baseline method [19], – rand greedy, described in previous sections, as a stronger version of quasigreedy strategy [2], – proposed, CSJ sampling, with M = 2 (just two clusters using Sørensen coefficient), – probabilistic method of Chen and Krause [3], generalized to the nonlinear scenario through performing Jaccard based non-linear projection [4], and fitting their linear approximator on the top [3]. All of them are implemented using Python with help of scikit-learn [15]. One can find source code of all the above approaches at github2 . As outlined in the previous sections, we investigate behavior of the proposed methods on the test set of U, ℵ cluster and on unlabeled part of samples from ℵ. We will now briefly discuss results and emerging conclusions. Let us first investigate how the proposed methods deal with building a concept of the activity in the whole compounds space. Table 1 summarizes the average ranking (position, obtained after performing the whole experiment and ordering strategies according to the given criterion) for results measured on the test part of the whole U set. Two different results are analyzed, first – final WAC3 of the model after the experiment and area under the WAC curve (which is equivalent to the mean WAC over the experiment – measuring how fast is given strategy leading to good results). These results show how good is each strategy

batch size CSJ2 sampling Rand Greedy Chen Krause Uncertainty Passive

20 2.33 2.33 2.50 3.17 4.67

50 100 avg batch size 2.17 3.33 2.33 3.67 3.50

2.17 2.17 3.50 3.17 4.00

2.22 2.61 2.78 3.33 4.06

CSJ2 sampling Rand Greedy Chen Krause Uncertainty Passive

20 2.17 1.33 4.00 3.33 4.17

50 100 avg 2.17 2.17 3.00 3.33 4.33

2.00 2.00 3.00 4.17 3.83

2.11 1.83 3.33 3.61 4.11

Table 1. Average ranking of final WAC score (on the left) and AUC score (on the right) for each strategy over all considered experiments on the test part of U for given batch size.

in building a general concept of activity. Here one can notice that rand greedy strategy obtains better AUC scores, meaning that it is able to faster converge to good model. On the other hand CSJ is a close second place, and outperforms all methods when it comes to final WAC score. One should note also that CSJ behaves much better once batch size is big enough. Proposed strategy is much better in diversifying samples in a batch, so with bigger batches its strength is better captured. It is quite interesting that strategy proposed by Chen and 2 3

http://github.com/gmum/mlls2015/ TP TN WAC= 12 TP+FN + 12 TN+FP

Active Learning of Compounds Activity

Krause behaves worse than rand greedy. There might be multiple reasons for such behavior. First, this method requires fitting of many hyperparameters, which might be performed suboptimally as during active learning scenario it is hard to fit multiple hyperparameters of the strategy. Second, proposed delinearization is not fully consistent with the kernelized SVM, one should probably change whole strategy to the kernel space, but it would drastically increase the computational complexity. Finally, their strategy does not include much diversification in the batches, which we argue is a crucial element for the considered problem. Let us now focus on the main element of the proposed scenario – evaluation of the ℵ cluster, measuring how good is a particular strategy in finding actual new drugs. It is worth stressing, that using Sørensen clustering in CSJ is supposed to simulate the fact that we do not know the true measure of “diversity” of compounds. We use Jaccard coefficient to build ℵ cluster, so if we use Jaccard also for clustering, obtained results would be less reliable (we did also perform such experiments, and obtained results were actually very similar to the ones reported here). At the same time, Sørensen coefficient is quite similar to Jaccard’s, which is supposed to model real life situation, where we do have a measure which well captures a compounds similarity [1], but is not the exact same one that described the actual diversity. Table 2 shows analogous results to the previous Table, but measured on the ℵ cluster. One can notice significant

batch size CSJ2 sampling Rand Greedy Chen Krause Uncertainty Passive

20 2.00 2.33 3.33 3.83 3.50

50 100 avg batch size 2.17 2.50 3.67 4.17 2.50

2.17 2.83 4.50 2.83 2.67

2.11 2.56 3.83 3.61 2.89

CSJ2 sampling Rand Greedy Chen Krause Uncertainty Passive

20 1.17 2.00 4.33 3.33 4.17

50 100 avg 1.50 2.17 4.00 3.50 3.83

2.00 2.17 2.83 3.67 4.33

1.56 2.11 3.72 3.50 4.11

Table 2. Average ranking of final WAC score (on the left) and AUC score (on the right) for each strategy over all considered experiments on the test part of ℵ cluster for given batch size.

difference between results obtained by CSJ and all competing approaches. It strongly suggests that proposed approach is much better in exploration of the input space. It is worth noting, that when it comes to final WAC score, passive learning is better than greedy uncertainty as well as Chen and Krause method. Difference between rand greedy and passive is also barely significant, showing that their exploration is very limited. On the other hand when it comes to the speed of converge (measured as AUC) passive learning loses with all the competing methods, as can be seen in Figure 3. So it seems that the exploration issues of most of the considered strategies appear in the “later” part of the experiment (they seem to discover the cluster and focus on it more than passive, but they leave it to too early; only CSJ consistently analyzes its samples).

WM Czarnecki, S Jastrzebski, I Sieradzki, S Podlewska

Finally, we briefly analyze the strategies ability to eliminate unlabeled samples from ℵ. High scores in such an experiment are important if we assume that there is a finite amount of interesting drug candidates, and they are all available in the pool U. Then, “buying” labels of such samples is equivalent to actually discovering all interesting drugs. Results in Table 3 are final confirmation of

batch size CSJ2 sampling Rand Greedy Chen Krause Uncertainty Passive

20 2.00 2.83 3.17 3.33 3.67

50 100 avg batch size 2.00 3.50 3.50 3.50 2.50

1.50 2.33 3.50 4.17 3.50

1.83 2.33 3.39 3.67 3.22

CSJ2 sampling Rand Greedy Chen Krause Uncertainty Passive

20 1.67 1.50 4.33 2.83 4.67

50 100 avg 1.50 2.50 4.00 2.50 4.50

1.50 2.17 3.33 3.17 4.83

1.56 2.06 3.89 2.83 4.67

Table 3. Average ranking of final WAC score (on the left) and AUC score (on the right) for each strategy over all considered experiments on unlabeled elements of ℵ cluster for given batch size.

CSJ ability to fast exploration of the input space and consequently identifying drugs from the ℵ cluster. Once again most of the strategies led to worse (or comparable) final WAC results to the passive learning in this subtask.

Fig. 3. Results of model prediction on ℵ cluster for 5 tested quering strategies on a single protein with batch size set to 50. While eventually all strategies achieve similar result CSJ stays strong throughout the AL process.

Active Learning of Compounds Activity

Conclusions

There are two main contributions of this paper. First, we introduced and described an experimental setting for active learning based drug candidates identification procedure. The proposed method is the first that does not make unrealistic assumptions of previous research in the area and shows a proof of concept of the solution. However, in order to obtain fully scientifically sound setting, one should replace automatic clustering with expert based identification of compounds group (which might be very hard due to the very limited knowledge of active compounds in the whole input space). Second contribution is introducing simple active learning k-batch strategy, exploiting both Sørensen and Jaccard coefficients, that achieves significantly better scores than competing approaches in conducted experiments. It would be valuable to further investigate other methods of diversifying samples inside the batch and efficiently estimate their informativeness. Acknowledgments. Work of first two authors was partially supported by National Science Center Poland Found grant no. 2013/09/N/ST6/03015.

References 1. Bajusz, D., Racz, A., Heberger, K.: Why is tanimoto index an appropriate choice for fingerprint-based similarity calculations? Journal of Cheminformatics 7(1), 20 (2015), http://www.jcheminf.com/content/7/1/20 2. Brinker, K.: Incorporating diversity in active learning with support vector machines. In: ICML. vol. 3, pp. 59–66 (2003) 3. Chen, Y., Krause, A.: Near-optimal batch mode active learning and adaptive submodular optimization. In: ICML. pp. 160–168 (2013) 4. Czarnecki, W.M.: Weighted tanimoto extreme learning machine with case study of drug discovery. IEEE Computational Intelligence Magazine (2015) 5. Czarnecki, W.M., Rataj, K.: Compounds Activity Prediction in Large Imbalanced Datasets with Substructural Relations Fingerprint and EEM (2015) 6. Ewing, T., Baber, J.C., Feher, M.: Novel 2D fingerprints for ligand-based virtual screening. Journal of chemical information and modeling 46(6), 2423–2431 (2006) 7. Garc´ıa, M.L.L., Garc´ıa-R´ odenas, R., G´ omez, A.G.: K-means algorithms for functional data. Neurocomputing 151, 231–245 (2015) 8. Guo, Y., Schuurmans, D.: Discriminative batch mode active learning. In: Advances in neural information processing systems. pp. 593–600 (2008) 9. Hall, L.H., Kier, L.B.: Electrotopological State Indices for Atom Types: A Novel Combination of Electronic, Topological, and Valence State Information. Journal of Chemical Information and Modeling 35(6), 1039–1045 (1995) 10. Jastrzebski, S., Czarnecki, W.M.: Analysis of compounds activity concept learned by SVM using robust Jaccard based low-dimensional embedding. Schedae Informaticae (2015) 11. Klekota, J., Roth, F.P.: Chemical substructures that enrich for biological activity. Bioinformatics (Oxford, England) 24(21), 2518–2525 (Nov 2008)

WM Czarnecki, S Jastrzebski, I Sieradzki, S Podlewska

12. Kremer, J., Steenstrup Pedersen, K., Igel, C.: Active learning with support vector machines. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 4(4), 313–326 (2014) 13. Kuncheva, L.I., Whitaker, C.J.: Measures of diversity in classifier ensembles and their relationship with the ensemble accuracy. Machine learning 51(2), 181–207 (2003) 14. McCallumzy, A.K., Nigamy, K.: Employing em and pool-based active learning for text classification. In: ICML. pp. 359–367. Citeseer (1998) 15. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., et al.: Scikit-learn: Machine learning in python. The Journal of Machine Learning Research 12, 2825–2830 (2011) 16. Ralaivola, L., Swamidass, S.J., Saigo, H., Baldi, P.: Graph kernels for chemical informatics. Neural Networks 18(8), 1093–1110 (2005) 17. Roy, N., McCallum, A.: Toward optimal active learning through monte carlo estimation of error reduction. ICML, Williamstown pp. 441–448 (2001) 18. Schohn, G., Cohn, D.: Less is more: Active learning with support vector machines. In: ICML. pp. 839–846. Citeseer (2000) 19. Settles, B.: Active learning literature survey. University of Wisconsin, Madison 52(55-66), 11 (2010) 20. Settles, B.: Active learning. Synthesis Lectures on Artificial Intelligence and Machine Learning 6(1), 1–114 (2012) 21. Smusz, S., Kurczab, R., Bojarski, A.J.: The influence of the inactives subset generation on the performance of machine learning methods. Journal of Cheminformatics 5, 17 (2013) 22. Steinbeck, C., Han, Y., Kuhn, S., Horlacher, O., Luttmann, E., Willighagen, E.: The Chemistry Development Kit (CDK): an open-source Java library for Chemoand Bioinformatics. Journal of chemical information and computer sciences 43(2), 493–500 (2003) 23. Tong, S., Chang, E.: Support vector machine active learning for image retrieval. In: Proceedings of the ninth ACM international conference on Multimedia. pp. 107–118. ACM (2001) 24. Tong, S., Koller, D.: Support vector machine active learning with applications to text classification. The Journal of Machine Learning Research 2, 45–66 (2002) 25. Wang, M., Yang, X.G., Xue, Y.: Identifying hERG Potassium Channel Inhibitors by Machine Learning Methods. QSAR & Combinatorial Science 27(8), 1028–1035 (Aug 2008) 26. Warmuth, M.K., Liao, J., R¨ atsch, G., Mathieson, M., Putta, S., Lemmen, C.: Active learning with support vector machines in the drug discovery process. Journal of Chemical Information and Computer Sciences 43(2), 667–673 (2003) 27. Yap, C.W.: Padel-descriptor: An open source software to calculate molecular descriptors and fingerprints. Journal of computational chemistry 32(7), 1466–1474 (2011)

Learning symbolic features for rule induction in computer aided diagnosis Sebastijan DumanË&#x2021;ciÂ´c, Antoine Adam and Hendrik Blockeel Department of Computer Science, Katholieke Universiteit Leuven, 3001 Heverlee, Belgium {sebastijan.dumancic,antoine.adam,hendrik.blockeel}@cs.kuleuven.be

Abstract. In computer aided medical diagnosis (CAD), interpretability of learned models is an important concern. Unfortunately, the raw data used to train a model are often in sub-symbolic form (for instance, images), which makes the application of symbolic learning methods difficult. One way to alleviate this problem is to construct symbolic features that describe images, and learn to extract those features from raw images. The sub-symbolic part of the model is then limited to the lowest layer, making the model as a whole more interpretable. This paper presents a case study of how simple rule-based learners can be used to learn interpretable models from visual data by including a symbolic feature extraction step, in the domain of CAD. The symbolic representation is supported by literature and learned in the supervised way by means of deep learning. It turns out that the learned models are equally accurate as the black-box models that constitute the current state of the art. Keywords: computer aided diagnostics, inductive logic programming, deep learning, symbolic feature learning

Introduction

Computational systems assisting humans in decision making have become very common lately, covering a wide large of applications. One notable example are recommender systems that allow massive online retailers to help their customers browse large amount of available items. Other examples include search engines rank the information by its relevance [1] , while computer vision techniques are used in biology for tracking cells (or other objects) and analyse them [2, 3]. One domain that can hugely benefit from computational assistance systems is medicine. The possibilities there are numerous; such systems can double-check physiciansâ&#x20AC;&#x2122; decisions, pre-select potential infected patients from a large pool of test specimen and many more. One case that attracted a lot of attention recently is an anti-nuclear antibody (ANA) test for auto-immune diseases. The workflow of this test is fairly straightforward - starting with an image containing many cells, a physician is required to identify a staining pattern those cells exhibit. Examples of such patterns are shown in figure 1. The test is based purely on a

Fig. 1. Examples of HEp-2 staining patterns

visual assessment of different staining patterns. Each pattern further maps to a specific disease. This test is known to be subjective [4]; it depends heavily on the expertise of a physician, and on the varieties of reading systems and optics. The subjectiveness of the test might be significantly reduced by an intelligent system helping doctors make their decisions. In this work, we focus on this specific use case. In the last couple of years, a number of solutions to this problem have been proposed [5, 6]. However, when dealing with image data, machine learning solutions typically provide a black-box solution. Although such solution might be very accurate, in many cases a black-box non-interpretable solution in not a desirable solution. In a critical domain such as medicine, it is very important that a physician can interpret a solution provided by a computer system. Even more, it is important that a physician can understand why a program made certain decision. Knowing precisely why a system made certain decision may greatly help in a situation when a physician is uncertain about his/her decision. If a system used to double-check physicianâ&#x20AC;&#x2122;s decision makes a conflicting decision, having a black-box solution can not really resolve a conflict. However, if a system could explain its decision, it would be easy to compare reasoning steps and see where they differ. Having an image data, this is rarely possible at the moment. This motivates our approach to this problem. In this paper, we want to break open the black box. We propose to learn interpretable models from raw image data by introducing a feature construction step that extracts symbolic features using sub-symbolic learning. We achieve this by first extracting interesting features from medical text and further employing a deep learning methods to learn those features. This pre-defined set of features is learned in the supervised way - we know which features are interesting, but lack a way of specifying it formally. Having these interpretable features, we employ simple rule induction algorithms to learn rules describing the staining patterns. We focus on simple rule-based models because of their simplicity and interpretability. Additionally, we demonstrate how these simple models can be very helpful in this particular situation by introducing a collective classification settings. We elaborate this later on.

III

The rest of this paper is structured as follows. Section 2 discusses some background and related work. Section 3 provides more information about the data set used for this case study, and outlines our approach to this problem. It also focuses on the feature extraction step: it describes how models were learned that automatically extract the symbolic feature values from images, using the manually annotated images as training examples, and it evaluates the quality of these models. In section 5, several techniques are compared for learning to classify cells based on their own symbolic description, or on the description of other cells occurring in the same image. Section 6, finally, presents our conclusions.

Related work

This use case has been presented as a contest at the International Conference on Pattern Recognition 2012. The summary of the results is provided in [5]. For details about the approaches we refer to the paper, however, for this work it is important to state that all approaches employ high-dimensional pixel-based feature representations and complex classifiers such as Support vector machines [14]. To our knowledge, the best performance so far was reported by Xu at al [7]. The author have used a Linear Local Distance Coding method to extract the features, which were further fed into a linear Support vector machine. The approach achieved an accuracy of 95.59 %. 2.1

Deep learning and Deep belief networks

When learning a interpretable symbolic features from raw images, we focus our work on methods from deep learning[11], namely deep belief network[12]. Deep learning is a relatively new approach to machine learning which is often referred as Representation learning. It is built upon artificial neural networks and imitates the human brain in representing data. The main idea behind deep learning is to re-represent the data with many intermediate layers that represent a gradual abstraction of input data. The motivation for learning representations is quite clear the form in which data is represented is important. The success of our classifier depends on the quality of data used for training. The deep belief network can be seen as a multi-layer generative model where each layer consists of multiple nodes, similar to a neural network. The first layer, often referred as the visible layer, represents the raw input data, while every higher-level layer is referred to as the hidden layer. It is trained in two steps - first unsupervised then supervised. For the unsupervised phase, Restricted Boltzmann machines [13] are used. The Restricted Boltzmann machine is a generative energy-based model that shares the parametrization with the neural network. The Restricted Boltzmann machines are trained by maximizing the probability of data:

arg max W

P (v)

(1)

v∈V

where v represent a raw data instance, or visible layer of pixels when trained on images, while probability is represented as an energy e−E(v,h) Z E(v, h) = −bT v − cT h − hT Wv P (v, h) =

(2) (3)

where v represents raw data instance, h response of hidden units, bi and ci are the offsets associated with a single element from x or h and Wij are weights associated with each pair of units from different layers. Z is a normalization factor. After unsupervised training, the deep belief network is fine tuned by back-propagation [16].

3 3.1

Data and approach outline Data

As said in the introduction, this case study focuses on a cell classification problem considered as part of a contest at the International Conference on Pattern Recognition in 2012. The original dataset considered is a set of 28 images, where each image contains a number of cells. These images were manually segmented into separate cells by human experts, leading to a second dataset containing 1456 images of individual cells. For these individual cells, it is known which original image they were extracted from; that is, we have information about which cells were originally on the same image. This is important information we plan to utilize, while none of the previous approach uses that information. Although this information is trivial to extract, none of the previous approaches uses it, as they don’t have a way to integrate this information. The images in this dataset contain also a different kind of cells, namely mitotic cells. Mitotic cells are cells that have already started dividing at the moment an image is taken. For this particular use case, they are considered very important - depending on a pattern type, mitotic cells can take different forms. This information typically helps physicians in making decisions. However, none of the previous approaches use this information. The main problem with mitotic cells is that they might not appear on every image. For this particular dataset, there are approximately 70 mitotic cells compared to 1456 regular cells, while 3 out of 28 images does not contain any mitotic cells. In that sense, the information about mitotic cells is often missing. Although not used by other methods, the dataset provides the information about the mitotic cells. We later show how methods from Inductive Logic programming (ILP)[17], a rule induction set of methods that rely on first-order logic for data representation, allow us to elegantly incorporate this information while bypassing the problem of missing data.

Fig. 2. System scheme

3.2

Our approach

Our approach is illustrated in figure 2. Compared to a black-box model, our approach proceeds in two steps: 1. it first assigns to each image a set of symbolic features 2. based on these symbolic features, it assigns to each image the pattern class. The extraction of symbolic information from raw images is the key component of our system. This is done by the deep learning methods explained earlier. The set of symbolic features is pre-defined and extracted from medical texts. The goal of this case study is to investigate to what extent the relationship between raw images and their classification can be made more interpretable by building models in which the sub-symbolic component is isolated from the symbolic, interpretable component.

Extracting features As our goal is to work with features that make sense to human experts, we have searched the medical literature [8, 9] for features used by humans when classifying this type of cells. As the ANA test is based on visual interpretation, we restrict ourselves to features that describe visual properties of the cells. This led to the following list of features and their possible values: – – – – – –

shape: circular, irregular fluorescence intensity level: positive, intermediate structure: homogeneous, speckled organelle type: dark, bright, neutral organelle number: none, few, lots texture: smooth, sparkly, blob

These six feature describe purely visual properties of a cell and can easily be labelled. We have manually annotated all cells with the value for these features. These features now serve as labels for a classifier mapping a raw image to the predefined set of symbolic features. As the importance of mitotic cells was previously discussed, we include this information in our model. This feature takes as value the type of the mitotic cells that were present on the same image as the cell of interest. All mitotic cells

on the same image are of the same type. The mitotic cells were also manually segmented by the human experts. However, it is important to note a difference between the six visual feature described above and mitotic cell information - it cannot be derived from a cell image in isolation, information about other cells in the same original image is needed. How these features are learned is explained in section 4. Having each image now described with these feature we can run any rule induction algorithm to learn how to detect the target patterns. Utilizing collective classification As it was previously mentioned, the original images consist of many cells that are later manually segmented to the individual cells. The fact that all cells from the same image have to be of the same type might be further utilized to gain performance. This scenario significantly resembles collective classification [10]. In collective classification, related instances (or objects to be classified) are classified not just based on their own set of attribute values but also based on the attribute values and class labels of the related instances. In this specific use case, this means that each cell is classified not only by its attributes, but by looking at the attributes of other cells on the same image too. One may argue that in this case, classifying each cell individually and taking the majority vote as a final class for each cell is enough. While that is true for images of high quality containing a lot of cells, it is not true for the case of low quality images containing only a couple of cells (which is more often the case). When an image is of low quality, a classifier will most likely make many mistakes. If there is a small number of cells on the same image, it might be very difficult to find a majority vote on one class confidently. On the contrary, in the collective classification settings when all cells are classified as a whole, a classifierâ&#x20AC;&#x2122;s decision will be mostly influenced by the cells that can confidently be predicated as a certain class. To see how exactly collective classification might help, imagine you are given an image containing a number of cells. Assume you are about to classify a cell that given its attributes cannot be confidently assigned to a particular class by a model learned from data. For the sake of illustration, assume also that a model gives as a probability distribution over classes as an output. If a model cannot make a confident prediction about a given cell, its output can be seen as an approximately uniform distribution over classes. However, with collective classification, we can put a restriction that every cell on the same image has to belong to the same class. In that case, if a system figures out that there is a certain cell on the same image that can be confidently classified as the particular class b, the system will use that information to increase the probability of the uncertain cell being the class b. We omit a lot of details here due to the space restrictions and refer a reader to [21].

VII

Learning symbolic features

Having defined the features used in the intermediate layer, we need to build models that extract the values of these features from cell images. None of these models are straightforward: the definitions of the features are to some extent subjective. Therefore, the models are learned from the dataset. This can be seen as a supervised feature construction - we know which features we want (from medical literature), but are unable to specify a model for them. A different model is learned for each feature, in a supervised manner, using the manual annotations as examples. For all features except Shape and Fluorescence Intensity Level (briefly, Intensity) a deep belief network [12] was trained, as these are known to work well for identifying visual properties of images. A separate network with Bernoulli units was trained for each feature. Shape and Intensity are learned in the following way: Shape: Visual shape classification is a well-studied topic in computer vision and methods suitable for our goal already exist. We adopt the following method, motivated by Belongie et al [15]. Each individual cell image is divided into 4-by-4 blocks, and for each block, the proportion of pixels inside the extracted segment is calculated. A support vector machine [14] with radial basis function (RBF) kernel is next trained, using these 16 proportions as input features. Intensity: This describes the clarity of the cells in an image. Determining the fluorescence intensity level is a separate task in the ANA workflow. The medical literature does not provide a precise definition for it, only a provisional ranking of four possibilities [4], described in terms of how easy is to distinguish cells from the background. All approaches mentioned in Section 2 suggest to recognize only two classes - positive when cells are clearly distinguishable from the background, and intermediate when it is difficult to distinguish cells from the background. Our method to estimate fluorescence intensity level works as follows. Our methods starts with an observation that, although cell express different intensities across image, the background is always constant and darker compared to cells. The major assumption taken here is that each image histogram (a distribution of grayscale colors or intensities across an image) can be segmented in two distinct parts - one representing the background and the second one representing the cells. Following this intuition, we approximate every image histogram with 2 Gaussian distribution. In images with positive intensity, the two components should be well separated from each other, while in images with intermediate intensity they should be relatively close. To classify cells as having positive or intermediate fluorescence intensity level, an SVM with RBF kernel is trained that uses the mean and variance of both fitted Gaussians as inputs.

VIII

Results

As we already said, our goal in this work is to map raw images to the set of predefined symbolic features that would allow usage of an interpretable models to learn the domain. Although any rule-based induction algorithm can be used, here we have focused on the methods from Inductive logic programming (ILP) [17] and its probabilistic extension. The main reason why we focus on these models is that they allow us (1) to easily incorporate missing information (which is necessary for the mitotic cells) and (2) make use of collective classification. We have chosen to compare FOIL [18] and Aleph [19] as ILP methods, and their probabilistic extension in Markov logic networks [20]. We leave out the details how to train such models here and point the reader to the references, but emphasize here that these models use first-order logic as a knowledge representation, which makes them interpretable. We focus on answering the following questions: 1. how well our model with interpretable features compare to black-box models from prior work? 2. how well our model performs when information about mitotic cells is added, compared to the base in 1)? 3. how well our model performs when collective classification is performed, compared to the base in 1)? Important thing to notice here is that questions 2) and 3) do not allow us to perform any comparison with prior work, as to the best of our knowledge none of the previously used methods uses this particular information (mitotic cells and image location information). However, our goal is to test how much this information can help in this prediction task, together with interpretable features we learn. We first test our approach using ground truth features - assigned by human, to test the usefulness of selected features. Finally, we test our approach in full settings - we first use deep belief networks to learn the features, and then use those learned features to classify cells. 5.1

Experimental settings

For our experiments, we have used the dataset from the ICPR 2012 contest 1 . The original dataset considered is a set of 28 images, where each image contains a number of cells. These images were manually segmented into separate cells by human experts, leading to a second dataset containing 1456 images of individual cells. The correct symbolic feature values described in section 4 are manually assigned to each cell. As it is mentioned before, the features are designed to represent simple visual shapes so that the expert knowledge about the domain is not necessary. 1

http://mivia.unisa.it/datasets/biomedical-image-datasets/hep2-image-dataset/

In each of our experiments, we have used 10-fold cross validation to evaluate our approach. To fully utilize the strengths of relational learners, the folds are created on image level - folds represent non-overlapping partitions of a set of original images (containing a number of cells). This ensures that individual cells from the same image do not appear in both training and test sets. This slightly differs from cross-validation settings usually employed, but it is crucial for properly testing relational learning methods. We report the accuracy of the classifiers for each experiment. The dataset with learned features was created in the same way - we use 9 folds to learn the model parameters, as proposed in section 4, and fill in the values in the remaining fold. The predictions on the leave-out folds are then aggregated to a new dataset with features learned by the system.

5.2

Tests with the ground truth data

We first evaluate our model using only manually assigned features that describe the visual properties of each cell. We first exclude the mitotic cells from the feature set and classify cells using only their visual properties. The results are summarized in table 1. For each setting, due to the space limitations, we present only accuracy of each model. The test with the ground truth data corresponds to the first row of the table. The results show that the state-of-the-art solution proposed by Xu et al. [7] performs significantly better than logic-based approaches chosen for our work. The accuracy of the state-of-the-art solution is 95.59 %. This is somehow an expected result as we have to sacrifice expressiveness to gain interpretability. As we try to use features understandable by humans, the performance is bounded by their expressibility. It is worth noting that FOIL performs as well as the human expert on the same dataset [5]. Sophisticated image analysis methods might find enough information to separate difficult cases even without mitotic cells, but it would be extremely difficult to tailor understandable features to express those differences. This experiment answers the first question. Although our model performs worse than the state-of-the-art solution, we believe it makes a step forward in making these models interpretable to human experts. To answer the second question, we include the information about the mitotic cells in the dataset used to train the model. The results are shown in the second row of the table 1. It is immediately clear from the results that mitotic cells play an important role in the diagnostic procedure, as the difference in the accuracies is substantial. In this case, the results are comparable to the state-of-the-art. These results demonstrate that the interpretable features defined, together with the mitotic cells, are sufficient for the task, and in they sacrifice the performance only slightly. Note again that this is an unfair comparison with prior work as

X Table 1. Performance of the classifiers in different settings Settings Xu at al visual features 95.59 mitotic cells included â&#x20AC;&#x201C; complete information â&#x20AC;&#x201C; learned features â&#x20AC;&#x201C;

MLN 74.05 93.05 94.88 89

FOIL 81.45 93.30 98.00 97.32

Aleph 40.41 84.06 89.35 89.28

mitotic cells were not used there, but our aim is to show how this, obviously important, information can be easily integrated in a model using the simple ILP techniques. Finally, we have tested our model collective classification settings. To achieve collective classification, we had to add a logical predicate SameImage(x,y) that evaluates to true when cells x and y are located on the same image (we leave out the details how collective classification is performed in the system). The mitotic cells were also included in this experiment. The results are presented in the third row of the table 1. Not surprisingly, collective classification clearly helps and increases the performance of the system. By combining both the information about mitotic cells and collective classification, FOIL even outperforms the stateof-the-art approach. This is a very pleasing result for the following two reasons: 1. it outperforms the state-of-the-art approach while maintaining an interpretable representation that sacrifices a lot of expressivity 2. it mimics the setup of the test in practice, and at the same time makes use of relational information other systems cannot easily incorporate. 5.3

Tests with the features learned by the system

The previous section aimed at demonstrating the capability of the predefined set of interpretable features for this task. In this section, we evaluate our system in full. We first train the deep belief network to assign symbolic features to a given cell, as described in section 4. Then, we use those features to predict the class of each cell. As some symbolic features will be mislabelled, this evaluates the robustness of our approach given imprecise data. Mitotic cells are included for this experiments, as well as the collective classification setting as they lead to the most successful results. Table 1, final row, lists classification performance when learning from the dataset when the features are learned. Compared to the dataset containing the true features, the performance drops slightly, but not dramatically. This shows that even with the noisy information that is inherent to automatic feature extraction, quite accurate classification can be obtained. More importantly, using the collective classification seems to provide more stable results as the misclassified features affect the classification accuracy only slightly.

Conclusion

Learning symbolic interpretable representations from images is a very difficult task, but necessary in many domains. An example of such a domain are medical diagnostic procedures based on visual interpretation of images. In this paper we presented a case study of detecting antibodies patterns from images demonstrating the benefit of using ILP methods for the task. The outcomes of the paper are three-fold. First, we have proposed a method that constructs interpretable features for this application domain. Construction of such interpretable models from sub-symbolic data is a non-trivial task. In our approach we first identify and define symbolic features that are interpretable to humans and demonstrate how these features can be learned automatically by means of deep belief networks. Second, we have demonstrated a benefit of using the information about mitotic cells for the task. Related approaches ignore this information at the moment, mainly because mitotic cells do not appear on every image and raise the question of how to represent missing information. However, ILP methods provide us with an elegant way to include this information. Finally, we have demonstrated the benefits of using collective classification for the task. Experiments show that, on the domain considered, this interpretable model can achieve accuracy comparable to black-box models, and even outperform them. This is a positive result as the final goal of the work is to build an interpretable model, that sacrifices the performance as little as possible. Deep learning show as a promising approach for this direction. Within this particular application, other possible future work includes automatic mitotic cell detection and artefact removal, so as broader experimentation with deep learning approaches and automatic segmentation of individual cells. Acknowledgements This work is funded by the KU Leuven Research Fund (project IDO/10/012).

References 1. Radlinski, F. and Joachims, T. : Query Chains: Learning to Rank from Implicit Feedback International Conference On Knowledge Discovery and Data Mining, ACM SIGKDD, 239â&#x20AC;&#x201C;248 (2005) 2. Harder,N. et al. : Automatic analysis of dividing cells in live cell movies to detect mitotic delays and correlate phenotypes in time Genome Research, vol. 19(11), 21132124 (2009) 3. Godinez, W. et al : Deterministic and probabilistic approaches for tracking virus particles in time-lapse fluorescence microscopy image sequences Medical Image Analysis, vol. 13, 325-342 (2009)

XII 4. Rigon, A. and Soda, P. and Zennaro, D. and Iannello, G. and Afeltra, A. : Indirect immunofluorescence in autoimmune diseases: Assessment of digital images for diagnostic purpose Cytometry Part B-clinical Cytometry, vol. 72B, 472-477 (2007) 5. Foggia, P. et al.: Benchmarking HEp-2 Cells Classification Methods. IEEE Trans. Med. Imaging, vol. 32, 1878-1889 (2013) 6. Agrawal,P., Vatsa, M., Singh, R.: HEp-2 Cell Image Classification: A Comparative Analysis. Lecture Notes in Computer Science, Machine Learning in Medical Imaging, vol. 8184, Springer International Publishing, 195-202, (2013) 7. Xu, X. et al: Linear Local Distance coding for classification of HEp-2 staining patterns Winter Conference of Application of Computer Vision, 393-400 (2014) 8. Wiik, A.S. and Hier-Madsen, M. and Forslid, J. and Charles, P. and Meyrowitsch, J.: Antinuclear antibodies: A contemporary nomenclature using HEp-2 cells Journal of Autoimmunity , vol. 35 (3), 276-290 (2010) 9. Bolon, P.: Cellular and Molecular Mechanisms of Autoimmune Disease Toxicologic Pathology, vol. 40 (2), 216-229 (2012) 10. Sen, P. and Namata, G. and Bilgic,M. and Getoor, L. and Gallagher, B. and Eliassi-Ra, T. : Collective Classification in Network Data AI Magazine, vol. 93, 93 – 106 (2008) 11. Bengio, Y.: Learning Deep Architectures for AI. Found. Trends Mach. Learn., vol. 2, Now Publishers Inc., 1–127 (2009) 12. Hinton, G, E., Osindero, S. and Teh, Y.: A Fast Learning Algorithm for Deep Belief Nets. Neural Comput., vol. 18, MIT Press, pp. 1527–1554 (2006) 13. Larochelle, H., Bengio,Y.: Classification using discriminative restricted boltzmann machines. 25th international conference on Machine learning, ACM (2008) 14. Vapnik, V., Cortes, C.: Support-Vector Networks. Machine learning, vol. 20, Kluwer Academic Publishers, 273–297 (1995) 15. Belongie, S., Malik, J. and Puzicha, J. Matching Shapes. International Conference on Computer Vision, vol. 1, 454–461 (2001) 16. Murphy, K. : Machine Learning: A Probabilistic Perspective (Adaptive Computation and Machine Learning series). MIT Press (2012) 17. Lavraˇc, N. and Dˇzeroski,S. : Inductive Logic Programming: Techniques and Applications. Routledge (1993) 18. Quinlan, J. R. earning logical definitions from relations. Machine Learning, vol. 5, 239–266 (1990) 19. Muggleton, S. Inverse Entailment and Progol. New Generation Computing, vol 13, 245–286 (1995) 20. Richardson, M. and Domingos, P. : Markov Logic Networks. Machine learning, volume 62 (1-2), 107-136 (2006) 21. Crane,R. and Mcdowell, L.: Investigating markov logic networks for collective classification. In ICAART (2012)

Learning to rank chemical compounds based on their multiprotein activity using Random Forests Damian LeÂ´sniak, Michal Kowalik, and Piotr Kruk Faculty of Mathematics and Computer Science, Jagiellonian Unviersity, Cracow, Poland. smp.damian.lesniak@student.uj.edu.pl

Abstract. In this study we investigate the following problem from the field of drug design: suppose we are given a list of chemical compounds described using chemical fingerprints and a set of proteins. Each compound can be active or inactive towards a specific protein, and we are looking for as many active pairs as possible. Our task is to rank the compounds from the most to the least promising one before we start conducting laboratory tests as those are often expensive and time-consuming. We use the fact that previous experiments on different compounds, available in databases, form the training set and propose a Random Forest [1] based model which ranks the new compounds. Using this method we won the GMUM Challenge competition associated with the Machine Learning in Life Sciences Workshop at European Conference on Machine Learning, ECML PKDD 2015. Keywords: chemical compounds activity, chemical fingerprints, Random Forest, GMUM challenge

Introduction

For a biochemistry research group which goal is to find new chemical structures that are likely to become commercial drugs it is necessary to narrow down the set of all available chemical compounds to those worth further research. The most interesting ones are those active towards certain proteins. Obviously, the budget and human resources are limited, so it would be useful to sort the compounds being under consideration, placing most promising ones on top of the list, before starting laboratory tests. Nowadays, the machine learning methods are more and more popular in the field of computer-aided drug design [2][3] and have shown their effectiveness, so it might be useful to have a statistical model that performs the task of ranking chemical compounds. To create such a model, we need to choose a representation of chemical structures that would be easy to process. The chemical fingerprints [4][5] â&#x20AC;&#x201C; functions which are constructed as a set of predicates describing some features of the compound (mostly binary, existance or not of some substructure) â&#x20AC;&#x201C; are well-suited for this task. See Figure 1 for an example:

Fig. 1. Chemical compound (on the left) and sample fingerprint [6] consisting of the set of predicates φi .

We also need to have a training dataset. Large databases of chemical compounds and their activities towards proteins are available for researchers [7], so we assume that the chemical compounds may be arbitrary, but the proteins were tested with some other compounds, and the results are known to us. We hope that the model will capture the relation between the presence of certain substructures and activity, so that in future we will be able to predict compounds’ behaviour by looking at their fingerprints. This problem was the subject of the GMUM Challenge Competition 2015 organised by The Group of Machine Learning Research at the Jagiellonian University in Cracow in cooperation with the Institute of Pharmacology, Polish Academy of Sciences, Cracow. We will be using the metric proposed by the organisers to evaluate our results (described in detail in the following section). We define the value of a chemical compound as the sum of values of the proteins that the compound is active to (proteins’ values may be chosen arbitrarily to meet our needs; the organisers proposed a set of values based on the number of training examples for each protein in the dataset). The ranking of compounds is built using this quantity. Our winning solution uses Random Forest Classifier with implementation from Python 2.7 with scikit-learn [8] and NumPy [9] libraries. It tries to predict activity of each chemical compound and then ranks the compounds based on their expected value. Other considered solutions [10] to this problem were: feedforward neural networks, Extreme Learning Machine [11], and Support Vector Machines [12].

The GMUM Challenge Problem

The competitors were given a list of 14891 chemical compounds with labelled activity against a set of 24 proteins. For each pair compound-protein there were three possible labels: active (1), inactive (−1), and unknown activity (0). The matrix of labels was sparse – only around 8% of the labels were nonzero. The task was to sort the list of N = 9928 unlabelled chemical compounds in a way to maximize the following score: Pi N 1 X j=1 v(cj ) V (c1 , . . . , cN ) := , Pi N i=1 maxσ j=1 v(cσj )

where σ is any permutation of the set {1, . . . , N } and

v(c) :=

X p p∈a(c)

P (p)

+ 0.001

p p∈u(c)

P (p)

where a(c) is the subset of proteins labelled as active for a given compound c, u(c) is the subset of proteins labelled with unknown activity, and P (p) is a predefined constant associated with the protein p (known to the competitors, varying from 0.00328779 to 0.15854460, and the smaller the constant, the more valuable the protein). As we can see, a model which perfectly predicts the true activity of chemical compounds (so it never assigns the zero label) might not get the perfect score – it is also very important to predict if a given unlabelled compound-protein pair has been tested for activity (the model must predict decisions of real-life chemists). The other obvious observation is that to achieve the best score, we should sort the compounds from the most to the least valuable one. The competitors were also given a validation set – a list of 4964 unlabelled chemical compounds – to test their predictions via the competition website. The limit was set to three submits per day. Each compound was represented by a binary fingerprint of length d (d = 6231, but it was only known that d is greater than 6000) Then the vectors (denoted as ϕ) were anonymized using a predefined secret hash function h in the following way: {0, 1}d 3 ϕ 7→ Φ ∈ {0, 1}1000 , where h : {1, . . . , d} → {1, . . . , 1000} , and Φi :=

ϕj .

j:h(j)=i

The new representation consisted of vectors of natural numbers (less than or equal to 7), and the original vectors, together with the hashing function, were kept secret from the competitors.

Methods

In this section we present all the approaches we have considered while trying to solve the posed problem. We start with a few words about preprocessing the dataset, then present our best method based on Random Forests, and follow with subsections about models using Perceptron, SVM, and Neural Networks.

3.1

Data Preprocessing

Three types of preprocessing were considered: – cloning features – due to the hashing, each feature was actually a sum of up to seven original features; we decided to clone each feature as many times as was the largest value of this particular feature in the combined dataset (training, validation, and test set) – this approach was supposed to discard those original features that were constantly equal to zero and equally represent the others in the dataset; it gave slightly better results, – discarding proteins – we tried to discard the most valuable proteins, as they were the least tested ones (the most valuable one had only 196 out of 14891 chemical compounds labelled as nonzero), and falsely predicting a compound being active towards those proteins resulted in generating a large error of the v 0 value – it was a rather poor way of dealing with unbalanced dataset, but, surprisingly, in some cases this resulted in a small score improvement, – reducing the number of features – the most important features were selected with Random Forest (as described below), and then the final model was trained (in the best case it was a second Random Forest) – this resulted in a significant improvement of the score. Due to the lack of time we were unable to test other kinds of preprocessing. We think that trying to deal with the class imbalance problem would certainly contribute to achieving better results. 3.2

Random Forest

We decided to use Random Forests for prediction and dimensionality reduction because of their simplicity and effectiveness. The algorithm was implemented using Python 2.7 with scikit-learn (library containing implementation of the Random Forest Classifier, further abbreviated as RF) and NumPy. A trained RF model can generate probability distribution on the set {−1, 0, 1} for each compound-protein pair (with an arbitrary compound description and one of the 24 given proteins). It can also calculate the (real valued) importance of a feature used to describe a chemical compound – this was particularly useful for reducing the number of features. In the first pass we used RF with 500 trees to select the most important features. Afterwards a second RF was trained using 1000 trees. Other parameters were set to default: square root of the total number of features were considered when splitting, nodes were expanded until all leaves were pure, and bootstrap samples were used when building trees. The value of a chemical compound was calculated as: v 0 (c) :=

X p∈P r

1 p (pa (c, p) + 0.001pu (c, p)) , P (p)

where P r is the set of proteins, P (p) is defined as in the previous section, pa (c, p)

is the probability that the compound c is labelled as active towards the protein p, and pu (c, p) denotes the probability that the activity is unknown – in other words, v 0 is just the expected value of v defined in the previous section. This approach gave better results than, for example, choosing the most probable label in the first place, and then calculating v value, because the evaluation metric behaves well if we replace the true value of each compound with true value plus some small random error, and is more unstable if we try to randomly change some labels and recalculate compounds’ values. That is why we should be more concerned with estimating the compounds’ true values rather than having as many true labels as possible. Finally, the compounds were sorted according to their v 0 value in a descending order.

Cloning features

Discarding proteins Training Random Forest and selecting 100 best features Training second Random Forest

Predicting compounds’ expected values

Creating ranking

Fig. 2. Diagram representing the winning Random Forest approach.

The results for different parameters are shown in Table 1. In the end, we decided to use 100 most important features and discard one protein, and we achieved score of 68.85% on the test set. Results on the validation set are visualized in Figure 3. Each point represents one chemical compound in a way that the x-coordinate stands for its true position in the compounds ranking (with 1 being the least and 4964 the most valuable compound), and the y-coordinate represents its predicted position. The perfect prediction would be a diagonal line. The points seem to be scattered randomly, and we can observe some regions of higher density lying off the diagonal. This is due to hashing of the features, and in Section 4 we will see that the model performs much better on the unhashed dataset. To sum up, our winning method for the GMUM Challenge competition

can be described as a two-stage process: designing a model that would perform well on the unhashed dataset (Random Forest used twice), and feeding it with preprocessed dataset, where the preprocessing attempts to unhash the features (feature cloning).

24 proteins

23 proteins

22 proteins

all features

0.6617/0.6610

0.6651/0.6626

0.6498/0.6566

70 features

0.6697/0.6699

0.6693/0.6694

0.6598/0.6661

77 features

0.6741/0.6738

0.6773/0.6740

0.6687/0.6671

85 features

0.6770/0.6759

0.6797/0.6764

0.6724/0.6681

93 features

0.6742/0.6758

0.6759/0.6773

0.6702/0.6724

100 features

0.6734/0.6767

0.6751/0.6812

0.6701/0.6731

107 features

0.6719/0.6772

0.6751/0.6776

0.6690/0.6735

115 features

0.6731/0.6762

0.6727/0.6781

0.6632/0.6746

Table 1. Score on the validation set without/with features cloning (RF).

Fig. 3. Results calculated on the validation set (RF) â&#x20AC;&#x201C; higher valued compounds to the right in each figure.

3.3

Perceptron and SVM

The baseline solution provided by the organisers was implemented in Python 2.7 with NumPy, SciPy, and scikit-learn libraries. 24 classifiers were used, each trying

to predict whether given compound is active or inactive towards a particular protein (the unknown state is ignored and treated as the inactive state). The classifiers were Perceptrons [13] trained in 100 iterations on the training set with other parameters set to default, and using Stochastic Gradient Descent (SGD). This implementation achieved score of 0.6351 on the validation set and 0.6244 on the test set. Using linear SVM (with default parameters) instead of the Perceptron gave only slightly better results. 3.4

Neural Networks

Neural networks were used for comparisons with the RF method. To implement neural network models, we used Python 2.7 with theanets, ffnet [14], scikit-learn, hpelm [15] and NumPy libraries, and also julia language and julia-ann library. To find the best model, a few approaches were tested, including neural networks with rmsprop [16] algorithm, Extreme Learning Machine (ELM), and pre-train algorithms [17]. To train and validate our models, we have used cross validation implemented in the scikit-learn library with parameters: test size = 0.7, random state = 0. For all networks we used a sigmoid activation functions, and for ELM we used a sigmoid function for first layer, and for the rest layers we used a rbf l2 activation function. The other parameters were set to default. A few simplifications were used. The output of a network is a vector with values −1/1 (−1 - protein is inactive, 1 - protein is active). For each protein p we set P (p) = 1 (this works well for testing purposes on the given dataset), and the ranking along with the score are calculated as previously. Some of the best results are presented in Table 2. No experiment with neural networks ever beats the score achieved by Random Forest – around 63% using this approach.

Method

Architecture of layers

Algorithm

Score

Neural Network

(1000, 400, 400, 24)

rmsprop

52%

Neural Network

(1000, 800, 400, 24)

rmsprop

52%

Neural Network

(1000, 2000, 1000, 24)

rmsprop

54%

ELM

(1000, 800, 400, 200, 24)

LOO

56%

NN + pretraining

(1000, 800, 400, 200, 24)

rmsprop

59%

NN + pretraining

(1000, 800, 400, 300, 24)

rmsprop

59%

Table 2. Comparison of the results (NN).

The main problem with using neural networks was insufficient computational power – each experiment took approximately 8-12 hours.

Using Deep Neural Networks (DNN) [18] for classification problems becomes more and more popular. Experiments show that a DNN is capable of achieving record-breaking results [19], but they need a lot of time and computational power for calculation. In this competition we did not know a lot about the structure of the dataset – and with limited time we were unable to find a good network architecture and a suitable set of hyper-parameters. This is the reason why we concentrated on the Random Forest approach – it is much better for a fast big data exploratory analysis and additionally more resistant to changes of the hyper-parameters.

Further Investigation of the Random Forest Method

After the competition was over the unhashed fingerprints for training and validation sets were sent to us. We tested the RF approach using training and validation sets. Nearly two thirds of the features were constant on the training set, so we decided to discard them. In Table 3 we present the scores for different numbers of the most important features chosen. Leaving 50 features scores slightly over 81% on the validation set. The other parameters were set as in the original approach, unless otherwise stated. The results are visualized in Figure 4.

Features

Score

all

0.7277

0.7915

0.8054

0.8115

0.8035

120

0.7929

Table 3. Score on the validation set (RF without hashing).

In the bottom-left part of Figure 4 we can clearly see a dense region lying below the diagonal. In our opinion this is due to the fact that the true ranking (horizontal axis) has the following structure (counting from the least valuable compounds): – over 850 compounds valued less than 0.166 (with different sets of labels, but always inactive or with unknown activity), – over 250 compounds valued 2.667 and active only towards the least valuable protein (with all but one labels equal to zero), – over 250 compounds valued 2.853 and active only towards the second least valuable protein,

Fig. 4. Results calculated on the validation set (RF without hashing) – higher valued compounds to the right in each figure.

– over 650 compounds valued 2.887 and active only towards the third least valuable protein (the anomalous region), – nearly 3000 compounds valued more than 2.887. Compounds’ values vary from 0.058 to 120.962 with average 6.266. The fourth group constitutes one-seventh of the validation set (the analogous observation is true for the training set), additionally it consists of compounds with exactly the same set of labels (it is an uncommon situation) – probably because of that the model predicts the position of chemical compounds from this group with greater confidence (it is represented by the darker vertical belt of lower density over the region). On the other hand, the compounds from the previous three groups tend to be overvalued, and, as a result, they get higher positions in the ranking, making the fourth group fill the empty space below its true location. We have created similar charts for other methods (for example Random Forest without reducing the number of features) and observed that improvement of the score was caused by migrating of dense regions towards the diagonal. As the next step, bearing in mind remarks from the previous paragraph, we would propose investigating the reasons of overvaluing the compounds, especially the least active ones (light regions over the diagonal; the top-left part). When we discussed the evaluation metric introduced in the GMUM Challenge competition we have made an observation that the model has to predict activity of the compounds and the decisions made by scientists – the latter one means predicting which labels are nonzero. As we are more interested in having a model that addresses the former problem, we would also like to find out how the Random Forest performs on a dataset with all labels being different from zero. In practice creating such dataset would be extremely expensive, but let us consider the following modification of the GMUM Challenge problem. Suppose that at the stage of generating the ranking of chemical compounds we tell our model which labels are nonzero (but, of course, we keep the information about

the activity secret). We remember that the value of a compound was calculated as follows: X 1 p (pa (c, p) + 0.001pu (c, p)) , v 0 (c) := P (p) p∈P r where pa (c, p) is the probability that the compound c is labelled as active towards the protein p and pu (c, p) denotes the probability that the activity is unknown. It would be rational to replace those two values with conditional probabilities (under the condition that the activity of c towards p is unknown) calculated using the following formulas (here P denotes probability): pa (c, p) = P (c [labelled as] active towards p|c tested with p)P (c tested with p) , P (c tested with p) = 1 − pu (c, p) , assuming that the value P (c tested with p) is known. So, if the activity is unknown (the label is equal to zero), then we replace pu (c, p) with 1 and pa (c, p) with 0. On the other hand, if the activity is known, and we want the model to predict it, we must replace pu (c, p) with 0 and pa (c, p) with pa (c, p)/(1−pu (c, p)) (in the 0/0 case we replace the value with 1/2). We will call this model RF+ and compare its performance with the following alternatives: – always active – this model always predicts (when asked about a nonzero label) that the compound is active; it is possibly a good approximation of scientists’ intuition, as we may assume that when they test a chemical compound they expect it to be active rather than the opposite, – random+ – this model assigns random probability of being active (random real number from the [0, 1] interval instead of pa (c, p)/(1 − pu (c, p))) to every nonzero label it is asked to predict (it should not be mistaken with a model that ranks the compounds in a random order!), – always inactive – similar to always active, but always predicts −1. The results are presented in Table 4 and Table 5. As we can see, the RF+ model is significantly better than the always active, thus the proposed approach predicts more than just the scientific intuition. On the other hand, the high score obtained by the random+ model shows the importance of knowing which labels are nonzero.

Conclusions

Our experiments show that with simple machine learning models it is possible to find a good solution for the problem of ranking chemical compounds and predicting their activity. We suppose that our preliminary results might easily be extended in the future. We propose trying the following improvements: – close investigation of dense regions lying off the diagonal in Figure 4,

Features

Score

all

0.9284

0.9365

0.9410

0.9418

100

0.9421

150

0.9411

200

0.9385

Table 4. RF+ score, obtained with knowledge which labels are equal to zero.

RF+

always active

random+

always inactive

0.9421

0.9052

≈ 0.87

0.8115

0.4732

Table 5. Comparison of different approaches.

– redesigning the model so that it directly predicts the order of chemical compounds (maximizes the evaluation metric) without the middle step of estimating each compound’s value, – preprocessing the dataset to overcome issues arising from sparsity of the labels and training a Multi-Task Deep Neural Network – DNNs outperformed other models in the Merck Molecular Activity Challenge, a similar competition held in year 2012.

References 1. Breiman, L.: Random Forests. Mach. Learn. 45, 5–32 (2001) 2. Varnek, A., Baskin, I.: Machine Learning Methods for Property Prediction in Chemoinformatics: Quo Vadis? J. Chem. Inf. Model. 52, 1413–1437 (2012) 3. Schneider, G.: Virtual screening: an endless staircase? Nat. Rev. Drug Discov. 9, 273–276 (2010) 4. Yap, C.W.: PaDEL-Descriptor: An Open Source Software to Calcuate Molecular Descriptors and Fingerprints. J. Comput. Chem. 32, 1466–1474 (2010) 5. Raevsky, O.A.: Molecular structure descriptors in the computer-aided design of biologically active compounds. Russ. Chem. Rev. 68, 505–524 (1999) 6. Czarnecki, W.: Weighted Tanimoto Extreme Learning Machine with Case Study in Drug Discovery. IEEE Comput. Intell. M. 10, 17–27 (2015) 7. Bento A.P. et al.: The ChEMBL bioactivity database: an update. Nucleic Acids Res. 42, 1083–1090 (2014) 8. Pedregosa, F. et al.: Scikit-learn: Machine Learning in Python. JMLR 12, 2825– 2830 (2011)

9. Van Der Walt, S., Colbert, S.C., Varoquaux, G.: The NumPy array: a structure for efficient numerical computation, IEEE Comput. Sci. Eng. 13, 22–30 (2011) 10. Smusz, S., Kurczab, R., Bojarski, A.J.: A multidimensional analysis of machine learning methods performance in the classification of bioactive compounds. Chemom. Intell. Lab. Systems 128, 89–100 (2013) 11. Guang-Bin Huang, Qin-Yu Zhu, Chee-Kheong Siew: Extreme learning machine: a new learning scheme of feedforward neural networks. In: 2004 IEEE International Joint Conference on Neural Networks, vol. 2, pp. 985–990. IEEE Press, Budapest (2004) 12. Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20, 273–297 (1995) 13. Freund, Y., Schapire, R.E.: Large margin classification using the perceptron algorithm. Mach. Learn. 37, 277–296 (1999) 14. Wojciechowski, M.: Feed-forward neural network for python, http://ffnet. sourceforge.net 15. Akusok, A., Bjork, K.-M., Miche, Y., Lendasse, A.: High Performance Extreme Learning Machines: A Complete Toolbox for Big Data Applications. IEEE Access (2015) 16. Tieleman, T., Hinton, G.: Lecture 6.5 - rmsprop, COURSERA: Neural Networks for Machine Learning (2012). 17. Erhan, D. et al.: The Difficulty of Training Deep Architectures and the Effect of Unsupervised Pre-Training. In: Van Dyk, D. Welling, M. (eds.) 12th International Conference on Artificial Intelligence and Statistics, vol. 5, pp. 153–160. JMLR, Clearwater Beach (2009) 18. Hinton, G.E., Osindero, S., Teh Y.-W.: A Fast Learning Algorithm for Deep Belief Nets. Neural Comput. 18, 1527-1554 (2006) 19. Taigman, Y. et al.: DeepFace: Closing the Gap to Human-Level Performance in Face Verification. In: 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1701–1708. IEEE Press, Columbus (2014)

One-Class Rotation Forest for High-Dimensional Data Classification Bartosz Krawczyk1 and Michal WoÂ´zniak1 Department of Systems and Computer Networks, Wroclaw University of Technology, WybrzeË&#x2122;ze WyspiaÂ´ nskiego 27, 50-370 Wroclaw, Poland. e-mail: {bartosz.krawczyk,michal.wozniak}@pwr.edu.pl

Abstract. The advance of high-throughput techniques, such as gene microarrays and protein chips have a major impact on contemporary biology and medicine. Due to the high-dimensionality and complexity of the data, it is impossible to analyze it manually. Therefore machine learning techniques play an important role in dealing with such data. In this paper we propose to use a one-class approach to classifying microarrays. Unlike canonical classifiers, these models rely only on objects coming from single class distributions. They distinguish observations coming from the given class from any other possible states of the object, that were unseen during the classification step. While having less information to dichotomize between classes, one-class models can easily learn the specific properties of a given dataset and are robust to difficulties embedded in the nature of the data. We show, that using one-class ensembles can give as good results as canonical multi-class classifiers, while allowing to deal with imbalanced distribution and unexpected noise in the data. To cope with high dimensionality of the feature space, we propose a novel approach of One-Class Rotation Forest. Experimental investigations, carried on public datasets, prove the usefulness of the proposed approach.

Keywords: machine learning, one-class classification, classifier ensemble, highdimensional data, bioinformatics.

Introduction

Contemporary high-throughput technologies produce massive volumes of biomedical data. Transcriptional research and profiling, with the usage of microarray technologies are powerful tools to gain a deep insight into the pathogenesis of complex diseases that plague modern society, such as cancer. Recent works on cancer profiling showed without a doubt, that gene expression patters can be used for high-quality cancer subtype recognition [23] - leukemias [18], melanoma [16], breast cancer [6] or prostate cancer [12] to name a few. Identifying cancer properties, based on their distinct expression profiles may provide necessary information for a breakthrough, that is required for patienttailored therapy. Currently there are no distinct rules on how individuals respond

to chemotherapy and existing chemotherapies have in most cases severe sideeffects with varying medical efficiency. Due to massive amounts of data generated by microarray experiments and their high complexity and dimensionality, one requires a decision support system to extract the meaningful information from them. Machine learning is widely used for this task [10], with two distinct areas - unsupervised [24] and supervised learning [15]. In this paper we will focus on the latter one. Supervised machine learning is a promising approach for analyzing microarray results in context of predicting patients outcome. Support Vector Machines are among the most popular classifiers used for this task [1]. Multiple Classifier Systems [25], or classifier ensembles, have gained an significant attention of the bioinformatics community in recent years. Random Forest [13] and Rotation Forest [11] ensembles have displayed an excellent classification accuracy for small-sample, high dimensionality microarray datasets, outperforming singlemodel approaches. Another important issue is the problem of curse of dimensionality. Microarray data suffer from a relatively small number of objects, in comparison to the feature space dimensionality, often reaching several thousands. This causes difficulties for machine learning algorithms, reducing their performance and increasing their computational complexity. Among this data flood a major number of parameters possess small discriminative power and is irrelevant to the classification process, which makes feature selection a crucial step in microarray analysis [7]. Although there are many applications of machine learning-based decision support systems in bioinformatics, there are still many unresolved problems, such as: – How to integrate heterogeneous data sources to achieve better insight into the mechanism behind complex diseases? – How to organize, store, analyze and visualize high-dimensionality data obtained from the biomedical data flood? – How to deal with the problem of high-dimensionality, small sample size, which strongly affects the classification performance and may lead to overfitting, poor generalization and unstable predictors? – How to cope with difficulties embedded in the nature of microarray data, such as noise or class imbalance, as canonical machine learning classifiers cannot cope with them easily? In this paper the last two issues are addressed. We propose to analyze microarray data with the usage of one-class classifiers, instead of commonly applied binary ones. To cope with the high dimensionality and complexity of the problem we apply a novel ensemble approach of One-Class Rotation Forest. It creates base classifiers on the basis of rotating the feature space with the usage of Principal Components Analysis (PCA). This introduces diversity among base learners and assures that each classifier has a different area of competence. To deal with the numerous features we propose to retain only a small subset of extracted principal components. This reduces the computational

complexity of each model while further improving the diversity among ensemble members. Experiments, based on a set of publicly available microarray datasets, show that the proposed approach maintains a good classification accuracy, while displaying an improved robustness to atypical data distribution and prevalent noise.

One-class classification

The aim of one-class classification (OCC) is to recognize one specific class from the more broad set of classes (e.g., selecting horses from all animals). The given class is known as target class ωt , while the remaining are denoted as outliers ωO . During the learning only examples target class (known also as positive examples) are being presented to learner, while it is assumed that during the exploitation phase new, unseen objects from other classes may appear. OCC problems are common in the real world where positive examples are widely available but negative ones are hard, expensive or even impossible to gather [2]. Let us consider an engine. It is a quite easy and cheap to collect data about its normal work. Collecting observations about failures it is expensive and sometimes impossible, because in this case we would have to spoil the engine. Such approach is very useful as well for many practical cases especially when the target class is ”stable” and outlier one is ”unstable”. To explain this motivation let us consider a computer security problem as spam filtering or intrusion detection (IDS/IPS) [14]. Among several types of classifiers dedicated to OCC, the most popular is one concentrating on estimation of a closed boundary for given data, assuming that such a boundary will describe sufficiently the target class [21]. The main aim of those methods is to find the optimal size of the volume enclosing given training points. Too small size could lead to overfitting the model, while too big size might lead to extensive acceptance of outliers into the target class. Those methods rely strongly on the distance between objects [8]. Boundary methods require smaller number of objects to properly estimate the decision criterion, which makes them a perfect tool for applications suffering from a small sample size,such as microarrays classification. The well-known boundary methods are one-class support vector machine (OCSVM) [17] and support vector data description (SVDD) [19]. In this work we will use the former one.

One-Class Rotation Forest

The idea of One-Class Rotation Forest (OC-RotF) originates from the recent proposal of One-Class Random Forest (OC-RandF) [5]. Authors proposed to adapt Random Forest to one-class learning scheme, as using a reduced feature space is an attractive property (due to the growing complexity of one-class learners in higher dimensions). However, the proposed OC-RandF has a major drawback it uses binary trees as base learners. Authors proposed a novel scheme for generating artificial outliers in reduced feature subsets, in order to train binary trees.

They argued that the proposed classifier can outperform many other one-class learners. However, this is not a canonical one-class method. It rather transforms the one-class problem into binary one and solves it with two-class approach. One can see how strong dependency lies between the classification model and the quality of generated outliers. Additionally, binary trees create a dichotomization hyperplane, not a data description. So they operate on different principles than one-class learners. Finally, authors proposed a rather dubious experimental analysis, where no proper metric of the robustness to outliers were used. This observations lead to a proposal of novel one-class ensemble learning method that would use one-class classifiers as base learners and would be in agreement with the principles of learning in the absence of counterexamples. We selected Rotation Forest as basis of our approach, as it preserves advantages of Random Forest, while using unsupervised feature extraction methods that can work well with one-class classifiers. Let us present the steps for preparing an input datasets for l-th classifier Ψ (l) from the pool: 1. Split feature space X randomly into S subsets. The subsets may be disjoint or intersecting. Standard Rotation Forest assumed disjoint subspaces. However, this prevented us from exploring hidden dependencies between features and limited the usage of this ensemble for small-dimensional datasets. In OC-RotF we propose to allow for intersecting subsets, just as in Random Subspaces approach. We assume that all subspaces use identical number of features NS . 2. For each of S such feature subset draw a bootstrap sample Bs of training objects equal to 75% of the original dataset. 3. Let us run a PCA algorithm on the s-th subset of features and corresponding bootstrap sample of training objects. In order to cope with the high dimensionality of data we preserve only first 10% of principal components. Store (1) the extracted coefficients for s-th subset for l-th classifier as a vector [yl , (2) (NS ) yl , · · · , yl ]. Please note that it is possible that some of the coefficients will be equal to zero. 4. Let us organize the obtained vectors into a sparse rotation matrix R:  (1) (2) (N ) [y1 , y1 , · · · , y1 S ] [0] ···  (1) (2) (NS )  [0] [y2 , y2 , · · · , y2 ] · · ·  R= . .. ..  .. . .  [0]

[0]

(1)

(2)

[0]



[0] .. .

     

(NS )

· · · [yS , yS , · · · , yS

] (1)

The rotation matrix will be of dimensionality d × NS . 5. Train l-th classifier using objects from Bs and R as the new feature space input. The pseudocode of the proposed OC-RotF method is given in Algorithm 1.

Algorithm 1 Overview of OC-RotF algorithm Require: T S - training set X - feature space L - size of the ensemble S - number of feature subsets NS - size of feature subsets Ψ - one-class classifier 1: for l = 1 to L do 2: split X into S subsets of NS size 3: for s = 1 to S do 4: Bs ← bootstrap sample from T S with features from s-th subset 5: apply PCA on Bs to obtain coefficients 6: retain 10% of first principal components 7: end for 8: Rl ← arrange coefficients as in Eq. (1) 9: Ψ ← train one-class classifier (Bs , Rl ) 10: end for

In order to combine base classifiers in our OC-RotF we propose to use product combination of the estimated supports [20], which is expressed by: QL (l) l=1 FωT (x) FωT (x) = QL (2) QL (l) , (l) l=1 FωT (x) + l=1 θ (l)

where FωT (x) is the support of l-th classifier for object x belonging to the target class and θ(l) is the classification threshold of l-th classifier for accepting an object as belonging to target class. For decomposing a binary problem (such as considered here microarray data) we train a separate OC-RotF. Then the class label for a new object is established by the maximum rule over the outputted decision supports [9].

Experimental investigations

In this section we evaluate the proposed one-class ensemble on the basis of datasets available at 1 , whose details are given in Table 1. Four different datasets were used and additional, fifth one, was generated. It was based on the Breast Cancer dataset. To test the performance of classifiers in difficult scenarios we have affected 25% of objects with Gaussian noise, thus creating in-class outliers in the data. As base classifier we have used an OCSVM with RBF kernel [3]. We have trained 100 base classifiers. To put the obtained results into context we have tested the performance of multi-class classifiers used for this task - single SVM (trained with RBF kernel and SMO procedure), Random Forest (consisting of 100 decision trees) and 1

http://datam.i2r.a-star.edu.sg/datasets/krbd/

Table 1. Statistics of the datasets used in the experiments. dataset samples (class 1 / class 2 ) features Breast Cancer 78 (34 / 44 ) 24481 Breast Cancer - noise 78 (34 / 44 ) 24481 Central Nervous System 60 (21 / 39) 7129 Colon Tumor 62 (22 / 40) 6500 Lung Cancer 181 (31 / 150) 12533

Rotation Forest (consisting of 100 decision trees). Additionally we show the performance of a single OCSVM and OC-RandF. Results are based on leave-one-out cross-validation (LOOCV). All experiments were carried out in the R environment [22], with classification algorithms taken from the dedicated packages, thus ensuring that the results achieved the best possible efficiency and that the performance was not decreased by a bad implementation. The Friedman ranking test [4] with significance level Îą = 0.05 was done for comparison over multiple benchmark datasets. Results with respect to sensitivity and specificity, are given in Tab. 2. From the results one may clearly see, that in case of standard microarray datasets the proposed approach returns both specificity and sensitivity similar to those of the state-of-the-art multi-class models. However in case of noisy (dataset no. 2) and imbalanced (datasets no. 4 and no. 5) our proposed approach is able to outperform significantly the standard classifiers. This happens due to the nature of OCC models - as they are able to learn the distinct properties of the target class, they are able to cope with in-class difficulties. Ensemble approaches are superior to single-model one-class classifiers. Using decomposition with OCSVMs did not lead to satisfactory results. This can be explained by increasing complexity of data description algorithms in high dimensions. Therefore, without using any approach for reducing the number of features used by each one-class classifier it is impossible to efficiently handle microarray data. Both OC-RandF and OC-RotF offer possibility of reducing the size of the feature space. OC-RandF achieves this by using a drawn subset of features in each node of inducted tree. The proposed OC-RotF retains only a small number of first principal components, thus offering a significant dimensionality reduction. When comparing these two one-class ensembles we may see that our proposal achieves much better performance, especially on noisy and imbalanced data. This is due to the operating modes of these committees. OC-RotF transforms an one-class problem into a binary one by adding artificial counterexamples. Then canonical binary trees are being trained. This does not output a data description, but a standard dichotomization boundary. Therefore, OC-RandF loses the desirable properties of OCC methods and becomes more similar to binary classifiers. The proposed OC-RotF satisfies the conditions of OCC by working only with objects from one class. PCA is an unsupervised method and therefore is highly suitable for one-class data. Any type of one-class classifier

can be used as a base learner for OC-RotF, thus making in a flexible framework for OCC. For standard classification problems we may retain all of principal components, while for high-dimensional problems reduce their number as in this paper.

Conclusions

In this paper a novel approach for microarray analysis, based on an ensemble of one-class support vector machines, was presented. To deal with the problem of high dimensionality, which may cause difficulties for one-class model, a novel one-class ensemble method based on Rotation Forest was introduced. Creating base classifiers with the usage of rotated and reduced feature spaces allowed for forming an efficient multiple classifier system that was able to cope with highdimensionality nature of data and return similar performance as state-of-the-art multi-class methods. The strong points of the proposed method were revealed when dealing with noisy and imbalanced data. In such a case the proposed OneClass Rotation Forest displayed superior quality over its competitors. The proposed approach may be an attractive tool for bioinformatics decision support systems, in which we deal with uncertain, noisy data or data coming from uneven distributions.

Acknowledgment This work was partially supported by The Polish National Science Centre under the grant PRELUDIUM number DEC-2013/09/N/ST6/03504 and by EC under FP7, Coordination and Support Action, Grant Agreement Number 316097, ENGINE European Research Centre of Network Intelligence for Innovation Enhancement (http:// engine.pwr.wroc.pl/).

References 1. D. Bariamis, D. Maroulis, and D. K. Iakovidis. Unsupervised svm-based gridding for dna microarray images. Computerized Medical Imaging and Graphics, 34(6):418–425, 2010. 2. B. Cyganek. Image segmentation with a hybrid ensemble of one-class support vector machines. volume 6076 LNAI of Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), pages 254–261. 2010. 3. B. Cyganek. One-class support vector ensembles for image segmentation and classification. Journal of Mathematical Imaging and Vision, 42(2-3):103–117, 2012. 4. Janez Demˇsar. Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res., 7:1–30, 2006. 5. C. Desir, S. Bernard, C. Petitjean, and L. Heutte. One class random forests. Pattern Recognition, 46(12):3490–3506, 2013.

6. G. Finak, N. Bertos, F. Pepin, S. Sadekova, M. Souleimanova, H. Zhao, H. Chen, G. Omeroglu, S. Meterissian, A. Omeroglu, M. Hallett, and M. Park. Stromal gene expression predicts clinical outcome in breast cancer. Nature medicine, 14(5):518– 527, 2008. 7. I. Inza, P. Larraaga, R. Blanco, and A. J. Cerrolaza. Filter versus wrapper gene selection approaches in dna microarray domains. Artificial Intelligence in Medicine, 31(2):91–103, 2004. 8. B. Krawczyk, M. Wo´zniak, and B. Cyganek. Clustering-based ensembles for oneclass classification. Information Sciences, 264:182–195, 2014. 9. B. Krawczyk, M. Wo´zniak, and F. Herrera. On the usefulness of one-class classifier ensembles for decomposition of multi-class problems. Pattern Recognition, 2015. 10. P. Larranaga, B. Calvo, R. Santana, C. Bielza, J. Galdiano, I. Inza, J. A. Lozano, R. Armananzas, G. Santaf, A. Perez, and V. Robles. Machine learning in bioinformatics. Briefings in Bioinformatics, 7(1):86–112, 2006. 11. K. Liu and D. Huang. Cancer classification using rotation forest. Computers in biology and medicine, 38(5):601–610, 2008. 12. C. C. Lynch, A. Hikosaka, H. B. Acuff, M. D. Martin, N. Kawai, R. K. Singh, T. C. Vargo-Gogola, J. L. Begtrup, T. E. Peterson, B. Fingleton, T. Shirai, L. M. Matrisian, and M. Futakuchi. Mmp-7 promotes prostate cancer-induced osteolysis via the solubilization of rankl. Cancer Cell, 7(5):485–496, 2005. 13. K. Moorthy and M. S. Mohamad. Random forest for gene selection and microarray data classification. volume 295 CCIS of Communications in Computer and Information Science, pages 174–183, 2012. 14. K. Noto, C. Brodley, and D. Slonim. Frac: A feature-modeling approach for semisupervised and unsupervised anomaly detection. Data Mining and Knowledge Discovery, 25(1):109–133, 2012. 15. M. Ringner, C. Peterson, and J. Khan. Analyzing array data using supervised methods. Pharmacogenomics, 3(3):403–415, 2002. Cited By (since 1996):43. 16. T. Schatton, G. F. Murphy, N. Y. Frank, K. Yamaura, A. M. Waaga-Gasser, M. Gasser, Q. Zhan, S. Jordan, L. M. Duncan, C. Weishaupt, R. C. Fuhlbrigge, T. S. Kupper, M. H. Sayegh, and M. H. Frank. Identification of cells initiating human melanomas. Nature, 451(7176):345–349, 2008. 17. B. Sch¨ olkopf and A.J. Smola. Learning with kernels: support vector machines, regularization, optimization, and beyond. Adaptive computation and machine learning. MIT Press, 2002. 18. V. S. Silveira, C. A. Scrideli, D. A. Moreno, J. A. Yunes, R. G. P. Queiroz, S. C. Toledo, M. L. M. Lee, A. S. Petrilli, S. R. Brandalise, and L. G. Tone. Gene expression pattern contributing to prognostic factors in childhood acute lymphoblastic leukemia. Leukemia and Lymphoma, 54(2):310–314, 2013. 19. D. M. J. Tax and R. P. W. Duin. Support vector data description. Machine Learning, 54(1):45–66, 2004. 20. David M. J. Tax and Robert P. W. Duin. Combining one-class classifiers. In Proceedings of the Second International Workshop on Multiple Classifier Systems, MCS ’01, pages 299–308, London, UK, 2001. Springer-Verlag. 21. David M. J. Tax, Piotr Juszczak, Elzbieta Pekalska, and Robert P. W. Duin. Outlier detection using ball descriptions with adjustable metric. In Proceedings of the 2006 joint IAPR international conference on Structural, Syntactic, and Statistical Pattern Recognition, SSPR’06/SPR’06, pages 587–595, Berlin, Heidelberg, 2006. Springer-Verlag. 22. R Development Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2008.

23. A. V. Tinker, A. Boussioutas, and D. D. L. Bowtell. The challenges of gene expression microarrays for the study of human cancer. Cancer Cell, 9(5):333–339, 2006. 24. Y. . Wang, Z. . Yu, and V. Anh. Fuzzy c-means method with empirical mode decomposition for clustering microarray data. International Journal of Data Mining and Bioinformatics, 7(2):103–117, 2013. 25. M. Wo´zniak, M. Grana, and E. Corchado. A survey of multiple classifier systems as hybrid systems. Information Fusion, 2013. Article in Press.

Table 2. Recognition sensitivity [%] and specificity [%] for examined methods. RandF stands for Random Forest, RotF for Rotation Forest, OC â&#x2C6;&#x2019; RandF for an one-class Random Forest and OC â&#x2C6;&#x2019; RotF for the proposed one-class Rotation Forest. Average rank of tested classifiers, according to Friedman ranking test, are given at the bottom. Dataset SVM RandF RotF OCSVM OC-RandF OC-RotF Sens [%] Spec[%] Sens [%] Spec[%] Sens [%] Spec[%] Sens [%] Spec[%] Sens [%] Spec[%] Sens [%] Spec[%] 90.23 91.46 92.32 93.65 92.32 93.65 87.85 90.07 89.15 91.45 93.28 92.78 74.46 83.59 77.36 84.90 80.05 85.72 75.20 82.98 85.15 87.40 89.32 90.18 85.60 94.36 88.20 95.90 88.20 95.90 82.95 90.11 85.67 93.15 87.46 93.44 78.90 91.25 81.35 94.03 82.70 93.90 80.15 92.36 83.85 94.10 85.02 93.92 61.72 93.05 65.89 95.11 67.00 94.85 69.22 92.08 72.98 94.12 75.48 95.60 4.85 2.90 2.25 5.50 3.80 1.70 Breast Cancer Breast Cancer - noise Central Nervous System Colon Tumor Lung Cancer Avg. score