S O C I É T É F R A N Ç A I S E D E S TAT I S T I Q U E Frédéric Bertrand Université de Strasbourg
Jean-Jacques Droesbeke Université Libre de Bruxelles
Gilbert Saporta
Conservatoire National des Arts et Métiers, Paris
Christine Thomas-Agnan Université Toulouse 1 Capitole
Model choice and model aggregation
2017
Éditions TECHNIP
Pages de titre -Model choice.indd 1
5 avenue de la République, 75011 PARIS
10/08/2017 10:17
BY THE SAME PUBLISHER SFdS books • Méthodes robustes en statistiques, 2015 J.-J. Droesbeke, G. Saporta, C. Thomas-agnan, Eds. • Approches statistiques du risque, 2014 J.-J. Droesbeke, M. Maumy-Bertrand, G. Saporta, C. Thomas-agnan, Eds. • Modèles à variables latentes et modèles de mélange, 2013 J.-J. Droesbeke, G. Saporta, C. Thomas-agnan, Eds. • Approches non paramétriques en régression, 2011 J.-J. Droesbeke, G. Saporta, Eds. • Analyse statistique des données longitudinales, 2010 J.-J. Droesbeke, G. Saporta, Eds. • Analyse statistique des données spatiales, 2006 J.-J. Droesbeke, M. Lejeune, G. Saporta, Eds. • Modèles statistiques pour des données qualitatives, 2005 J.-J. Droesbeke, M. Lejeune, G. Saporta, Eds. • Méthodes bayésiennes en statistique, 2002 J.-J. Droesbeke, J. Fine, G. Saporta, Eds. • Plans d’expériences – application à l’entreprise, 1997 J.-J. Droesbeke, J. Fine, G. Saporta, Eds.
All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopy, recording, or any information storage and retrieval system, without the prior written permission of the publisher.
© Éditions Technip, Paris, 2017. ISBN 978-2-7108-1177-0
Pages de titre -Model choice.indd 2
10/08/2017 10:17
iii «On ne fait bien que ce qu’on aime. Ni la science ni la conscience ne modèlent un grand cuisinier.» (We do well only what we like to do. Neither science, nor awareness can model a great cook) Sidonie-Gabrielle Colette (1873-1954), Prisons et Paradis
Preface The Biennal Workshop in Statistics / Journées d’étude en Statistique (JES) were organized for the 16th time in 2014 by the French Statistical Society / Société Française de Statistique (SFdS). Since the first workshop organized in 1984, every two years the JES have explored a particular domain of statistics, each time publishing a book intended not only for participants but also for users and teachers interested in the chosen theme. The first subject that was developed was Time Series Analysis (Droesbeke et al. [1989b]). In 1986 the workshop was devoted to Sampling Methods (Droesbeke et al. [1987]), providing the first publication of the collection. Those workshops were followed by Statistical Analysis of Lifespan in 1988 (Droesbeke et al. [1989a]) and Models for Multidimensional Data Analysis in 1990 (Droesbeke et al. [1992]). These four books have been edited by Economica. The fifth JES was organized in 1992 on ARCH Models and Applications in Finance (Droesbeke et al. [1994]). In 1994, the following theme was Non Parametric Inference, more precisely Rank Statistics (Droesbeke and Fine [1996]). Both books have been edited by Éditions de l’Université de Bruxelles and Ellipses. In 1996, the theme was Experimental Designs (Droesbeke et al. [1997]). This seventh book of the collection has been edited by Technip, as the following ones. It is useful to note that the first seven JES were organized by the Association for Statistics and its Applications / Association pour la Statistique et ses Utilisations (ASU). This society merged with the Statistical Society of Paris / Société de Statistique de Paris (SSP) in order to form the French Statistical Society / Société Française de Statistique (SFdS). The latter Society has organized the JES since 1997. The first of these meetings (which has number 8 in the series) was organized in 1998 on the theme Bayesian Methods in Statistics (Droesbeke et al. [2002]), followed in 2000 by Statistical Models for Qualitative Data (Droesbeke et al. [2005]), Statistical Analysis of Spatial Data in 2002 (Droesbeke et al. [2006]), Statistical Analysis of Longitudinal Data in 2004 (Droesbeke and Saporta [2010]), Non Parametric Approaches in Regression in 2006 (Droes-
vi beke and Saporta [2011]), Models with latent variables and Mixture Models in 2008 (Droesbeke et al. [2013]), Statistical Approaches of Risk in 2010 (Droesbeke et al. [2014]) and Robust Methods in Statistics in 2012 (Droesbeke et al. [2015]). This book has been elaborated on the basis of the 16th JES organized in 2014 on the theme Model Choice and Model Aggregation. We would like to thank the lecturers who participated to these JES • Christophe Biernacki (Université Lille 1) • Jean-Michel Marin (Université de Montpellier) • Pascal Massart (Université de Paris-Sud) • Cathy Maugis-Rabusseau (INSA de Toulouse) • Mathilde Mougeot (Université Paris Diderot). • Nicolas Vayatis (École Normale Supérieure de Cachan) We also thank Christian Robert (Université Paris Dauphine) co-author of chapters 4 and 5 and Marie-Laure Martin-Magnette (INRA) and Andrea Rau (INRA) co-authors of chapter 10. We thank Myriam Maumy-Bertrand who coordinated contacts with authors. Finally we would like to thank all persons who have helped us at the Conservatoire National des Arts et Métiers of Paris, the Université libre de Bruxelles, the Université de Strasbourg, the Université de Toulouse I and the Villa Clythia at Fréjus.
Frédéric Bertrand Université de Strasbourg Jean-Jacques Droesbeke Université libre de Bruxelles Gilbert Saporta Conservatoire National des Arts et Métiers, Paris Christine Thomas-Agnan Université de Toulouse I
Contents 1 A MODEL SELECTION TALE Jean-Jacques Droesbeke, Gilbert Saporta and Christine Thomas-Agnan
1
1.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
1.2
Elements of the history of words and ideas . . . . . . . . . . . .
1
1.3
Modeling in astronomy . . . . . . . . . . . . . . . . . . . . . . .
2
1.4
Triangulation in geodesy . . . . . . . . . . . . . . . . . . . . . .
5
1.5
The measurement of meridian arcs . . . . . . . . . . . . . . . .
7
1.6
A model selection tale . . . . . . . . . . . . . . . . . . . . . . .
10
1.7
A new model appears . . . . . . . . . . . . . . . . . . . . . . .
14
1.8
Expeditions for choosing a good model . . . . . . . . . . . . . .
17
1.9
The control of errors . . . . . . . . . . . . . . . . . . . . . . . .
18
1.10 A final example . . . . . . . . . . . . . . . . . . . . . . . . . . .
18
1.11 Outline of the book . . . . . . . . . . . . . . . . . . . . . . . . .
20
2 MODEL’S INTRODUCTION Pascal Massart 2.1
2.2
2.3
21
Model selection . . . . . . . . . . . . . . . . . . . . . . . . . . .
22
2.1.1
Empirical risk minimization . . . . . . . . . . . . . . . .
23
2.1.2
The model choice paradigm . . . . . . . . . . . . . . . .
26
2.1.3
Model selection via penalization . . . . . . . . . . . . .
27
Selection of linear Gaussian models . . . . . . . . . . . . . . . .
30
2.2.1
Examples of Gaussian frameworks . . . . . . . . . . . .
31
2.2.2
Some model selection problems . . . . . . . . . . . . . .
33
2.2.3
The least squares procedure . . . . . . . . . . . . . . . .
35
Selecting linear models . . . . . . . . . . . . . . . . . . . . . . .
35
2.3.1
37
Mallows’ heuristics . . . . . . . . . . . . . . . . . . . . .
viii
CONTENTS
2.4
2.5
2.3.2
Schwarz’s heuristics . . . . . . . . . . . . . . . . . . . .
37
2.3.3
A first model selection theorem for linear models . . . .
38
. . . . . . . . . . .
43
2.4.1
Adaptive estimation in the minimax sense Minimax lower bounds
. . . . . . . . . . . . . . . . . .
45
2.4.2
Adaptive properties of penalized estimators for Gaussian sequences . . . . . . . . . . . . . . . . . . . . . . . . . .
54
2.4.3
Adaptation with respect to ellipsoids . . . . . . . . . . .
55
2.4.4
Adaptation with respect to arbitrary `p -bodies . . . . .
56
Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
61
2.5.1 2.5.2
Functional analysis: from function spaces to sequence spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . .
61
Gaussian processes
63
. . . . . . . . . . . . . . . . . . . .
3 NON LINEAR GAUSSIAN MODEL SELECTION Pascal Massart
71
3.1
A general Theorem . . . . . . . . . . . . . . . . . . . . . . . . .
71
3.2
Selecting ellipsoids and `2 regularization . . . . . . . . . . . . .
76
3.2.1
Adaptation over Besov ellipsoids . . . . . . . . . . . . .
77
3.2.2
A first penalization strategy . . . . . . . . . . . . . . . .
79
3.2.3 3.3
3.4
`2 regularization . . . . . . . . . . . . . . . . . . . . . .
81
`1 regularization . . . . . . . . . . . . . . . . . . . . . . . . . .
84
3.3.1
Variable selection . . . . . . . . . . . . . . . . . . . . . .
85
3.3.2
Selecting `1 balls and the Lasso . . . . . . . . . . . . . .
86
Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
87
3.4.1
Concentration inequalities . . . . . . . . . . . . . . . . .
87
3.4.2
Information inequalities . . . . . . . . . . . . . . . . . .
96
3.4.3
Birgé’s Lemma . . . . . . . . . . . . . . . . . . . . . . .
98
4 BAYESIAN MODEL CHOICE Jean-Michel Marin and Christian Robert 4.1
4.2
101
The Bayesian paradigm . . . . . . . . . . . . . . . . . . . . . .
101
4.1.1
The posterior distribution . . . . . . . . . . . . . . . . .
101
4.1.2
Bayesian estimates . . . . . . . . . . . . . . . . . . . . .
104
4.1.3
Conjugate prior distributions . . . . . . . . . . . . . . .
104
4.1.4
Noninformative priors . . . . . . . . . . . . . . . . . . .
105
4.1.5
Bayesian credible sets . . . . . . . . . . . . . . . . . . .
106
Bayesian discrimination between models . . . . . . . . . . . . .
107
ix
CONTENTS
4.3
4.2.1
The model index as a parameter . . . . . . . . . . . . .
107
4.2.2
The Bayes Factor . . . . . . . . . . . . . . . . . . . . . .
109
4.2.3
The ban on improper priors . . . . . . . . . . . . . . . .
110
4.2.4
The Bayesian Information Criterium . . . . . . . . . . .
112
4.2.5
Bayesian Model Averaging . . . . . . . . . . . . . . . . .
113
The case of linear regression models . . . . . . . . . . . . . . .
113
4.3.1
Conjugate prior . . . . . . . . . . . . . . . . . . . . . . .
114
4.3.2
Zellner’s G prior distribution . . . . . . . . . . . . . . .
114
4.3.3
HPD regions . . . . . . . . . . . . . . . . . . . . . . . .
117
4.3.4
Calculation of evidences and Bayes factors . . . . . . . .
117
4.3.5
Variable Selection . . . . . . . . . . . . . . . . . . . . .
118
5 SOME COMPUTATIONAL ASPECTS OF BAYESIAN MODEL CHOICE Jean-Michel Marin and Christian Robert 5.1
5.2 5.3
121
Some Monte Carlo strategies to approximate the evidence . . .
121
5.1.1
The basic Monte Carlo solution . . . . . . . . . . . . . .
123
5.1.2
Usual importance sampling approximations . . . . . . .
124
5.1.3
The Harmonic mean approximation . . . . . . . . . . .
126
5.1.4
The Chib’s method . . . . . . . . . . . . . . . . . . . . .
127
The bridge sampling methodology to compare embedded models 127 A Monte Carlo Markov Chain method for variable selection . .
130
5.3.1
The Gibbs sampler . . . . . . . . . . . . . . . . . . . . .
130
5.3.2
A Stochastic Search for the Most Likely Model . . . . .
133
6 RANDOMIZATION AND AGGREGATION FOR PREDICTIVE MODELING WITH CLASSIFICATION DATA Nicolas Vayatis 135 6.1 6.2
6.3
Motivations . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
135
Randomness, bless our data! . . . . . . . . . . . . . . . . . . . .
136
6.2.1
A probabilistic view of classification data . . . . . . . .
136
6.2.2
Let the data go: error estimation and model validation .
140
Power to the masses: aggregation principles . . . . . . . . . . .
142
6.3.1
Voting and averaging in binary classification
. . . . . .
142
6.3.2
A lazy way to multi-class classification . . . . . . . . . .
143
6.3.3
Agreement and averaging in the context of scoring . . .
144
6.3.4
From bipartite ranking to K-partite ranking
148
. . . . . .
x
CONTENTS
6.4
6.5
Time for doers: popular aggregation meta-algorithms . . . . . .
150
6.4.1
Bagging . . . . . . . . . . . . . . . . . . . . . . . . . . .
151
6.4.2
Boosting . . . . . . . . . . . . . . . . . . . . . . . . . . .
152
6.4.3
Forests for bipartite ranking and scoring . . . . . . . . .
154
Time for thinkers: Theory of aggregated rules . . . . . . . . . .
157
6.5.1
Aggregation of classification rules . . . . . . . . . . . . .
157
6.5.2
Consistency of Forests . . . . . . . . . . . . . . . . . . .
158
6.5.3
From bipartite consistency to K-partite consistency . .
160
7 MIXTURE MODELS Christophe Biernacki 7.1
7.2
7.3
7.4
7.5
7.6
165
Mixture models as a many-purpose tool . . . . . . . . . . . . .
165
7.1.1
Starting from applications . . . . . . . . . . . . . . . . .
165
7.1.2
The mixture model answer . . . . . . . . . . . . . . . .
168
7.1.3
Classical mixture models
. . . . . . . . . . . . . . . . .
170
7.1.4
Other models . . . . . . . . . . . . . . . . . . . . . . . .
175
Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
175
7.2.1
Overview . . . . . . . . . . . . . . . . . . . . . . . . . .
175
7.2.2
Maximum likelihood and variants . . . . . . . . . . . . .
176
7.2.3
Theoretical difficulties related to the likelihood . . . . .
179
7.2.4
Estimation algorithms . . . . . . . . . . . . . . . . . . .
180
Model selection in density estimation . . . . . . . . . . . . . . .
186
7.3.1
Need to select a model . . . . . . . . . . . . . . . . . . .
186
7.3.2
Frequentist approach and deviance . . . . . . . . . . . .
189
7.3.3
Bayesian approach and integrated likelihood . . . . . . .
194
Model selection in (semi-)supervised classification . . . . . . . .
200
7.4.1
Need to select a model . . . . . . . . . . . . . . . . . . .
200
7.4.2
Error rates-based criteria . . . . . . . . . . . . . . . . .
203
7.4.3
A predictive deviance criterion . . . . . . . . . . . . . .
205
Model selection in clustering . . . . . . . . . . . . . . . . . . . .
208
7.5.1
Need to select a model . . . . . . . . . . . . . . . . . . .
208
7.5.2
Partition-based criteria . . . . . . . . . . . . . . . . . .
209
7.5.3
The Integrated Completed Likelihood criterion . . . . .
211
Experiments on real data sets . . . . . . . . . . . . . . . . . . .
217
7.6.1
BIC: extra-solar planets . . . . . . . . . . . . . . . . . .
218
7.6.2
AICcond /BIC/AIC/BEC/ˆ ecv : benchmark data sets . . .
219
xi
CONTENTS
7.7
7.6.3
AICcond /ˆ ecv V : textile data set . . . . . . . . . . . . . . .
221
7.6.4
BIC: social comparison theory
222
7.6.5
NEC: marketing data . . . . . . . . . . . . . . . . . . .
224
7.6.6
ICL: prostate cancer data . . . . . . . . . . . . . . . . .
225
7.6.7
BIC: density estimation in the steel industry . . . . . .
228
7.6.8
BIC: partitioning communes of Wallonia . . . . . . . . .
229
7.6.9
ICLbic/BIC: acoustic emission control . . . . . . . . . .
231
7.6.10 ICLbic/ICL/BIC/ILbayes: a seabird data set . . . . . .
232
Future methodological challenges . . . . . . . . . . . . . . . . .
234
. . . . . . . . . . . . . .
8 CALIBRATION OF PENALTIES Pascal Massart 8.1
8.2
The concept of minimal penalty
237 . . . . . . . . . . . . . . . . .
238
8.1.1
A small number of models . . . . . . . . . . . . . . . . .
239
8.1.2
A large number of models . . . . . . . . . . . . . . . . .
242
Data-driven penalties . . . . . . . . . . . . . . . . . . . . . . . .
243
8.2.1
From theory to practice . . . . . . . . . . . . . . . . . .
243
8.2.2
The slope heuristics . . . . . . . . . . . . . . . . . . . .
244
9 HIGH DIMENSIONAL CLUSTERING Christophe Biernacki and Cathy Maugis-Rabusseau
247
9.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
247
9.2
HD clustering: Curse or blessing? . . . . . . . . . . . . . . . . .
250
9.2.1
HD density estimation: Curse . . . . . . . . . . . . . . .
250
9.2.2
HD clustering: A mix of curse and blessing . . . . . . .
252
9.2.3 9.3
9.4
Intermediate conclusion . . . . . . . . . . . . . . . . . .
254
Non-canonical models . . . . . . . . . . . . . . . . . . . . . . .
256
9.3.1
Gaussian mixture of factor analysers . . . . . . . . . . .
256
9.3.2
HD Gaussian mixture models . . . . . . . . . . . . . . .
257
9.3.3
Functional data . . . . . . . . . . . . . . . . . . . . . . .
258
9.3.4
Intermediate conclusion . . . . . . . . . . . . . . . . . .
262
Canonical models . . . . . . . . . . . . . . . . . . . . . . . . . .
262
9.4.1
Parsimonious mixture models . . . . . . . . . . . . . . .
263
9.4.2
Variable selection through regularization . . . . . . . . .
266
9.4.3
Variable role modelling . . . . . . . . . . . . . . . . . .
270
9.4.4
Co-clustering . . . . . . . . . . . . . . . . . . . . . . . .
274
9.4.5
Intermediate conclusion . . . . . . . . . . . . . . . . . .
281
xii
CONTENTS
9.5
Future methodological challenges . . . . . . . . . . . . . . . . .
10 CLUSTERING OF CO-EXPRESSED GENES Marie-Laure Martin-Magniette, Cathy Maugis-Rabusseau Rau 10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . 10.2 Model-based clustering . . . . . . . . . . . . . . . . . 10.3 Clustering of microarray data . . . . . . . . . . . . . 10.3.1 Microarray data . . . . . . . . . . . . . . . . 10.3.2 Gaussian mixture models . . . . . . . . . . . 10.3.3 Application . . . . . . . . . . . . . . . . . . . 10.4 Clustering of RNA-seq data . . . . . . . . . . . . . . 10.4.1 RNA-seq data . . . . . . . . . . . . . . . . . . 10.4.2 Poisson mixture models . . . . . . . . . . . . 10.4.3 Applications . . . . . . . . . . . . . . . . . . 10.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . .
282
and Andrea 283 . . . . . . 283 . . . . . . 284 . . . . . . 286 . . . . . . 286 . . . . . . 287 . . . . . . 288 . . . . . . 296 . . . . . . 296 . . . . . . 297 . . . . . . 299 . . . . . . 307
11 FORECASTING THE FRENCH NATIONAL ELECTRICITY CONSUMPTION: FROM SPARSE MODELS TO AGGREGATED FORECASTS Mathilde Mougeot 313 11.1 Functional regression models . . . . . . . . . . . . . . . . . . . 315 11.2 Data Mining using sparse approximation of the intra day load curves . . . . . . . . . . . . . . . . . . . . 317 11.2.1 Choice of a generic dictionary . . . . . . . . . . . . . . . 318 11.2.2 Mining and clustering . . . . . . . . . . . . . . . . . . . 319 11.2.3 Patterns of consumption . . . . . . . . . . . . . . . . . . 320 11.3 Sparse modeling with adaptive dictionaries . . . . . . . . . . . 320 11.4 Forecasting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 321 11.4.1 The experts . . . . . . . . . . . . . . . . . . . . . . . . . 322 11.4.2 Aggregation . . . . . . . . . . . . . . . . . . . . . . . . . 322 11.5 Performances & Software . . . . . . . . . . . . . . . . . . . . . 323 11.6 Conclusion and perspectives . . . . . . . . . . . . . . . . . . . . 324 11.7 Annexes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325 Bibliography
327
Index
353
Chapter 1 A MODEL SELECTION TALE Jean-Jacques Droesbeke, Gilbert Saporta and Christine Thomas-Agnan
1.1
Introduction
Modeling holds an important place in statistics, which is borne out in numerous articles, books and encyclopedias. Nevertheless, models simplify reality, and as George Box (1919-2013) liked to say, “Essentially, all models are wrong, but some are useful ”. What is certain is that a large number of leaps forward in statistics have rested on the use of models, and in particular the selection of relevant ones. This has also been true in the history of other scientific subjects. In this chapter1 we are going to describe a key model selection tale from history whose consequences for statistics turned out to be quite important.
1.2
Elements of the history of words and ideas
We know that from an etymological point of view, the term “model” comes from the Italian modello, which at the beginning of the 16th century meant an “illustration to be reproduced” 2 . This word is itself derived from the Late Latin modellus, and the earlier Latin modulus (diminutive of modus: measure) with the original meaning: “arbitrary measure used to determine the proportional relationships between parts of a work of architecture” 3 . 1 This paragraph and the two following it are principally based on the work of Droesbeke and Saporta [2010] as well as Lacombe and Costabel [1988] and Chapter 4 of the doctoral thesis of Armatte [1995]. 2 The same is true for the French modèle and the German modell. 3 See Bachelard [1979], p. 15.
2
Chapter 1
From the initial meaning4 of the Italian word modello, the term “model” was applied, as early as 1563, to “any simplified representation of a construction or object that is to be reproduced at a larger scale”. At the abstract level, it was applied to something or someone exhibiting to the highest degree the characteristics of a species, category or quality. In the 17th century, the word was used in the vocabulary of the arts with the meaning: “Someone who poses for a painter or sculptor”, from which then came: “Real person providing the inspiration for a writer”. It also slipped into the world of fashion before being overtaken by mannequin. In the scientific world, the idea of a model is strongly linked to that of representing something. The idea of formalizing this representation by a mathematical expression based on the tools of logic sprung forth from the intellectual eruptions of the 19th century. It also took advantage of the needs of physics to match observations with theory. Nor did the world of economics stay silent on the matter. In particular, econometrics attached very early importance to the use of models. This was further amplified in the middle of the 20th century, particularly by U.S. laboratories which received the support of scientists fleeing Nazism. Under the pressure of events related to World War II, military research acted as a favorable catalyst to the development of new disciplines such as game theory and operations research. In particular, it led John von Neumann and Oskar Morgenstern to publish their Theory of Games and Economic Behavior, a book that became an essential reference to many researchers of the time. Sociology, linguistics, and many more areas also turned with success to the use of models. But it is from astronomy and geodesy that we pluck the fascinating example that will hold our attention from here on in.
1.3
Modeling in astronomy
Astronomy appears to have been the first discipline to build a parametric model and fit it to data. The idea of using a model originated in questions studied by astronomers on the movements of celestial bodies in the sky and the need to understand them by defining systems. Take the example of the theory that the sun revolves around the Earth. This model prevailed in ancient times before the other way around was proposed. It is not surprising that the simplest model imagined at the time assumed that the sun went through a circular orbit around the Earth at a constant speed. Another way of saying the same thing was to say that the angular position of the sun was a linear function of time, a proposal that ensured Hipparchus’ fame5 . 4 The
various meanings of the word model are taken from Rey et al. [1992]. in Nicaea in the 2nd century BC, Hipparchus is often described as the greatest
5 Born
A MODEL SELECTION TALE
3
This model was subsequently taken up by Ptolemy, a Greek astronomer of the 2nd century AD, with his geocentric system that placed the Earth at the center of the universe for almost fourteen more centuries. This astronomer made his observations in Alexandria between 127 and 141 AD and proposed his system with the help of his great mathematical syntax, which was later transmitted to the West in the 9th century by the Arabs as the Almagest 6 . Ptolemy’s Almagest influenced the scientific world until the early 16th century, when a Polish astronomer named Copernicus realized the shortcomings of the Ptolemaic system, leading him to propose a new theory of planetary motion, passing from geocentric to heliocentric. The year 1543 was a pivotal one in the story of Copernicus as it saw the publication of his De revolutionibus coelestium but unfortunately also his death. His work, as we know, opened the door to the later work of Galileo and Newton.
Figure 1.1: Ptolemy
Figure 1.2: Copernicus
To understand the reasons for the slow transition from geocentric to heliocentric theory, one can turn to theological and philosophical arguments, but we must also realize that from antiquity to the Middle Ages, astronomical measuring instruments were few: the quadrant reigned supreme, whether static or mobile, along with the astrolabe. Before the Renaissance, the measurement precision of these instruments was at most half a degree! We can understand the difficulty of using the observed data to validate a model, especially if it upset traditional thinking. We would have to wait for better tools in order to progress, which leads us to the story of Tycho Brahe. Tycho Brahe (1546-1601) is a character who interests us for many reasons. Of Danish nationality, he came from some of the oldest noble stock of the kingdom. Educated in Leipzig “to receive the lightest of education which, in ancient astronomer of antiquity. 6 This work refers to the work of Hipparchus; it also contains a catalog of 1028 star positions and a detailed development of rectilinear and spherical trigonometry.
Chapter 3 NON LINEAR GAUSSIAN MODEL SELECTION Pascal Massart As in Section 2.2, we consider the generalized linear Gaussian model. This means that, given some separable Hilbert space H, one observes Yε (g) = hf, gi + εW (g) , for all g ∈ H, where W is some isonormal process. Our purpose is to state and prove some fairly general model selection theorem that will allow us first to validate the oracle type inequality that we stated in Section 2.2. Secondly, the possibility of selecting convex bodies will allow us to derive some risk bounds for regularization from this general theorem.
3.1
A general Theorem
Our purpose is to propose a model selection procedure among a collection of possibly nonlinear models. This procedure is based on a penalized least squares criterion which involves a penalty depending on some extended notion of dimension allowing to deal with non linear models. Theorem 3.1 Let {Sm }m∈M be some finite or countable collection of closed convex subsets of H. We assume that for any m ∈ M, there exists some a.s. continuous version W of the isonormal process on Sm . Assume furthermore the existence of some positive and nondecreasing continuous function φm defined on (0, +∞) such that φm (x) /x is nonincreasing and !# " W (g) − W (h) 6 x−2 φm (x) (3.1) 2E sup 2 g∈Sm kg − hk + x2
72
Chapter 3
for any positive x and any point h in Sm . Let us define Dm > 0 such that p (3.2) φm ε Dm = εDm and consider some family of weights {xm }m∈M such that X e−xm = Σ < ∞. m∈M
Let K be some constant with K > 1 and take p 2 √ pen (m) > Kε2 Dm + 2xm .
(3.3)
2
We set for all g ∈ H, Lε (g) = kgk − 2Yε (g) and consider some collection of ρ-LSEs fbm i.e., for any m ∈ M, m∈M
Lε fbm 6 Lε (g) + ρ, for all g ∈ Sm . Defining a penalized ρ-LSE as fe = fbm b , the following risk bound holds for all f ∈H 2 e 2 2 Ef f − f 6 C (K) inf d (f, Sm ) + pen (m) + ε (Σ + 1) + ρ . m∈M
(3.4) Proof. We first recall that for every m ∈ M and any point f ∈ H, the projection fm of f onto the closed and convex model Sm satisfies the following properties kf − fm k = d (f, Sm ) kfm − gk 6 kf − gk , for all g ∈ Sm .
(3.5) (3.6)
The first property is just the definition of the projection point fm and the second one is merely the contraction property of the projection on a closed convex set in a Hilbert space. Let us assume for n the sake of simplicity that ρ = 0. We now fix some mo∈ M and define M0 = m0 ∈ M, Lε fbm0 + pen (m0 ) 6 Lε fbm + pen (m) . By definition, for every m0 ∈ M0 Lε fbm0 + pen (m0 ) 6 Lε fbm + pen (m) 6 Lε (fm ) + pen (m) . Let us now assume that the target f belongs to model Sm (we shall relax this assumption afterwards) which of course means that fm = f . Noticing that 2
2
Lε (g) = kg − f k − kf k − 2εW (g) ,
73
NON LINEAR GAUSSIAN MODEL SELECTION
the preceding inequality becomes 2 h i b 0 fm − f 6 2ε W fbm0 − W (f ) − pen (m0 ) + pen (m) .
(3.7)
For any m0 ∈ M, we consider some positive number ym0 to be chosen later, define for any g ∈ Sm0 2
2 2wm0 (g) = kf − gk + ym 0
and finally set Vm0 = sup g∈Sm0
W (g) − W (f ) . wm0 (g)
Taking these definitions into account, we get from (3.7) 2 b fm0 − f 6 2εwm0 fbm0 Vm0 − pen (m0 ) + pen (m)
(3.8)
for every m0 ∈ M0 . We now control the variables Vm0 for all possible values of m0 in M. To do this we use the concentration inequality for the suprema of Gaussian processes (i.e. Proposition 3.4) which ensures that, given z > 0, for any m0 ∈ M, n o p P Vm0 > E [Vm0 ] + 2vm0 (xm0 + z) 6 e−xm0 e−z (3.9) where vm0 = sup Var [W (g) − W (f ) /wm0 (g)] = sup g∈Sm0
g∈Sm0
h i 2 2 kg − f k /wm 0 (g) .
−2 Since wm0 (g) > kg − f k ym0 , then vm0 6 ym 0 and therefore, summing up in0 equalities (3.9) over m ∈ M we derive that, on some event Ωz with probability larger than 1 − Σe−z , for all m0 ∈ M p −1 Vm0 6 E [Vm0 ] + ym 2 (xm0 + z). (3.10) 0
We now use assumption (3.1) to bound E [Vm0 ]. Indeed " # (W (fm0 ) − W (f ))+ W (g) − W (fm0 ) +E (3.11) E [Vm0 ] 6 E sup wm0 (g) inf g∈Sm0 [wm0 (g)] g∈Sm0 2
2 and since by (3.6), 2wm0 (g) > kg − fm0 k + ym 0 for all g ∈ Sm0 we derive from (3.1) with h = fm0 and the monotonicity assumption on φm0 that " # W (g) − W (fm0 ) −2 E sup 6 ym 0 φm0 (ym0 ) wm0 (g) g∈Sm0 p −1/2 −1 6 ym ε Dm0 ε−1 Dm0 0 φm 0
Chapter 9 HIGH DIMENSIONAL CLUSTERING Christophe Biernacki and Cathy Maugis-Rabusseau
9.1
Introduction
High-dimensional (HD) data sets are now frequent, mostly motivated by technological reasons which concern automation in variable acquisition, cheaper availability of data storage and more powerful standard computers for quick data management possibility. All fields are impacted by this general phenomenon of variable number inflation, only the definition of “high” being domain dependent. In marketing, this number can be of order 102 , in microarray gene expression between 102 and 104 , in text mining 103 or more, of order 106 for single nucleotide polymorphism (SNP) data, etc. Note also that sometimes much more variables can be involved, what can be typically the case with discretized curves, for instance curves coming from temporal sequences. Here are two related illustrations. Figure 9.1(a) displays a text mining example1 . It mixes Medline (1033 medical abstracts) and Cranfield (1398 aeronautical abstracts) making a total of 2431 documents. Furthermore, all the words (excluding stop words) are considered as features making a total of 9275 unique words. The data matrix consists of documents on the rows and words on the columns with each entry giving the term frequency, that is the number of occurrences of corresponding word in corresponding document. Figure 9.1(b) displays a curve example. This Kneading data set comes from Danone Vitapole Paris Research Center and concerns the quality of cookies and the relationship with the flour kneading process (Lévéder et al. [2004]). It is composed by 115 different flours for which the dough resistance is measured during the kneading process for 480 seconds. We notice that the equispaced instants of time in the interval [0; 480] (here 241 measures) could be much more large than 241 if measures were more frequently recorded. 1 This
data set is publicly available at ftp://ftp.cs.cornell.edu/pub/smart.
248
Chapter 9
(a)
(b)
Figure 9.1: Examples of high-dimensional data sets: (a) Text mining: n = 2431 documents and the frequency that d = 9275 unique words occurs in each document (a whiter cell indicates a higher frequency); (b) Curves: n = 115 kneading curves observed at d = 241 equispaced instants of time in the interval [0; 480].
Such a technological revolution has a huge impact in other scientific fields, as societal or also mathematical ones. In particular, high-dimensional data management brings some new challenges to statisticians since standard (lowdimensional) data analysis methods struggle to directly apply to the new (highdimensional) data sets. The reason can be twofold, sometimes linked, involving either combinatorial difficulties or disastrously large estimate variance increase. Data analysis methods are essential for providing a synthetic view of data sets, allowing data summary and data exploratory for future decision making for instance. This need is even more acute in the high-dimensional setting since on the one hand the large number of variables suggests that a lot of information is conveyed by data but, in the other hand, such information may be hidden behind their volume. Cluster analysis is one of the main data analysis method. It aims at partitioning a data set x = (x1 , . . . , xn ), composed by n individuals and lying in a space X of dimension d into K groups G1 , . . . , GK . This partition is denoted by z = (z1 , . . . , zn ), lying in a space Z, where zi = (zi1 , . . . , ziK )0 is a vector of {0, 1}K such that zik = 1 if individual xi belongs to the kth group Gk , and zik = 0 otherwise (i = 1, . . . , n, k = 1, . . . , K). Figure 9.2 gives an illustration of this principle when d = 2. Model-based clustering allows to reformulate cluster analysis as a well-posed estimation problem both for the partition z and for the number K of groups. It P considers data x1 , . . . , xn as n i.i.d. realK izations of a mixture pdf f (·; θK ) = k=1 πk f (·; αk ), where f (·; αk ) indicates the pdf, parameterized by αk , associated toPthe group k, where πk indicates K the mixture proportion of this component ( k=1 πk = 1, πk > 0) and where θK = (πk , αk , k = 1, . . . , K) indicates the whole mixture parameters. From the whole data set x it is then possible to obtain a mixture parameter estimate θˆK ˆ from the conditional probability f (z|x; θˆK ). to deduce a partition estimate z ˆ from an estimate of the marginal It is also possible to derive an estimate K
249
HIGH DIMENSIONAL CLUSTERING
4 2 X2 0 −2
−2
0
X2
2
4
probability fˆ(x|K). More details on mixture models, related estimation of θK , z and K are given throughout Chapter 7.
−2
0
2
4
−2
X1
x = (x1 , . . . , xn )
0
2
4
X1
−→
ˆ =3 ˆ = (ˆ ˆn ), K z z1 , . . . , z
Figure 9.2: The clustering purpose illustrated in the two-dimensional setting.
Beyond the nice mathematical background it provides, model-based clustering has led also to numerous and significant practical successes in the “lowdimensional” setting as Chapter 7 relates, with references therein. Extending the general framework of model-based clustering to the “high-dimensional” setting is thus a natural and desirable purpose. In principle, the more information we have about each individual, the better a clustering method is expected to perform. However the structure of interest may often be contained in a subset of the available variables and a lot of variables may be useless or even harmful to detect a reasonable clustering structure. It is thus important to select the relevant variables from the cluster analysis view point. It is a recent research topic in contrast to variable selection in regression and classification models (Kohavi and John [1997]; Guyon and Elisseeff [2003]; Miller [1990]). This new interest for variable selection in clustering comes from the increasingly frequent use of these methods on high-dimensional data sets, such as transcriptome data sets. Three types of approaches dealing with variable selection in clustering have been proposed. The first one includes clustering methods with weighted variables (see for instance Friedman and Meulman [2004]) and dimension reduction methods. For this later, McLachlan et al. [2002] use a mixture of factor analyzers to reduce the extremely high dimensionality of a gene expression problem. A suitable Gaussian mixture family is considered in Bouveyron et al. [2007] to take into account the dimension reduction and the data clustering simultaneously. In contrast to this first method type, the last two approaches select explicitly relevant variables. The so-called “filter” approaches select the variables before a clustering analysis (see for instance Dash et al. [2002]; Jouve and Nicoloyannis [2005]). Their main weakness is the influence of independent selection step of the clustering results. In contrast, the so-called “wrapper” approaches combine