248

Chapter 9

(a)

(b)

Figure 9.1: Examples of high-dimensional data sets: (a) Text mining: n = 2431 documents and the frequency that d = 9275 unique words occurs in each document (a whiter cell indicates a higher frequency); (b) Curves: n = 115 kneading curves observed at d = 241 equispaced instants of time in the interval [0; 480].

Such a technological revolution has a huge impact in other scientific fields, as societal or also mathematical ones. In particular, high-dimensional data management brings some new challenges to statisticians since standard (lowdimensional) data analysis methods struggle to directly apply to the new (highdimensional) data sets. The reason can be twofold, sometimes linked, involving either combinatorial difficulties or disastrously large estimate variance increase. Data analysis methods are essential for providing a synthetic view of data sets, allowing data summary and data exploratory for future decision making for instance. This need is even more acute in the high-dimensional setting since on the one hand the large number of variables suggests that a lot of information is conveyed by data but, in the other hand, such information may be hidden behind their volume. Cluster analysis is one of the main data analysis method. It aims at partitioning a data set x = (x1 , . . . , xn ), composed by n individuals and lying in a space X of dimension d into K groups G1 , . . . , GK . This partition is denoted by z = (z1 , . . . , zn ), lying in a space Z, where zi = (zi1 , . . . , ziK )0 is a vector of {0, 1}K such that zik = 1 if individual xi belongs to the kth group Gk , and zik = 0 otherwise (i = 1, . . . , n, k = 1, . . . , K). Figure 9.2 gives an illustration of this principle when d = 2. Model-based clustering allows to reformulate cluster analysis as a well-posed estimation problem both for the partition z and for the number K of groups. It P considers data x1 , . . . , xn as n i.i.d. realK izations of a mixture pdf f (·; θK ) = k=1 πk f (·; αk ), where f (·; αk ) indicates the pdf, parameterized by αk , associated toPthe group k, where πk indicates K the mixture proportion of this component ( k=1 πk = 1, πk > 0) and where θK = (πk , αk , k = 1, . . . , K) indicates the whole mixture parameters. From the whole data set x it is then possible to obtain a mixture parameter estimate θˆK ˆ from the conditional probability f (z|x; θˆK ). to deduce a partition estimate z ˆ from an estimate of the marginal It is also possible to derive an estimate K

Model Choice and Model Aggregation, F. Bertrand - Editions Techip

For over fourty years, choosing a statistical model thanks to data consisted in optimizing a criterion based on penalized likelihood (H. Aka...

Model Choice and Model Aggregation, F. Bertrand - Editions Techip

For over fourty years, choosing a statistical model thanks to data consisted in optimizing a criterion based on penalized likelihood (H. Aka...