WILEYSERIESINPROBABILITYANDSTATISTICS
EstablishedbyWALTERA.SHEWHARTandSAMUELS.WILKS Editors:NoelCressie,GarrettFitzmaurice,DavidBalding,GeertMolenberghs,GeofGivens, HarveyGoldstein,DavidScott,AdrianSmith,RueyTsay.
SamplingandEstimationfromFinitePopulations
YvesTillé
UniversitédeNeuchâtel
Switzerland
MostofthisbookhasbeentranslatedfromFrenchby IlyaHekimi
OriginalFrenchtitle: Théoriedessondages:Échantillonnageetestimationen populationsfinies
Thiseditionfirstpublished2020
©2020JohnWiley&SonsLtd
Allrightsreserved.Nopartofthispublicationmaybereproduced,storedinaretrievalsystem,or transmitted,inanyformorbyanymeans,electronic,mechanical,photocopying,recordingorotherwise, exceptaspermittedbylaw.Adviceonhowtoobtainpermissiontoreusematerialfromthistitleisavailable athttp://www.wiley.com/go/permissions.
TherightofYvesTillétobeidentifiedastheauthorofthisworkhasbeenassertedinaccordancewithlaw.
RegisteredOffices
JohnWiley&Sons,Inc.,111RiverStreet,Hoboken,NJ07030,USA
JohnWiley&SonsLtd,TheAtrium,SouthernGate,Chichester,WestSussex,PO198SQ,UK
EditorialOffice
9600GarsingtonRoad,Oxford,OX42DQ,UK
Fordetailsofourglobaleditorialoffices,customerservices,andmoreinformationaboutWileyproducts visitusatwww.wiley.com.
Wileyalsopublishesitsbooksinavarietyofelectronicformatsandbyprint-on-demand.Somecontentthat appearsinstandardprintversionsofthisbookmaynotbeavailableinotherformats.
LimitofLiability/DisclaimerofWarranty
Whilethepublisherandauthorshaveusedtheirbesteffortsinpreparingthiswork,theymakeno representationsorwarrantieswithrespecttotheaccuracyorcompletenessofthecontentsofthisworkand specificallydisclaimallwarranties,includingwithoutlimitationanyimpliedwarrantiesofmerchantabilityor fitnessforaparticularpurpose.Nowarrantymaybecreatedorextendedbysalesrepresentatives,written salesmaterialsorpromotionalstatementsforthiswork.Thefactthatanorganization,website,orproductis referredtointhisworkasacitationand/orpotentialsourceoffurtherinformationdoesnotmeanthatthe publisherandauthorsendorsetheinformationorservicestheorganization,website,orproductmayprovide orrecommendationsitmaymake.Thisworkissoldwiththeunderstandingthatthepublisherisnotengaged inrenderingprofessionalservices.Theadviceandstrategiescontainedhereinmaynotbesuitableforyour situation.Youshouldconsultwithaspecialistwhereappropriate.Further,readersshouldbeawarethat websiteslistedinthisworkmayhavechangedordisappearedbetweenwhenthisworkwaswrittenandwhen itisread.Neitherthepublishernorauthorsshallbeliableforanylossofprofitoranyothercommercial damages,includingbutnotlimitedtospecial,incidental,consequential,orotherdamages.
LibraryofCongressCataloging-in-PublicationData
Names:Tillé,Yves,author.|Hekimi,Ilya,translator.
Title:Samplingandestimationfromfinitepopulations/YvesTillé;most ofthisbookhasbeentranslatedfromFrenchbyIlyaHekimi.
Othertitles:Théoriedessondages.English
Description:Hoboken,NJ:Wiley,[2020]|Series:Wileyseriesin probabilityandstatisticsapplied.Probabilityandstatisticssection| Translationof:Théoriedessondages:échantillonnageetestimation enpopulationsfinies.|Includesbibliographicalreferencesandindex.
Identifiers:LCCN2019048451|ISBN9780470682050(hardback)|ISBN 9781119071266(adobepdf)|ISBN9781119071273(epub)
Subjects:LCSH:Sampling(Statistics)|Publicopinionpolls–Statistical methods.|Estimationtheory.
Classification:LCCQA276.6.T628132020|DDC519.5/2–dc23 LCrecordavailableathttps://lccn.loc.gov/2019048451
CoverDesign:Wiley
CoverImage:©gremlin/GettyImages
Setin10/12ptWarnockProbySPiGlobal,Chennai,India
PrintedandboundbyCPIGroup(UK)Ltd,Croydon,CR04YY 10987654321
Contents
ListofFigures xiii
ListofTables xvii
ListofAlgorithms xix
Preface xxi
PrefacetotheFirstFrenchEdition xxiii
TableofNotations xxv
1AHistoryofIdeasinSurveySamplingTheory 1
1.1Introduction 1
1.2EnumerativeStatisticsDuringthe19thCentury 2
1.3ControversyontheuseofPartialData 4
1.4DevelopmentofaSurveySamplingTheory 5
1.5TheUSElectionsof1936 6
1.6TheStatisticalTheoryofSurveySampling 6
1.7ModelingthePopulation 8
1.8AttempttoaSynthesis 9
1.9AuxiliaryInformation 9
1.10RecentReferencesandDevelopment 10
2Population,Sample,andEstimation 13
2.1Population 13
2.2Sample 14
2.3InclusionProbabilities 15
2.4ParameterEstimation 17
2.5EstimationofaTotal 18
2.6EstimationofaMean 19
2.7VarianceoftheTotalEstimator 20
2.8SamplingwithReplacement 22 Exercises 24
3SimpleandSystematicDesigns 27
3.1SimpleRandomSamplingwithoutReplacementwithFixedSampleSize 27
3.1.1SamplingDesignandInclusionProbabilities 27
3.1.2TheExpansionEstimatoranditsVariance 28
3.1.3CommentontheVariance–CovarianceMatrix 31
3.2BernoulliSampling 32
3.2.1SamplingDesignandInclusionProbabilities 32
3.2.2Estimation 34
3.3SimpleRandomSamplingwithReplacement 36
3.4ComparisonoftheDesignswithandWithoutReplacement 38
3.5SamplingwithReplacementandRetainingDistinctUnits 38
3.5.1SampleSizeandSamplingDesign 38
3.5.2InclusionProbabilitiesandEstimation 41
3.5.3ComparisonoftheEstimators 44
3.6InverseSamplingwithReplacement 45
3.7EstimationofOtherFunctionsofInterest 47
3.7.1EstimationofaCountoraProportion 47
3.7.2EstimationofaRatio 48
3.8DeterminationoftheSampleSize 50
3.9ImplementationofSimpleRandomSamplingDesigns 51
3.9.1ObjectivesandPrinciples 51
3.9.2BernoulliSampling 51
3.9.3SuccessiveDrawingoftheUnits 52
3.9.4RandomSortingMethod 52
3.9.5Selection–RejectionMethod 53
3.9.6TheReservoirMethod 54
3.9.7ImplementationofSimpleRandomSamplingwithReplacement 56
3.10SystematicSamplingwithEqualProbabilities 57
3.11EntropyforSimpleandSystematicDesigns 58
3.11.1BernoulliDesignsandEntropy 58
3.11.2EntropyandSimpleRandomSampling 60
3.11.3GeneralRemarks 61 Exercises 61
4Stratification 65
4.1PopulationandStrata 65
4.2Sample,InclusionProbabilities,andEstimation 66
4.3SimpleStratifiedDesigns 68
4.4StratifiedDesignwithProportionalAllocation 70
4.5OptimalStratifiedDesignfortheTotal 71
4.6NotesAboutOptimalityinStratification 74
4.7PowerAllocation 75
4.8OptimalityandCost 76
4.9SmallestSampleSize 76
4.10ConstructionoftheStrata 77
4.10.1GeneralComments 77
4.10.2DividingaQuantitativeVariableinStrata 77
4.11StratificationUnderManyObjectives 79 Exercises 80
5SamplingwithUnequalProbabilities 83
5.1AuxiliaryVariablesandInclusionProbabilities 83
5.2CalculationoftheInclusionProbabilities 84
5.3GeneralRemarks 85
5.4SamplingwithReplacementwithUnequalInclusionProbabilities 86
5.5NonvalidityoftheGeneralizationoftheSuccessiveDrawingwithout Replacement 88
5.6SystematicSamplingwithUnequalProbabilities 89
5.7Deville’sSystematicSampling 91
5.8PoissonSampling 92
5.9MaximumEntropyDesign 95
5.10Rao–SampfordRejectiveProcedure 98
5.11OrderSampling 100
5.12SplittingMethod 101
5.12.1GeneralPrinciples 101
5.12.2MinimumSupportDesign 103
5.12.3DecompositionintoSimpleRandomSamplingDesigns 104
5.12.4PivotalMethod 107
5.12.5BrewerMethod 109
5.13ChoiceofMethod 110
5.14VarianceApproximation 111
5.15VarianceEstimation 114 Exercises 115
6BalancedSampling 119
6.1Introduction 119
6.2BalancedSampling:Definition 120
6.3BalancedSamplingandLinearProgramming 122
6.4BalancedSamplingbySystematicSampling 123
6.5MethodeofDeville,Grosbras,andRoth 124
6.6CubeMethod 125
6.6.1RepresentationofaSamplingDesignintheformofaCube 125
6.6.2ConstraintSubspace 126
6.6.3RepresentationoftheRoundingProblem 127
6.6.4PrincipleoftheCubeMethod 130
6.6.5TheFlightPhase 130
6.6.6LandingPhasebyLinearProgramming 133
6.6.7ChoiceoftheCostFunction 134
6.6.8LandingPhasebyRelaxingVariables 135
6.6.9QualityofBalancing 135
6.6.10AnExample 136
6.7VarianceApproximation 137
6.8VarianceEstimation 140
6.9SpecialCasesofBalancedSampling 141
6.10PracticalAspectsofBalancedSampling 141 Exercise 142
7ClusterandTwo-stageSampling 143
7.1ClusterSampling 143
7.1.1NotationandDefinitions 143
7.1.2ClusterSamplingwithEqualProbabilities 146
7.1.3SamplingProportionaltoSize 147
7.2Two-stageSampling 148
7.2.1Population,Primary,andSecondaryUnits 149
7.2.2TheExpansionEstimatoranditsVariance 151
7.2.3SamplingwithEqualProbability 155
7.2.4Self-weightingTwo-stageDesign 156
7.3Multi-stageDesigns 157
7.4SelectingPrimaryUnitswithReplacement 158
7.5Two-phaseDesigns 161
7.5.1DesignandEstimation 161
7.5.2VarianceandVarianceEstimation 162
7.6IntersectionofTwoIndependentSamples 163 Exercises 165
8OtherTopicsonSampling 167
8.1SpatialSampling 167
8.1.1TheProblem 167
8.1.2GeneralizedRandomTessellationStratifiedSampling 167
8.1.3UsingtheTravelingSalesmanMethod 169
8.1.4TheLocalPivotalMethod 169
8.1.5TheLocalCubeMethod 169
8.1.6MeasuresofSpread 170
8.2CoordinationinRepeatedSurveys 172
8.2.1TheProblem 172
8.2.2Population,Sample,andSampleDesign 173
8.2.3SampleCoordinationandResponseBurden 174
8.2.4PoissonMethodwithPermanentRandomNumbers 175
8.2.5KishandScottMethodforStratifiedSamples 176
8.2.6TheCottonandHesseMethod 176
8.2.7TheRivièreMethod 177
8.2.8TheNetherlandsMethod 178
8.2.9TheSwissMethod 178
8.2.10CoordinatingUnequalProbabilityDesignswithFixedSize 181
8.2.11Remarks 181
8.3MultipleSurveyFrames 182
8.3.1Introduction 182
8.3.2CalculatingInclusionProbabilities 183
8.3.3UsingInclusionProbabilitySums 184
8.3.4UsingaMultiplicityVariable 185
8.3.5UsingaWeightedMultiplicityVariable 186
8.3.6Remarks 187
8.4IndirectSampling 187
8.4.1Introduction 187
8.4.2AdaptiveSampling 188
8.4.3SnowballSampling 188
8.4.4IndirectSampling 189
8.4.5TheGeneralizedWeightSharingMethod 190
8.5Capture–Recapture 191
9EstimationwithaQuantitativeAuxiliaryVariable 195
9.1TheProblem 195
9.2RatioEstimator 196
9.2.1MotivationandDefinition 196
9.2.2ApproximateBiasoftheRatioEstimator 197
9.2.3ApproximateVarianceoftheRatioEstimator 198
9.2.4BiasRatio 199
9.2.5RatioandStratifiedDesigns 199
9.3TheDifferenceEstimator 201
9.4EstimationbyRegression 202
9.5TheOptimalRegressionEstimator 204
9.6DiscussionoftheThreeEstimationMethods 205 Exercises 208
10Post-StratificationandCalibrationonMarginalTotals 209
10.1Introduction 209
10.2Post-Stratification 209
10.2.1NotationandDefinitions 209
10.2.2Post-StratifiedEstimator 211
10.3ThePost-StratifiedEstimatorinSimpleDesigns 212
10.3.1Estimator 212
10.3.2ConditioninginaSimpleDesign 213
10.3.3PropertiesoftheEstimatorinaSimpleDesign 214
10.4EstimationbyCalibrationonMarginalTotals 217
10.4.1TheProblem 217
10.4.2CalibrationonMarginalTotals 218
10.4.3CalibrationandKullback–LeiblerDivergence 220
10.4.4RakingRatioEstimation 221
10.5Example 221 Exercises 224
11MultipleRegressionEstimation 225
11.1Introduction 225
11.2MultipleRegressionEstimator 226
11.3AlternativeFormsoftheEstimator 227
11.3.1HomogeneousLinearEstimator 227
11.3.2ProjectiveForm 228
11.3.3CosmeticForm 228
11.4CalibrationoftheMultipleRegressionEstimator 229
11.5VarianceoftheMultipleRegressionEstimator 230
11.6ChoiceofWeights 231
11.7SpecialCases 231
11.7.1RatioEstimator 231
x Contents
11.7.2Post-stratifiedEstimator 231
11.7.3RegressionEstimationwithaSingleExplanatoryVariable 233
11.7.4OptimalRegressionEstimator 233
11.7.5ConditionalEstimation 235
11.8ExtensiontoRegressionEstimation 236 Exercise 236
12CalibrationEstimation 237
12.1CalibratedMethods 237
12.2DistancesandCalibrationFunctions 239
12.2.1TheLinearMethod 239
12.2.2TheRakingRatioMethod 240
12.2.3PseudoEmpiricalLikelihood 242
12.2.4ReverseInformation 244
12.2.5TheTruncatedLinearMethod 245
12.2.6GeneralPseudo-Distance 246
12.2.7TheLogisticMethod 249
12.2.8DevilleCalibrationFunction 249
12.2.9RoyandVanheuverzwynMethod 251
12.3SolvingCalibrationEquations 252
12.3.1SolvingbyNewton’sMethod 252
12.3.2BoundManagement 253
12.3.3ImproperCalibrationFunctions 254
12.3.4ExistenceofaSolution 254
12.4CalibratingonHouseholdsandIndividuals 255
12.5GeneralizedCalibration 256
12.5.1CalibrationEquations 256
12.5.2LinearCalibrationFunctions 257
12.6CalibrationinPractice 258
12.7AnExample 259 Exercises 260
13Model-Basedapproach 263
13.1ModelApproach 263
13.2TheModel 263
13.3HomoscedasticConstantModel 267
13.4HeteroscedasticModel1WithoutIntercept 267
13.5HeteroscedasticModel2WithoutIntercept 269
13.6UnivariateHomoscedasticLinearModel 270
13.7StratifiedPopulation 271
13.8SimplifiedVersionsoftheOptimalEstimator 273
13.9CompletedHeteroscedasticityModel 276
13.10Discussion 277
13.11AnApproachthatisBothModel-andDesign-based 277
14EstimationofComplexParameters 281
14.1EstimationofaFunctionofTotals 281
14.2VarianceEstimation 282
14.3CovarianceEstimation 283
14.4ImplicitFunctionEstimation 283
14.5CumulativeDistributionFunctionandQuantiles 284
14.5.1CumulativeDistributionFunctionEstimation 284
14.5.2QuantileEstimation:Method1 284
14.5.3QuantileEstimation:Method2 285
14.5.4QuantileEstimation:Method3 287
14.5.5QuantileEstimation:Method4 288
14.6CumulativeIncome,LorenzCurve,andQuintileShareRatio 288
14.6.1CumulativeIncomeEstimation 288
14.6.2LorenzCurveEstimation 289
14.6.3QuintileShareRatioEstimation 289
14.7GiniIndex 290
14.8AnExample 291
15VarianceEstimationbyLinearization 295
15.1Introduction 295
15.2OrdersofMagnitudeinProbability 295
15.3AsymptoticHypotheses 300
15.3.1LinearizingaFunctionofTotals 301
15.3.2VarianceEstimation 303
15.4LinearizationofFunctionsofInterest 303
15.4.1LinearizationofaRatio 303
15.4.2LinearizationofaRatioEstimator 304
15.4.3LinearizationofaGeometricMean 305
15.4.4LinearizationofaVariance 305
15.4.5LinearizationofaCovariance 306
15.4.6LinearizationofaVectorofRegressionCoefficients 307
15.5LinearizationbySteps 308
15.5.1DecompositionofLinearizationbySteps 308
15.5.2LinearizationofaRegressionCoefficient 308
15.5.3LinearizationofaUnivariateRegressionEstimator 309
15.5.4LinearizationofaMultipleRegressionEstimator 309
15.6LinearizationofanImplicitFunctionofInterest 310
15.6.1EstimatingEquationandImplicitFunctionofInterest 310
15.6.2LinearizationofaLogisticRegressionCoefficient 311
15.6.3LinearizationofaCalibrationEquationParameter 313
15.6.4LinearizationofaCalibratedEstimator 313
15.7InfluenceFunctionApproach 314
15.7.1FunctionofInterest,Functional 314
15.7.2Definition 315
15.7.3LinearizationofaTotal 316
15.7.4LinearizationofaFunctionofTotals 316
15.7.5LinearizationofSumsandProducts 317
15.7.6LinearizationbySteps 318
15.7.7LinearizationofaParameterDefinedbyanImplicitFunction 318
15.7.8LinearizationofaDoubleSum 319
15.8Binder’sCookbookApproach 321
15.9DemnatiandRaoApproach 322
15.10LinearizationbytheSampleIndicatorVariables 324
15.10.1TheMethod 324
15.10.2LinearizationofaQuantile 326
15.10.3LinearizationofaCalibratedEstimator 327
15.10.4LinearizationofaMultipleRegressionEstimator 328
15.10.5LinearizationofanEstimatorofaComplexFunctionwithCalibrated Weights 329
15.10.6LinearizationoftheGiniIndex 330
15.11DiscussiononVarianceEstimation 331 Exercises 331
16TreatmentofNonresponse 333
16.1SourcesofError 333
16.2CoverageErrors 334
16.3DifferentTypesofNonresponse 334
16.4NonresponseModeling 335
16.5TreatingNonresponsebyReweighting 336
16.5.1NonresponseComingfromaSample 336
16.5.2ModelingtheNonresponseMechanism 337
16.5.3DirectCalibrationofNonresponse 339
16.5.4ReweightingbyGeneralizedCalibration 341
16.6Imputation 342
16.6.1GeneralPrinciples 342
16.6.2ImputingFromanExistingValue 342
16.6.3ImputationbyPrediction 342
16.6.4LinkBetweenRegressionImputationandReweighting 343
16.6.5RandomImputation 345
16.7VarianceEstimationwithNonresponse 347
16.7.1GeneralPrinciples 347
16.7.2EstimationbyDirectCalibration 348
16.7.3GeneralCase 349
16.7.4VarianceforMaximumLikelihoodEstimation 350
16.7.5VarianceforEstimationbyCalibration 353
16.7.6VarianceofanEstimatorImputedbyRegression 356
16.7.7OtherVarianceEstimationTechniques 357
17SummarySolutionstotheExercises 359
Bibliography 379
AuthorIndex 405
SubjectIndex 411
ListofFigures
Figure1.1 Auxiliaryinformationcanbeusedbeforeorafterdatacollectionto improveestimations 10
Figure4.1 Stratifieddesign:thesamplesareselectedindependentlyfromone stratumtoanother 67
Figure5.1 Systematicsampling:examplewithinclusionprobabilities ��1 =0.2, ��2
Figure5.2 MethodofDeville 91
Figure5.3 Splittingintotwoparts 102
Figure5.4 Splittingin M parts 103
Figure5.5 Minimumsupportdesign 105
Figure5.6 Decompositionintosimplerandomsamplingdesigns 106
Figure5.7 Pivotalmethodappliedonvector �� =(0 3, 0 4, 0 6, 0 7)⊤ 108
Figure6.1 Possiblesamplesinapopulationofsize N = 3 126
Figure6.2 Fixedsizeconstraint:thethreesamplesofsize n = 2areconnectedby anaffinesubspace 126
Figure6.3 Noneoftheverticesof K isavertexofthecube 128
Figure6.4 Twoverticesof K areverticesofthecube,butthethirdisnot 129
Figure6.5 Flightphaseinapopulationofsize N = 3withaconstraintoffixedsize n = 2 131
Figure7.1 Clustersampling:thepopulationisdividedintoclusters.Clustersare randomlyselected.Allunitsfromtheselectedclustersareincludedin thesample 144
Figure7.2 Two-stagesamplingdesign:werandomlyselectprimaryunitsinwhich weselectasampleofsecondaryunits 149
Figure7.3 Two-phasedesign:asample Sb isselectedinsample Sa 161
Figure7.4 Thesample S istheintersectionofsamples SA and SB 164
Figure8.1 Ina 40×40 grid,asystematicsampleandastratifiedsamplewithone unitperstratumareselected 168
Figure8.2 RecursivequadrantfunctionusedfortheGRTSmethodwiththree subdivisions 168
ListofFigures
Figure8.3 Originalfunctionwithfourrandompermutations 168
Figure8.4 Samplesof64pointsinagridof 40×40=1600 pointsusingsimple designs,GRTS,thelocalpivotalmethod,andthelocalcube method 170
Figure8.5 Sampleof64pointsinagridof 40×40=1600 pointsandVoronoï polygons.Applicationstosimple,systematic,andstratifieddesigns,the localpivotalmethod,andthelocalcubemethod 171
Figure8.6 Intervalcorrespondingtothefirstwave(extractfromQualité, 2009) 179
Figure8.7 Positivecoordinationwhen �� 2 k ≤ �� 1 k (extractfromQualité,2009) 179
Figure8.8 Positivecoordinationwhen �� 2 k ≥ �� 1
Figure8.9 Negativecoordinationwhen
(extractfromQualité,2009) 179
1 (extractfromQualité, 2009) 180
Figure8.10 Negativecoordinationwhen
1 (extractfromQualité, 2009) 180
Figure8.11 Coordinationofathirdsample(extractfromQualité,2009) 181
Figure8.12 Twosurveyframes UA and UB coverthepopulation.Ineachone,we selectasample 183
Figure8.13 Inthisexample,thepointsrepresentcontaminatedtrees.Duringthe initialsampling,theshadedsquaresareselected.Thebordersinbold surroundthefinalselectedzones 189
Figure8.14 Exampleofindirectsampling.Inpopulation UA , theunitssurrounded byacircleareselected.Twoclusters(UB1 and UB3 )ofpopulation UB eachcontainatleastoneunitthathasalinkwithaunitselectedin population UA .Unitsof UB surroundedbyacircleareselectedatthe end 190
Figure9.1 Ratioestimator:observationsalignedalongalinepassingthroughthe origin 196
Figure9.2 Differenceestimator:observationsalignedalongalineofslopeequal to1 201
Figure10.1 Post-stratification:thepopulationisdividedinpost-strata,butthe sampleisselectedwithouttakingpost-strataintoaccount 210
Figure12.1 Linearmethod:pseudo-distance G(��k , dk ) with qk =1 and dk =10 239
Figure12.2 Linearmethod:function g (��k , dk ) with qk =1 and dk =10. 240
Figure12.3 Linearmethod:function Fk (u) with qk =1 240
Figure12.4 Rakingratio:pseudo-distance G(��k , dk ) with qk =1 and dk =10 241
Figure12.5 Rakingratio:function g (��k , dk ) with qk =1 and dk =10 241
Figure12.6 Rakingratio:function Fk (u) with qk =1 241
Figure12.7 Reverseinformation:pseudo-distance G(��k , dk ) with qk =1 and dk =10 244
Figure12.8 Reverseinformation:function g (��k , dk ) with qk =1 and dk =10 244
Figure12.9 Reverseinformation:function Fk (u) with qk =1 245
Figure12.10 Truncatedlinearmethod:pseudo-distance G(��k , dk ) with qk =1, dk =10, L =0 2,and H =2 5 246
Figure12.11 Truncatedlinearmethod:function g (��k , dk ) with qk =1, dk =10, L =0 2,and H =2 5 246
Figure12.12 Truncatedlinearmethod:calibrationfunction Fk (u) with qk =1, dk =10, L =0.2,and H =2.5 246
Figure12.13 Pseudo-distances G�� (��k , dk ) with �� =−1, 0, 1∕2, 1, 2, 3 and dk =2 247
Figure12.14 Calibrationfunctions F �� k (u) with �� =−1, 0, 1∕2, 1, 2, 3 and qk =1 248
Figure12.15 Logisticmethod:pseudo-distance G(��k , dk ) with qk =1, dk =10, L =0 2,and H =2 5 249
Figure12.16 Logisticmethod:function g (��k , dk ) with qk =1, dk =10, L =0.2,and H =2 5 250
Figure12.17 Logisticmethod:calibrationfunction Fk (u) with qk =1, L =0.2, and H =2.5 250
Figure12.18 Devillecalibration:pseudo-distance G(��k , dk ) with qk =1, dk =10. 250
Figure12.19 Devillecalibration:calibrationfunction Fk (u) with qk =1 251
Figure12.20 Pseudo-distances G�� (��k , dk ) ofRoyandVanheuverzwynwith �� =0, 1, 2, 3, dk =2,and qk =1 251
Figure12.21 Calibrationfunction ̃ F �� k (u) ofRoyandVanheuverzwynwith �� =0, 1, 2, 3 and qk =1 252
Figure12.22 Variationofthe g -weightsfordifferentcalibrationmethodsasa functionoftheirrank 260
Figure13.1 Totaltaxableincomeinmillionsofeuroswithrespecttothenumberof inhabitantsinBelgianmunicipalitiesof100000peopleorlessin2004 (Source:Statbel).Thecloudofpointsisalignedalongalinegoing throughtheorigin 268
Figure14.1 Stepcumulativedistributionfunction ̂ F1 (x) withcorresponding quartiles 285
Figure14.2 Cumulativedistributionfunction ̂ F2 (y) obtainedbyinterpolationof points (yk , F1 (yk )) withcorrespondingquartiles 286
Figure14.3 Cumulativedistributionfunction ̂ F3 (x) obtainedbyinterpolatingthe centeroftheriserswithcorrespondingquartiles 287
Figure14.4 LorenzcurveandthesurfaceassociatedwiththeGiniindex 292
Figure16.1 Two-phaseapproachfornonresponse.Thesetofrespondents R isa subsetofsample S336
Figure16.2 Thereversedapproachfornonresponse.Thesampleof nonrespondents R isindependentoftheselectedsample S336
ListofTables
Table3.1 Simpledesigns:summarytable 38
Table3.2 Exampleofsamplesizesrequiredfordifferentpopulationsizesand differentvaluesof b for �� =0.05 and ̂ P =1∕2 51
Table4.1 Applicationofoptimalallocation:thesamplesizeislargerthanthe populationsizeinthethirdstratum 73
Table4.2 Secondapplicationofoptimalallocationinstrata1and2 73
Table5.1 Minimumsupportdesign 105
Table5.2 Decompositionintosimplerandomsamplingdesigns 106
Table5.3 Decompositioninto N simplerandomsamplingdesigns 107
Table5.4 Propertiesofthemethods 111
Table6.1 Populationof20studentswithvariables,constant,gender(1,male, 2female),age,andamarkof20inastatisticsexam 136
Table6.2 Totalsandexpansionestimatorsforbalancingvariables 137
Table6.3 Variancesoftheexpansionestimatorsofthemeansundersimple randomsamplingandbalancedsampling 137
Table7.1 Blocknumber,numberofhouseholds,andtotalhouseholdincome 165
Table8.1 MeansofspatialbalancingmeasuresbasedonVoronoïpolygons B(as ) andmodifiedMoranindices IB forsixsamplingdesignson1000 simulations 172
Table8.2 Selectionintervalsfornegativecoordinationandselectionindicatorsin thecasewherethePRNsfallswithintheinterval.Ontheleft,thecase where �� 1 k + �� 2 k ≤ 1 (Figure8.9).Ontheright,thecasewhere �� 1 k + �� 2 k ≥ 1 (Figure8.10) 180
Table8.3 Selectionindicatorsforeachselectionintervalforunit k181
Table9.1 Estimationmethods:summarytable 206
Table10.1 Populationpartition 217
Table10.2 Totalswithrespecttotwovariables 218
Table10.3 Calibration,startingtable 219
Table10.4 SalariesinEuros 222
Table10.5 Estimatedtotalsusingsimplerandomsamplingwithout replacement 222
Table10.6 Knownmarginsusingacensus 222
Table10.7 Iteration1:rowtotaladjustment 222
Table10.8 Iteration2:columntotaladjustment 223
Table10.9 Iteration3:rowtotaladjustment 223
Table10.10 Iteration4:columntotaladjustment 223
Table12.1 Pseudo-distancesforcalibration 248
Table12.2 Calibrationfunctionsandtheirderivatives 253
Table12.3 Minima,maxima,means,andstandarddeviationsoftheweightsfor eachcalibrationmethod 260
Table14.1 Sample,variableofinterest yk ,weights ��k ,cumulativeweights Wk , and relativecumulativeweights pk 285
Table14.2 Tableoffictitiousincomes yk ,weights ��k ,cumulativeweights Wk , relativecumulativeweights pk ,cumulativeincomes ̂ Y (pk ),andthe Lorenzcurve ̂ L(pk ) 292
Table14.3 TotalsnecessarytoestimatetheGiniindex 293
ListofAlgorithms
Algorithm1 Bernoullisampling 52
Algorithm2 Selection–rejectionmethod 53
Algorithm3 Reservoirmethod 55
Algorithm4 Sequentialalgorithmforsimplerandomsamplingwith replacement 56
Algorithm5 Systematicsamplingwithequalprobabilities 57
Algorithm6 Systematicsamplingwithunequalprobabilities 90
Algorithm7 AlgorithmforPoissonsampling 93
Algorithm8 Sampfordprocedure 100
Algorithm9 Generalalgorithmforthecubemethod 132
Algorithm10 PositivecoordinationusingtheKishandScottmethod 177
Algorithm11 NegativecoordinationwiththeRivièremethod 178
Algorithm12 NegativecoordinationwithEDSmethod 179
Preface
Thefirstversionofthisbookwaspublishedin2001,theyearIlefttheEcoleNationalede laStatistiqueetdel’Analysedel’Information(ENSAI)inRennes(France)toteachatthe UniversityofNeuchâtelinSwitzerland.Thisversioncamefromseveralcoursematerials ofsamplingtheorythatIhadtaughtinRennes.AttheENSAI,thecollaborationwith Jean-ClaudeDevillewasparticularlystimulating.
Theeditingofthisneweditionwaslaboriousandwasdoneinfitsandstarts.Ithank allthosewhoreviewedthedraftsandprovidedmewiththeircomments.Specialthanks toMoniqueGrafforhermeticulousre-readingofsomechapters.
Thealmost20yearsIspentinNeuchâtelweredottedwithmultipleadventures.Iam particularlygratefultoPhilippeEichenbergerandJean-PierreRenfer,whosuccessively headedtheStatisticalMethodsSectionoftheFederalStatisticalOffice.Theirtrustand professionalismhelpedtoestablishafruitfulexchangebetweentheInstituteofStatistics oftheUniversityofNeuchâtelandtheSwissFederalStatisticalOffice.
IamalsoverygratefultothePhDstudentsthatIhavehadthepleasureofmentoringsofar.Eachthesisisanadventurethatteachesbothsupervisoranddoctoral student.ThankyoutoAlinaMatei,LionelQuality,DesislavaNedyalkova,ErikaAntal, MattiLangel,TokyRandrianasolo,EricGraf,CarenHasler,MatthieuWilhelm,Mihaela Guinand-Anastasiade,andAudrey-AnneValléewhotrustedmeandwhomIhadthe pleasuretosuperviseforafewyears.
Neuchâtel,2018
YvesTillé
PrefacetotheFirstFrenchEdition
ThisbookcontainsteachingmaterialthatIstartedtodevelopin1994.Allchaptershave indeedservedasasupportforteaching,acourse,training,aworkshoporaseminar.By groupingthismaterial,Ihopetopresentacoherentandmodernsetofresultsonthe sampling,estimation,andtreatmentofnonresponses,inotherwords,onallthestatisticaloperationsofastandardsamplesurvey.
Inproducingthisbook,mygoalisnottoprovideacomprehensiveoverviewofsurvey samplingtheory,butrathertoshowthatsamplingtheoryisalivingdiscipline,withavery broadscope.If,inseveralchaptersdemonstrationshavebeendiscarded,Ihavealways beencarefultoreferthereadertobibliographicalreferences.Theabundanceofvery recentpublicationsatteststothefertilityofthe1990sinthisarea.Allthedevelopments presentedinthisbookarebasedontheso-called“design-based”approach.Intheory, thereisanotherpointofviewbasedonpopulationmodeling.Iintentionallyleftthis approachaside,notoutofdisinterest,buttoproposeanapproachthatIdeemconsistent andethicallyacceptabletothepublicstatistician.
Iwouldliketothankallthepeoplewho,inonewayoranother,helpedmetomake thisbook:LaurenceBroze,whoentrustedmewithmyfirstsamplingcourseattheUniversityLille3,CarlSärndal,whoencouragedmeonseveraloccasions,andYvesBerger, withwhomIsharedanofficeattheUniversitéLibredeBruxellesforseveralyearsand whogavemeamultitudeofreleventremarks.MythanksalsogotoAntonioCanedo whotaughtmetouseLaTeX,toLydiaZaïdwhohascorrectedthemanuscriptseveral times,andtoJeanDumaisforhismanyconstructivecomments.
Iwrotemostofthisbookatthe ÉcoleNationaledelaStatistiqueetdel’Analysede l’Information.Thewarmatmospherethatprevailedinthestatisticsdepartmentgave mealotofsupport.IespeciallythankmycolleaguesFabienneGaude,CameliaGoga, andSylvieRousseau,whometiculouslyrereadthemanuscript,andGermaineRazé, whodidtheworkofreproductionoftheproofs.SeveralexercisesareduetoPascal Ardilly,Jean-ClaudeDeville,andLaurentWilms.Iwanttothankthemforallowingme toreproducethem.MygratitudegoesparticularlytoJean-ClaudeDevilleforourfruitfulcollaborationwithintheLaboratoryofSurveyStatisticsoftheCenterforResearchin EconomicsandStatistics.Thechaptersonthesplittingmethodandbalancedsampling alsoreflecttheresearchthatwehavedonetogether.
Bruz,2001
YvesTillé
TableofNotations
# cardinal(numberofelementsinaset)
≪ muchlessthan
∖ A∖B complementof B in A
′ function f ′ (x) isthederivativeof f (x)
! factorial: n!= n ×(n −1)×···×2×1
( N n ) N ! n!(N n)! numberofwaystochoose k unitsfrom N units
[a ± b] interval [a b, a + b]
≈ isapproximatelyequalto
∝ isproportionalto
∼ followsaspecificprobabilitydistribution(forarandomvalue)
��{A} equals1if A istrueand0otherwise ak numberoftimesunit k isinthesample a vectorof ak
B0 , B1 , B2 , populationregressioncoefficients
B vectorofpopulationregressioncoefficients
��0 ,��1 ,��2 , … regressioncoefficientsformodel M
�� vectorofregressioncoefficientsofmodel M
̂
B vectorofestimatedregressioncoefficients
̂
�� vectorofestimatedregressioncoefficientsofthemodel
C cubewhoseverticesaresamples
covp (X , Y ) covariancebetweenrandomvariables X and Y
cov(X , Y ) estimatedcovariancebetweenrandomvariables X and Y
CVpopulationcoefficientofvariation
̂
CVestimatedcoefficientofvariation
dk dk =1∕��k expansionestimatorsurveyweights
Ep ( ̂ Y ) mathematicalexpectationunderthesamplingdesign p(.) of estimator ̂ Y
EM ( ̂ Y ) mathematicalexpectationunderthemodel M ofestimator ̂ Y
Eq ( ̂ Y ) mathematicalexpectationunderthenonresponsemechanism q of estimator ̂ Y
EI ( ̂ Y ) mathematicalexpectationundertheimputationmechanism I of estimator ̂ Y
MSEmeansquareerror
f samplingfraction f = n∕N
TableofNotations
gk (.,.) pseudo-distancederivativeforcalibration
gk adjustmentfactoraftercalibrationcalled g -weight gk = ��k ��k = ��k ∕dk
Gk (.,.) pseudo-distanceforcalibration
h strataorpost-strataindex
IC(1− �� ) confidenceintervalwithconfidencelevel 1− ��
k ou �� indicatesastatisticalunit, k ∈ U or �� ∈ U
KC ∩ Q intersectionofthecubeandconstraintspaceforthecube method
m numberofclustersorprimaryunitsinthesampleofclustersor primaryunits
M numberofclustersorprimaryunitsinthepopulation
n Samplesize(withoutreplacement)
ni numberofsecondaryunitssampledinprimaryunit i
nS sizeofthesamplein S ifthesizeisrandom
N populationsize
nh
Samplesizeinstratumorpost-stratum Uh
Nh numberofunitsinstratumorpost-stratum Uh
Ni numberofsecondaryunitsinprimaryunit i
Nij populationtotalswhen (i, j) isacontingencytable
ℕ setofnaturalnumbers
ℕ+ setofpositivenaturalnumberswithzero
p(s) probabilityofselectingsample s
pi probabilityofsamplingunit i forsamplingwithreplacement
P or PD proportionofunitsbelongingtodomain D
Pr(A) probabilitythatevent A occurs
Pr(A|B) probabilitythatevent A occurs,given B occurred
Q subspaceofconstraintsforthecubemethod
rk responseindicator
ℝ setofrealnumbers
ℝ+ setofpositiverealnumberswithzero
ℝ∗ + setofstrictlypositiverealnumbers
s Sampleorsubsetofthepopulation, s ⊂ U
s2 y Samplevarianceofvariable y
s2 yh Samplevarianceof y instratumorpost-stratum h
sxy covariancebetweenvariables x and y inthesample
S randomsamplesuchthat Pr (S = s)= p(s)
s2 y varianceofvariance y inthepopulation
Sxy covariancebetweenvariables x and y inthepopulation
Sh randomsampleselectedinstratumorpost-stratum h
s2 yh populationvarianceof y inthestratumorpost-stratum h
⊤ vector u⊤ isthetransposeofvector u
U finitepopulationofsize N
Uh stratumorpost-stratum h,where h =1, … , H
��k linearizedvariable
��HT ( ̂ Y ) Horvitz–Thompsonestimatorofthevarianceofestimator ̂ Y
��SYG ( ̂ Y ) Sen–Yates–Grundyestimatorofthevarianceofestimator ̂ Y
TableofNotations
varp ( ̂ Y )
varM ( ̂ Y )
varq ( ̂ Y )
varI ( ̂ Y )
��( ̂ Y )
��
varianceofestimator ̂ Y underthesurveydesign
varianceofestimator ̂ Y underthemodel
varianceofestimator ̂ Y underthenonresponsemechanism
varianceofestimator ̂ Y undertheimputationmechanism
varianceestimatorofestimator ̂ Y
k or ��k (S ) weightassociatedwithindividual k inthesampleaftercalibration
x auxiliaryvariable
xk auxiliaryvariablevalueofunit k
xk vectorin ℝp ofthe p valuestakenbytheauxiliaryvariableson k
X totalvalueoftheauxiliaryvariableoveralltheunitsof U
̂
X expansionestimatorof X
X meanvalueoftheauxiliaryvariablesoveralltheunitsof U
̂
X expansionestimatorof X
y variableofinterest
yk valueofthevariableofinterestforunit k
y
∗ k imputedvalueof y for k (treatingnonresponse)
Y totalvalueofthevariableofinterestoveralltheunitsof U
Yh totalvalueofthevariableofinterestoveralltheunitsinstratumor post-stratum Uh
Yi totalof yk inprimaryunitorcluster i
̂ Y expansionestimatorof Y
Y meanvalueofthevariableofinterestoveralltheunitsof U
Y h meanvalueofthevariableofinterestoverallunitsofstratumor post-stratum Uh
̂
Y h estimatorofthemeanvalueofthevariableofinterestoverallunitsof stratumorpost-stratum Uh
̂
Y expansionestimatorof Y
̂ YBLU bestunbiasedlinearestimatorunderthemodeloftotal Y
̂ YCAL calibratedestimatoroftotal Y
̂ YD differenceestimatoroftotal Y
̂ Yh estimatoroftotal Yh instratumorpost-stratum Uh
̂ YHAJ Hájekestimatorof Y
̂ YHH Hansen–Hurwitzestimatorof Y
̂ YIMP estimatorusedwhenmissingvaluesareimputed
̂ YOPT expansionestimatorofthetotalinanoptimalstratifieddesign
̂ YPOST post-stratifiedestimatorofthetotal Y
̂ YPROP expansionestimatorofthetotalinastratifieddesignwith proportionalallocation
̂ YREG regressionestimatoroftotal Y
̂ YREGM multipleregressionestimatoroftotal Y
̂ YREG-OPT optimalregressionestimatoroftotal Y
̂ YRB Rao–Blackwellizedestimatorof Y
̂ YQ ratioestimatorofthetotal
̂ YSTRAT expansionestimatorofthetotalinastratifieddesign
zp quantileoforder p ofastandardizednormalrandomvariable
�� probabilitythattheparameterisoutsidetheinterval
k �� ��k �� ��k ����
��k inclusionprobabilityofunit k
��k �� second-orderinclusionprobabilitiesforunits k and �� ��k �� =Pr(k and �� ∈ S )
��k responseprobabilityofunit k
�� 2 varianceofaninfinitepopulationorvariableof y orvarianceunder themodel
�� correlationbetween x and y inthepopulation
AHistoryofIdeasinSurveySamplingTheory
1.1Introduction
Lookingback,thedebatesthatanimatedascientificdisciplineoftenappearfutile. However,thehistoryofsamplingtheoryisparticularlyinstructive.Itisoneofthe specializationsofstatisticswhichitselfhasasomewhatspecialposition,sinceitisused inalmostallscientificdisciplines.Statisticsisinseparablefromitsfieldsofapplicationsinceitdetermineshowdatashouldbeprocessed.Statisticsisthecornerstone ofquantitativescientificmethods.Itisnotpossibletodeterminetherelevanceof theapplicationsofastatisticaltechniquewithoutreferringtothescientificmethods ofthedisciplinesinwhichitisapplied.
Scientifictruthisoftenpresentedastheconsensusofascientificcommunityataspecificpointintime.Thehistoryofascientificdisciplineisthestoryoftheseconsensuses andespeciallyoftheirchanges.SincetheworkofThomasSamuelKuhn(1970),we haveconsideredthatsciencedevelopsaroundparadigmsthatare,accordingtoKuhn (1970,p.10),“modelsfromwhichspringparticularcoherenttraditionsofscientific research.”Thesemodelshavetwocharacteristics:“Theirachievementwassufficiently unprecedentedtoattractanenduringgroupofadherentsawayfromcompetingmodes ofscientificactivity.Simultaneously,itwassufficientlyopen-endedtoleaveallsortsof problemsfortheredefinedgroupofpractitionerstoresolve.”(Kuhn,1970,p.10).
Manyauthorshaveproposedachronologyofdiscoveriesinsurveytheorythatreflect themajorcontroversiesthathavemarkeditsdevelopment(seeamongothersHansen &Madow,1974;Hansenetal.,1983;Owen&Cochran,1976;Sheynin,1986;Stigler, 1986).Bellhouse(1988a)interpretsthistimelineasastoryofthegreatideasthatcontributedtothedevelopmentofsurveysamplingtheory.Statisticsisapeculiarscience. Withmathematicsfortools,itallowsthemethodologyoftheotherdisciplinestobe finalized.Becauseoftheclosecorrelationbetweenamethodandthemultiplicityofits fieldsofaction,statisticsisbasedonamultitudeofdifferentideasfromthevarious disciplinesinwhichitisapplied.
Thetheoryofsurveysamplingplaysapreponderantroleinthedevelopmentof statistics.However,theuseofsamplingtechniqueshasbeenacceptedonlyvery recently.Amongthecontroversiesthathaveanimatedthistheory,wefindsomeof theclassicaldebatesofmathematicalstatistics,suchastheroleofmodelinganda discussionofestimationtechniques.Samplingtheorywastornbetweenthemajor currentsofstatisticsandgaverisetomultipleapproaches:design-based,model-based, model-assisted,predictive,andBayesian. SamplingandEstimationfromFinitePopulations,
1.2EnumerativeStatisticsDuringthe19thCentury
IntheMiddleAges,severalattemptstoextrapolatepartialdatatoanentirepopulation canbefoundinDroesbekeetal.(1987).In1783,inFrance,PierreSimondeLaplace (see1847)presentedtotheAcademyofSciencesamethodtodeterminethenumber ofinhabitantsfrombirthregistersusingasampleofregions.Heproposedtocalculate, fromthissampleofregions,theratioofthenumberofinhabitantstothenumberof birthsandthentomultiplyitbythetotalnumberofbirths,whichcouldbeobtained withprecisionforthewholepopulation.Laplaceevensuggestedestimating“theerror tobefeared”byreferringtothecentrallimittheorem.Inaddition,herecommended theuseofaratioestimatorusingthetotalnumberofbirthsasauxiliaryinformation. Surveymethodologyaswellasprobabilistictoolswereknownbeforethe19thcentury. However,neverduringthisperiodwasthereaconsensusabouttheirvalidity.
Thedevelopmentofstatistics(etymologically,fromGerman:analysisofdataaboutthe state)isinseparablefromtheemergenceofmodernstatesinthe19thcentury.Oneofthe mostoutstandingpersonalitiesintheofficialstatisticsofthe19thcenturyistheBelgian AdolpheQuételet(1796–1874).HeknewofLaplace’smethodandmaintainedacorrespondencewithhim.AccordingtoStigler(1986,pp.164–165),Quételetwasinitially attractedtotheideaofusingpartialdata.HeeventriedtoapplyLaplace’smethodto estimatethepopulationoftheNetherlandsin1824(whichBelgiumwasapartofuntil 1830).However,itseemsthathethenralliedtoanotefromKeverberg(1827)which severelycriticizedtheuseofpartialdatainthenameofprecisionandaccuracy:
Inmyopinion,thereisonlyonewaytoarriveatanexactknowledgeofthepopulationandtheelementsofwhichitiscomposed:itisthatofanactualanddetailed enumeration;thatistosay,theformationofnominativestatesofalltheinhabitants,withindicationoftheirageandoccupation.Onlybythismodeofoperation canreliabledocumentsbeobtainedontheactualnumberofinhabitantsofa country,andatthesametimeonthestatisticsoftheagesofwhichthepopulationiscomposed,andthebranchesofindustryinwhichitfindsthemeansof comfortandprosperity.1
InoneofhisletterstotheDukeofSaxe-CoburgGotha,Quételet(1846,p.293)also advocatesforanexhaustivestatement:
LaPlacehadproposedtosubstituteforthecensusofalargecountry,suchas France,somespecialcensusesinselecteddepartmentswherethiskindofoperationmighthavemorechancesofsuccess,andthentocarefullydeterminethe ratioofthepopulationeitheratbirthoratdeath.Bymeansoftheseratiosof thebirthsanddeathsofalltheotherdepartments,figureswhichcanbeascertainedwithsufficientaccuracy,itistheneasytodeterminethepopulationof
1TranslatedfromFrench:“Àmonavis,iln’existequ’unseulmoyendeparveniràuneconnaissanceexacte delapopulationetdesélémensdontellesecompose:c’estcelled’undénombrementeffectifetdétaillé; c’est-à-dire,delaformationd’étatsnominatifsdetousleshabitans,avecindicationdeleurâgeetdeleur profession.Cen’estqueparcemoded’opérer,qu’onpeutobtenirdesdocumensdignesdeconfiancesurle nombreréeld’habitansd’unpays,etenmêmetempssurlastatistiquedesâgesdontlapopulationse compose,etdesbranchesd’industriedanslesquelleselletrouvedesmoyensd’aisanceetdeprospérité.”
1.2EnumerativeStatisticsDuringthe19thCentury 3 thewholekingdom.Thiswayofoperatingisveryexpeditious,butitsupposes aninvariableratiopassingfromonedepartmenttoanother.[···]Thisindirect methodmustbeavoidedasmuchaspossible,althoughitmaybeusefulinsome cases,wheretheadministrationwouldhavetoproceedquickly;itcanalsobeused withadvantageasameansofcontrol.2
ItisinterestingtoexaminetheargumentusedbyQuételet(1846,p.293)tojustifyhis position.
Tonotobtainthefacultyofverifyingthedocumentsthatarecollectedistofail inoneoftheprincipalrulesofscience.Statisticsisvaluableonlybyitsaccuracy; withoutthisessentialquality,itbecomesnull,dangerouseven,sinceitleadsto error.3
Again,accuracyisconsideredabasicprincipleofstatisticalscience.Despitetheexistenceofprobabilistictoolsanddespitevariousapplicationsofsamplingtechniques,the useofpartialdatawasperceivedasadubiousandunscientificmethod.Quételethad agreatinfluenceonthedevelopmentofofficialstatistics.HeparticipatedinthecreationofasectionforstatisticswithintheBritishAssociationoftheAdvancementof Sciencesin1833withThomasMalthusandCharlesBabbage(seeHorvàth,1974).One ofitsobjectiveswastoharmonizetheproductionofofficialstatistics.Heorganizedthe InternationalCongressofStatisticsinBrusselsin1853.Quételetwaswellacquainted withtheadministrativesystemsofFrance,theUnitedKingdom,theNetherlands,and Belgium.Hehasprobablycontributedtotheideathattheuseofpartialdataisunscientific.
Somepersonalities,suchasMalthusandBabbageinGreatBritain,andQuételetin Belgium,contributedgreatlytothedevelopmentofstatisticalmethodology.Onthe otherhand,theestablishmentofastatisticalapparatuswasanecessityintheconstructionofmodernstates,anditisprobablynotacoincidencethatthesepersonalitiescome fromthetwocountriesmostrapidlyaffectedbytheindustrialrevolution.Atthattime, thestatistician’sobjectivewasmainlytomakeenumerations.Themainconcernwasto inventorytheresourcesofnations.Inthiscontext,theuseofsamplingwasunanimously rejectedasaninexactandfundamentallyunscientificprocedure.Throughoutthe19th century,thediscussionsofstatisticiansfocusedonhowtoobtainreliabledataandon thepresentation,interpretation,andpossiblymodeling(adjustment)ofthesedata.
2TranslatedfromFrench:“LaPlaceavaitproposédesubstitueraurecensementd’ungrandpays,telquela France,quelquesrecensementsparticuliersdansdesdépartementschoisis,oùcegenred’opérationpouvait avoirplusdechancesdesuccès,puisd’ydétermineravecsoinlerapportdelapopulationsoitauxnaissances soitauxdécès.Aumoyendecesrapportsdesnaissancesetdesdécèsdetouslesautresdépartements, chiffresqu’onpeutconstateravecassezd’exactitude,ildevientfacileensuitededéterminerlapopulationde toutleroyaume.Cettemanièred’opéreresttrèsexpéditive,maisellesupposeunrapportinvariableen passantd’undépartementàunautre.[···]Cetteméthodeindirectedoitêtreévitéeautantquepossible,bien qu’ellepuisseêtreutiledanscertainscas,oùl’administrationauraitàprocéderavecrapidité;onpeutaussi l’employeravecavantagecommemoyendecontrôle.”
3TranslatedfromFrench:“Nepasseprocurerlafacultédevérifierlesdocumentsquel’onréunit,c’est manqueràl’unedesprincipalesrèglesdelascience.Lastatistiquen’adevaleurqueparsonexactitude;sans cettequalitéessentielle,elledevientnulle,dangereusemêmepuisqu’elleconduitàl’erreur.”
1.3ControversyontheuseofPartialData
In1895,theNorwegianAndersNicolaiKiær,DirectoroftheCentralStatisticalOfficeof Norway,presentedtotheCongressoftheInternationalStatisticalInstituteofStatistics (ISI)inBernaworkentitled Observationsetexpériencesconcernantdesdénombrements représentatifs (Observationsandexperimentsonrepresentativeenumeration)forasurveyconductedinNorway.Kiær(1896)firstselectedasampleofcitiesandmunicipalities. Then,ineachofthesemunicipalities,heselectedonlysomeindividualsusingthefirst letteroftheirsurnames.Heappliedatwo-stagedesign,butthechoiceoftheunitswas notrandom.Kiærarguesfortheuseofpartialdataifitisproducedusinga“representativemethod”.Accordingtothismethod,thesamplemustbearepresentationwith areducedsizeofthepopulation.Kiær’sconceptofrepresentativenessislinkedtothe quotamethod.Hisspeechwasfollowedbyaheateddebate,andtheproceedingsof theCongressoftheISIreflectalongdispute.Letustakeacloserlookatthearguments fromtwoopponentsofKiær’smethod(seeISIGeneralAssemblyMinutes,1896).
GeorgvonMayr(Prussia)[ ]Itisespeciallydangeroustocallforthissystemof representativeinvestigationswithinanassemblyofstatisticians.Itisunderstandablethatforlegislativeoradministrativepurposessuchlimitedenumerationmay beuseful–butthenitmustberememberedthatitcanneverreplacecomplete statisticalobservation.Itisallthemorenecessarytosupportthispoint,thatthere isamongusinthesedaysacurrentamongmathematicianswho,inmanydirections,wouldrathercalculatethanobserve.Butwemustremainfirmandsay:no calculationwhereobservationcanbedone.4
GuillaumeMilliet(Switzerland).Ibelievethatitisnotrighttogiveacongressional voicetotherepresentativemethod(whichcanonlybeanexpedient)animportancethatseriousstatisticswillneverrecognize.Nodoubt,statisticsmadewith thismethod,or,asImightcallit,statistics, parsprototo,hasgivenushereand thereinterestinginformation;butitsprincipleissomuchincontradictionwith thedemandsofthestatisticalmethodthatasstatisticians,weshouldnotgrantto imperfectthingsthesamerightofbourgeoisie,sotospeak,thatweaccordtothe idealthatscientificallyweproposetoreach.5
4TranslatedfromFrench:“C’estsurtoutdangereuxdesedéclarerpourcesystèmedesinvestigations représentativesauseind’uneassembléedestatisticiens.Oncomprendquepourdesbutslégislatifsou administratifsunteldénombrementrestreintpeutêtreutile–maisalorsilnefautpasoublierqu’ilnepeut jamaisremplacerl’observationstatistiquecomplète.Ilestd’autantplusnécessaired’appuyerlà-dessus,qu’ily aparminousdanscesjoursuncourantauseindesmathématiciensqui,dansdenombreusesdirections, voudraientplutôtcalculerqu’observer.Maisilfautresterfermeetdire:pasdecalcullàoùl’observationpeut êtrefaite.”
5TranslatedfromFrench:“Jecroisqu’iln’estpasjustededonnerparunvœuducongrèsàlaméthode représentative(quienfinnepeutêtrequ’unexpédient)uneimportancequelastatistiquesérieusene reconnaîtrajamais.Sansdoute,lastatistiquefaiteaveccetteméthodeou,commejepourraisl’appeler,la statistique, parsprototo,nousadonnéçaetlàdesrenseignementsintéressants;maissonprincipeest tellementencontradictionaveclesexigencesquedoitavoirlaméthodestatistique,que,commestatisticiens, nousnedevonspasaccorderauxchosesimparfaiteslemêmedroitdebourgeoisie,pourainsidire,quenous accordonsàl’idéalquescientifiquementnousnousproposonsd’atteindre.”
Thecontentofthesereactionscanagainbesummarizedasfollows:sincestatisticsis bydefinitionexhaustive,renouncingcompleteenumerationdeniestheverymissionof statisticalscience.ThediscussiondoesnotconcernthemethodproposedbyKiaer,but isonthedefinitionofstatisticalscience.However,Kiaerdidnotletgo,andcontinuedto defendtherepresentativemethodin1897atthecongressoftheISIatSt.Petersburg(see Kiær,1899),in1901inBudapest,andin1903inBerlin(seeKiær,1903,1905).Afterthis date,theissueisnolongermentionedattheISICongress.However,Kiærobtainedthe supportofArthurBowley(1869–1957),whothenplayedadecisiveroleinthedevelopmentofsamplingtheory.Bowley(1906)presentedanempiricalverificationofthe applicationofthecentrallimittheoremtosampling.Hewasthetruepromoterofrandomsamplingtechniques,developedstratifieddesignswithproportionalallocations, andusedthelawoftotalvariance.ItwillbenecessarytowaitfortheendoftheFirst WorldWarandtheemergenceofanewgenerationofstatisticiansfortheproblemto berediscussedwithintheISI.Onthissubject,wecannothelpbutquoteMaxPlank’s reflectionontheappearanceofnewscientifictruths:“anewscientifictruthdoesnot triumphbyconvincingitsopponentsandmakingthemseethelight,butratherbecause itsopponentseventuallydie,andanewgenerationgrowsupthatisfamiliarwithit” (quotedbyKuhn,1970,p.151).
In1924,acommission(composedofArthurBowley,CorradoGini,AdolpheJensen, LucienMarch,VerrijnStuart,andFrantzZizek)wascreatedtoevaluatetherelevanceof usingtherepresentativemethod.Theresultsofthiscommission,entitled“Reportonthe representativemethodofstatistics”,werepresentedatthe1925ISICongressinRome. Thecommissionacceptedtheprincipleofsurveysamplingaslongasthemethodologyis respected.ThirtyyearsafterKiær’scommunication,theideaofsamplingwasofficially accepted.Thecommissionlaidthefoundationforfutureresearch.Twomethodsare clearlydistinguished:“randomselection”and“purposiveselection”.Thesetwomethods correspondtotwofundamentallydifferentscientificapproaches.Ontheonehand,the validationofrandommethodsisbasedonthecalculationofprobabilitiesthatallows confidenceintervalstobebuildforcertainparameters.Ontheotherhand,thevalidation ofthepurposiveselectionmethodcanonlybeobtainedthroughexperimentationby comparingtheobtainedestimationstocensusresults.Therefore,randommethodsare validatedbyastrictlymathematicalargumentwhilepurposivemethodsarevalidatedby anexperimentalapproach.
1.4DevelopmentofaSurveySamplingTheory
ThereportofthecommissionpresentedtotheISICongressin1925markedtheofficial recognitionoftheuseofsurveysampling.Mostofthebasicproblemshadalreadybeen posed,suchastheuseofrandomsamplesandthecalculationofthevarianceofthe estimatorsforsimpleandstratifieddesigns.Theacceptanceoftheuseofpartialdata,and especiallytherecommendationtouserandomdesigns,ledtoarapidmathematizationof thistheory.Atthattime,thecalculationofprobabilitieswasalreadyknown.Inaddition, statisticianshadalreadydevelopedatheoryforexperimentalstatistics.Everythingwas inplacefortherapidprogressofafertilefieldofresearch:theconstructionofastatistical theoryofsurveysampling.