9789144115757 by Smakprov Media AB

A Concise Introduction to

MATHEMATICAL STATISTICS

DRAGI ANEVSKI

Copying prohibited All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, recording, or any information storage and retrieval system, without permission in writing from the publisher. The papers and inks used in this product are eco-friendly. Art. No 39387 isbn 978-91-44-11575-7 Edition 1:1 ÂŠ The Author and Studentlitteratur 2017 www.studentlitteratur.se Studentlitteratur AB, Lund Cover design by Jens Martin/Signalera Cover image by Dragi Anevski Author photo by Bengt Jakobsson Printed by Holmbergs i MalmĂś AB, Sweden 2017

CONTENTS

Preface 11

Mathematical Statistics 15 What is unique about the subject? 15 Very short historical overview and the development of the subject 16 Two important results 16 Some important areas in mathematical statistics 17 Important applications and connections to other areas 19 Philosophy 19 Physics 20 Chemistry 20 Engineering science 20 Medicine 21 Economy 21 Biology 22 Law 22 The Nobel Prize and the Fields medal 23 PART I

Probability Theory

CHAPTER 1

An introductory example and overview 27

The rolling of a dice 27 CHAPTER 2

2.1 2.2

Set theory and introduction to probability 33

Set theory and events 33 Probabilities 37

ÂŠ T H E A U T H O R A N D S T U D E N T L I T T E R AT U R

contents

2.3 2.4 2.5

Probabilities, σ-algebras 43 A σ-algebra as a container of information about an experiment 50 Exercises 53

CHAPTER 3

3.1 3.2 3.3

Conditional probability 55 Independent events 59 Exercises 61

CHAPTER 4

4.1 4.2 4.3 4.4 4.5

5.2

5.3 5.4 5.5

Random variables 63

The distribution of a random variable 66 Counting measure, length measure, and integrals with such 70 The density of a r.v. 73 The quantiles 78 Exercises 80

CHAPTER 5

5.1

Conditional probability and independence 55

Random vectors 83

Counting measures and length measures on Rn and integrals with such 87 The density of a random vector 90 Discrete random vectors 91 Continuous random vectors 93 Mixed discrete and continuous r.v. 96 Independent r.v.’s 98 Conditional distributions 100 Exercises 105

CHAPTER 6

Functions of random variables and of random vectors 107

6.1

Functions of one random variable 108

contents

6.2

Functions of a random vector 116 Maxima and minima 118 Sums and convolutions 122 General real valued functions of a random vector 125 Vector valued functions of random vectors 126 6.3 Exercises 128 CHAPTER 7

7.1 7.2 7.3 7.4 7.5 7.6 7.7

The Riemann-Stieltjes integral on R 130 Application to probability theory 138 The Riemann-Stieltjes integral on Rn 147 Applications to probability theory 151 Expectations and covariances of random vectors 162 The expectation of positive r.v.â&#x20AC;&#x2122;s 164 Exercises 165

CHAPTER 8

8.1 8.2

Conditional expectation 169

The definition and properties of a conditional expectation 169 Exercises 176

CHAPTER 9

9.1 9.2 9.3 9.4 9.5 9.6 9.7 9.8 9.9 9.10 9.11

Expectation 129

Examples of distributions 179

The Bernoulli distribution 179 The binomial distribution 180 The geometric distribution 183 The exponential distribution 184 The discrete uniform distribution 185 The continuous uniform distribution 186 The Poisson distribution 186 The Gaussian (normal) distribution 188 The multinomial distribution 192 The multivariate Gaussian (normal) distribution 195 Exercises 199

ÂŠ T H E A U T H O R A N D S T U D E N T L I T T E R AT U R

contents

CHAPTER 10

Stochastic convergence 201

10.1

Convergence of random variables 202 Sure convergence 202 Almost sure convergence 203 Convergence in pâ&#x20AC;&#x2122;th mean. 203 Convergence in probabiltiy 204 The interpretation of a probability: consequences of of the LLN 206 Convergence in distribution 207 The prevalence of the Gaussian distribution: consequences of the central limit theorem 210 The prevalence of the Poisson distribution 211 10.2 Exercises 213 CHAPTER 11

Stochastic processes 215

11.1 11.2 11.3 11.4

Introduction 215 Stochastic processes 215 The distribution of a stochastic process 220 Three important classes of processes 221 Stationary processes 221 Markov processes 222 Martingales 223 11.5 Two important processes 224 The partial sum process 224 The empirical process 224 11.6 Exercises 226 PART II

Inference Theory

CHAPTER 12

An introductory example and overview 229

A coin toss experiment 230 CHAPTER 13

13.1 6

Statistics 235

A statistic seen as a function of the data sample 236

ÂŠ T H E A U T H O R A N D S T U D E N T L I T T E R AT U R

contents

13.2

A statistic seen as a function(al) of the distribution function 238 13.3 Properties of estimators 245 Finite-sample properties 245 Asymptotic properties 249 13.4 The standard error 252 13.5 Exercises 253 CHAPTER 14

Methods to obtain estimators 257

14.1 14.2 14.3 14.4

The plug-in estimator 257 The maximum likelihood method 257 The least squares estimator 266 Extensions and modifications 269 The ML estimator 269 The LS estimator 272 14.5 Exercises 274 CHAPTER 15

15.1 15.2 15.3

Pivot functions 279 Joint confidence intervals for several parameters 284 Exercises 286

CHAPTER 16

16.1 16.2 16.3 16.4 16.5

Tests 289

The power of a test and the power function 293 Composite null hypothesis 295 p-values 298 Multiple testing 301 Exercises 303

CHAPTER 17

17.1 17.2

Confidence intervals 277

Normal approximation of estimators 307

Normal approximation of linear functionals 307 Normal approximation of the binomial distribution 310

ÂŠ T H E A U T H O R A N D S T U D E N T L I T T E R AT U R

contents

17.3

Normal approximation of the Poisson distribution 311 A note on the Gaussian approximation 312 17.4 Exercises 313 CHAPTER 18

Applications to some common situations 315

18.1

Data from a Gaussian distribution 315 One sample 315 Several samples 316 Observation in pairs 318 18.2 Binomial data 319 One sample 319 Two samples, inference for difference in success probability 321 18.3 Poisson data 322 18.4 Exercises 324 CHAPTER 19

Test-based intervals and the confidence interval method. 327

19.1 19.2 19.3

The confidence interval method 327 Test based confidence intervals 329 Exercises 331

CHAPTER 20

Parametric, semi-parametric and non-parametric estimation problems 333

CHAPTER 21

21.1

Some more advanced properties of the empirical distribution function 339

CHAPTER 22

22.1 22.2 22.3 8

The empirical distribution function 337

Some nonparametric inference problems 343

Introduction 343 Density function estimation 345 Regression function estimation 350

ÂŠ T H E A U T H O R A N D S T U D E N T L I T T E R AT U R

contents

One and k-sample tests 352 Why a nonparametric test? 353 22.5 Estimating the survival function in survival analysis 355 22.6 Exercises 359 22.4

CHAPTER 23

Linear regression 363

23.1 23.2 23.3

The least squares estimator 365 Normal linear model 367 The ML estimator in the normal linear model 370 Test for and confidence interval at a point y(x 0 ) on the plane in a normal linear model 371 23.4 Residual analysis and model fit. 372 23.5 Testing of a model and model choice. 375 All subsets regression 376 Stepwise forward or backward regression 377 23.6 Prediction intervals 379 23.7 Some further interesting regression problems 380 Dichotomous response and logistic regression 381 Time to an event as response variable and regression models in survival analysis 382 23.8 Exercises 384 CHAPTER 24

Introduction

inference

for

stochastic

processes 387 24.1

Introduction 387 The inference problem 389 24.2 Inference for Poisson processes 389 The definition and further properties 389 Inference for Îť 390 24.3 Inference for Markov chains 391 The definition and further properties 391 Inference for the transition probabilities 394 24.4 Exercises 397

ÂŠ T H E A U T H O R A N D S T U D E N T L I T T E R AT U R

contents

APPENDIX A

A.1 A.2 A.3

Some useful results from analysis 399

Maps of sets 399 Continuity 400 Measurability of a r.v. 401

APPENDIX B

Distributions arising from the Gaussian distribution 405

APPENDIX C

The Riemann integral 413

Integration on R 413 Integration on Rn 416 Bibliography 419

ÂŠ T H E A U T H O R A N D S T U D E N T L I T T E R AT U R

CHAPTER 3

Conditional probability and independence

3.1 Conditional probability We have defined a probability as a measure of how likely an event is. Here the events are the sets in F and Ω is the basis or reference set, which restricts what can happen in the following particular sense: We know that for sure we are in Ω, or put differently Ω is a certain event. Now assume that we would like to model the situation that we know for sure that we are in the subset B ⊆ Ω for some given B ∈ F. How then should we model the probability that another event A ∈ F will happen, given that B has happened? Let us denote this probability by P(A∣B) and let us see what properties are desirable. First it seems reasonable to demand that a)

P(B∣B) = 1,

in analogy with the unconditional case P(Ω) = 1, since the set B now replaces the outcome space Ω. Furthermore, countable additivity¹ b)

∞

P(∪∞ i=1 A i ∣B) = ∑ P(A i ∣B) i=1

if A i ∩ A j = ∅ when i ≠ j, seems reasonable. We next give a definition of such a conditional probability. 1 This is actually a very strong assumption, so much so that one calls such a conditional probability a “regular” conditional probability. The problem is that for some uncountable Ω’s, it is not possible to get regular conditional probability. It is straightforward and elementary to show the result for countable Ω′ s, and there is no problem when Ω is an Euclidian space; then it is always possible to do a definition so that b) is satisfied. We do not worry about this here and use Definition 3.1 freely.

part i probability theory

FIGURE 3.1

The conditional probability of A given that B has occurred.

Definition 3.1 Let B ∈ F be given and assume that P(B) > 0. Then we define the function P(⋅∣B) ∶ F → [0,1] by P(A ∩ B) , P(B)

P(A∣B) =

for A ∈ F. We call P(A∣B) the conditional probability for A given B. Note the difference between P(A ∩ B) and P(A∣B): In both cases we are making the restriction that we are in the set B, however in the second case we normalise with P(B). Note also that P(A∣B) is undefined for B such that P(B) = 0. Lemma 3.2 The conditional probability P(⋅∣B) in Definition 3.1 satisfies a) and b). Proof. Left as an exercise. (Use the properties of a probability and Definition 3.1.) If we can write Ω = B 1 ∪ . . . ∪ B n with B i ∩ B j = ∅ for i ≠ j, we call B 1 , . . . ,B n a partition of Ω. Lemma 3.3 (The law of total probability) If B 1 , . . . ,B n is a partition of Ω, with P(B i ) > 0 for all i, then n

P(A) = ∑ P(A∣B i )P(B i ). i=1

3 conditional probability and independence

FIGURE 3.2

The law of total probability.

Proof. This follows by additivity and the definition of a conditional probability P(A) = P(A ∩ Ω) = P(A ∩ (B 1 ∪ . . . ∪ B n ) = P((A ∩ B 1 ) ∪ . . . (A ∩ B n )) = P(A ∩ B 1 ) + . . . + P(A ∩ B n ) = P(A∣B 1 )P(B 1 ) + . . . + P(A∣B 2 )P(B 2 ), since the sets A ∩ B 1 , . . . , A ∩ B n are disjoint.

◻

In fact, we have the same result for infinite partitions. Lemma 3.4 Assume that Ω = ∪∞ i=1 B i with B i ∩ B j = ∅ when i ≠ j. Assume also that P(B i ) > 0 for all i ≥ 1. Then for any A ⊆ F ∞

P(A) = ∑ P(A∣B i )P(B i ). i=1

Proof. The proof is analogous to the proof of the previous lemma, ∞ P(A) = P(A ∩ Ω) = P(A ∩ ∪∞ i=1 B i ) = P(∪ i=1 (A ∩ B i )) ∞

∞

i=1

= ∑ P(A ∩ B i ) = ∑ P(A∣B i )P(B i )

◻

part i probability theory

Example 3.5 We are interested in the probability for an individual in Sweden to die in a cardiovascular disease. It is known that the probability for smokers is 0.4, for nonsmokers it is 0.1. The probability of a person being smoker in Sweden is 0.1. We now make a probabilistic model of this: Let A = {death in cardiovascular disease}, B 1 = {smoker}, B 2 = {non-smoker}. Let Ω be the set of people in Sweden. We have that Ω = B 1 ∪ B 2 , and B 1 ∩ B 2 = ∅, and that A ⊆ Ω. We want to make sure that A,B 1 ,B 2 are events, and we can for instance let it be given that F = σ({A,B 1 }). Then the probability data that are given above are P(A∣B 1 ) = 0.4, P(A∣B 2 ) = 0.1, P(B 1 ) = 0.1, P(B 2 ) = 1 − P(B 1 ) = 0.9. Using Lemma 3.3, we get P(A) = P(A∣B 1 )P(B 1 ) + P(A∣B 2 )P(B 2 ) = 0.4 ⋅ 0.1 + 0.1 ⋅ 0.9 = 0.13. As can be seen from this example the lemma is very handy for calculating probabilities of events A that can be complicated; it is done by calculating the probabilities of A when we restrict ourselves to being in one of the partition sets B i , something that usually is easier, and weighting those probabilities with the weights P(B i ). An example for infinite partitions is given in Example 3.10 below. The next theorem gives a reverse to the law of total probability, and is one of the most famous, and useful theorems in mathematical statistics. Lemma 3.6 (Bayes theorem) Let B 1 , . . . ,B n be a partition with P(B i ) > 0 for all i. Then P(B k ∣A) =

P(A∣B k )P(B k ) P(A∣B i )P(B i )

n ∑ i=1

3 conditional probability and independence

Proof. We have P(B k ∣A) =

P(B k ∩ A) P(A∣B k )P(B k ) = n P(A) ∑ i=1 P(A∣B i )P(B i )

◻

Bayes theorem gives as a way to calculate the probability of any partition set B i , if we know that (the complicated event) A has occurred. Example 3.7 (Example 3.5 continued). We are interested in the probability of a person that has died in a cardiovascular disease is in fact a smoker. Bayes theorem gives us P(A∣B 1 )P(B 1 ) P(A∣B 1 )P(B 1 ) + P(A∣B 2 )P(B 2 ) 0.4 ⋅ 0.1 = 0.4 ⋅ 0.1 + 0.1 ⋅ 0.9 = 0.31.

P(B 1 ∣A) =

We note that the (unconditional) probability of being a smoker is 0.1 while the probability of being a smoker conditional on being dead is 0.31.

3.2 Independent events In Example 3.5 the unconditional and conditional probabilities of A, given B 1 , was different. The case when they are the same is important. Definition 3.8 Two events A,B ∈ F are called independent if P(A ∩ B) = P(A) ⋅ P(B). A collection of events {A i ∈ F ∶ i ∈ I} is called independent if P(∩ j∈J ) = ∏ j∈J P(A j ) for every finite subset J ⊆ I. Independence for random events is a very common assumption to make (maybe as usual as the assumption of linearity in analysis). Example 3.9 Assume we make the same experiment two times, each time the event A can occur, and let A 1 denote that the event occurs the first time and A 2 that the event occurs the second time. Then we can define the universal © T H E A U T H O R A N D S T U D E N T L I T T E R AT U R

part i probability theory

set as Ω = (A 1 ∪ Ac1 ) ∩ (A 2 ∪ Ac2 ) = (A 1 ∩ A 2 ) ∪ (A 1 ∩ Ac2 ) ∪ (Ac1 ∩ A 2 ) ∪ (Ac1 ∩ Ac2 ). and we note that in this case actually P(Ω) is a sensible σ−algebra. (Exercise: Does A 1 lie in this Ω? It should...) Then we say that the experiments are independent, and that the events A 1 ,A 2 are independent if P(A 1 ∩ A 2 ) = P(A 1 )⋅P(A 2 ). For instance the experiment can be to toss a coin, and the event can be A = {head}. Then if the events are independent, and the coin is fair (so that P(A) = 1/2 is a sensible model for this experiment) then the probability to get two heads in a row is P(A 1 ∩ A 2 ) = P(A 1 ) ⋅ P(A 2 ) = 1/2 ⋅ 1/2 = 1/4. Example 3.10 Assume we have the coin toss experiment in Example 2.3, where we toss a coin until the first head turns up. We have constructed the outcome space as Ω = {ω 1 ,ω 2 , . . .}. where ωi

= {the first i − 1 coin tosses result in tails and the i’th coin toss results in head}.

We wrote the event A = {the number of coin tosses needed is even} as A = {ω 2 ,ω 4 ,ω 6 , . . .} = ∪∞ i=1 {ω 2i } = ∪∞ i=1 A i , with A i = {ω 2i }. Since Ω is discrete we can use the power set P(Ω) as the σalgebra. Now let the probability of head in a random coin toss be p, for some fixed p ∈ (0,1). We claim that a good probabilistic model for this experiment is that the outcomes are independent, so that P({ω i }) = (1 − p) i−1 p. Thus a reasonable model for the experiment is that P(A i ) = (1 − p)2i−1 p. Then what is P(A)? 60

3 conditional probability and independence

Clearly A i ∩ A j = ∅ if i ≠ j so the events are disjoint. Therefore the set A = ∪∞ i=1 A i (which is an event) has probability ∞

P(A) = ∑ P(A i ) i=1 ∞

= ∑(1 − p)2i−1 p. i=1

Let q = 1 − p. We get ∞

P(A) = ∑ q 2i−1 p i=1

= = = = =

p ∞ 2i ∑q q i=1

p 2∞ 2 i q ∑(q ) q i=0 1 pq 1 − q2 q 1+q 1− p , 2− p

if 0 < p < 1. (Check that P(A) = 0 if p = 0 or p = 1.) For which p is this probability maximal?

Dragi Anevski is senior lecturer in mathematical statistics at the Centre for Mathematical Sciences at Lund University. His main research area is in inference theory, in particular in nonparametric inference and in limit distributions for statistical functionals.

A Concise Introduction to Mathematical Statistics This book gives a thorough introduction to mathematical statistics. The text is unique as an introductory text, mainly by the use of the Riemann-Stieltjes integral. This enables a unified treatment of basic concepts in probability and inference theory, in a mathematically rigorous manner, without the use of measure theory. The approach differentiates this book from other introductory texts, where one does not give a unified approach to basic concepts, as well as from advanced texts, where one does give a unified approach relying on advanced mathematics. The treatment of probability theory differs from comparable books in that one discusses basic concepts rigorously but without the use of Lebesgue integration. Thereby it allows one to concentrate on the basic concepts of mathematical statistics, without sacrifice of mathematical stringency. The approach also enables a concise definition of the plug-in Âestimator in inference theory. Arguably, the plug-in estimator is the most natural and intuitive estimator possible. The introduction of it is however mathematically advanced, and typically covered in PhD level texts. Using the Riemann-Stieltjes integral the introÂ duction of it becomes elementary. The book is intended for students at the Faculty of Science and ÂFaculty of Engineering that have taken a full year of basic mathematics courses, including real analysis and linear algebra. Art.nr 39387

studentlitteratur.se