Draft:Study Guide- Introduction to regression models by Joshua Touyz

Study Guide: Introduction to Regression Modeling Joshuah Touyz c Draft date November 28, 2009

Contents Contents

Preface

1 A simple linear regression model

1.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1.2

Definitions and nomenclature . . . . . . . . . . . . . . . . . . . . . .

1.2.1

µ(x) and its derivatives . . . . . . . . . . . . . . . . . . . . .

1.2.2

Model Nomenclature . . . . . . . . . . . . . . . . . . . . . . .

1.2.3

Gauss’s Theorem and Summary . . . . . . . . . . . . . . . . .

Parameter Estimation . . . . . . . . . . . . . . . . . . . . . . . . . .

1.3.1

Maximum Likelihood Estimation . . . . . . . . . . . . . . . .

1.3.2

Least Squares Estimates . . . . . . . . . . . . . . . . . . . . . 10

1.3

2 Linear Algebra and Matrix Review

2.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.2

Important properties of linear combinations of random variables . . . 13

2.3

Bridging the gap between linear algebra and regression . . . . . . . . 14

2.4

Using least squares to estimate β̂ . . . . . . . . . . . . . . . . . . . . 15

2.5

Concept of regression via a picture . . . . . . . . . . . . . . . . . . . 19

2.6

Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3 Multivariate Normal

23 i

CONTENTS 3.1

The distribution of β̃ in the MLR case . . . . . . . . . . . . . . . . . 25

3.2

Distribution of residuals . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.3

Prediction and mean distributions . . . . . . . . . . . . . . . . . . . . 26 3.3.1

The distribution of y m . . . . . . . . . . . . . . . . . . . . . . 26

3.3.2

The distribution of Y p . . . . . . . . . . . . . . . . . . . . . . 27

3.3.3

Summary of distributions . . . . . . . . . . . . . . . . . . . . 27

3.3.4

Putting it together with some examples . . . . . . . . . . . . . 28

4 Confidence Intervals 4.0.5

Problems with estimating(page 34) . . . . . . . . . . . . . . . 32

5 Hypothesis Testing

5.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

5.2

Some prelimnary work . . . . . . . . . . . . . . . . . . . . . . . . . . 33 5.2.1

Formulation of tests and conclusions . . . . . . . . . . . . . . 33

5.2.2

Forms of hypothesis tests . . . . . . . . . . . . . . . . . . . . . 34

5.2.3

p-value . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

5.2.4

Tests for testing hypotheses . . . . . . . . . . . . . . . . . . . 35

6 ANOVA

6.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

6.2

Some notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

6.3

Models in ANOVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 6.3.1

Intercept only model . . . . . . . . . . . . . . . . . . . . . . . 39

7 Some notes for test 2

Preface The notes herewith are an attempt to provide some structure to the University of Waterloo’s actuarial science classes Stat 331 (Introduction to Regression Modeling) in electronic form. They aim to be a comprehensive study pack for the concepts presented in class and will hopefully bring an increased level of clarity to students who may otherwise find the concepts difficult.

CONTENTS

Chapter 1 A simple linear regression model 1.1

Introduction

A simple linear regression model is used to explain a response through a random linear relation. It has one predictor varaible and is thus considered “simple”. Mathematically it is written as: y = µ + ǫ = β0 + β1 x + ǫ ǫ ∼ N (0, σ 2 )

(1.1) In the model above:

• β0 is the is the intercept • β1 is the slope • ǫ is the random error • y is the response variable, it attempts to describe the distribution of the possible responses • x is the explanatory variate (also called the regressor variable) The regressor variable is under the experimenters control. In reality when dealing with data sets and consequently the simple continuous linear model above reduces to a data plot i.e.: Remark 1.1.1 yi = µi + ǫi = β0 + β1 xi + ǫi

ǫi ∼ N (0, σ 2 ).

The goal of linear regression is to fit contionous or mixed models to data points. 3

CHAPTER 1. A SIMPLE LINEAR REGRESSION MODEL

In the model it is further assumed that the random errors (ǫi ) are indepedent and normally distributed with mean 0 (E[ǫi ]), variance σ 2 (V ar[ǫi ] = σ 2 ) and covariance 0 (since Cov(ǫi , ǫj ) = 0). Since yi is a function of the random variable ǫi , it must also be a random variable, that is yi s are drawn from a probability distribution that is normal with: E[Yi ] = µi = E[β0 + β1 xi + ǫi ] = β0 + β1 xi V ar[Yi ] = V ar[β0 + β1 xi + ǫi ] = V ar[ǫi ] = σ 2 Yi ∼ N (µi , σ 2 ) where µu = β0 + β1 xi As mentioned above, the goal of linear regression is to fit models to data. To do so we must ascertain the parameters in the equation above. This is done through parameter estimation, but before doing some defintions and nomencalture have to be introduced.

1.2

Definitions and nomenclature

It can sometimes get confusing when considering all the different variables that go into modelling. This section will try to elucidate some of the differences between random and non-random values as well as contextualize the different variables that are used within the upcoming models. One thing to note is that although y is a random varialbe it is not captilazied. Later on capital letters, especially Y, will be used to represent matrices. Below a list of definitions are included (for completeness the definition of y is repeated along side the other definitinon below): • y is the response variable, it attempts to describe the distribution of the possible responses. y is random • y1 , y2 , ..., yn are the response data values seen when drawing a sample. They are non-random • ŷ is sometimes combined with subscribts ŷ1 , ..., ŷn . ŷis the estimated responses for a built model. • µ(x) is a function of explanatory variates x1 , ..., xn . It represents the average response. Note that x = [x1 , ..., xn ]T • xi are independent variables of the model but are not random. We assume the experimenter sets the value of x • E or ǫ is the error term. It is assumed to be a random variable with ǫ ∼ N (0, σ ) . ǫ function is to describe the variablity about the mean. • y = β0 + β1 x + ǫ is a simple linear model, weheras y = β0 + β1 (x − x̄) + ǫ is a simple centered model.

1.2. DEFINITIONS AND NOMENCLATURE

1.2.1

µ(x) and its derivatives

µ(x) can have infintely many forms since it depends on the sampled data points[x1 , ..., xn ]. Furthermore µ(x) interpetation varies depending in the model for example: µ(x) = β0 + β1 x + β2 x2 µ(x) = β0 + β1 x + β2 x2

quadratic form planar form with 2 different explanatory variates

For the moment lets consider a simple linear model (µ(x) = β0 + β1 x) and apply interpretations to the various derivatives of the model. β ′ s: The derivative of µ(x) with respect to x yields: dµ(x) = β1 dx Hence we can think of β1 as the unit change in µ(x) (and hence Y) for a unit change in x, β1 can be thought of as the slope.

1.2.2

Model Nomenclature

Consider a respsonse Y ad explanatory variates x1 , x2 , ..., xn then: A first order model is defined as: y = β0 + β1 x1 + β2 x2 ... + βn xn There is only x in each term and the exponents of the xs are 1. A second order model denotes a model having a combination of xi s and xj ; i could equal j. y = β0 = β1 x1 + β2 x2 + ... + βn xn + β11 x21 + b22 x22 + ... + βnn x2n + β12 x1 x2 + b13 x1 x3 + ... +ǫ

intercept term first order term second order terms second order terms-interactions random error

WE can further expand on the order of models but for the most comlpicated model we will deal with for now is the second order model. Let’s further look at interactions between components of a model. Consider the simple second order model with interactions: y = β0 + β1 x1 + β2 x2 + β12 x1 x2 + ǫ ǫ ∼ N (0, σ 2 )

CHAPTER 1. A SIMPLE LINEAR REGRESSION MODEL

If β12 = 0 interaction between x1 and x2 does not exist. Our response becomes y = β0 + β1 x1 + β2 x2 + ǫ and is said to be additive with respect to x1 and x2 . Interaction between components leads to a synergistic effect that is non-additive, since these two effects produce an even greater effect than the sum their individual effects.

1.2.3

Gauss’s Theorem and Summary

An important theorem for linear regression is Gauss’s theorem Theorem 1.2.1 Gauss’s Theorem: suppose we have n independent and identically distributedPnormal random variables Y1 , ..., Yn with mean µ and variance σ 2 . Given Z = ni=1 Yi then Z ∼ N (nµ, nσ 2 ). ThatP is Z is normally distributed n 2 with P mean nµ and variance nσ . Similarly if Z = i=1 ai Yi , ai ∈ R then Z ∼ P n n 2 2 N (µ i=1 ai , σ i=1 (ai ) ). A generalization is developped later for mutlitvariate normal under the Gauss-Markov Theorm. To summarize: • y = µ(x) + E, E ∼ N (0, σ 2 ) is a model whereas • ŷ = µ(x) = β̂0 + β̂1 x is an estimate for the model. • y is normally distributed with Y ∼ (β0 + β1 x, σ 2 )

1.3

Parameter Estimation

There are two well known methods to estimate paremeters for simple linear regression models. They include: • Method of maximum likelihood • Method of least squares estimates These two methods can be shown to yield similar results for β0 and β1 but not σ 2 . The first method considered is the method of maximum likelihood.

1.3.1

Maximum Likelihood Estimation

In maximum likelihhod estimation we consider the joint pdf of the data points with respect to unknown parameters. The joint pdf when considered from a parameteric

1.3. PARAMETER ESTIMATION

point of view is known as the likelihood function and for the simple linear regreesion model may be written as: (1.2) " # n n X 1 1 σ −n exp − 2 (yi − µi )2 L(β0 , β1 σ 2 |y) = √ where y = (y1 , ..., ym ) 2σ i=1 2π A related concept is the log-likelihood function which as the name suggests is the log of the likelihood function: n

1 X n (yi − µi )2 (1.3) l(β0 , β1 σ 2 |y) = ln(L(β0 , β1 σ 2 |y)) = − ln(2π) − nln(σ) − 2 2 2σ i=1 Generally the log-likelihhod function is easier to work with when deriving parameter estimates. This is becuase the loglikeihood function reprsents a bijection of the likelihood-funciton. The goal of MLE is to maximize the probability of parameters occuring i.e MLE selects the estimates of the parameters such that the likelihood function is maximized. To maximize with respect to one of its parameters we take the derivative of the function and set the resultant equation equal to 0. This is more easily done with the log-likelihood function. Consider the simple linear model : ∂l(β0 , β1 σ 2 |y) ∂β0 ∂l(β0 , β1 σ 2 |y) ∂β1 ∂l(β0 , β1 σ 2 |y) ∂σ 2 P The log-likelihood’s function is given by − n2 ln(2π) − nln(σ) − 2σ1 2 ni=1 (yi − µi )2 . Notice that when we maximize to β0 and β1 this is equivalent to minPn with respect 1 2 imizing the distance 2σ2 i=1 (yi − µi ) . As we will see in the following section, this in fact is the method of least squares and yields the same parametric estimation β0 and β1 as MLE. For σ 2 it will be slightly different. For now, the steps are summarized about how one goes about finding an MLE in the following recipe: 1. Determine the distribution of Yi , f (yi ). This will be provided through the model being used 2. Assume the y1 , ..., yn are observed 3. Find the joint pdf of f (y1 , ..., yn |β0 , β1 , σ 2 ) 4. Reorganize the joint pdf to get the likelihood function f (β0 , β1 , σ 2 |y1 , ..., yn )

CHAPTER 1. A SIMPLE LINEAR REGRESSION MODEL 5. Find l(β0 , β1 , σ 2 |y1 , ..., yn ) by take the ln of L(β0 , β1 , σ 2 |y1 , ..., yn ) 2 6. Derive with respect to βi i.e. ∂l(β0 ,β1 ,σ∂βi|y1 ,...,yn ) 2

7. Set ∂l(β0 ,β1 ,σ∂βi|y1 ,...,yn ) = 0 and solve for β̂i Example 1.3.1 Suppose we want to develop a relationship between two variables x and y. The model y = β0 + β1 (x − x̄) + ǫ. Where y is the response variate and x is the explanatory varaite. Using MLEs determine an estimaite for β0 , β1 and σ 2 . Solution 1.3.2 1. We follow the recipe so the first thing to do is identify a distribution for yi . yi ’s distribution is given by: yi = β0 + β1 (xi − x̄) + ǫ where ǫ ∼ N (0, σ 2 ) and Yi ∼ N (β0 + β1 (xi − x̄),′ sigma2 ) 2. Now to find the likelihood function for β0 , β1 and σ, assume we have n observations then: " # n n n Y X 1 1 √ L(β0 , β1 , σ|y) = σ −n exp − 2 where y = (y1 , ..., ym ) (yi − β0 − β1 (xi − x̄))2 2σ i=1 2π i=1 3. Taking ln of the function: n

1 X (yi − β0 − β1 (xi − x̄))2 l(β0 , β1 σ|y) = −nln(2π) − nln(σ) − 2 2σ i=1 4.Taking the derivative with respect to β0 , β1 and σ Pn (−2(yi − β0 − β1 (xi − x̄))) ∂l(β0 , β1 , σ|y) = i=1 ∂β0 2σ 2 −nȳ + nβ0 + nβ1 x̄ − nβ1 x̄ = σ2 −nȳ + nβ0 = σ2 For β1 : ∂l(β0 , β1 |y, σ 2 ) = ∂β1

i=1 (−2(yi − β0 − β1 (xi − x̄))( i − x̄)) 2σ 2

= Now for σ:

−

i=1 yi xi +

i=1 β0 xi +

i=1 yi x̄ − σ2

i=1 β0 x̄ +

n n 1 X ∂l(β0 , β1 , σ|y) (yi − β0 − β1 (xi − x̄))2 =− + 3 ∂β0 σ σ i=1

i=1 β1 (xi − x̄)

1.3. PARAMETER ESTIMATION 5. Setting the equations equal to 0 and solving for β̂0 , β̂1 andσ̂: −nȳ + nβ̂0 = 0 ⇒ ȳ = β̂0 σ2

−

=− n X i=1

β̂1 (xi − x̄)2 = ⇒

i=1 yi xi + β̂0

n X

yi xi +

i=1 n X i=1

β̂1 =

yi xi − ȳ

n X

i=1 n X i=1

i=1 xi + x̄

ȳxi + x̄

n X i=1

xi −

n X

i=1 yi − σ2

yi −

n X

ȳx̄ +

i=1

yi x̄ + nȳx̄ =

i=1

i=1 β̂0 x̄ +

n X i=1

i=1 β̂1 (xi − x̄)

β̂1 (xi − x̄)2 ⇒

(yi − ȳ)(xi − x̄)

Sxx i=1 (yi − ȳ)(xi − x̄) = P n 2 Sxy i=1 β̂1 (xi − x̄)

n 1 X n 0=− + (yi − β̂0 − β̂1 (xi − x̄))2 σ̂ σˆ3 i=1

⇒

n W (yi , xi , βˆ0 , βˆ1 ) 1X (yi − β̂0 − β̂1 (xi − x̄))2 = σˆ2 = n i=1 n

P Note that Sab = i (ai − x̄)(bi − b̄i ). This is different to the standard deviation although they have similar forumlas, for a simple linear model their estimates are given by: r p Sxx ŝ = CovX, Y = Sxy n − 1 n−1 There are some “tricks” that were used when solving the LSE above, they are useful and will reappear throughout the course. They are: n X i=1

(xi − x̄) = 0

n X i=1

xi = nx̄

n X i=1

yi (xi − x̄) = Sxy

Remark 1.3.3 The following remark addresses some issues that arise from confusion with notation:

CHAPTER 1. A SIMPLE LINEAR REGRESSION MODEL • Parameters(β0 , β1 , ..., σ, ...): are constant values respresenting the study population values (note that the (study population)⊆(sample distribution)). • Estimates (β̂0 , β̂1 , ..., σ̂) are values (constants) that are found using the sample data. Least squares estimates or MLE are to calculate them accordingly β̂i changes based on the sample obtained. • Estimators (β̃0 , β̃1 , ...σ̃): are random variables. Every time a sample is chosen from a study population a different estimate will be achieved. β̂i is only one realization of β̃i . Since β̃i is a random variable it will have an assocaited pdf and cdf.

It should be noted that MLE are often biased Definition 1.3.4 An estimator is said to be unbiased if E[β̃i ] = βi Informally the idea of an unbiased estimator means that after many samplings the average value of the estimates will tend to the true value of the estimator.

1.3.2

Least Squares Estimates

L east squares estimates is based on minimizing the distances between the estimated regression model and the values of yi . A residual represents the distance between the observed yi and the estimated ŷi , it is often represented by r̂i and is a very useful concept in regression modelling. It is represented as: r̂i = yi − ŷi

alternatively r̂(xi ) = yi (xi ) − ŷi (xi )

Accordingly the function of least squares estimates can be restated as “wanting to minimize the square of the residuals with respect to model parameters”. Residuals are somtime called errors. How might we minimize distances? First a constraint must be placed on the residuals then an obective function can be constructed, at which point LaGrangian’s can be used to find minimum values for various parameters in the model. For a simple linear model we have the following objective function along with its constraint. ( n ) n n X X X (1.4) min r̂i2 = (yi ŷi ) subject to r̂i = 0 i=1

i=1

several points to note: • Generally when using LaGrangians with respect to LS estimates only the first derivative will be taken to show that a critical point exists. It is assumed that

1.3. PARAMETER ESTIMATION

this point is a minumum. To check this assumption the second derivative in univariate distributions or the Jacobian may be consiedered in multivaraite cases. P • W (β1 , β2 , ...) is often used to represent ni=1 r̂i2 . • The assumptions for least squares and the same as those for MLEs Below two examples are provided for clarity: Example 1.3.5 Suppose we are given the following model y = β + ǫ ǫ ∼ N (0, σ ) . The model looks like: Then the least squares estimator for β is found by minimizing r̂i with respect to β P sibject to the constraint ni=1 ri = 0: dW (yi , β̂) dβ̂

dr̂i2 dβ̂

= −2 Consequently the estimator is β̃ = Ȳ

= n X i=1

i=1 (yi − β̂)

dβ̂

(yi − β̂) = 0 ⇒ ȳ = β̂

Example 1.3.6 Suppose we are given a slightly more complicated model: y = βx+ǫ. Graphically: The least squares estimate for β̂ is given by: P n X d ni=1 (yi − β̂xi )2 dr̂i2 = = −2 (yi − β̂xi )xi = 0 ⇒ dβ̂ dβ̂ i=1 Pn n n X X xi yi 2 xi yi = β̂ xi ⇒ β̂ Pi=1 n 2 i=1 xi i=1 i=1

In general the parameter estimates are the same for least sqaures as they are for maximum likelihood estimates. An expeption however is in the standard deviation estimate σ̂LS 6= σ̂M LE . It can furterh be shown that σ̂LS is an unbiased estimate whereas σ̂M LE is a biased estimate. The table below summarizes the idea:

CHAPTER 1. A SIMPLE LINEAR REGRESSION MODEL

Maximum Likelihood Estimates Goal: find the most probable β based on the data Somtimes Biased σ 2 = Wn

Least Sqaures Estimates Goal: Find the βs that minimize W (the residuals) No W 2 σ = n−p

Table 1.1: Comparison of MLE an LS estimates (n is sample size, p is # parameters)

Chapter 2 Linear Algebra and Matrix Review 2.1

Introduction

In this section we review some necessary linear algebra since much of regression works with matrices and vectors. Before beginning just a notaional word: capital letters will denote matrices whereas boldfaced symbols will denote vectors. For example A is a martix whereas x is a vector. Three important types of matrices that are commonly used in regression are: • Symmetric Matrix : A = AT where T is the transpose of A • Orthogonal Matrix : AT A = AAT = I where I is the identity matrix • Idempotent Matrix : A2 = A, note this mean An = A provided A is idempotent The rest of this section brings together linear algebra and multivariate regression models.

2.2

Important properties of linear combinations of random variables

Let a1 , a2 , ..., An be constants and let Y1 , Y2 , ..., Yn be random variables. Furthermore let a = [a1 a2 ...an ]T respersent a scalar vector P whereas and Y = [Y1 , Y2 , ..., Yn ]T n represents a random vector. Then given Z = i=1 ai Yi the following properties hold: 13

CHAPTER 2. LINEAR ALGEBRA AND MATRIX REVIEW 1. Expectation is a linear operator # " n n X X ai Y i = ai E[Yi ] = aT E[Y ] E[Z] = E i=1

i=1

2. If Σ represents the covariance matrice of Z then: # " n n X n X X ai Y i = V ar[Z] = V ar ai ai Cov(Yi , Yj ) where Cov(Yi , Yi ) = V ari[Yi ] i=1

i=1 j=1

= a Σa

3. For a covariance matrix where the Yi s i + {1, 2, ..., n} are independent and identically distributed the V ar[Z] reduces to: # " n X V ar[Z] = V ar ai Yi = aT Σa = σaT a i=1

These properties will turn out to be important for multiple linear regression models.

2.3

Bridging the gap between linear algebra and regression

The current models that we have worked with so far have all been simple. They were usually made up of three parts a response a deterministic part and a random part, mathematically this looks like Response = Deterministic Part + Random part ⇔ y = µ(x) + ǫ This is somewhat sloopy. Realistcally each response should be modeled by a formula that accounts for singular data point, that means when all responses are put together their will be n equations. yi = µ(xi ) + ǫi

which leads to the following n equations       y1 = µ(x1 ) + ǫ1 y1 µ(x1 ) ǫ1  y2   µ(x2 )   ǫ2  y2 = µ(x2 ) + ǫ2       =⇒  ..  =  ..  +  ..  .. .  .  . . yn µ(xn ) ǫn y = µ(x ) + ǫ n

2.4. USING LEAST SQUARES TO ESTIMATE β̂

Consider the simple linear model y = β0 + β1 x + ǫ along with it’s i = {1, 2, ..., n} data points. It can be more accurately modeled by the following matrix and set of vectors: y = µ(x) + ǫ = (β0 + β1 x) + ǫ         β0 + β1 x1 ǫ1 1 + x1 ǫ1 β         . . . 0 .. = +  ...   +  ..  =  ..  β1 β0 + β1 xn ǫn 1 + xn ǫn

Simlarly for a quadratic model y = β0 + β1 x + β2 x2 + ǫ the corresponding matrix is giving by:        ǫ1 β0 + x1 + x21 y1 β0    ..   . . ..  β1  +  ..  =⇒ (Y ) = Xβ + ǫ .= β2 ǫn β0 + xn + x2n yn

A couple of things to note about the various vectors above:

1. A vector is a column vector i.e. AT = [a1 , ..., an ]T = A 2. y and e are both random vectors 3. X is not random, but is capitalized because it’s a matrix 4. The column of 1’s represent the intercept with the y axis In single variable regression the simple linear model was denoted y = β0 + β1 x + ǫ with estimates it became ŷ = β̂0 + β̂1 . Notice that once we have an estimate for our model it no longer has a random component. In multivaraite regression the notationfor the estimate will be Ŷ = X β̂, notice also that in this estimate for Ŷ there is no random component.

2.4

Using least squares to estimate β̂

When it comes to using least squares with matrices there are two important rules that allow s to derive the appropriate estimates. Rule 1 Let u and v be n × 1 vectors, then their scalar inner product is given by T

c=u·v =u v =

n X i=1

ui v i

CHAPTER 2. LINEAR ALGEBRA AND MATRIX REVIEW

Suppose the scalar c above was rewritten in vector form i.. c = [u1 v1 , ..., un vn ]T , the differentiating c with respect to µj , the following is derived:  ∂c  ∂µ

1 ∂c   =  ...  = v ∂µj ∂c

∂µn

Alternatively if c is left as a scalar:

∂c = vi ∂µi

i = {1, 2, ..., n}

Rule 2: Let u be an n × 1 vector and A an n × n matrix. Then the scalar form of there triple product of u and A is given by q: q = uT Au Furthermore the derivative of q with respect to u is qiven by: ∂q = 2Au ∂u Both these rules come in handy when taking matrix and vector derivatives, for now we’ll just summarize the differences between simple regression and multilpe regression models, then go on to show what the estimate is for β̂ is using least sqaures estimatros under multiple regression. Now to find β̂ Model Estimated line of best fit Residuals W(SSE)

Simple Regression Multiple Regression y = µ((x)) + ǫ = β0 + β1 x + ǫ Y = Xβ + ǫ ŷ = µ̂(x) ŷ = X(β̂) r̂ = R = y − ŷ Py ŷ W = ni=1 r̂i2 W = r · r = RT T

Table 2.1: Simple versus Multiple Regression W (β0 , β1 , ŷ) = RT R in simple regression: r̂ = y − ŷ = (y − ŷ)T (y − ŷ)

= (y − X β̂)T (y − X β̂) = (y T − (X β̂)T )(y − X β̂) T

= (y T − β̂ X T )(y − X β̂) T

= y T y − y T X β̂ − β̂ X T y + β̂ X T X β̂

2.4. USING LEAST SQUARES TO ESTIMATE β̂

Taking the derivative with respect to β̂ ∂W (β0 , β1 , ŷ) ∂ β̂

= −(y T X)T − X T y + 2X T X β̂ = −2X T y + 2X T X β̂ T

setting this equal to 0 and solving for: β̂

⇒ X X β̂ = X y

⇒ β̂ = (X T X)−1 X T y For the equation above to be valid X T X must be invertible that is it must be “full rank”. Definition 2.4.1 A matrix that is said to full rank has lineanrly independent columns and rows Our model is Y = Xb + ǫ consisting of Yi elements. Note tat yi is a random variable represtening the ith response. β is a column vector of the parameters β = [β1 , ..., βn ]T , these are constants from the study population, where ǫ[ǫ1 , ..., ǫn ]T and ǫ is the error for teh data valuce c. Each element of X is assumed to be constant. The dimension of X-we want the number of columns to equal the number of parameters. The numbers of rows is equal to the number of data values (C). X is a matrix of explanatory varitaes. Each element of X is assumed to be a constnat. We really want n > p but this does not always happen. Example 2.4.2 Suppose we wanted to model the relationship between x and y through the simple linear regression model y = β0 + β1 x + ǫ we are given the following information: Use the model to find parameters β̂0 and β̂1 without matrices, then using x y

123 312

matrices find the following: • XT X • (X T X)− 1 • (X T X)− 1X T • β̂ using the results from a-c • What are the estimated residuals (r̂1 , r̂2 , r̂3 )? • How can it be checked that the residuals were caluclated correctly?

CHAPTER 2. LINEAR ALGEBRA AND MATRIX REVIEW • Recall b̂ = X(X T X)−1 X T y represnts the line of best fit. If P = X(X T X)−1 X T find P.

Solution 2.4.3 Without matrices, we can estimate β0 and β1 through least squares. Using the estimates from previous examples β1 =

−1 sxy = sxx 2

β0 = ȳ − β1 x̄ = 2 + 0.5(2) = 3

Finding β̂ with matrices Writing out our model in vector/matrix notiation we have:       ǫ1 1 + x1 y1 y2  = 1 + x2  β0 + ǫ2  β1 ǫ3 1 + x3 y3 Plugging in the values: and solving for X T X   1 1 X =  1 2 1 3   1 1 3 6 1 1 1  T 1 2 = X X= 6 14 1 2 3 1 3

Now to find the inverse (X T X)− 1: − 1 1 14 −6 3 6 14 −6 T − (X X) 1 = 1= = 6 14 3(14) − 62 −6 3 −6 −6 3

Solving for β̂: 1 14 −6 1 1 1 4/3 1/3 −2/3 = (X X) 1X = −1/2 0 1/2 1 2 3 −6 −6 3   3 β̂ 3 4/3 1/3 −2/3   T − T 1 = = 0 β̂ = (X X) 1X y = −1/2 −1/2 0 1/2 β̂1 2 1 ⇒ ŷ = 3 − x 2 This is the same model we had without matrices. Now lets consider the residuals: T

−

r̂1 = y1 − ŷ1 = 3 − (3 − .5(1)) = 0.5 r̂2 = y2 − ŷ2 = 1 − (3 − 0.5(2)) = −1 r̂3 = y3 − ŷ = 2 − (3 − (0.5)3) = 0.5

2.5. CONCEPT OF REGRESSION VIA A PICTURE

To see if these residuals are correct they must sum up to 0 (based on our intial assumption): n X r̂i = 0.5 + 0.5 − 1 = 0 i=1

To find P all that we need to do is multipkly the matrix found in c. by X:    5 2 −1 1 1 1 4/3 1/3 −2/3 = 2 2 2  X(X T X)−1 X T = 1 2 −1/2 0 1/2 6 7 2 5 1 3 

There are a couple of things to note in this example with respect to the linear algebra and general properties of matrices in regrsiion: • P is symmetric P T = (X(X T X)−1 X T )T = X(X(XT X −1 ))T = X((X T X)−1 )T X T = X((X T X)T )−1 X T = X(X T X)−1 X T = P • P is idempotent P 2 = (X(X T X)−1 X T )(X(X T X)−1 X T ) = X((X T X)−1 )(X T X)(X T X)−1 X T = X(X T X)−1 X T • Since P is both symmetric and idempotent it is a projection matrix • If the columns of X are not linearly independnt then X T X will not be invertible (singular)

2.5

Concept of regression via a picture

We project y onto R(X) [the space given by the columns of the matrix X], using ŷ = py, note: • Y lies in R(X) • R(X) is denoted by the span of the columns over X • When we minimize W = RT R we are really minimizing the length of the vector R • When we set W = 0 we are trying to makes the lines perpendicular so the R will be the smallest when it is orthogonal to R(X)

CHAPTER 2. LINEAR ALGEBRA AND MATRIX REVIEW • This means every column of X is perpendicular to R i.e.X T (y − ŷ) = X T R = 0: X T (y − ŷ) = X T (y − Xβ) = X T Y − X T Xβ

= X T y − X T X((X T X)−1 X T y) = X T y − X T y = 0

2.6

Estimators

Relating parameters, estimtates and estimators can be tricky, graphically we can represent the relation something like this: Study Population Parameters(σ 2 , β1 , ...)

Population LS,M LE

LS,M LE

−→ Estimates(σ̂ 2 , β̂1 , ...) −→ Estimators(σ̂ 2 , β̂1 , ...)

How do you find an estimator? Let our parameter be θ. Let our estimate for θ be θ̂(y, x), then our estimator is θ̃(Y, x). Note that Y is capitalized when going from an estimate to and estimator. This is done because Y is random (Y depends on the random variable ǫ). x is assumed to be constant in this context. For clarification consider the table below Estimator of Y Y = βx + ǫ

Estimate of β1 Pn yx i=1 β̂1 = Pn xi 2 i i=1

estimator of β1 Pn Yi xi i=1 β̃1 = Pn x2 i=1

We can construct confidence intervals and hypothesis tests for estimators, The method is straight forward: 1. Check whether Gauss-Markov theorem holds 2. Find the expected value of our estimator 3. Find the variance of the estimator Suppose we are using the simple linear model Y = β0 + β1 x + ǫ , ǫ ∼ N (0, σ 2 ), it can be shown that o ur esitmators under this model are (2.1) (2.2)

β˜0 = Ȳ − β̃2 x̄ Pn Pn (xi x̄)(Yi − Ȳ ) (xi − x̄)Yi sxY i=1 β̃1 = = Pn = Pi=1 n 2 2 sxx i=1 (xi − x̄) i=1 (xi − x̄)

Notice that β̃1 is random because it has a Y term. Using the simple linear model Y = β0 + β1 x + ǫ , ǫ ∼ N (0, σ 2 ),we will derive confidence intervals and distributions

2.6. ESTIMATORS for β̃1 , β̃0 and σ˜2 . Confidence interval and distribution of P β̃1

n (xi − x̄)Yi 1. Gauss Markov theorem holds that is β̃1 Pi=1 n 2 i=1 (xi − x̄)

is a linear combinatio of

random varaibles that are normal so it is normal as well. 2. The expected value of β̃1 is: " # P Pn n n (xi − x̄)(β0 + β1 xi ) 1 X i=1 (xi − x̄)E[Yi ] E[β̃1 ] = E (xi − x̄)Yi = = i=1 sxx sxx sxx Pn Pn Pn i=1 β1 i=1 (xi − x̄)xi β0 i=1 (xi − x̄ + β1 i=1 (xi − x̄)xi sxx = = β1 = sxx sxx sxx = β1 clearly β̃1 is unbiased

3. The variance of β̃1 : "

# n n 1 X 1 X sxx σ 2 V ar[β̃1 ] = V ar (xi − x̄)Yi = 2 (xi − x̄)2 V ar[Yi ] = 2 sxx i=1 sxx i=1 sxx =

σ2 sxx

Accordingly the confidence interval for β̃1 is given by : β1 ± cσ Confidence interval and distribution of β̃0 1. Gauss Markov theorem holds as can be seen: β̃0 = Ȳ −

p 1/sxx .

CHAPTER 2. LINEAR ALGEBRA AND MATRIX REVIEW

Chapter 3 Multivariate Normal Often we want to model more than in one dimension. The multivariate normal model is usedto describe such measures.. Suppose Yi ∼ N (µi , σi2 ) then the multivariate normal is given by:

(3.1)

f (y) =

1 2π

n/2

−1/2

exp

−1 (y − µ)T Σ−1 (y − µ) 2

In the equation above Σ is the covariance matrix, y is the vector of yi and µ is the vector of means. It is interseting to note that the covariance controls the shape of the contour of the distribution. Under independence the distribution of bivaraite normal will be f (x, y) = f (x)f (y) and its covariance term will be Cov(X, Y ) = 0. In general if Cov(X, Y ) = 0 then nothing can be said about independece assumption this is however not true for the multivariate normal. We’ll show this by proofing it for the simplest example and infer it holds for the rest. Proof 3.0.1 Let Y be a multrivariate normal with no covariance terms then: 2 σ1 0 T Y = [Y1 , Y2 ] where Y ∼ M V N (µ1 .µ2 ) , 0 σ22 2 1 1 σ1 0 T −1 f (y1 , y2 ) = exp (y − µ) Σ (y − µ) 0 σ22 2π 2 T

We wish to show that in the case of a MVN a covarince of 0 (i.e. Cov(X, Y ) = 0) implies independence (i.e. X⊥Y ), by expanding f (y1, y2) tis will turn out to be the 23

CHAPTER 3. MULTIVARIATE NORMAL

case. ! σ12 0 −1 (y1 − µ1 ) 1 (y1 − µ1 ) (y2 − µ2 ) f (y1 , y2 ) = exp (y2 − µ2 ) 0 σ22 2 1/2 1 1 −(y1 − µ1 )2 σ22 − (y2 − µ2 )2 σ12 1 exp = 2πσ12 2πσ22 2 2σ12 σ22 1/2 1/2 1 1 −(y1 − µ1 )2 −(y2 − µ2 )2 1 exp = 2πσ12 2πσ22 2 σ12 σ22 1/2 1/2 1 1 −(y1 − µ1 )2 −(y2 − µ2 )2 1 exp exp = 2πσ12 2 σ12 2πσ22 σ22 = f (y1 )f (y2 )

1 2π

1 2 2 σ1 σ2 1/2

1/2

So for a bivariate normal distribution when the covariance is 0 independence is implied . To illustrate some of the properties of a multivariate normal distribution an explicit case of a trivariate normal distribution is considered. That is details will be added for explicitlness. Example 3.0.2 Suppose Y ∼ M V N (µ, Σ) where µT = [E[Y1 ] = 1E[Y2 ] = 2E[Y3 ] = 3]  V ar[Y1 ] = 6 Cov(Y1 , Y2 ) = −8 Cov(Y1 , Y3 ) = 2 V ar[Y2 ] = 5 Cov(Y2 , Y3 ) = −1 Σ = Cov(Y2 , Y1 ) = −8 Cov(Y3 , Y1 ) = 2 Cov(Y3 , Y2 ) = −1 V ar[Y3 ] = 8 

Three questions can be asked when considering a MVN distribution: 1. What are the marginal distributions of a MVN normal?

• Y1 ∼ N (E[Y1 ] = 1, V ar[Y1 ] = 6) • Y2 ∼ N (E[Y2 ] = 2, V ar[Y2 ] = 5) • Y3 ∼ N (E[Y3 ] = 3, V ar[Y3 ] = 8) P 2. Find the distribution S = ni=1 ai Yi ai ∈ R. For example S = Y1 − Y2 is a normal distribution with µ = E[S] and σ 2 = V ar[S] 3. Find E[S] and V ar[S]. Using S = Y1 − Y2 E[S] = E[Y1 − Y2 ] = E[Y1 ] − E[Y2 ] = 1 − 2 = −1 V ar[S] = V ar[Y1 − Y2 ] = V ar[Y1 ] + V ar[Y2 ] − 2Cov(Y1 , Y2 ) = 6 + 5 − 2(−8) = 27

3.1. THE DISTRIBUTION OF β̃ IN THE MLR CASE

3.1

The distribution of β̃ in the MLR case

Let U p×1 = Ap×n Y n×1 where A is a set of constants and Y is random. Then E[U ] = E[AY ] = AE[Y ]

V ar[U ] = AV ar[Y ]AT

Recall that that β̂ = (X T X)−1 X T y the yi s are not random so neither is β̂. Alternatively, the estimator β̃ is random since the Yi s are random in β̂ = (X T X)− 1X T Y . Accordingly we can find β̃’s distribution its expectation and variance: E[β̃] = E[(X T X)−1 X T Y ] = (X T X)−1 X T E[Y ] = (X T X)−1 X T Xβ

since Y = Xβ + ǫ ǫ ∼ N (0, σ 2 )

= (X T X)−1 (X T X)β = β

β is unbiased

V ar[β̃] = V ar[(X T X)−1 X T Y ] = ((X T X)−1 X T )V ar[Y ]((X T X)−1 X T )T   σ12 . . . 0   = ((X T X)−1 X T )  ... . . . ...  ((X T X)−1 X T )T since ǫi ⊥ǫj i 6= j 0 . . . σn2 = σ 2 ((X T X)−1 X T )I((X T X)−1 X T )T ⇒ V ar[β̃] = σ 2 (X T X)−1 In the single varaite case the Gauss theroem said a linear combination of normal distributions is also normal. Extending the idea to the multidimensional case the gauss Markvo theorem is derived: Theorem 3.1.1 the Gauss Markov theorem states if Y is a multivariate normal with E[Y ] = µ and V ar[Y ] = Σ then a linear combination denoted as U p×1 = Ap×n Y n×1 is U ∼ M V N (Aµ, AΣAT ) where (AT A)is an invertible matrix. As in the univariate case the idea is extended to β̃ and β̃i : β̃ ∼ M V N (β, σ 2 (X T X)−1 )

3.2

β̂ ∼ N (βi , σ 2 (X T X)−1 ii )

Distribution of residuals

Recall the estimated residual in a multivariate case was given by R̂ = y − ŷ = y − X β̂

CHAPTER 3. MULTIVARIATE NORMAL

However the residuals change with each sample, so what is the distribution of the residual estimator? First write out the equation for the residual estimator: R̃ = Y − X β̃ R̃ seems to be a linear combination of Y and β̃. By the Gauss Markov theorem R̃ is a multivariate normal. We can find its expectation and variance: E[R̃] = E[Y − X β̃] = E[Y ] − E[X β̃] = E[X β̃] − X[E β̃] = 0

V ar[R̃] = V ar[Y − X β̃] = V ar[Y − X(X T X)−1 X T Y ]

= V ar[Y − P Y ] = V ar[(I − p)Y ] = (I − P )V ar(Y )(I − P )T = (I − P )σ 2 since (I − P )is an a projecion matrix-¿symmetric and idempotent

This means that the residual estimator has a normal distribution with R̃ = (I − P )σ 2

(3.2)

This is different from what we would expect if R̃ was treated as an error term rather than a residual. The error terms distribution R̃ ∼ M V N (0, σ 2 I) which is clearly different to R̃s. The only time when R̃ = ǫ is when P = 0.

3.3

Prediction and mean distributions

Like the simple linear regression model the multiple linear regression model is made up of a deterministic and random part. The deterministic part (line of best fit) can be thought of as a mean across the data. As before, lets split up the model into three parts the predictor function which we’ll call Y p , the mean response y m (which is equivalent to X β̃) and the random error ǫ. Response Yp

= Deterministic Part + = ym+

Random part ǫ ǫ ∼ N (0, σ 2 )

We already known the distribution for ǫ but we can now find the distributions for Y p and y m .

3.3.1

The distribution of y m

The estimate for ym is given by ŷ = X β̂ which means its estimator is given by ỹ = X β̃ (which can also be written as y m but for clarity a tilde is included). Since

3.3. PREDICTION AND MEAN DISTRIBUTIONS

ỹ is a linear combination of MVN by the Gauss Markov theorem it is also a a MVN with E[ỹ] = E[X β̃] = XE[β̃] = Xβ V ar[ỹ] = V ar[X β̃] = XV ar[β̃]X T = σ 2 (X(X T X)−1 X) = σ 2 P That is ỹ m ∼ M V N (Xβ, P σ 2 ).

3.3.2

The distribution of Y p

Y p is the predictor function it predicts the next most likely point to occur given x. Since Y p like y m is a linear combination of multivariate normal it is also a multivariate normal. Notice that a tilde is not included on top Y p , in this case it is not necessary since it is explicit that Y p is a random variable and represents an estimator. It would not be wrong to inclde a tilde on top of Y p and is done so in the summary table for completeness. Y p s mean and variance are given below: E[Y p ] = E[X β̃ + ǫ] = XE[β̃] + E[ǫ] = XE[β̃] = Xβ V ar[Y p ] = V ar[X β̃ + ǫ] = V ar[X β̃] + V ar[ǫ] = σ 2 ) + σ 2 = σ 2 (P + I) Note that the variance for Y p is larger than that of ỹ m . The varianve for Y p and ỹ m are the same when P = 0

3.3.3

Summary of distributions

Several distributions have been presented and keeping the straight from one another can be difficult. The table below summarizes the mean, variance and distribution of each distribution. Estimator ǫ

Mean 0

β̃1

β1

β̃0

β0

β̃i

βi

Variance σ2 σ2 sxx 1 1 2 σ + n sxx σ 2 (X T X)−1 ii

Distribution Normal Normal Normal Normal

Table 3.1: Summary of distributions of simple linear model with independent and identically distributed ǫs

CHAPTER 3. MULTIVARIATE NORMAL Estimator ǫ R̃ β̃ ỹm Ỹp

Mean 0 0 β Xβ Xβ

Variance σ2I σ 2 (I − P ) σ 2 (X T X)−1 σ2P σ 2 (P + I)

Distribution Multivariate Normal Multivariate Normal Multivariate Normal Multivariate Normal Multivariate Normal

Table 3.2: Summary of distributions of multiple regression model with independent and identically distributed ǫs

3.3.4

Putting it together with some examples

Example: Normal Suppose the model y = xβ + ǫ N ∼ (0, σ 2 ) is used to model the response P between time x (in hours) and production y. Under this model β̃ ∼ N (β, σ 2 / ni=1 x2i ) and the following information is provided with respect to x and y: Several common questions are asked with respect to this type of model, we will x y

1 1

4 2

6 3

answer them one at a time: 1. What is the mean production if an individual works 9 hours of work? 27 The line of best fit is ŷ = xβ̂ = x so for 9 hours of work there will be 53 27 ŷ = 9 = 4.58 units of production (if this is required to be a whole number 53 the there will be 4 units of production). 2. What is the predicted number of production units produced if a labourer works 5 hours? 27 The predicted number of production units will be ŷ = xβ̂ = 5. 53 3. What is the distribution of the • mean The estimator of the mean is ỹm = xm β̃, since β̃ is normal xβ̃ is normal too by Gauss’s theorem. The mean and variance are given by: E[ỹm ] = E[xm β̃] = xm E[β̃] = xm β σ2 V ar[ỹm ] = V ar[xm β̃] = x2m Pn 2 i=1 xi

3.3. PREDICTION AND MEAN DISTRIBUTIONS

• predictive value The estimator of the predictive values is given by ỹp = xm β̃ + ǫ, as before this is a linear combination of normals. By Gauss’s theorem the predictive value distribution is also normal. Its mean and variance are given by: E[ỹp ] = E[xm β̃ + ǫ] = xm E[β̃] = xm β σ2 x2m 2 2 2 V ar[ỹp ] = V ar[xm β̃ + ǫ] = xm Pn 2 + σ = σ Pn 2 + 1 i=1 xi i=1 xi

Example: Multivariate normal Suppose that we wish to model the relationship between IQ, study hours and final grade. The following information for 5 students is provided: Given that we are using the following model IQ 104 Study Hours 8.4 Grade 78.87

126 7.2 89.67

102 6.70 85.02

115 7.60 88.30

107 8.4 92.41

Table 3.3: Data for IQ, study hours and final grade

Y = β0 + βIQ (IQ) + βSH (SH) +ǫ | {z }

where ǫ ∼ N (0, σ 2 )

µ(IQ,SH)

answer the following questions:

1. What is the least squares estimate for β = [β0 βIQ βSH ]T ? The least squares estimate for β is given by (X T X)−1 X T Y :   T  −1  T  78.87 1 104 84 1 104 84 1 104 84   1 126 7.2 89.67  1 126 7.2      1 126 7.2  85.02 1 102 6.7 1 102 6.7 β̂ = 1 102 6.7             1 115 7.6 1 115 7.6 1 115 7.6 88.30 92.41 1 107 8.4 1 107 8.4 1 107 8.4 T = 57.52 0.2560 0.1263 So the least squares estimate for β is

ˆ = 0.1263]T β̂ = [βˆ0 = 57.52 βˆIQ = 0.2560 βSH 2. What is the mean grade an individual will achieve if there IQ is 110 and they study 9 hours? There individuals mean grade will be: T

Ym = β̂ Y = 57.52 + 0.2560(110) + 0.1263(9) = 86.8167 ≈ 86.82

CHAPTER 3. MULTIVARIATE NORMAL 3. What is the predicted grade for an individual if there IQ is 110 and they study 9 hours? There individuals predicted grade will be: T

Ym = β̂ Y = 57.52 + 0.2560(110) + 0.1263(9) = 86.8167 ≈ 86.82 4. What is the distribution of the: • Mean The estimator of the mean is given by Ỹm = xm T β̃. Since β̃ is multivariate normal by the Gauss Markov Theorem so is Ỹm . Accordingly it’s mean and variance are given by: E[Ỹm ] = E[xm T β̃] = xm T E[β̃] = xm T β V ar[Ỹm ] = V ar[xm T β̃] = xm V ar[β̃]xm T =

xm ((X T X)−1 σ 2 ) xm T |{z} |{z} a| vector a vector} {z Not the matrix P

• Predictive The estimator of the predictor is given by Ỹp = xp T β̃+ǫ. Since Ỹp a linear combination of multivariate normals by the Gauss Markov Theorem so is Ỹp . Accordingly YP ’s mean and variance are given by: ⊥

E[Ỹp ] = E[xp T β̃ + ǫ] == xp T E[β̃] + E[ǫ] = xm T β ⊥

V ar[Ỹp ] = V ar[xp T β̃ + ǫ] == V ar[xp T β̃] + V ar[ǫ] = xp V ar[β̃]xp T + σ 2 I = xp ((X T X)−1 σ 2 )xp + σ 2 I 5. Note: When both σ 2 and β are unknown the predictive and mean distributions become: Ym ∼ N (β̂xm , σˆ2 (xm (X T X)−1 xm T )

Yp ∼ N (β̂xp , σˆ2 (xp (X T X)−1 xp T ) + σ̂ 2 I)

Chapter 4 Confidence Intervals Through the previous sections we were discussing the distribution of estimators so that a confidence interval may be constructed. The goal of a confidence interval is explanatory. It provides an idea of plausible parameter values. That is it gives an approximate interval where your parameter should land given an experiment has been conducted. For example suppose that Y is a normal random variable with unknown mean µ and known variance σ. To find the mean µ a sample is drawn from the population to obtain an estimate û. Having obtained the estimate µ̂ a confidence interval can be constructed to approximate the probability of the real µ falling within a certain range. The probability of falling within this range will be given by (1 − p): Pr

x − û <z σ

=Φ

x − û σ

=1−p

Alternatively by assigning a confidence level to the interval an artifical bound is set on how high or how low µ̂ will be, the confidence interval can be derived from the equation above: Pr

x − û < zα/2 σ

= α ⇔ µ̂ ± zα/2 σ

This is different from a probability interval where µ is unknown i.e. µ ± zα/2 σ.

Note The confidence level is a probability. As soon as µ is replaced with µ̂ the confidence level is no longer a probability it is an estimate of a range of the range of µ̂ 31

CHAPTER 4. CONFIDENCE INTERVALS

4.0.5

Problems with estimating(page 34)

When estimating the value of θ we do so by taking the expected value of the random variable θ̃. By constructing a probability interval we say that θ will fall in a certain range of values with a probability determined by c. Using the estimator θ̃ the interval is written as: E[θ̃] ± c(V ar(θ̃))1/2

This notation however assumes that perfect knowledge of both E[θ̃] and (V ar(θ̃))1/2 which is not the case. When E[θ̃] and (V ar(θ̃))1/2 are estimated hats are affixed to them that is Ê(θ̃) and V ˆar(θ̃). A problem however arises in the estimation of the standard error (V ˆar(θ̃))1/2 . To see why consider the following: • σ 2 is typically unknown • Since σ 2 is unknown it has to be estimated, when σ 2 is estimated the normality assumption falls apart for a small sample < 30 • Suppose Xi ∼ N (µ, σ 2 ) where Zi = (X − µ)/σ ∼ N (0, 1) is the standard normal for observation i if: a. Z1 , Z2 , ..., Zd are normal N (0, 1) random variables that are independent and identically distributed b. Z12 ∼ χ21 is chi-squared with 1 degree of freedom c. Z12 + ... + Zd2 ∼ χ2d is chi-squared with d degrees of freedom

Z If Z ∼ N (0, 1) and c ∼ χ2d where c⊥d then td ∼ (c/d) 0.5 that is σ̂ is represented by a t-distribution with d degrees of freedom.

Chapter 5 Hypothesis Testing 5.1

Introduction

In engineering, biological and social sciences it is common to conduct hypothesis tests to check the validity of an idea. A hypothesis test cannot tell an individual if their idea is right only if it is wrong.

5.2

Some prelimnary work

5.2.1

Formulation of tests and conclusions

Hypothesis tests are formulated in terms of two hypotheses: the null hypothesis (H0 ) and the alternative hypothesis (Ha ). The null hypothesis represents the idea which researchers wish to disprove in favour of the alternative (which is in fact the idea of interest). For example suppose a hot-dog superstore believes that its customers eat at least 5 hot dogs per week. To check this assumption a hypothesis test is conducted. The hypotheses are given as: H0 ≤ 5 : Customers eat 5 hot dogs or less per week Ha < 5 : Customers eat more than 5 hot dogs per week The hypothesis of interest is the null. It is the statistic used to conduct the hypothesis test not the alternative. This may seem counterintuitive since we wish to prove that Ha is true. The thinking behind hypothesis is solipsistic, that is “nothing can be proven with certainty, things can only be disproven with certainty.” In that vain 33

CHAPTER 5. HYPOTHESIS TESTING

statisticians arrive at one of two conclusions and use formal language to describe whether the null is rejected or not: • Null hypothesis not rejected: Should a null hypothesis have sufficient data to indicate it cannot be rejected then a statistician cannot concludes the alternative is false but rather that there is currently insufficient data to reject the null hypothesis. • Null hypothesis is rejected: If however there is a strong indication in favour of the alternative hypothesis the statistician notes the null is rejected in favour of the alternative hypothesis.

5.2.2

Forms of hypothesis tests

There are only a limited number of ways in which the null and alternative hypotheses may be formulated with respect to one another. Convention often denotes the null hypothesis as H0 : θ = C0 where θ is some test statistic and C0 is the value to which it is compared. This method of presenting the null hypothesis is correct but incomplete since it does not indicate whether the alternative is less or greater than θ, in which case context will define the specific nature of H0 . For a complete description the less than and greater than signs should be included. Suppose θ is some statistic and C0 is the value to which the statistic is being compared then the various ways of formulating a hypothesis test are outlined below: Alternative Hypothesis Null Hypothesis Ha : θ 6= C0 H0 : θ = C 0 Ha :> C0 H0 :≤ C0 Ha : θ < C 0 H0 : θ ≥ C 0

Short-Hand Null Hypothesis H0 = C 0 H0 : θ = C 0 H0 : θ = C 0

Table 5.1: Various ways of defining a hypothesis test

5.2.3

p-value

Assuming the null hypothesis is true then the p-value represents a probability measure of distance (in absolute terms) from the test statistic C0 . The p-value is a useful gage of whether or not the null hypothesis should be rejected given a certain confidence level. Example Suppose Bob’s hot dog house wishes to test whether customers eat 5 or more hot dogs on average per week. Given that the estimated average number of

5.2. SOME PRELIMNARY WORK

hot dogs eaten per week is 4.5 and the known standard deviation is 1.5 determine the p-value of the average number of hot dogs eaten in a week using a normal approximation. Solution: The null hypothesis is H0 ≤ 5 so if µ represents the average number of hot dogs eaten in a week then: µ−5 4.5 − 5 ≤ = Φ(−1/3 < Z) = 1−Φ(1/3 > Z) = 1−0.6293 = 0.3703 = p P r(µ̂ ≤ µ) = P r 1.5 1.5 Since p is not sufficiently small the null hypothesis cannot be rejected. Example: Suppose Bob’s hot dog house made a mistake and that the estimated number of hot dogs should have 8 instead of 4.5, would this change the p − value sufficiently to conclude with a %5 level of significane that people eat 5 or more hot dogs per week? Solution: As before the p − value is estimated as: 8−5 µ−5 P r(µ̂ ≤ µ) = P r ≤ = Φ(−2 < Z) = 1−Φ(2 > Z) = 1−0.9773 = 0.0227 = p 1.5 1.5 Since the p-value is less than %5 the null hypothesis is rejected in favour of the alternative hypothesis.

5.2.4

Tests for testing hypotheses

In the previous section on p-values a procedure for approaching hypothesis tests was informally introduced. In this section a step by step recipe is developed. There is a slight difference in approach for when known and unknown σs. Hypothesis test recipe Suppose θ represents the test statistic of interest and C0 is the value against which θ is being compared then a recipe • Determine the hypotheses of interest H0 : θ ≥ C0 , H0 : θ ≤ C0 , H0 : θ = C0 • Use one of the following two formulas depending on whether or not σ is known: Notice that when σ is unknown it has to be estimated. Consequently a tdistribution rather than a normal distribution is used to estimate the p value. The dof (degrees of freedom) will be determined by the number of parameters, constraints are placed on the model as well as the number of observations. In general the degrees of freedom are given by n-p+k where n is the number of observations, p (not to be confused with the p-value) is the number of parameters and k is the number of constraints. • Having identified the appropriate formula in the values for θ̂, C0 and either σ or σ̂ and find the p-value either from the normal or t-table.

CHAPTER 5. HYPOTHESIS TESTING Null Hypothesis H0 : θ ≥ C 0 H0 : θ ≤ C 0 H0 : θ 6= C0

Known σ ! θ̂ − C0 √ ≥Z Pr σ/ n ! θ̂ − C0 √ ≤Z Pr σ/ n ! θ̂ − C0 √ ≥ |Z| Pr σ n

Unknown σ ! θ̂ − C0 √ ≥ Tdof Pr σ̂/ dof ! θ̂ − C0 √ ≤ Tdof Pr σ̂/ dof ! θ̂ − C0 √ ≥ |Tdof | Pr σ̂/ dof

Table 5.2: Hypothesis tests and formulas for known and unknown σs

• Draw a conclusion based on a significance level (either 0.1,0.05,0.025,0.01 or 0.005). The significance level determines the significance of the p-value, it may either by 1. Somewhat significant 0.1 < p < 0.05 2. Very significant 0.05 < p < 0.025 3. Highly significant 0.025 < p < 0.01 4. Extremely significant 0.01 < p < 0.005.

Chapter 6 ANOVA 6.1

Introduction

The models dealt with so far were composed of a deterministic and random part. Y =

+ |{z} ǫ µ(x) |{z} Deterministic Random

Both the deterministic and random parts of the model contributes to the overall variability of the Y . It is possible to split up these terms and consider the variances individually. ANOVA is an acronym for Analysis of Variance and it’s purpose is to deconstruct the variability of Y into its constituent parts.

6.2

Some notation

Before looking at the various linear models under ANOVA some new notation has to be introduced. Let Y = [Y1 Y2 ... Yn ]T , Ŷ = [Ŷ1 Ŷ2 ... Ŷn ]T and r = [r1 r2 ... rn ]T where ri ∼N(0, σ 2 ) i = {1, 2, ..., } for any linear model. Now recall that r = Y − Ŷ ⇐⇒ Y = Ŷ + r Multiplying both sides by Y T : T

Y T Y = (r + Y )T (r + Y ) = r T (r) + 2(r)T (Ŷ ) +Ŷ (Ŷ ) | {z } 0

T T (r) + Ŷ (Ŷ ) |Y {zY} = |r {z } | {z } SS(Tot) SS(Res) SS(Reg)

CHAPTER 6. ANOVA T

Instead of writing Y T Y , r T (r) and Ŷ (Ŷ ) we write SS(Tot), SS(Res) and SS(Reg). That is we have split up the variance into its constituent parts. A Note on Dimension Suppose there are n values in our sample and that our model is made up on k parameters (with n > k). Then: • Y has dimension n • X β̂ has dimension 1 since matrix X has one column for each of the k parameters. Begin: Aside What is the degrees of freedom? The degrees of freedom is the ability for a variable take on a non specified value. Consider the following example: Example Suppose we have 3 data values x, y and z that are unassigned but we know that there average is some value c. Mathematically: x+y+z =c 3 Then x, y and z are said to have 2 degrees of freedom because if values were assigned to x and y then the value of z would be known. Similarly if out of the three values x was assigned a value the degrees of freedom would decrease by 1, so y and z would only have one degree of freedom because assigning a value to y or z would implicitly assign a value to the other. When both x and y have assigned values then there are 0 degrees of freedom remaining. Knowing that the average has reduced the degrees of freedom we can take another look at why an estimated σ, where the average is estimated, has a smaller number of degrees of freedom, compare the following: Pn Pn 2 (yi − ȳ)2 i=1 (yi µ) σ̂ = σ̂ = i=1 n n−1 That is every time we place a linear constraint on the variables the degrees of freedom P is reduced by 1. In this case our linear constratint is ni=1 yi /n = ȳ. End: Aside

6.3

Models in ANOVA

In the notation section SS(Tot),SS(Res) and SS(Reg) were introduced. Of particular importance is SS(Reg), which can further be simplified

6.3. MODELS IN ANOVA

6.3.1

Intercept only model

CHAPTER 6. ANOVA

Chapter 7 Some notes for test 2 • A confidence interval is in reference to the average expected value of many components. A prediction interval in in reference to a single or individual component. A confidence interval is based on the estimator:

Ỹm = µ̃(x)

where x = [x1 x2 ... xn ]T that is x is a vector of values

where Ỹm ∼N(µ(x), σ 2 (X T X)−1 ). Similarly the prediction interval is based on the estimator

Ỹp = µ̃(x) +ǫ | {z }

whereǫ ∼ N (0, σ 2 ) andỸm is as defined above

Ỹm

where Ỹm ∼N(µ(x), σ 2 (X T X)−1 ) + σ 2 • When estimating the value of σ we use the value s to denote an estimate alternatively σ̂ could be used. s is known as the avergae of the residual sum squares (also called mean squar error-MSE) when we are conducting an ANOVA experiment whereas σ̂ is known as the total standard deviation when ANOVA is not being used. • ANOVA seeks to answer the question of β1 = β2 = ... = βn = 0 that is ANOVA is asking whether or not there at least one slope values is pertinent to the prediction of the response variate. 41

CHAPTER 7. SOME NOTES FOR TEST 2 • In ANOVA

r T r = (y − X β̂)T (y − X β̂) T

Ŷ Ŷ = (X β̂)T (X β̂)

= (X β̂)T (X(X T X)−1 X T Y ) = (X(X T X)−1 X T Y )T (X(X T X)−1 X T Y ) = (Y X(X T X)−1 X T )(X(X T X)−1 X T Y ) = Y T X(X T X)−1 (X T X)(X T X)−1 X T Y {z } | 1

−1

= Y X(X X) Y

= β̂X T Y = Y X β̂

• The R-Squared is a statistical measure of the linear relation between inputs and outputs. It is given by: SS(Reg) SS(Res) =1− SS(T ot) SS(T ot) • The adjusted or penalized R-squared is a modified version of the R-Squared which accounts for parameter degrees freedom divides. It is given by; SS(Reg)/p SS(Res)/(n − p − 1) =1− SS(T ot)/(n − 1) SS(T ot)/(n − 1)

• R2 is a measure of the strength of local linearity it is not a goodness of fit measure. As you add inputs R2 increases regardless of wheather or not those inputs are useful to the model. • When dealing with a simple linear model the R-squared value will just be the square of the coefficient of correlation corr(x, y)2 • The pvalue is the probability that we see a value as extreme (or worse) then βi assuming βi = 0 • The parameter σ 2 can be estimated by (n − 1)S2 /σ 2 ∼ χ2n−1 . Estimators are given in a similar manor (n − p − 1)σ̃ 2 /σ 2 . When the square root of a χ2 estimator is taken over its degrees of freedom it becomes t-distributed with that many degrees of freedom. That is: 2 0.5 (n−p−1)σ̃ σ2

n−p−1

∼ t(n − p − 1)

This is important for the construction of confidence intervals. Since: β̂i − βi √ T ; ( X X)ii = s equation

43 • For a univariate simple linear model the SS(Tot) can be decomposed into the following: X X X X SS(T ot) = (yi − ȳ)2 = (yi − µ̂i + µ̂i + ȳ)2 = (yi − ûi )2 + (µ̂i − yi )2 | {z } | {z } SS(Reg)

SS(Res)

SS(Reg) can further be decomposed SS(Reg) =

n X i=1

= β̂12

(µ̂i − ȳ) =

n X i=1

(xi − x̄) =

(β̂0 + β̂1 xi − ȳ)

n X i=1

(ȳ − β̂1 xi + β̂1 xi − ȳ)

= β̂12

n X (xi − x̄)2 i=1

• Note that: s2 = SS(Res) = M SE n−p−1 • An F distribution is the ratio of two χ2 distributions • Suppose we are interested in a linear combination of coeffecients i.e. θ = β0 + c1 β1 + c2 β2 + ... + cn βn and there everage effect. This is equivalent to Ym = cβ̂ with corresponding distribution N(θ, σ 2 c(X T X)−1 Xc). Similarly if we wanted to predict the next upcomgin result then we would have Yp = cβ̂ ∼ (θ, σ 2 c(X T X)−1 Xc + σ 2 ) • Note