Introductory Econometrics

Page 1

ECON5336 Introductory Econometrics Department of Economics The University of Texas - Arlington Dr. Craig A. Depken, II Spring 2007 Subject to Change and Alteration During the Semester Caveat Emptor: Notes not guaranteed 100% accurate

1


Contents 1 Introduction

1

1.1

What is econometrics? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1

1.2

Statistics vs. Econometrics . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1

1.3

An example of econometric analysis . . . . . . . . . . . . . . . . . . . . . . .

3

1.4

Example: Income and Crime . . . . . . . . . . . . . . . . . . . . . . . . . . .

10

2 Statistical Review

13

2.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

13

2.2

Random Variables and Distributions . . . . . . . . . . . . . . . . . . . . . .

14

2.3

The Moments of Distributions . . . . . . . . . . . . . . . . . . . . . . . . . .

16

2.4

Basic Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

22

2.5

Statistical Properties of Estimators . . . . . . . . . . . . . . . . . . . . . . .

26

2.6

Determining Unbiasedness . . . . . . . . . . . . . . . . . . . . . . . . . . . .

29

2.7

Using STATA for basic statistics . . . . . . . . . . . . . . . . . . . . . . . . .

31

2.7.1

Starting STATA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

31

2.7.2

Reading data into STATA . . . . . . . . . . . . . . . . . . . . . . . .

33

2.7.3

A Working Example: Student Spending and Voting Patterns . . . . .

34

3 The Simple Regression Model

38

3.1

Basic Notation and Assumptions . . . . . . . . . . . . . . . . . . . . . . . .

38

3.2

Deriving the Estimators of the Simple Regression Model . . . . . . . . . . .

39

3.3

Example: Wage Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

43

3.4

Properties of the OLS Estimators . . . . . . . . . . . . . . . . . . . . . . . .

47

3.5

Hypotheses Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

50

3.6

Goodness of Fit Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . .

53

2


3.7

Example: The CAPM Model . . . . . . . . . . . . . . . . . . . . . . . . . . .

56

4 Matrix Algebra

61

5 The Classical Model: Ordinary Least Squares

69

5.1

Introduction and Basic Assumptions . . . . . . . . . . . . . . . . . . . . . .

69

5.2

Aside: Matrix Differentiation . . . . . . . . . . . . . . . . . . . . . . . . . .

70

5.3

Derivation of the OLS Estimator . . . . . . . . . . . . . . . . . . . . . . . .

71

5.4

The Properties of the OLS Estimator . . . . . . . . . . . . . . . . . . . . . .

73

5.5

Multiple Regression Example: The Price of Gasoline . . . . . . . . . . . . .

76

5.6

Multiple Regression Example: Software Piracy and Economic Freedom . . .

78

6 Possible Problems in Regression

81

6.1

Omitted Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

81

6.2

Measurement Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

83

6.3

Multicolinearity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

83

6.4

Specification Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

86

6.5

Example: Hedonic Models . . . . . . . . . . . . . . . . . . . . . . . . . . . .

89

7 Functional Forms

93

7.1

Linear vs. Non-linear models . . . . . . . . . . . . . . . . . . . . . . . . . . .

93

7.2

Log-Linear Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

93

7.3

Other Functional Forms . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

97

7.4

Testing for the Appropriate Functional Form . . . . . . . . . . . . . . . . . . 101

7.5

The Polynomial Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

7.6

Example: Optimal Store Size . . . . . . . . . . . . . . . . . . . . . . . . . . 105

8 Dummy Variables and Functional Forms

3

108


8.1

Dummy Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

8.2

Interaction Terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

8.3

Time Trends . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

8.4

Example: The Taft-Hartley Act of 1947 . . . . . . . . . . . . . . . . . . . . . 117

9 Hypothesis Testing

121

9.1

Tests on a Single Parameter . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

9.2

Comparing Two Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . 122

9.3

Tests Involving Multiple Parameters

9.4

. . . . . . . . . . . . . . . . . . . . . . 123

9.3.1

Statistical Significance of the Model . . . . . . . . . . . . . . . . . . . 123

9.3.2

The Significance of Advertising . . . . . . . . . . . . . . . . . . . . . 125

Example: University Library Staff and School Size . . . . . . . . . . . . . . . 128 9.4.1

Example: A Firm Production Function . . . . . . . . . . . . . . . . . 132

9.5

Some other tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136

9.6

Test for Structural Break . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137

10 Generalized Least Squares

141

10.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 10.2 Implications for the OLS Estimator . . . . . . . . . . . . . . . . . . . . . . . 141 10.3 Generalized Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 10.4 Feasible Generalized Least Squares . . . . . . . . . . . . . . . . . . . . . . . 145 11 Heteroscedasticity

147

11.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 11.2 Generalized Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 11.3 Weighted Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150 11.4 Adjustment of Standard Errors . . . . . . . . . . . . . . . . . . . . . . . . . 151

4


11.5 A Specific Example of heteroscedasticity . . . . . . . . . . . . . . . . . . . . 152 11.6 Tests for heteroscedasticity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 11.6.1 Park Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 11.6.2 Goldfeld-Quandt Test

. . . . . . . . . . . . . . . . . . . . . . . . . . 155

11.6.3 Breusch-Pagan Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 11.6.4 White Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 11.7 General Corrections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158 11.8 Example: Major League Baseball Attendance 1991-2003 12 Autocorrelation

. . . . . . . . . . . 158 167

12.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 12.2 AR(1) Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 12.3 Tests for Autocorrelation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171 12.4 Correcting an AR(1) Process . . . . . . . . . . . . . . . . . . . . . . . . . . . 177 12.4.1 What if Ď is unknown? . . . . . . . . . . . . . . . . . . . . . . . . . . 179 12.5 Large Sample Fix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180 12.6 Forecasting in the AR(1) Environment . . . . . . . . . . . . . . . . . . . . . 181 12.7 Example: Gasoline Retail Prices . . . . . . . . . . . . . . . . . . . . . . . . . 182 12.8 Example: Presidential approval ratings . . . . . . . . . . . . . . . . . . . . . 188 13 Stochastic Regressors

196

13.1 Instrumental Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197 13.2 How to test for randomness in X? . . . . . . . . . . . . . . . . . . . . . . . . 202 13.3 Example: Wages for Married Women . . . . . . . . . . . . . . . . . . . . . . 203 13.4 Example: Cigarette smoking and birth weight . . . . . . . . . . . . . . . . . 204 14 Seemingly Unrelated Regressions

210

5


14.1 Several Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210 14.2 The SUR approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211 14.3 Goodness of Fit and Hypothesis Testing . . . . . . . . . . . . . . . . . . . . 217 14.4 Unbalanced SUR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220 14.5 Example: Production Function and Cost . . . . . . . . . . . . . . . . . . . . 221 14.6 Example: Automobile characteristics and price . . . . . . . . . . . . . . . . . 224 15 Simultaneous Equations

228

15.1 A Generalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233 15.1.1 Inverting a 2x2 Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . 236 15.2 The Identification Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237 15.3 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239 15.4 Example: Math and Science Test Scores . . . . . . . . . . . . . . . . . . . . 244 16 Additional Topics: Typically taught in more detail in ECON5329

253

16.1 Limited Dependent Variables . . . . . . . . . . . . . . . . . . . . . . . . . . 253 16.1.1 Linear Probability Model . . . . . . . . . . . . . . . . . . . . . . . . . 253 16.1.2 Probit Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256 16.1.3 Logit Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257 16.2 Panel Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258 16.2.1 Fixed Effects Model

. . . . . . . . . . . . . . . . . . . . . . . . . . . 259

16.2.2 Random Effects Model . . . . . . . . . . . . . . . . . . . . . . . . . . 260 17 Short Review

262

6


1

Introduction

1.1

What is econometrics?

• Econometrics: A working definition of econometrics is the statistical and applied field that attempts to test economic theory using real world data. • Econometrics can be used in two different, but not necessarily mutually exclusive ways. Econometrics can be used as a predictive tool, i.e., given hypothetical values of certain variables, what is the forecasted value of a variable of interest. Econometrics can also be used as an explanatory tool, and can be used to test or refute economic theory. • The basic process of econometrics entails the following steps: 1. Development of testable hypothesis from economic theory. 2. Specification of the hypothesis in mathematical form. 3. Specification of the statistical or econometric model: 4. Data collection 5. Parameter estimation 6. Hypothesis testing, forecasting/prediction 7. Policy analysis or behavioral choice • Note: Econometrics is a mathematical field. The theories involved have no true economic meaning in the same sense as perfect competition or utility theory. However, econometric theory is often driven by economic theory.

1


1.2

Statistics vs. Econometrics

• Econometrics begins with an economic model, i.e., there is some a priori relationship which is assumed, based upon a theoretical model, which is then tested using real world data. • Statistics often just looks for correlations, analysis of variance, etc., but has little formal theoretical underpinnings. Statistics is often seen using “descriptive” models which may or may not have much to do with formal economic models. • For example, consider the paper by Pollard (available at the course web site) in which he investigates whether moving to a new stadium correlates with higher or lower winning percentages in professional baseball, basketball, and hockey. • At the bottom of page 972 is a table of results in which he shows that the average home winning percentage is greater before moving than after moving. • Pollard attributes these results to – Lack of familiarity with local playing conditions. – Playing in front of larger crowds makes players perform worse. – Players do not want to “protect their turf” as much in a new stadium. • I don’t argue that Pollard is wrong in his calculations. The calculations are easy to replicate. • I might argue that his discussion omits several important points that might have changed his conclusions. • Given this correlation, most economists would look for an economic explanation for this correlation. 2


• Perhaps knowledge that attendance tends to increase when a team moves to a new stadium, despite the quality of the team, might provide an incentive to the team owner to reduce team quality? • In other words, the drop in team winning in a new stadium might not be psychological or sociological but rather economic. Notice that the two theories are consistent with the same result, i.e., a drop in team performance. However, econometricians would argue that our economic models and econometric techniques allow us to determine which is the “true” cause of the decline in team winning.

1.3

An example of econometric analysis

• Consider a basic economic theory. The Law of Demand states that, ceteris paribus, as price increases the quantity demanded will decrease. • Ceteris Paribus means “all other things remaining the same.” This is an easy assumption to invoke in theory-world, on the chalkboard, but is an entirely different issue when it comes to the real world data used to test the theory. • By its nature, the law of demand is just a theory. Theories are made to be tested on real data. Thus, one part of econometrics is to test the theories that are developed on the chalkboard with data generated on the assembly line. • Theoretically, the law of demand implies that there should be an inverse relationship between price and quantity demanded. • Consider a demand function given as P = f (Q) or a specific linear form of P = a + bQ, where b < 0 is theoretically predicted. In a theory class, we would simply draw a continuous demand curve which would be deterministic with a one-to-one mapping

3


between price and quantity, i.e., each price matches one and only one quantity. The demand curve would look like

P

..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... .

D

Q

The picture is useful because we can determine the price (quantity) at a given quantity (price). • If we are hired by a firm to estimate how quantity changes with a change in price, would we trust this picture? Would we base our reputation and client’s profitability on this picture alone? Probably not. • Demand theory suggests that other variables can influence the quantity demanded. Variables such as income, preference, weather, expected prices, prices of other goods, and population can all alter the quantity demanded at each and every price. • Somehow, these issues have to be controlled for to accurately gauge the effect of price on quantity. This is the essence of what the econometrician tries to do (although it isn’t often as simple as it sounds). • The deterministic demand curve from theory class may not even exist as we claim. It is rare that economic data of any kind follows linear patterns in their natural state. Consider the annual normalized real property damages from hurricanes (1900-2004):

4


Adjusted Real Hurricane Related Damages (m$) 20000 40000 60000 80000 100000 0 1900

1920

1940

1960

1980

2000

Year

• Suppose a firm gathers price and quantity data for us over a period of six days. During this period the firm adjusts price and records the quantity demanded at the end of the day. The data and subsequent deterministic demand curve might look like this: P

2 3

4 5

6

7

Q

2 6

3 2

3

1

5


7 6 5 price 4 3 2 1

2

3

4

5

6

quantity price

price

If we graph the data and play connect the dots, we find that the deterministic function has a zig-zag construct. This does not look anything like the demand curve that we drew before. • The zig-zag effect arises because there are multiple prices that have the same quantity associated with it, e.g., P =2 and 5 both have Q = 2. Why? • Do we really want to assume a deterministic function when we look at the data we have? Perhaps there are things that we must control for before we are happy with the demand model represented by the data we have gathered. • What may be involved? We have mentioned some (income, population, preferences, etc.), but others could also exist. • If the only data we have is price and quantity, then it seems that we have a bit of a 6


problem. The deterministic approach is not going to get us to where we want to be. For instance, what price is profit maximizing for the firm? The zig-zag demand curve is difficult to analyze for these questions. Perhaps the deterministic approach is not the best to use. • As an alternative, we introduce the idea of a stochastic approach. This is to say that there are some variables not controlled for and that introduce some “randomness” to the deterministic picture we normally draw in theory class. • This randomness is controlled for by including a random variable in the deterministic function in order to make it a stochastic function. • Now instead of P = a + bQ we have P = a + bQ + ², where ² is a stochastic (random) variable that controls for those variables we have not included in our story. • The nice thing about ² is that it allows us to ”fit” the theory we have to the data that we have collected. This introduces potential problems. Are the data the correct data to test the hypothesis (Morgenstern argument, 1951). Can we truly accept or refute a hypothesis? • Why include the error term? 1. Unpredictable element of randomness in human behavior. 2. Effect of a large number of omitted variables (measurable and unmeasurable) 3. Measurement errors in the dependent and independent variables. • Instead of the zig-zag deterministic function, we try to fit the theory to the data.1 Thus, we get something like this: 1

This is, perhaps, a dangerous statement. Some applied economists insist that best practice is to fit the data to the theory in order to test the theory. On the other hand, if the data are generated by an unknown, but stable, process, perhaps it is best to take the data as “given” and the theory as “speculative.”

7


7 6 5 4 3 2 1

2

3

4

5

6

quantity price

Fitted values

When we estimate α and β, we find α ˆ = 5.932 and βˆ = −0.5062. How do we find α ˆ ˆ That is what we will do in this class, among other things. and β? Not every data point lies on the stochastic demand curve. The difference between the “fitted” demand curve and the “actual” data is the realization of ², denoted ²ˆ. • Each actual data point has a different value of ² associated with it. This is why ² is called stochastic, it changes with each observation from reality. • The main drive of econometrics is three-fold: 1. What to put on the left-hand side (dependent variable or regressand) 2. What to put on the right-hand side (independent variable(s) or regressor(s)) 3. How to control for ² and possible problems in ². • Note that (1) and (2) determine (3). 8


• Recall: P = a + bQ + ². • Here we put Price on the left-hand side, i.e., it is the dependent variable. This may or may not be an accurate assumption. • We put quantity on the right-hand side, i.e., it is the independent variable. We may have problems here because we may not want to assume that quantity is independent in the model. • Furthermore, perhaps there are more variables than quantity that influences price. Perhaps we want to include other independent variables on the right-hand side. • We need to worry about ² - is it truly stochastic? What are its statistical properties, i.e., its distribution. • The statistical properties of ² are determined by the right hand side and left hand side variables that the econometrician includes. • We never really know what the actual ² looks like. Rather we observe individual values of ² drawn from the overall distribution of ². From the observed values of ² we attempt to determine the statistical properties of the actual ². • We will approach these three issues in this course. We start easy and progress to more sophisticated approaches. This is essentially a math class, but we will apply our knowledge to economic issues. • Because this is a math class, we need to review some basic math that will be used in this class. Needed are basic matrix algebra and some basic statistical analysis. • Note: When confused, always go back to the original premise of econometrics

9


Theory P = a + bQ

Econometrics P = a + bQ + ² P =a ˆ + ˆbQ + ²ˆ Note: “hats” indicate fitted or estimated values Pˆ = a ˆ + ˆbQ ²ˆ = P − Pˆ

• The “hats” imply an estimate of the theoretically based parameter. We don’t really know what the actual ’a,’ ’b’ and ² are. If we did, then the entire econometric exercise would be rather pointless. However, given data we can estimate ’a,’ ’b’ and ² and try to make statistical inference based upon these parameter estimates. • It is important to remember that parameter estimates will change with the data that we have to use. In other words, the parameters we estimate are dependent upon the particular sample of data we have. • Thus, we don’t put too much literal faith on a particular estimate. One study is not the be-all end-all on the subject. However, there is a field of econometrics called meta analysis which attempts statistical analysis of analyses. • Rather, we do look for consistency in results, e.g., that minimum wage increases cause an increase in unemployment. This is a general result although there have been specific instances of violations of this general result. • Parameter estimates are calculated using parameter estimators. An estimator is a function that is always the same. An estimate is the result of applying a particular data sample to the estimator function. • This is an important point!!! The estimator is an innocent tool of the researcher. • The estimator has NO idea of the context in which it is being used!! 10


• It is thus mandatory that the researcher not abuse the innocence of the estimator. Unfortunately this is not always the case. • There are (at least) three types of data we encounter in econometrics, two of which are the focus of this course: cross-sectional data and time-series data.2 1. Cross-sectional data is a snapshot in time of a number of different economic agents, be them individuals, households, firms, states, countries. An example would be the demand for laptops in 2004 in the United States. 2. Time-series data follows a particular variable over a length of time, e.g., the price of higher education in the United States from 1970 through 2005. • For the most part, the econometric methodology we develop in this class is applicable to either time-series or cross-sectional data. There are some advanced issues when it comes to time-series data, but they are not part of this course. That said, we will address specific problems with these two types of data towards the end of the class.

1.4

Example: Income and Crime

• We might postulate that counties with higher per-capita incomes tend to have lower rates of crime per capita. • We walk through the steps of econometric analysis: 1. Theoretical hypothesis from economic theory: Higher income ⇒ lower crime, i.e., crime is an inferior good (overall). 2. Specification of the hypothesis in mathematical form: 2

The third type is panel data which combines cross-section and time-series data. Panel data is beyond the scope of this class.

11


CRIM Ei = f (IN COM Ei ), where f 0 (IN COM E) < 0 and i indexes different counties. 3. Specification of the statistical or econometric model: CRIM Ei = α + βIN COM Ei + ²i 4. Data collection: County crime patterns for the 90 largest counties in the United States from from the Bureau of Justice Statistics. The data have been saved as countycrime1.dta. The data includes the per-capita income which I adjust to be measured in thousands, i.e., replace pcinc = pcinc/1000 and I also measure total crimes per 1000 capita, i.e., gen crimepc = totcrime96/(pop96/1000).

12


5. Estimation: Obtain α ˆ and βˆ via the STATA command reg crimepc pcinc. The following results are obtained: . reg crimepc pcinc Source | SS df MS -------------+-----------------------------Model | 2096.76525 1 2096.76525 Residual | 57271.5435 76 753.572941 -------------+-----------------------------Total | 59368.3088 77 771.016997

Number of obs F( 1, 76) Prob > F R-squared Adj R-squared Root MSE

= = = = = =

7 2.7 0.099 0.035 0.022 27.45

----------------------------------------------------------------------------crimepc | Coef. Std. Err. t P>|t| [95% Conf. Interval -------------+--------------------------------------------------------------pcinc | -1.123589 .6735894 -1.67 0.099 -2.465159 .217980 _cons | 83.68021 14.97283 5.59 0.000 53.85923 113.501 – For the moment we will ignore most of the information on this table (although in a few weeks you will be able to understand all the information presented in this table). – The intercept is 83.68 which implies that if income was zero, there would be approximately 84 crimes per 1000 people in a given county. – For every 1,000 dollars in per-capita income, there are 1.12 fewer crimes per 1000 people. 6. Hypothesis testing, forecasting and prediction: For a given INCOME how much crime there will be? If a county’s income were $24,000 per year, the sample predicts a crime rate of approximately (83.68 + -1.123*(24.00). You can also do this with the lincom command, which calculates linear combinations of parameters. In this case the command would be . lincom _b[_cons]+24*_b[pcinc] 13


which would yield the following:

( 1) 24 pcinc + _cons = 0 ----------------------------------------------------------------------------crimepc | Coef. Std. Err. t P>|t| [95% Conf. Interval -------------+--------------------------------------------------------------(1) | 56.71407 3.459774 16.39 0.000 49.82334 63.6048 7. Policy analysis or behavioral choice: Does more income reduce crime? Can wealth transfers help reduce crime?

14


2

Statistical Review

2.1

Introduction

• The basis of econometrics is ², which is a stochastic or random variable. Therefore, ex ante estimation we know little about ². Part of econometric analysis is to discover more about the statistical properties of ². With knowledge of ² we can control for several possible problems. • In statistics it is common to invoke or assume normality. This may or may not be a good assumption in applied economics because of the unique nature of economic data. Often we will invoke the Normality assumption but this is not always true. • There are possible problems with economic data that may not be conducive to normality, e.g., one-sided variables, non-continuous variables, etc. 1. Price and quantity are often one-sided variables. That is, negative quantity or price doesn’t make much sense. 2. Income or profit can be two-sided. That is, an individual may have negative income or a firm a negative profit (i.e., a loss). 3. Race and gender are discontinuous. Except in rare instances, an individual’s race and gender never change. One is either male or female, not some combination of the two. 4. Education is another good example of a non-cooperative variable. Most people do not obtain continuous levels of formal education. Some may drop out of high school and obtain, say, nine years of education or ten. College graduates get 16 years of education (approximately) and graduate students upwards to 21 years.

15


Note that the education level is incremented by units of one (normally) and eventually reaches some maximum level. 5. Some variables are difficult to measure. For example, how does one measure motivation, preferences, laziness, happiness, or success? At times theory employs variables that may be impossible to measure objectively. • Some statistical review is necessary for this course.

2.2

Random Variables and Distributions

• If X is a random variable (r.v.) then it is the result of an experiment in which the outcome is unknown ex ante, i.e., before the experiment is undertaken. • The result of tossing a coin is a simple example. Before we toss the coin we know the possible outcomes: heads, tails, perhaps neither (i.e., landing on the side of the coin, or failing to return to earth, however improbable). The point is that before the coin toss, we don’t know the outcome that will actually happen. • A discrete random variable is one in which all possible outcomes are finite (e.g., rolling a dice). • A dummy variable is a special type of discrete variable which is restricted to one of two possible values, e.g., Male=1 and Female = 0. • A continuous random variable is one in which all possible outcomes are infinite (e.g., time, income, etc). • A probability distribution is a listing of all values of X and their associated probabili-

16


ties:

f (x) = P rob(X = x) where 0 ≤ P rob(X = x) ≤ 1 and

X

f (xi ) = 1

i

• In a continuous context the P rob(X = x) = 0. Thus we need another way to describe the distribution of X. • Probability Density Function (PDF) is such that f (x) ≥ 0 and Z

b

P rob(a ≤ X ≤ b) =

f (x)dx ≥ 0 a

or the probability that X falls within a given range. ............................ ...... ....... ..... ...... ..... ..... . . . . ..... .... . ..... . . ..... ... . . . ..... ... . . . ..... ... . ..... . . .... . ..... . . . . ..... .... ... . . . ..... .. ... . . ..... . . . ... ..... . . . . . . ..... . .... . . . . ..... . . ... . ..... . . . . . ...... . ... . . . . . . . ...... . .... . . . ....... . . . . . ...... ........ . . . . . . . . . .......... ..... ... . . . . . . . . ................. . . . . . . . ........... ...................... .. ..

a

x

b

Probability Density Function and

Z

+∞

f (x)dx = 1. −∞

17


• A Cumulative Density Function (CDF) is the P rob(X ≤ a) ≡ F (a). ......................... ...... ....... ..... ...... ..... ..... ..... ..... . . . . ..... .... . ..... . . ..... ... . . . ..... .. . . . . .... ... . . . ......... ... . . . ... ....... .. . . . . ... ...... ... . ..... . . ... . ..... ... . . . ... . ..... ... . . ..... . . ... ..... ... . . . . . . ...... . ... . . . . ...... . . . . .... ....... . . . . . . . . ........ . ..... . . . . . ........... . . . . . . ..................... ....... . . . . . . . . . . . . . . . . . . . . . . .... .... .

a

Cumulative Density Function Note: P rob(X = a) = 0 so that

P rob(X ≤ a) = P rob(X < a) ≡ F (a)

• For a discrete random variable: If we rank all possible values of X along the real number line,

F (X) =

X

f (x) and thus f (xi ) = F (xi ) − F (xi−1 )

X≤x

• For a continuous random variable: Z

X

F (X) =

f (t)dt and f (x) = −∞

where F (X) satisfies the following: 1. 0 ≤ F (X) ≤ 1 2. if x > y then F (x) ≥ F (y) 3. F (+∞) = 1 4. F (−∞) = 0 5. P rob(a ≤ X ≤ b) = F (b) − F (a)

18

dF (x) dx


• Some basic rules on summation: 1. General definition:

N X

xi = x1 + x2 + x3 + . . . + xN

i=1

2. If k is a constant then

N X

k = Nk

i=1

3. If k is a constant then

N X

kxi = k

i=1

N X

xi

i=1

4. If x and y are both random variables, N X X X (xi + yi ) = xi + yi i=1

5.

2.3

P

Xi Yi 6=

P

Xi

P

Yi unless all Xi or Yi are equal to zero.

The Moments of Distributions

• Much of statistical analysis centers on what are called “moments,” which are characteristics of distributions. • The first four moments are 1. 1st moment: mean or average 2. 2nd moment: variance or distribution around the mean 3. 3rd moment: skewness or asymmetry of the distribution 4. 4th moment: kurtosis or the peakedness of the distribution

19


• The mean of a random variable µx .

µx =

X

xi f (xi ) where f (xi ) = P rob(X = xi )

If all observations have equal probability, then f (xi ) = 1/N and the mean can be rewritten as X ¯= 1 xi µx = X N Example: xi

1

P r(X = xi )

2

0.3 0.5

3 0.2

The mean of X is ¯= X

X

xi f (xi ) = (1)(0.3) + (2)(0.5) + (3)(0.2) = 1.9

What if we mistakenly treated all possible xi as having the same probability? X 1 ¯= 1 X xi = (6) = 2 N 3 • This is an example of possible differences between small and large samples. If we had a large sample such that we could safely assume each possible outcome had equal probabilities, then the differences between the two calculations may not be very large. • However, in a small sample, individual probabilities may differ and the effect of mistakenly assuming equal probabilities can be dramatic.

20


Here, we were off by 5% if we mistakenly assume equal probabilities. Some might think this isn’t a big deal. However, what if we are predicting GDP or the amount of fuel needed to reach the planet Mars? • We often use the expectation operator to deal with the mean. The expectation operator is a short-cut notation for the mean and is denoted with E[·]. • Some rules for expectation operators 1. If X is a random variable, E[X] =

P

xi f (xi ), where f (xi ) = P rob(X = xi ). An-

other way to interpret the expectation operator is that it replaces the summation P notation of the mean, i.e., E[·] = N1 [·]. 2. If a is a constant then E[a] = a 3. If a and b are constants and X is a r.v., then E[a+bX] = E[a]+E[bX] = a+bE[X]. 4. If a is a constant then E[(aX)2 ] = E[a2 X 2 ] = a2 E[X 2 ] 5. If X and Y are rv’s then E[X + Y ] = E[X] + E[Y ]. • The variance of a random variable X is a measure of the dispersion of the values around the mean. The higher the variance, the more disperse the values are distributed around the mean. The variance is defined as var(X) = σx2 =

1 X ¯ 2 = E[(X − E[X])2 ] (xi − X) N

Note: The standard deviation is often used in place of the variance. The standard p deviation is the positive square root of variance, σx = σx2 . • The covariance of two random variables measures the linear association between the

21


two variables, assuming that we have the same number of observations for X and Y : ¯ cov(X, Y ) = E[(X − X)(Y − Y¯ )]

cov(X, Y ) =

N X N X

¯ j − Y¯ ) ρij (xi − X)(y

i=1 j=1

where ρij is the probability of both xi and yj occurring at the same time. • If the probability that any particular Xi and Yi are chosen simultaneously is equal (this might not always be the case) then the covariance can be rewritten as:

cov(X, Y ) =

1 X ¯ i − Y¯ ) (xi − X)(y N

Note: If X ≡ Y then the cov(x, y) ≡ var(x). • Some general results on covariance: 1. If both X and Y are always above and below their means at the same time then the covariance is positive. 2. If X is above its mean when Y is below its mean and vice versa, then covariance is negative. 3. The value of the covariance depends on the units in which X and Y are measured, it is not scale free. 4. To standardize the covariance, we have the correlation coefficient:

ρ(X, Y ) =

cov(x, y) σx σy

5. If X and Y are independent, then the cov(x, y) = 0, because ρij = 0 ∈ [−1, +1]. 22


6. Zero covariance does not necessarily mean independence. Assume the following probability distribution for X and Y : •......

X

-2

-1

0

1

2

Y

4

1

0

1

4

ρx,y

.. .. ... .. .. .. ... .. .. ... ... .. . . . . ... ... .... ... ... ... ... ... ... ... ... ... . . . ... ... .... .. .. .. ... .. .. ... ... .. . . . . ... .. .... ... ... ... ... ... ... ... ... .... ... . . .... . . ..... ... ....... ........ .. ........ ..............

−4 −3 −2 −1 0

0.2 0.2 0.2 0.2 0.2

1

2

3

4

The mean of X is 0 and the mean of Y is 2. P The cov(X, Y ) = xi (yi − 2) = −2(2) + −1(−1) + 0(−2) + 1(−1) + 2(2) = 0 although the relationship between X and Y is clearly Y = X 2 . This is because the covariance is a linear association between X and Y . • Let’s look again at the variance using the expectation operator: X ¯ 2 where X ¯ ≡ E[X] = 1 (xi − X) xi N X ¯ + (X) ¯ 2) (x2i − 2xi X hX i X X 2 2 ¯ ¯ xi − 2X xi + (X) X X 1 ¯ 1 ¯ 2 x2i − 2X( xi ) + N (X) N N 2 2 ¯X ¯ +X ¯ = E[X ] − 2X

1 N 1 = N 1 = N 1 = N

var(X) =

X

var(X) = E[X 2 ] − E[X]2

• We can derive the variance of X differently: var(X) = E[(X − E[X])2 ] = E[(X 2 − 2XE[X] + E[X]2 )] 23


= E[X 2 ] − 2E[X]E[X] + E[E[X]2 ] = E[X 2 ] − 2E[X]2 + E[X]2 var(X) = E[X 2 ] − E[X]2

Note that regardless of how we set up the problem we end up with the same result. • What if we take the var(Z = a + cX)? var(Z = a + cX) = E[((a + cX) − E[a + cX])2 )] = c2 var(X)

• Lemma:

P (xi − E[X]) = 0

Proof: X

xi −

X

E[X] = 0 Divide both sides by N

1 X 1 X xi − E[X] = 0 Note that E[X] is a constant N N N E[X] − E[X] = 0 N • Theorem: Cov(X, Y ) = E[XY ] − E[X]E[Y ]. Proof: 1 N 1 = N 1 = N 1 = N

Cov(X, Y ) =

X (xi − E[X])(yi − E[Y ]) X (xi yi − xi E[Y ] − E[X]yi + E[X]E[Y ]) X 1 X 1 X xi y i − xi E[Y ] − E[X]yi + N N X 1 X 1 X xi yi − E[Y ] xi − E[X] yi + N N 24

1 X E[X]E[Y ] N N E[X]E[Y ] N


= E[XY ] − E[Y ]E[X] − E[X]E[Y ] + E[X]E[Y ] cov(X, Y ) = E[XY ] − E[X]E[Y ]

• We can go about deriving the covariance a different way:

cov(X, Y ) = E[(X − E[X])(Y − E[Y ])] = E[XY − XE[Y ] − E[X]Y + E[X]E[Y ]] = E[XY ] − E[X]E[Y ] − E[X]E[Y ] + E[X]E[Y ] cov(X, Y ) = E[XY ] − E[X]E[Y ]

• A few other formulas that will be useful in the future:

var(X + Y ) = var(X) + var(Y ) + 2cov(X, Y ) var(X − Y ) = var(X) + var(Y ) − 2cov(X, Y )

2.4

Basic Distributions

We try to estimate the values for unknown parameters. However, estimation is based on statistical distributions. There are several basic distributions that we will use in this course: 1. Normal Distribution:

25


• The “Bell Curve” that you may have seen in other classes. .. .. .. ... ... ... . .. ... ........................... . . . . ... . ... . ...... . .... . . . . . ... . . . ... ..... . ... . . . . . ... . . . ... ..... . ... . . . . . ... . . ... . ..... . .... . . . ... . . . ... . ..... . ... . . . . ... . ... . . ..... . ... . . . ... . . ... . ..... . ... . . . . ... . ... . . ..... . ... . . . ... . . ... . ..... . .. . ... ........ . . ..... .... ... ....... ..... .. .... ....... ...... . . . ...... ..... .... .. ..... ..... .. ... ... ........ ..... .. ...... ...... .... ... ... ....... ...... . . . . . . . . . ... ........ . . . ..... . . ........... . . . . . . . . . ... . . ..................... ..... .... ........................ ... ... ..

Area = 0.025 &

Area = 0.025 . f (x)

E[X] − 2σx E[X] E[X] + 2σx A Normal Distribution • The Normal distribution is convenient because it is (i) symmetric and (ii) fully described by the mean and variance. • The probability that a single observation will be within two standard deviations of the mean is approximately 0.95. • The Probability that a single observation will lie within 2.5 standard deviations from the mean is approximately 0.99 • If two or more random variables are normally distributed with identical means and variances, then any weighted sum of these variables is also normally distributed. • Normal Distribution is denoted by X ∼ N (µx , σx2 ) 1

·

1 P rob(X = xi ) = √ exp − 2 (xi − µ)2 2 2σ 2πσ

¸

2. Standard Normal: X ∼ N (0, 1) • It is convenient that the normal distribution is preserved in linear transformations: If X ∼ N (µ, σ 2 ) then a + bX ∼ N (a + bµ, b2 σ 2 ).

26


• One particularly useful transformation is to let a = −µ/σ and b = 1/σ then

Z=

x−µ ∼ N (0, 1) σ

with a density function of µ 2¶ 1 z φ(z) = √ exp − 2 2π Note: φ(z) is a distribution function, Φ(z) is a cumulative density function. 3. Chi-squared Distribution • A useful distribution for testing hypotheses related to the variances of random variables. • The sum of the squares of N independently distributed standard normal variables is distributed as chi-squared with N degrees of freedom. • As the degrees of freedom get larger the chi squared distribution approximates the normal. • Let z ∼ N (0, 1). If n values of z are drawn at random, squared and summed, the resultant statistic has a chi-squared distribution (χ2 ) with n degrees of freedom: (z12 + z22 + · · · + zn2 ) ∼ χ2(n) .

• Note: E[χ2(n) ] = n and var(χ2(n) ) = 2n.

27


4. t Distribution • Also known as Student’s t Distribution. • Developed by William S. Gossett (1908) who worked for Guinness Brewing. He was not allowed to publish under his own name. • This is a useful distribution when the variance is unknown. • The t distribution is defined in terms of a standard normal variable and an independent χ2 variable: z ∼ N (0, 1) and y ∼ χ2(n) where z and y are independently distributed (i.e., their covariance is zero). Then √ z n t= √ y 28


has the Student’s t distribution with n degrees of freedom. The t distribution is symmetric around zero and asymptotically approaches the standard normal distribution. • We use the t distribution to test whether the mean of any random variable is equal to a particular number, even when the variance of the random variable is unknown. • As N → ∞ the t distribution becomes the standard normal distribution.

5. F Distribution • Often we want to test joint hypotheses involving two or more regression parameters. An example would be testing whether two slope parameters are both equal to zero at the same time. • The F distribution is defined in terms of two independent χ2 variables. Chisquared variables are based upon the variance. • Let y1 and y2 be independently distributed χ2 variables with n1 and n2 degrees 29


of freedom, respectively. Then the statistic

F =

y1 /n1 y2 /n2

has the F distribution with (n1 , n2 ) degrees of freedom. Note that there are n1 degrees of freedom in the numerator and n2 degrees of freedom in the denominator. • If we square the expression for t to obtain t2 =

z 2 /1 y/n

where z 2 is a square of a standard normal variable, and thus is distributed as χ2(1) . Note that t2 is the ratio of two χ2 variables and thus t2 ∼ F (1, n).

2.5

Statistical Properties of Estimators

• Statistical analysis deals with sample estimates that come from a distribution of possible values. These estimates are generated by applying a particular sample of data to a function. The function is termed an estimator. • When we generate estimates from estimators, we have several desirable properties in mind. Let θ be the actual value of the parameter we are going to estimate. Unfortunately, we don’t now the real θ, otherwise we wouldn’t need to use statistical analysis. 30


Let θˆ be the sample value that we obtain by cranking our data through the estimator. (1) Unbiasedness: On average we want our estimate of the unknown parameter to equal the actual parameter value. ˆ = E[θ] ˆ − θ. Define bias as Bias(θ) ˆ > 0 then θˆ is biased upward. – If Bias(θ) ˆ < 0 then θˆ is biased downward. – If Bias(θ) ˆ = 0 then θˆ is called unbiased. – If Bias(θ) ˆ = θ. – θˆ is an unbiased estimator of θ if E[θ] Graphically this looks like the following ˆ P r(θ)

.................... ....... ... ............ ..... ...... ... ..... ..... . . . ... ..... . ..... ..... ... .... ..... . . . . . . ..... .... .. . . . ..... . ... .. .... . . . . ..... ... .. . . . . ..... . ... . . . . ..... . . .. . . ..... . . . . . ..... ... . . . . . . ..... . ... . . . ..... . . . . ... ..... . . . . . . ..... . ... . . . . ...... . . . . .... ...... . . . . . . . ....... . ..... . . . . ......... . . . . . ..... . ............ . . . . . . . . . . . . ...................... ........................ ...

ˆ =θ E[θ] An Unbiased Estimator ˆ P r(θ)

.... ...................................... ...... .. .... ... ..... ...... ... ..... ......... . . ..... . . ... ..... .... ... . . . . . ..... . . ... . . . . . . . ..... . . ... . . . ..... . . . . . . ... ..... . . . . . . . ..... . . ... . . . . . . . ..... .. . . . . . . . ..... . . . ... . . . ..... . . . . . . . ..... . . ... . . . . . . . . ..... . . ... . . . . ..... . . . . . . ..... ... . . . . . . . . ...... . . .... . . . . ...... . . . . . . ....... .... . . . . . . . . . ........ . . ..... . . . . . . ........... . . . . . ...... . . . .................... . . . . . . . . . . . . . . . . . . . . ...... .......... . .

ˆ θ E[θ] An Estimator Biased Upward (2) Efficiency: We want our estimators to have the least amount of variance around the ˆ is less than var(θ), ˜ where true θ. Thus, θˆ is efficient if, for a given sample size, var(θ) 31


θ˜ is any other unbiased estimator. ˆ P r(θ) ˜ P r(θ),

......... .............. .. .............. ...... ... .. ... ...... ..... .. .. ..... ......... ..... ..... .... . ... ...... . . . ..... .. ... .... ..... ..... ... ... .. ..... ..... ... ... .. ..... .... . . . . ... . . . ..... . . ... . . . . . . . . . ..... ... . .. . . . . . ..... . . . . ... . . ..... .... . . . . . . . . ... ..... . .. . . . . . . . . . . ..... ... . . ... . . . . ..... . . . . ... . ... .. ..... . . . . . . . . . ... ..... . . ... . . . . ..... . . . . . ... . .. ...... .... . . . . . . . . . ... ...... . . .... . . . . . . . ....... . . . . . . ..... . .... ... ........ . . . . . . . . . . . . . . . ...... ........... . ... ...... . . . . . . . . . . . ...................... . . . . . . . . . . . . . . . . ......... ... ................. ......... ..

ˆ f (θ)

˜ f (θ)

ˆ =θ E[θ] Efficient Estimator θˆ Relative to Inefficient Estimator θ˜ • Ideally we would like to have both unbiasedness and efficiency for our estimators. This may be difficult at times. Thus, we may consider relative efficiency and then trade off between unbiasedness and efficiency. • Why would we make a trade off unbiasedness and efficiency? Perhaps we would rather be a little bit wrong, on average, but when we miss the actual value, we don’t miss by much. This would correspond with a relatively small variance with some bias. • When we face a tradeoff between bias and efficiency, we may focus on the Mean Squared Error and attempt to minimize this, i.e., • Theorem: Assume that θˆ is an estimator of θ, then the mean squared error (MSE) of θˆ is ˆ = E[(θˆ − θ)2 ] M SE(θ) ˆ + [Bias(θ)] ˆ 2 = var(θ)

• Proof: h i 2 2 ˆ ˆ ˆ ˆ ˆ E(θ − θ) = E (θ − E[θ]) + (E[θ] − θ)) add and subtract E[θ]

32


h i ˆ 2 + 2(E[θ] ˆ − θ)(θˆ − E[θ]) ˆ + (E[θ] ˆ − θ)2 = E (θˆ − E[θ]) ˆ 2 ] + 2(E[θ] ˆ − θ)(E[θ] ˆ − E[θ]) ˆ + (E[θ] ˆ − θ)2 = E[(θˆ − E[θ]) ˆ + (bias(θ)) ˆ 2 = var(θ)

From the last line, it is possible to see that variance can decrease with an increase in bias, or vice-versa, and the mean squared error will stay the same. • We recognize that we will eventually be moving from relatively small samples to relatively large samples. Unbiasedness is a small sample property. In large samples we think of consistency. (3) Consistency: As N → ∞, θˆ → θ. Another way of defining consistency is that ˆ = 0 and the distribution of θˆ collapses onto θ. limN →∞ var(θ) Define a probability limit as ˆ = θ if as N → ∞ P rob[|θ − θ| ˆ < φ] = 1, plim(θ) where φ is an arbitrarily small constant. We claim that θˆ is consistent if plimθˆ = θ. Note: We would rather have consistency than unbiasedness. • For this reason, it is often undesirable to throw away data that can be included in your sample. Depending upon the project at hand and the availability of data for other variables in the study, some observations may have to be discarded. However, this is not a good thing!! The more data, in general, the better off you are. Why?

33


2.6

Determining Unbiasedness

¯ = 1/N P xi is an estimator of the mean, is X ¯ • If µx is the actual mean of X and X unbiased?

1 X 1 X xi ] = E[ xi ] N N 1 X 1 X = E[xi ] = µ N N ¯ = Nµ = µ E[X] N

¯ = E[ E[X]

• If σx2 is the actual variance of X and s2 =

1 N

P ¯ 2 is an estimator of σ 2 , is s2 an (xi − X) x

unbiased estimator? • The answer is “No,” but why? • A formal proof is beyond the scope of this course. However, the reason behind the bias of s2 is an important concept to remember. • Conceptually it has to do with the idea of Degrees of Freedom. • We have N data points with which to calculate the variance. But to calculate the variance, we must calculate the mean. • This places a ”constraint” on the N th data point so as to ”peg” the mean. • This constraint implies that the variance is based upon N − 1 unconstrained data points. How? ¯ = P xi . Thus, the last data point is forced to line up with this condition. • Note: N X The other (N − 1) data points are unconstrained with respect to the sample variance. 34


• Theorem: An unbiased estimator of σx2 is s2 =

1 X (xi − E[X])2 N −1

Proof: ·

¸ 1 X 2 E[s ] = E (xi − E[X]) N −1 ¸ · 1 X 2 ((xi − µ) − (E[X] − µ)) = E N −1 N 1 = σx2 − σx2 = σx2 N −1 N −1 2

A corollary: cov(X, c Y)=

1 X ¯ i − Y¯ ) (xi − X)(y N −1

• Note: In rather large samples, one can use 1/N instead of 1/(N − 1) and not affect the calculation of cov c very much. However, be careful about doing this.

2.7 2.7.1

Using STATA for basic statistics Starting STATA

• STATA is the statistical package recommended for this course. Throughout the course I will refer to STATA commands and the class web site will offer various resources to help you learn STATA. • STATA is a very powerful statistics package that can handle many standard statistical issues as well as sophisticated and specialized econometric estimation techniques. • The key to working with STATA is practice. There are considerable on-line resources for STATA. My suggestion at the outset is to use Google if you are confused or have 35


problems. I would search by a particular keyword such as “simple regression” with the additional word “stata”. One of my favorite sites is www.ats.ucla.edu/stat/Stata/. I recommend bookmarking this site as it is full of examples and tutorials concerning STATA. • To begin using STATA, we first need to start the program. You can start STATA by double clicking on the STATA icon on your desktop (assuming you have installed STATA correctly). • When STATA starts up you should see a few windows: the main output window (labeled STATA Results), a command window (labeled STATA Command), a variables window (labeled Variables) and a review window (labeled Review). The review window is often useful because it keeps a history of the STATA commands you have entered during your session. These windows contain, at various times, different information that is helpful in analyzing data. • The output window is where STATA results are presented upon executing a command. One downside is that the output window is limited in how much it can display. One solution is to open a log file and save the output displayed in the Output window in a file stored on your hard drive.3 • At the very bottom of the main STATA program window you should see ‘‘C:\Stata9’ ’ or something similar. This is the current directory that STATA is using. You can change directories by typing cd directoryname in the command window, where directoryname contains the entire directory tree, e.g., cd c:\metrics\project1 • STATA can be operated by typing commands in the command window, by using the point-and-click menu system at the top of the main STATA page, or by using what is 3

You can open a log file by typing log using filename.txt in the command window. STATA will open a log file that will store all that is displayed in the output window (until the log file is closed).

36


called a batch file, which is a text file that contains a sequence of commands that you want STATA to execute. STATA batch files are denoted with a .do extension and are very handy when you have a lot of commands you wish to execute. For the moment, we will stick with entering commands in the command window. • Before loading a data set into STATA is useful to change the working directory so that STATA output and data sets are located in a single place. • If this is the very first time you have loaded STATA, you might want to create a new directory, perhaps called metrics. You can do this using the mkdir command, e.g., mkdir c:\metrics. • Next, we want to redirect STATA to our metrics directory using the cd command, e.g., cd c:\metrics • Next, we want to open a log file to capture the output that will be displayed in the output window. We do this using the log using <filename>.log, replace command, e.g., log using disasters.log, replace. If you use the .log extension then STATA saves the log file as a standard ASCII text file that can be read in MS Word and other programs. 2.7.2

Reading data into STATA

• The first step in using STATA (after starting the system) is to read your data into a STATA data set. STATA data can be in STATA format (.dta), SAS transport format (.stp), ASCII format (.dat), comma delimited format (.csv), and many other formats. I find it easy to enter and organize data in MS Excel with the first row of column headings containing variable names.

37


• Once the data have been entered and double checked for accuracy, it is possible to save the data in comma delimited format (.csv) and read the saved file into STATA using the insheet command. The insheet command will read the variable names in the first row. • An alternative to saving the file as a .csv file is to copy the entire data set from your MS Excel spreadsheet, including column headers, and use the data editor in STATA. When the data editor is displayed, it will look like a empty spreadsheet with rows and columns. Place the cursor in the upper left hand corner (the one, one cell) and paste your data into the STATA data editor. You should see your data entered in columns with the column headers from the Excel spreadsheet becoming the variable names in the STATA editor. • Remember that making mistakes in STATA is to be expected as you learn how to use the system. If your data are saved in Excel and you mess things up in STATA, simply close the data editor (selecting to keep the changes or not). If you wish to start over for whatever reason, enter the command clear in the command window. This will clear the existing data from memory and give you a chance to re-paste the data from Excel (or to do whatever else you wish to do). • If you are happy with your data after the transfer, it is advisable to save the data as a STATA data file. At the bottom of the STATA system window you will see the current directory that STATA is using. It is advisable to transfer the system to the appropriate directory using the cd dir command, where dir is the directory you wish to move to. • Once in the appropriate directory, use the save filename command, where filename is a name of your choice. I find it useful to append a 1,2,3, and so forth, to the end of 38


the filename as things progress in the project. This way you always have backups of what you have done previously. • If you have saved the file before, STATA requires you to tell it that it is okay to overwrite the existing file. To do this, add the option replace at the end of the save command, i.e., save filename, replace. • It is possible to read data that are located on the web (my web site or elsewhere) using the use command. If the file is already in STATA data format, regardless of its location, it is possible to be read into STATA. 2.7.3

A Working Example: Student Spending and Voting Patterns

• Let’s read in a data set describing higher education spending. The data are located in the STATA data file spending0203.dta and is located at my web site with the command use http://www.uta.edu/depken/ugrad/3318/spending0203.dta. • The variables window displays the variable names and the desc command provides us with a list of variable names, types (strings, floats, integers), and variable descriptions if entered (you can do this with the label var varname “description”. • The summarize command provides sample size, mean, standard deviation, minimum, and maximum for all of the variables or a subset of variables if those names are typed immediately after the summarize command, e.g., summarize var1 var2 var3. The summarize command can be abbreviated with the command sum: . sum students perstudent perlocal perstate perfederal bush Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------students | 51 945118.3 1118237 76166 6353667 perstudent | 51 8023.667 1707.275 4838 12568 perlocal | 51 41.17647 14.09781 2 86 perstate | 51 49.4902 13.80489 0 90 perfederal | 51 9.509804 3.126484 4 18 39


bush |

51

.6078431

.4930895

0

1

• It is possible to obtain extended descriptive statistics with the option detail added to the summarize command, e.g., sum students, detail or sum students, d for short: .

sum students, d

Number of students in state ------------------------------------------------------------Percentiles Smallest 1% 76166 76166 5% 99978 88116 10% 130048 99978 Obs 51 25% 248604 104225 Sum of Wgt. 51 50% 660782 Mean 945118.3 Largest Std. Dev. 1118237 75% 1014798 2539929 90% 1838285 2888233 Variance 1.25e+12 95% 2888233 4259823 Skewness 2.908509 99% 6353667 6353667 Kurtosis 13.13866 The extended information includes percentile cutoffs, and the first four moments of the variables. • It is possible to determine the correlation between a set of variables, or all variables, using the corr command: . corr students perstudent bush (obs=51) | students perstudent bush -------------+--------------------------students | 1.0000 perstudent | 0.0088 1.0000 bush | -0.1508 -0.6786 1.0000 • It is possible to obtain the covariance matrix of a set of variables by adding the cov option to the end of a corr command, e.g., corr students perstudent bush, cov: . corr students perstudent bush,c (obs=51)

40


| students perstu~t bush -------------+--------------------------students | 1.3e+12 perstudent | 1.7e+07 2.9e+06 bush | -83146.8 -571.293 .243137

4000

6000

Per−student spending 8000 10000

12000

• To generate scatter plots of the data use the scatter yvar xvar command:

0

2000000 4000000 Number of students in state

6000000

The scatter plot suggests that there is a lot of variance in per-student spending in secondary education but there isn’t a clear relationship between student spending and the number of students in the state. • We could try to plot student spending against the percentage of spending that is financed by the Federal government:

41


12000 Per−student spending 8000 10000 6000 4000

5

10 15 Percentage of spending generated at federal level

20

In this graph there seems to be a general inverse relationship between per-pupil spending and federal support. Can you think of an economic rationale for why this would be the case?

42


3

The Simple Regression Model

3.1

Basic Notation and Assumptions

• To motivate the intuition behind econometric analysis, we will start out with the easiest model possible which is sometimes called the Simple Regression Model. • In this model, there is one left-hand side variable and one right-hand side variable. We include a constant term and a stochastic error term ². • The model is written as yi = α + βxi + ²i where α and β are the real values. We don’t know the real α and β, they live on the dark side of the moon with the number 2 and ∞. • We wish to derive estimators for α and β which are just functions of what we know, i.e., the values of yi and xi . We use the estimators, combined with data, to calculate estimates. • There are some basic assumptions we make: 1. The relationship between y and x is linear in parameters (no higher powers on α and β). 2. xi are non-stochastic, i.e., they are fixed in repeated samples. 3. E[²i ] = 0 : This is guaranteed when we include an intercept term in the model. 4. var(²i ) = E[²2i ] = σ 2 , where σ 2 is a constant and the same for all i. 5. E[²i ²j ] = 0, i.e., there is no covariance across observations. 6. There are N > 2 observations on x and y. 7. We will assume normality of ², that is ² ∼ N (0, σ 2 ). This last assumption is not required for the derivation of estimators, but is useful for statistical inference. • Assumptions 3, 4, and 5 indicate that ² ∼ iid(0, σ 2 ) which is the definition of “white” noise. The iid stands for identically and independently distributed. 43


• Remember that ² is the true error structure. We will never see the real error structure, rather we see ²ˆ = (yi − yˆi ) which is the fitted residual. ˆ i is the fitted or estimated value of yi given the realization of xi • Note that yˆi = α ˆ + βx and the estimates of α and β. • The constant variance assumption is known as homoscedasticity (failure is heteroscedasticity) • The no-covariance assumption is known as independence (failure is (auto)correlation). • Note: E[X²] = XE[²] = 0 P P • Note: E[ ²i ] = E[²i ] = 0 • What if the expected value of ² is not equal to zero? We can correct the model. Assume that E[²] = γ 6= 0, then transform the model as

yi = α + βxi + ²i yi = α + βxi + ²i + γ − γ yi = α + γ + βxi + ²i − γ yi = α∗ + βxi + ²∗ where ²∗i = ²i − γ, αi∗ = α + γ, and E[²∗ ] = E[²i − γ] = E[²i ] − γ = γ − γ = 0

• Because we never know if we have ² or ²∗ and thus α or α∗ , we rarely place much economic interpretation on the intercept term.

44


3.2

Deriving the Estimators of the Simple Regression Model

• How do we derive the estimators for α and β? • We could try to minimize the sum of the error terms. This may be intuitive because the error terms represent how far away our predicted value is from the actual value of y. P However, minimizing ²i would force us to a corner solution because the minimization would be unconstrained. • An alternative is to minimize the sum of squared errors. This would place a constraint on the minimization, i.e., zero, and we would seek to minimize

min SSE = α,β

N X

²2i

i=1

but we know that ²i = yi − α − βxi , so we substitute into our definition of SSE to obtain

N X min SSE = (yi − α − βxi )2 α,β

i=1

• This is a constrained minimization. To solve this, we take the first-order differential of the SSE with respect to our two unknown parameters, α and β, to obtain N X ∂SSE = −2(yi − α − βxi ) ∂α i=1 N X ∂SSE = −2xi (yi − α − βxi ) ∂β i=1

The first-order necessary conditions allow us to solve for a minimum or a maximum when they are set equal to zero and solved for the parameters of interest. • However, we don’t know yet whether we have a minimum or a maximum. To determine this, we look at the second-order conditions. 45


• For a minimum, second order conditions should be strictly positive. For a maximum, the second-order effects should be strictly negative. The second-order effects are: N X ∂ 2 SSE = 2 = 2N > 0 ∂α∂α i=1 N X ∂ 2 SSE = 2x2i > 0 ∂β∂β i=1

Thus, the second-order conditions indicate that we do indeed find a minimum. • Note that the two first-order conditions are two equations in two unknowns: N X ∂SSE ˆ i) = 0 = −2(yi − α ˆ − βx ∂α i=1 N X ∂SSE ˆ i) = 0 = −2xi (yi − α ˆ − βx ∂β i=1

where the ”hats” indicate estimators that will solve the two equations simultaneously. • The two equations in two unknowns can be solved algebraically. • Divide both equations by −2 to obtain N X ˆ i) = 0 (yi − α ˆ − βx i=1 N X

ˆ i) = 0 xi (yi − α ˆ − βx

i=1

• Looking at the first equation, we find

0 =

X

yi − N α ˆ − βˆ 46

X

xi


1 X 1 X yi − βˆ xi N N ¯ α ˆ = Y¯ − βˆX α ˆ =

¯ The • We have solved for α in terms of β and the known sample values Y¯ and X. intercept term is dependent upon the sample means of Y and X, and the estimate of β • Note that when an intercept is included in the model the

P

² = 0. How?

• To solve for βˆ we reduce the second first-order condition so that it is in terms of βˆ ˆ X, ¯ and Y¯ . alone. We do this by substituting the equation for α ˆ which is in terms of β, • This substitution yields: X X ˆ xi y i − α ˆ xi − β( x2i ) X X ˆ = α ˆ xi + β( x2i ) X X ˆ ¯ = (Y¯ − βˆX) xi + β( x2i ) X X X ˆ ¯ = Y¯ xi − βˆX xi + β( x2i ) µ X ¶ NX NX ˆ ¯ ¯ = Y xi − β X xi − x2i N N ³ X ´ ¯ − βˆ N X ¯X ¯− = N Y¯ X x2i X ¯ Y¯ = xi y i − N X P ¯ Y¯ (xi yi ) − N X = P 2 ¯2 (x ) − N X P i ¯ cov(X, Y ) (xi − X)(yi − Y¯ ) P = = 2 ¯ var(X) (xi − X)

0 = X

xi yi

X

xi yi ³X ´ ¯2 βˆ x2i − N X βˆ βˆ

X

ˆ is a function of the data that we have, i.e., it is • The estimator for β, denoted β, a function of Xi ’s and Yi ’s. The estimator itself doesn’t change with the data, but 47


clearly the estimates will. • Note that βˆ is actually the ratio between the covariance between X and Y and the variance in X: βˆ = cov(x, y)/var(x). 1. If there is no variance in X then βˆ does not exist, or is undefined. 2. If there is no covariance between X and Y , then βˆ = 0. 3. If the var(X) = ∞ then βˆ = 0. 4. If the cov(X, Y ) = var(X) then βˆ = 1 • It is important to remember that the the estimator stays the same whereas estimates will change with different data samples. This point makes good sense because we have yet to define exactly what X and Y represent. If I am estimating a simple demand model, such that y is interpreted as price and x is interpreted as quantity, I will be ˆ able to calculate estimates using the estimators for α ˆ and β. • Yet, what if we estimate a different relationship, say y is income and x is education? There is no reason to expect that the resulting estimates will be the same as the demand model estimates, even though the estimator is exactly the same. • However, if we did find the same estimates with two sets of data from entirely different contexts, what would that imply? Only that the ratios of the cov(X, Y ) and var(X) are the same. • The reason for this is that the process of minimizing the sum of squared errors is not dependent upon a particular economic theory. Rather, it is a mathematical program that is the same regardless of the particular economic application.

48


3.3

Example: Wage Model

• A common claim is that education provides positive returns to future income. We investigate this using data gathered from the National Longitudinal Study. The data are available online as a STATA file, along with a sample STATA program file and a sample STATA log file. • Using a data sample of 3296 individuals from the U.S. National Longitudinal Study from 1987, we estimate the following model

W AGE = α + βSCHOOL + ²,

where W AGE is the average hourly wage and SCHOOL is the number of years of formal education. • The descriptive statistics of the data are obtained using the summarize (or sum for short) command in STATA: Variable | Obs Mean Std. Dev. d Min Max ---------+----------------------------------------------exper | 3296 8.041869 2.290855 1 18 male | 3296 .5239684 .499501 0 1 school | 3296 11.63016 1.657114 3 16 wage | 3296 5.816391 4.054694 .0765556 112.7919 • There are two ways to obtain the simple regression estimation results. First, we could ask STATA to provide a covariance matrix which would report the variances of the dependent and independent variables as well as the covariance between the two. We could do this with the STATA command corr wage exper, cov which would yield: . corr wage school, c (obs=3296) | wage school -------------+-----------------wage | 16.4405 school | 1.49278 2.74603 49


• From the results of the covariance matrix and the summarize, it is possible to calculate ˆ α ˆ and β. ˆ • For β: cov(wage, school) 0.49278 βˆ = = = 0.5436 var(school) 2.74603 • For α ˆ: ˆ α ˆ = wage − βschool α ˆ = 5.816391 − (0.5436)(11.63016) α ˆ = −0.5057

• Because the estimators for the simple regression model are fixed, that is, they do not change, it is relatively easy to automate the calculation of the parameter estimates, their standard errors, and a number of other useful calculations. • This is accomplished using the STATA command reg where the syntax for the simple regression model is reg depvar indvar, e.g., reg wage school.

50


• The estimation results are: . reg wage school Source | SS df MS ---------+-----------------------------Model | 2673.87763 1 2673.87763 Residual | 51497.7045 3294 15.6337901 ---------+-----------------------------Total | 54171.5821 3295 16.4405408

Number of obs F( 1, 3294) Prob > F R-squared Adj R-squared Root MSE

= = = = = =

3296 171.03 0.0000 0.0494 0.0491 3.954

-----------------------------------------------------------------wage | Coef. Std. Err. t P>|t| [95\% Conf. Interv] ---------+-------------------------------------------------------school | .5436139 .0415673 13.08 0.000 .4621135 .6251143 _cons |-.5059249 .4883156 -1.04 0.300 -1.463358 .4515078 ------------------------------------------------------------------

• Notice that the two sets of parameter estimates are very similar to each other, the minute differences are due to rounding errors. • The regression results include a lot of information, most of which seems confusing if you don’t know what you are looking at. Let’s ignore most of what is included in the table and focus, for the moment, on the bottom portion of the table. • The first row contains column headings: wage is the dependent variable in the model, Coef. stands for “coefficient”, Std. Err. stands for “standard error,” t stands for a t-score, P > |t| indicates a ‘p-value’ and [95% Conf. Interval] indicates the upper and lower bounds of the 95% confidence interval for the parameter estimates. • The next two rows contain the information for the two (or more) parameters that were estimated using the reg command. The second row indicates the parameter estimate, standard error, t-score, p-value, and confidence interval for the parameter on SCHOOL, which in the simple regression model is the estimate for β. The third row, 51


which is denoted by CONS contains the parameter information for the intercept term α. • The results suggest that for the 3,296 individuals in the sample, every year of additional schooling yielded an average wage increase of approximately $0.54 per hour. While this might seem trivial, the impacts of formal education can be dramatic. Assuming a 2,000 hour work year, each additional year of schooling would return approximately $1,080 per year. Perhaps even more important, an individual who dropped out of school after the eighth grade would expect to make approximately $4,300 less than a high school graduate and approximately $8,600 less than a college graduate. • One might scoff at this difference and think the $1,000 per year of additional schooling really isn’t that big. However, the sample of individuals were from 1987. Therefore, the parameter estimates obtained above are measured in 1987 dollars! How many of us remember 1987 and how much the dollar purchased back then? • If we go to the web site eh.net we can calculate how much the 1987 dollars would be worth in 2004 dollars. The $4,300 would be worth approximately $7,148 CPIadjusted 2004 dollars, whereas the $8,600 would be worth approximately $14,296 in CPI-adjusted dollars. • To take the results even further, the expected work life of a college graduate 25 years old might be 40 years. If we considered the higher wages as a 40 year annuity, we could calculate the present value of the annuity. Assuming a 2% interest rate, the present value factor would be 27.35 and the present value of the annuity would be $195,536. In other words, the person who drops out of school essentially walks away from $195,000. The number is even greater relative to the college graduate: $391,073!! 52


• From this, it is relatively easy to make a convincing economic argument for why people decide to go to school and, perhaps, why society might want to dissuade people from dropping out of school. • A final point. Many student athletes come out of school early to play professionally, especially in basketball and football. Assuming that the athlete’s wage would be statistically similar to the sample investigated here, the student athlete should drop out of school if the gain to doing so outweighs the cost. A junior in college might find it advantageous to leave school if the salary being offered to go pro outweighs the lost wages from not graduating.

3.4

Properties of the OLS Estimators

• We will focus on the estimator for β for the subsequent analysis. • We claim the desirable properties of our estimators are unbiasedness, efficiency and consistency. How do these properties relate to βˆ (and α ˆ )? ˆ An alternative question is whether βˆ is unbiased. • What is the mean of β? • Each estimate of β is simply a “guess” as to what the real β actually is. • Intuitively, an estimate of βˆ is a draw on a distribution, but what is the nature of this distribution? We will not know much about the distribution of β but we can try to glean some information from the distributive properties of ². By looking at the distribution of βˆ we can try to infer information about the distribution of β. ˆ = β? • Is βˆ unbiased? In other words, is E[β] ¯ and ei = (²i − ²¯). These are deviations from • Define yi = (Yi − Y¯ ), xi = (Xi − X)

53


means. Then,

• Let ci = xi /

P

P xi yi βˆ = P 2 xi x2i which is a constant because all the X’s are constant.

P ci yi • Then βˆ = • Note: yi = βxi + ei and βˆ =

X

ci (βxi + ei ) X X βˆ = β c i xi + ci ei X X ˆ = β E[β] c i xi + ci E[ei ]

• But we know that E[ei ] = 0 and thus ˆ =β E[β]

• Note that

X

X

ci xi

X · xi ¸ X 1 P 2 xi = P 2 c i xi = xi xi = 1 xi [ (xi )

• Therefore, βˆ is unbiased because ˆ =β×1=β E[β]

• To derive the variance of βˆ we define ˆ = E[(βˆ − E[β]) ˆ 2 ] = E[(βˆ − β)2 ] var(β)

54


• Note that we know X ci xi + ci ei − β X X = β( ci xi − 1) + ci ei X = 0+ ci ei

(βˆ − β) = β

⇒ (βˆ − β)2 = (

X

X

ci ei )2

• Thus, X ˆ = E[( var(β) ci xi )2 ] = E[(c1 e1 )2 ] + E[(c2 e2 )2 ] + · · · + E[(cN eN )2 ]

• By assumption we have E[ej ei ] = 0 ∀ i 6= j. Thus, ˆ = c2 E[e2 ] + c2 E[e2 ] + · · · + c2 E[e2 ] var(β) 1 1 2 2 N N Note that: E[e2i ] = var(²i ) = σ 2 = c21 σ 2 + c22 σ 2 + · · · + c2N σ 2 via the homoscedasticity assumption X = σ2 c2i X X µ xi ¶2 2 P 2 and ci = xi P µ ¶ X x2i (x )2 P 2 2 = P i2 2 = ( xi ) ( xi ) 1 = P 2 (xi ) 2 ˆ = Pσ ⇒ var(β) (xi )2

55


ˆ which is the second moment of the distribution of β, ˆ is • The variance of β, σ2 σ2 P = . (xi )2 var(X) σ2 ˆ • Thus, βˆ ∼ (β, P(x 2 ). However, we can’t claim β is normally distributed. i)

• An unbiased estimator of σ 2 ? P 2 P ˆ i )2 (²i ) (Yi − α ˆ − βX 2 ˆ s = =σ = N −2 N −2 2

Note: We divide by N − 2 because: We lose one degree of freedom from α ˆ ˆ We lose one degree of freedom from β. • The estimator for the variance of βˆ is defined as s2βˆ

=P

s2 (xi )2

• This implies that the more variation in X the tighter the distribution of βˆ around β. ˆ We define the • A useful number for hypothesis testing is the standard error of β. standard error of βˆ as sβˆ =

3.5

q s2βˆ

Hypotheses Testing

• We know that on average βˆ = β. But there are some other questions that are useful to ask.

56


1. Is βˆ statistically different from some theoretically based value, conditional upon the data we have. 2. What is the probability that βˆ lies in a confidence range that contains the actual β. • We construct hypotheses based upon statistical significance such that

P rob[βlow ≤ β ≤ βhigh ] = 1 − [statistical significance]

• Statistical significance is typically chosen at 0.05 or 0.01, but can vary with the application. • We test our hypothesis against what we term the null hypothesis. The null is a straw man that we hope to reject. • Note: We can only reject or fail to reject a null hypothesis. We cannot ”accept” a hypothesis. • An easy test (and perhaps the most common) is whether βˆ = 0 or our model Y = α + βX is trivial in the sense that there is no relationship between Y and X. • Note: Hypothesis testing is subject to errors. These errors can take two forms: 1. Type I error: Falsely reject the null. For example, let’s assume a null hypothesis is that a suspect is guilty, then a Type I error would then be exonerating a guilty person. On the other hand, if we have a null hypothesis that a suspect is innocent, the Type I error is that we convict the innocent person. The Type I error is typically denoted as α ∈ [0, 1], not to be confused with the intercept term α.

57


2. Type II error: Falsely fail to reject the null. Back to our examples from above. If we have the null hypothesis that a suspect is guilty, the Type II error is convicting the innocent person. If we have the null hypothesis that a suspect is innocent, the Type II error is failing to convict a guilty person. The Type II error is typically denoted as β ∈ [0, 1], not to be confused with the slope parameter. • There is some debate as to which “error” is more important or damaging. Most think that the Type I error is of greater concern, but this is not universal. • In the U.S. criminal justice system, a suspect is “innocent until proven guilty.” Thus, the null hypothesis is that you are innocent and the prosecutor has to have substantial evidence, “beyond reasonable doubt,” that the suspect is guilty. It seems that in the U.S., at least, society has decided that the Type I error or convicting the innocent person is of more concern than the Type II error of releasing the guilty person. • We base statistical tests of individual parameters on the Student’s t distribution. ˆ We do so by subtracting • To test with the t distribution, we must first standardize our β. away the theoretical value we are testing against, βT , and normalize by the standard ˆ error of β:

Ã

βˆ − βT sβˆ

! = tN −2

where N − 2 is the degrees of freedom in the simple regression model. • The general test claimed that we want to know the probability that our test statistic falls within a high and a low value. • Thus, at the 0.05 significance level, we get

P rob(−tc < tN −2 < tc ) = 1 − α, 58


where α is the size of the Type I error, and tc is a critical value used in the decision rule. • Via substitution and assuming the conventional value for α of 0.05, this can be rewritten as:

P rob(−tc <

βˆ − βT < tc ) = 0.95 sβˆ

P rob(−tc sβˆ < βˆ − βT < tc sβˆ) = 0.95 P rob(−βˆ − tc sβˆ < −βT < −βˆ + tc sβˆ) = 0.95 P rob(βˆ + tc sβˆ > βT > βˆ − tc sβˆ) = 0.95 P rob(βˆ − tc sβˆ < βT < βˆ + tc sβˆ) = 0.95

• An example. Assume that N = 8, then N − 2 = 6. We look at the table in the back of the book and find that the critical value of the t-distribution at the 0.05 significance level and 6 degrees of freedom is tc = 2.447. From the Student’s t Table it looks like α/2 is the significance level, but we use a two tailed test to obtain a critical value of 2.447. • Let βˆ = 0.50 and sβˆ = 0.10 and the null hypothesis be that βˆ = 0, then

tN −2 =

βˆ − 0 0.50 − 0 = = 5.00 sβˆ 0.10

• We compare our t statistic of 5.00 to the critical value of 2.447. Because our t-statistic is greater than the critical value, we can reject the null hypothesis. • Note: If the t-statistic had been less than 2.477 in absolute value we would fail to reject the null hypothesis (rather than ”accepting” the null hypothesis). 59


• Note: The t-test does not test for theoretical validity, or relative importance - that is a job for the economist. Read both the McCloskey papers and the Gelman and Stern paper. • Note: As the N → ∞, t → ∞. Why?

3.6

Goodness of Fit Measures

• Part of the goal of the econometric model is to explain the variance of the dependent variable. We include right-hand side variable(s) to accomplish this. • We seek a measure of how much of the variance we are actually able to explain. • We know that var(Y ) =

1 N −1

P

(Yi − Y¯ )2 .

• Consider that the variation in Y can be some combination of the variation in X and the variation in ²:

Y

= βX + ²

var(Y ) = var(βX) + var(²) + 2cov(βX, ²) var(Y ) = var(βX) + var(²) var(Y ) var(βX) var(²) = + var(Y ) var(Y ) var(Y ) • The variation of X (our independent variable) captures a portion of the variance of Y (the dependent variable) - termed the explained variation. The variation of ² is the unexplained variation that arises from omitted variables, etc. • Consider that Yi − Y¯ = (Yi − Yˆi ) + (Yˆi − Y¯ ) 60


then X X X X (Yi − Y¯ )2 = (Yi − Yˆ )2 − 2 (Yi − Yˆ )(Yˆi − Y¯ ) + (Yˆi − Y¯ )2

But the middle term of this equation is equal to zero so that X X X (Yi − Y¯ )2 = (Yi − Yˆ )2 + (Yˆi − Y¯ )2

which can be interpreted as the Total Variance = Error (Residual) Variance + Regression (Explained) Variance or Total Variance = Residual Variance + Explained Variance. • Rewrite this decomposition of total variance as T SS = RSS + ESS or Total Sum of Squares = Residual Sum of Squares + Explained Sum of Squares. Also, note that RSS ≥ 0 and ESS ≥ 0 and thus T SS > 0. • Using this formulation we can manipulate to find

T SS = RSS + ESS T SS RSS ESS = + T SS T SS T SS RSS ESS 1 = + T SS T SS ESS RSS = 1− T SS T SS • We define R2 =

ESS T SS

= 1−

RSS T SS

where R2 is a measure of the goodness of fit of the

regression. • An alternative approach:

61


Y

= Xβ + ²

var(Y ) = var(βX) + var(²) + 2cov(βX, ²) var(Y ) var(βX) var(²) = + var(Y ) var(Y ) var(Y ) T SS ESS RSS = + T SS T SS T SS ESS RSS = 1− T SS T SS • Note that R2 must fall between 0 and 1. This is because 0 ≤ RSS/T SS ≤ 1. • Note: The R2 measure is a percentage. Therefore, the R2 is not a statistic. • In the linear model, R2 ∈ [0, 1], where 0 occurs if nothing is explained by the model, i.e., RSS = 1, and 1 if all variation is explained, i.e., RSS = 0. • Note: In the wage model example presented above, the model’s R2 = 0.049. What does this imply about the explanatory power of the model as it was specified? Does this measure alter your conclusions about the value of the model? • Note: We rarely see an R2 of zero. Just as unlikely is an R2 = 1. If the latter occurs, then you have successfully explained ALL of the variation in the dependent variable. This usually occurs when you try to estimate an equation that is actually the definition. • An example might be estimating the national income accounting definition

Y = C + I + G + (X − M ) + ²

In this case, the only variation will be measurement error. If there is no measurement error in the right-hand side variables, R2 will equal one. 62


• Note that the R2 is really just a percentage. As it turns out, percentages are not statistics, notwithstanding popular opinion. • The R2 measure does not have any statistical interpretation. Thus, it is hard to put much credence in a ”high” or ”low” R2 . 1. We normally find relatively low R2 in cross-sectional studies. 2. We normally find relatively high R2 in time-series studies. • The R2 measure that we have calculated has a potential shortcoming. The R2 measure is guaranteed to increase as you include more and more regressors on the right-hand side. • Thus, if one only wanted to maximize R2 then you would include all sorts of variables on the right-hand side, even those that have no rational or economic basis for being included in the model. • An alternative is to measure goodness of fit while accounting for the number of regressors on the right-hand side. • This is termed the ”adjusted R2 ” calculated as ¯ 2 = 1 − RSS/(N − k) R T SS/(N − 1) where k is the number of right-hand side variables (not including the intercept) included in the model. Note that the residual has N-k degrees of freedom and the variance of Y has N-1 degrees of freedom. • Adjusted and unadjusted R2 are related in the following manner: ¯ 2 = 1 − N − 1 (1 − R2 ) R N −k 63


=

1−k N −1 2 + R N −k N −k

• Note: While an extra variable on the right hand side automatically makes R2 increase, ¯ 2 . If R2 does not improve enough to outweigh this is not necessarily the case with R ¯ 2 can actually decrease. the first term, which is negative, then R • That is, if the contribution of the extra variable increases R2 sufficiently to outweigh ¯ 2 will increase. the lost degree of freedom, R • Two other useful measures to compare goodness of fit across different numbers of right-hand side variables are: Schwartz Criterion: SC = ln

³P 2 ´ ²ˆi N

+

k lnN N

Akaike Information Criterion: AIC = ln

3.7

³P 2 ´ ²ˆi N

+

2k N

Example: The CAPM Model

• The CAPM (Capital Asset Pricing Model) model is a means of describing the expected returns on an asset as a function of the expected return on the market portfolio, both being compared to the risk-free rate of return. Various forms of the CAPM have been developed. • One popular version is ri = βrm + ²i where ri is the excess return on asset i and rm is the excess market return, both measured against some “risk free” asset, typically a U.S. treasury bill. Note that in this specification there is no intercept term. (What does this mean?) • An alternative is to include an intercept term in the specification.

64


• Using data for three stocks listed on the U.S. New York Stock Exchange: Bank of America, IBM, and AMR (the parent company of American Airlines). from Oct 2, 2000 through January 10, 2006 (data obtained from Yahoo.com). Risk-less returns are measured using the 3 month T-bill rate, obtained from the St. Louis Fed. The data are available at my website in the file capmdata.dta. • Once the raw data are read into STATA, it is necessary to calculate returns for each of the assets. This is accomplished by taking the difference in the log of today’s closing value from the log of yesterday’s closing value. • Logs are calculated by the gen command as gen lnvarname1 = log(varname1). • The differences in logs are easily calculated if STATA knows that the data are time series in nature and that there is an unique identifier in the data for the time period. As the data were naturally entered in chronological order, it is possible to use the internal counter n, which holds the observation number, as the unique identifier, e.g., gen timevar = n. In my data set, the timevar is the variable id. • After the time variable has been defined (either in STATA or in your Excel file or elsewhere), you can tell STATA that the data are time series, which variable indicates time, and what frequency the time is measuring (daily, monthly, quarterly, or annual), by using the tsset timevarname where timevarname is the name of the variable that indexes time. • Once the data are tsset it is possible to use built-in STATA time series commands to determine the returns. For example, to determine the one-period difference in log values, you could use the command gen dreturn = d.logclose, where the d. indicates first differences. We can take second differences as gen d2return = d2.logclose. We

65


could take first differences as gen dreturn = logclose - l.logclose, where the l. indicates the first lag (or previous time period). • Once the returns have been calculated for each asset, it is necessary to normalize these returns by subtracting away the T-bill’s rate , e.g., gen ramrret = amrret tbillrate. • After all of this is done, which can actually be accomplished in about six lines of STATA command code, we are ready to estimate the naive CAPM model. For example, when estimating the CAPM for AMR, the command is reg ramrret = rdowret, noc, where the noc option tells STATA to suppress the constant term. • The results are as follows: Company

AMR

Bank of America

IBM

Market Return

1.152

0.883

1.033

(0.044)

(0.024)

(0.015)

0.336

0.500

0.790

R2

Standard errors in parentheses. • What do these results tell us? • The estimated slope coefficient indicates how sensitive a company’s stock is to general market movements. This sensitivity is relatively low for the Bank of America but fairly high for AMR: an excess return in the market of say 10% corresponds to an expected excess return on the Bank of America of 8.8% and 11.5% respectively. • We can also directly test the hypothesis, of limited economic interest, that β = 1 for each of the stocks. The results in t − values of 3.41, -4.79, and 2.25, respectively, and we can reject the null hypothesis for all three stocks. 66


• The CAPM implies that the only relevant variable in the regression is the excess return on the market portfolio, thus any other variable known to the investor when making his/her decision should have a zero coefficient. We can test this by including an intercept term and testing whether it is statistically different from zero. The results of this estimation are reported below: Company

AMR

Bank of America

IBM

Constant

0.009

-0.007

0.001

(0.002)

(0.001)

(0.000)

1.388

0.6883

1.063

(0.072)

(0.038)

(0.023)

0.222

0.193

0.605

Market Return

R2

Standard errors in parentheses. • The intercept terms are all statistically different from zero, although they are also rather small in some cases. • It is possible to place an interesting economic interpretation on the R2 in these models. Note that the CAPM relates excess returns of a stock to two elements: the excess return on the market and the “error term” which is indiosyncratic to the stock itself. Manipulating the CAPM model allows us to claim: V ar(ri ) = β 2 V ar(rm ) + V ar(²i ),

or that the variance in the excess returns of a stock is a combination of the variance in the market index and the variance in the indiosyncratic noise. In economic terms this says that the total risk of an asset equals market risk plus idiosyncratic risk. Market risk is rewarded by a proportion β: stocks with a higher β provide higher expected 67


returns. Idisoyncratic risk is not “rewarded,” note that there is an understood ’1’ as the coefficient on V ar(²i ). • As the R2 measures the proportion of explained variation in total variation, the R2 is an estimate of the relative market risk for each of the stocks. It is estimated that 22% of the risk (variance) of AMR stock is caused by the market as a whole, while 78% is idiosyncratic risk. • Several variations on the CAPM have been developed, including issues such as the January effect, arbitrage pricing theory (APT), and various event studies.

68


4

Matrix Algebra • Let use return to our initial equation which we will denote as

y = a + bX + ²

We are attempting to fit the data that we have collected to this linear equation. How much data do we need? • If we have only one observation on y and x, then the exercise is pretty silly. • If we have only two observations then the exercise is still silly because one and only one line can pass through two points. • If we have three or more observations then the exercise becomes a bit more interesting. • Thus, we will collect N observations on y and X indexed as yi and Xi . • Note that if we take the estimates of a and b, denoted a ˆ and ˆb that we can claim yi = a ˆ + ˆbXi + ²ˆi

which implies that for all N observations the equation must hold. • We will use matrix algebra in order to compress the notation of our N observations and N equations. • Def’n: Vector: An array of numbers, either in row or column format. Consider that the yi ’s could be stacked so that we get:   y1  y2     y3  =y   ..   .  yN 69


Where y is a column vector of dimension [N × 1] where N is the number of rows and 1 is the number of columns. • We can also stack the Xi to obtain a similar  column vector. X1  X2     X3   =X  ..   .  XN • A matrix is an array of vectors. We can have a matrix of any dimension, just like a spreadsheet. We will use matrices to derive estimators in this class. • Consider our story again yi = a + bXi + ²i We can compress the notation   in the  following   manner   y1 a bX1  y2   a   bX2           y3   a   bX3    = + +  ..   ..   ..    .   .   .   yN a bXN

²1 ²2 ²3 .. .

      

²N

• Note the three column vectors on the rhs of the equation. The equation can be rewritten as         y1 1 X1 ²1  y2   1   X2   ²2           y3          = a  1  + b  X3  +  ²3   ..   ..   ..   ..   .   .   .   .  yN 1 XN ²N • The column vector of 1’s and the column vector matrix     X1 1  X2   1         1    and  X3  to become  ..   ..   .   .  XN 1 70

of Xi ’s can be combined into one       

 1 X1 1 X2   1 X3  =X .. ..  . .  1 XN


• But what about a and b? They are combined in a column vector as 

 a  β=  b

• Then we can rewrite our equation as

y = Xβ + ²

• Here, X can be of any dimension and so can β, but the two matrices must conform to one another. The intuition is that for every column in X there must be an associated row (or parameter) in β. • With this in mind, that is y = Xβ + ² is where we are eventually going to end up, let us consider some basic matrix algebra using some simple matrices. Consider a matrix A:

a a12 a13  11   a21 a22 a23   A= . .  .    .  aN 1 aN 2 aN 3

. a1k . a2k .

.

. aN k

• A has a dimension of [N × k] or N rows and k columns. • A column vector: matrix of one column • A row vector: a matrix of one row.

71

          


• Symmetric Matrix: aik = aki e.g., 

 1 2  A=  2 1 • Diagonal Matrix: All aij = 0 ∀ i 6= j e.g., 

 1 0  A=  0 2 • Identity Matrix: A special kind of diagonal matrix where all aii = 1 and aij = 0 ∀ i 6= j. This matrix is denoted Ij where j is the number of rows and columns in the matrix, i.e., the identity matrix is square, e.g., 

 1 0 0     I3 =  0 1 0     0 0 1 • The identity matrix plays the same role as the number one in scalar math. • Im A = A and Iy = y. • Trace of a matrix: If A is square, the trace is denoted tr(A) and is the sum of the P diagonal elements, i.e., tr(A) = aii . 1. tr(cA) = c × tr(A) 2. tr(A0 ) = tr(A) 3. tr(Ik ) = k 4. tr(AB) = tr(BA) 72


• Rank of a matrix: The rank is the maximum number of linearly independent columns (or rows) in the matrix. • Triangular Matrix: All zeroes above or below the diagonal. 

 1 2 3     Upper Triangular :   0 1 2    0 0 1 

 1 0 0     Lower Triangular :  1 2 0     1 2 3 • Matrix equality: A = B if and only if aik = bik ∀ i, k. • Transpose of a Matrix: Denoted A0 or AT . The transpose of matrix A, A0 is such that aik = a0ki e.g,

 1 2 3   1 4 7    0    A = 2 5 8  A= 4 5 6         7 8 9 3 6 9 Note: If A is symmetric then A = A0 . Note: For any matrix A, (A0 )0 = A. • Matrix addition: A + B = [aik + bik ] = C 1. A + 0 = A 2. A − B = [aik − bik ] = C 3. A + B = B + A 4. (A + B) + C = A + (B + C) 73


5. (A + B)0 = A0 + B 0 • If

 1 2   5 6  A=  B=  3 4 7 8 then

 1+5 2+6   6 8  A+B = =  3+7 4+8 10 12 • Matrix Multiplication: uses the dot product or inner product of vectors. The inner product of two column vectors a and b is written as a0 b = a1 b1 + a2 b2 + a3 b3 + . . . + an bn

Note that the two column vectors must be of the same dimension, i.e., both have N rows. An example of the dot product: 

 1   4       b= 5  a= 2         3 6 Then the dot product becomes  · a0 b =

¸ 4     1 2 3   5  = 4 + 10 + 18 = 32   6

Note: a0 b = b0 a.

74


• For matrices C = AB, the elements of C, cik , are the inner products of the columns of A and B. • If A = [N × k] and B = [k × T ] then AB = C yields C = [N × T ] where cit is the inner product of row i of A and column k of B. • For multiplication, we need conformability, i.e., the number of rows in A must be the same as the number of columns in B. An example: 

 7 8    1 2 3    A=  and B =   9 10    4 5 6 11 12 then 

 7 8  1 2 3     AB =   9 10     4 5 6 11 12      7 + 18 + 33 8 + 20 + 36   58 64  AB =  =  139 154 28 + 45 + 66 32 + 50 + 72 Note: AB = [2 × 3][3 × 2] = [2 × 2] Note: In general AB 6= BA. • If C = AB then C 0 = (AB)0 = B 0 A0 • (ABC)0 = C 0 B 0 A0 • (AB)C = A(BC)

75


• Idempotent matrix: A matrix where A = A2 = A3 = · · ·. The identity matrix is an example of an idempotent matrix. • Vectors and matrices can be multiplied C = Ab where A = [N × k] and b = [k × 1]. For example, 

 a   1 2 3         A=  4 5 6  b= b      c 7 8 9 then



 1 2 3   a   a + 2b + 3c       b  =  4a + 5b + 6c C = Ab =  4 5 6         7a + 8b + 9c c 7 8 9

     

• Inverse Matrices: It is not possible to ”divide” in matrix algebra. The analogue to division is the inverse matrix. An inverse matrix of A, denoted A−1 , is such that AA−1 = A−1 A = I, where I is the identity matrix (the matrix analogue to one). • Not all matrices have inverses. For a matrix to have an inverse it must be square and non-singular. • Non-square and/or singular matrices do not have inverses. • The inverse of an inverse reproduces the original matrix (A−1 )−1 = A. • The inverse of a transpose equals the transpose of the inverse (A0 )−1 = (A−1 )0 . • The inverse of an upper (lower) triangular matrix is also an upper (lower) triangular matrix.

76


5

The Classical Model: Ordinary Least Squares

5.1

Introduction and Basic Assumptions

• We have developed the simple regression model in which we included only an intercept term and one right-hand side variable. • Oftentimes we would think that this is rather naive because we are not able to explain much of the variation in Y and we may have theoretical justification for including other variables. • We have hinted already at the idea that we would include more than one right-hand side variable in a model. Indeed, we often have any number of regressors on the right hand side. • To this end, we develop the OLS model with more than one rhs variable. To make the notation easier, we will use matrix notation. • We note that the econometric model must be linear in parameters:

yi = β0 + β1 X1i + β2 X2i + · · · + βk Xki + ²i Y

= Xβ + ²

• We assume that y = [N × 1], X = [N × k] and includes a constant term (represented by a column of ones), β = [k × 1] and ² = [N × 1]. • We can only include observations with fully defined values for each Y and X. • The Full Ideal Conditions: 1. Model is linear in parameters 77


2. Explanatory variables (X) are fixed in repeated samples (non-stochastic) 3. X has rank of k where k < N . 4. ²i are independent and identically distributed. 5. All error terms have a zero mean: E[²i ] = 0. 6. All error terms have a constant variance: E[²²0 ] = σ 2 I. 7. #6 implies that the error terms have no covariance: E[²i ²j ] = 0 ∀ i 6= j. • The linear estimator for β, denoted βˆ is found by minimizing the sum of squared errors over β where SSE =

N X

²2i

N X = (y − βX)2

i=1

i=1

or in matrix notation SSE = ²0 ² = (y − Xβ)0 (y − Xβ).

5.2

Aside: Matrix Differentiation

• We know that ²0 ² = y 0 y − 2β 0 X 0 y + β 0 X 0 X 0 β. • The second term is clearly linear in β since X 0 y is a k-element vector of known scalars, whereas the third term is a quadratic in β. • Looking at the linear term, one could write the term as f (β) = a0 β = a1 β1 + a2 β2 + · · · + ak βk = β 0 a where a = X 0 y. • Taking partial derivatives with respect to each of the βi and arranging the results in a column vector yields

78


   ∂(a β) ∂(β a)  = =  ∂β ∂β   0

0

a1 a2 a3 . . ak

   =a   

• For the linear term in the SSE, it immediately follows that ∂(2β 0 X 0 y) = 2X 0 y ∂β • The quadratic term can be rewritten as β 0 Aβ where the matrix A is of known constants, i.e., X 0 X. We can write this as  · f (β) =

β1 β2



¸  a11 a12 a13  β3   a21 a22 a23  a31 a32 a33

  β1     β   2    β3

= a11 β12 + a22 β22 + a33 β32 + 2a12 β1 β2 + 2a13 β1 β3 + 2a23 β2 β3

• The vector of partial derivatives is then written as 

  ∂f (β)  =  ∂β 

∂f ∂β1 ∂f ∂β2 ∂f ∂β3

  2(a11 β1 + a12 β2 + a13 β3 )    =  2(a β + a β + a β ) 12 1 22 2 23 3     2(a13 β1 + a23 β2 + a33 β3 )

• This result holds for any symmetric quadratic form, that is ∂(β 0 Aβ) = 2Aβ ∂β

79

   = 2Aβ  


for any symmetric A. From our SSE, we have A = X 0 X and substituting we obtain ∂(β 0 X 0 Xβ) = 2(X 0 X)β. ∂β

5.3

Derivation of the OLS Estimator

• The minimization of the SSE leads to the first-order necessary conditions: ∂SSE = −2X 0 y + 2X 0 Xβ = 0 ∂β • This is a matrix version of the simple regression model. There is one fonc for every parameter to be estimated. • We solve these k first order conditions by dividing by 2, taking X 0 y to the right-hand side and solving for β. • Unfortunately, we cannot divide when it comes to matrices, but we do have the matrix analogue to division, the inverse matrix. • Pre-multiply both sides of the equation by (X 0 X)−1 to obtain (X 0 X)−1 (X 0 X)β = (X 0 X)−1 X 0 y

• The first two matrices on the left hand side cancel each other out to become the identity matrix, a la A−1 A = I, which can be suppressed so that the estimator for β, denoted βˆ is βˆ = (X 0 X)−1 X 0 y • Note that the matrix-notation version of βˆ is very analogous to the scalar version 80


derived in the simple regression model. • (X 0 X)−1 is the matrix analogue to the denominator of the simple regression estimator ˆ P x2 . β, i • Likewise X 0 y is the matrix analogue to the numerator of the simple regression estimator ˆ P xi yi . β, • Remember that βˆ is a vector of estimated parameters, not a scalar. ˆ and var(β). ˆ • We look again at the first two moments of βˆ in matrix form: E[β] ˆ E[β] ˆ = E[(X 0 X)−1 X 0 y] where y = Xβ + ². Therefore, • The expectation of β: ˆ = E[(X 0 X)−1 X 0 (Xβ + ²)] E[β] = E[β + (X 0 X)−1 X 0 ²]: but β is a constant so, ˆ = β + (X 0 X)−1 X 0 E[²]: but E[²] = 0 so, E[β] ˆ = β+0 E[β] ˆ = β E[β]

ˆ is found by taking the E[(βˆ − β)(βˆ − β)0 ]. This leads to the following: • The cov(β) ˆ = E[(β + (X 0 X)−1 X 0 ² − β)(β + (X 0 X)−1 X 0 ² − β)0 ] cov(β) = E[((X 0 X)−1 X 0 ²)(X 0 X)−1 X 0 ²)0 ] = E[((X 0 X)−1 X 0 ²²0 X(X 0 X)−1 ] = (X 0 X)−1 X 0 E[²²0 ]X(X 0 X)−1 = (X 0 X)−1 X 0 σ 2 IX(X 0 X)−1 = σ 2 (X 0 X)−1 XX(X 0 X)−1 81


ˆ = σ 2 (X 0 X)−1 cov(β)

• How do we get an estimate of σ 2 ? • We use the fitted residuals ²ˆ and adjust for the appropriate degrees of freedom: ·

²ˆ0 ²ˆ σ ˆ = N −k

¸

2

where k is the number of right-hand side variables (including the constant term). ˆ = X 0 Y −X 0 X βˆ = X 0 Y −X 0 X(X 0 X)−1 X 0 Y = X 0 Y −X 0 Y = 0 • Note: X 0 ²ˆ = X 0 [Y −X β]

5.4

The Properties of the OLS Estimator

ˆ = β and cov(β) ˆ = σ 2 (X 0 X)−1 we move to prove the Gauss• Having shown that E[β] Markov Theorem. • The Gauss-Markov Theorem states that βˆ is BLUE or Best Linear Unbiased Estimator. Our estimator is the ”best” because it has the minimum variance of all linear unbiased estimators. • The proof is relatively straight forward. • Consider another linear estimator β˜ = C 0 y where C is some [N × k] matrix. ˜ = β it must be true that • For E[β] ˜ = E[C 0 y] = E[C 0 (Xβ + ²)] E[β] = E[C 0 Xβ + C 0 ²] = C 0 Xβ

82


˜ = β it must be true that C 0 X = I. Thus, for E[β] • Now consider the following lemmas: Lemma: β˜ = βˆ + [C 0 − (X 0 X)−1 X 0 ]y Proof: β˜ = C 0 y, thus β˜ = C 0 y + (X 0 X)−1 X 0 y − (X 0 X)−1 X 0 y = βˆ + [C 0 − (X 0 X)−1 X 0 ]y (1)

Lemma: β˜ = βˆ + [C 0 − (X 0 X)−1 X 0 ]² Proof: β˜ = βˆ + [C 0 − (X 0 X)−1 X 0 ]y, thus β˜ = βˆ + [C 0 − (X 0 X)−1 X 0 ][Xβ + ²] = βˆ + C 0 Xβ − β + [C 0 − (X 0 X)−1 X 0 ]² but C 0 X = I from before, so that = βˆ + [C 0 − (X 0 X)−1 X 0 ]²

• With these two lemmas we can continue to prove the Gauss-Markov theorem. We have ˆ ≤ cov(β). ˜ determined that both βˆ and β˜ are unbiased. Now, we must prove that cov(β)

˜ = E[(β˜ − E[β])( ˜ β˜ − E[β]) ˜ 0] cov(β) Now, take advantage of our lemmas and that β˜ is unbiased to obtain ˜ = E[(βˆ + [C 0 − (X 0 X)−1 X 0 ]² − β)(βˆ + [C 0 − (X 0 X)−1 X 0 ]² − β)0 ] cov(β) = E[((X 0 X)−1 X 0 ² + [·]²)((X 0 X)−1 X 0 ² + [·]²)0 ] : [·] = [C 0 − (X 0 X)−1 X 0 ] 83


= σ 2 (X 0 X)−1 + σ 2 [C 0 − (X 0 X)−1 X 0 ][C 0 − (X 0 X)−1 X 0 ]0

• The matrix [C 0 − (X 0 X)−1 X 0 ][C 0 − (X 0 X)−1 X 0 ]0 is non-negative semi-definite. This is the matrix analogue to saying greater than or equal to zero. ˆ ≤ cov(β) ˜ and βˆ is BLUE. • Thus, cov(β) • Is βˆ consistent? • Assume lim

N →∞

1 (X 0 X) = Qxx which is nonsingular N

• Theorem: plimβˆ = β. ˆ = β. • Proof: βˆ is asymptotically unbiased. That is, limN →∞ E[β] ˆ as Rewrite the cov(β) 2 ˆ = σ 2 (X 0 X)−1 = σ cov(β) N

µ

¶−1 1 0 (X X) N

Then ˆ = lim lim cov(β)

N →∞

σ 2 −1 Q =0 N xx

which implies that the covariance matrix of βˆ collapses to zero which then implies plimβˆ = β

5.5

Multiple Regression Example: The Price of Gasoline

• Some express concern that there might be price manipulation in the retail gasoline market. To see if this is true, monthly price, tax, and cost data were gath84


ered from the Energy Information Agency (www.eia.gov) and the Tax Foundation (www.taxfoundation.org).

50

100

150

200

250

• Here is a time plot of the retail and wholesale price of gasoline (U.S. Average)

0

50

100

150

200

250

obs gasprice

wprice

• Here are the results of a multiple regression analysis: . reg allgradesprice fedtax avestatetax wholesaleprice obs Source | SS df MS ---------+-----------------------------Model | 408989.997 4 102247.499 Residual | 2969.76256 259 11.4662647 ---------+-----------------------------Total | 411959.76 263 1566.38692

Number of obs = 264 F(4, 259) = 8917.25 Prob > F = 0.0000 R-squared = 0.9928 Adj R-squared = 0.9927 Root MSE = 3.3862

-----------------------------------------------------------------gasprice | Coef. Std. Err. t P>|t| [95% Conf. Int] --------------+--------------------------------------------------fedtax | 1.268 .159 7.94 0.000 .953 1.583 avestatetax | .725 .203 3.57 0.000 .325 1.125 wholesaleprie | 1.091 .011 92.62 0.000 1.068 1.115 trend | .033 .009 3.62 0.000 .015 .051 _cons | 5.281 2.698 1.96 0.051 -.031 10.594 ----------------------------------------------------------------85


• The dependent variable is measured in pennies per gallon, as are all independent variables. • The results suggest: 1. For every penny in federal tax, the retail gasoline price increases by 1.268 pennies. 2. For every penny in state sales tax the price increases by only 0.725 cents. 3. For every penny in wholesale price, the retail price increases by 1.091 pennies. 4. The time trend, which advances by one unit for every month starting in January 1985, indicates that the average real price of gasoline increases by about 0.03 cents per gallon per month, everything else equal. 5. The multiple regression results do not suggest a tremendous amount of pricing power on the part of retail outlets. 6. The R2 is very high; approximately 99.2% of the variation in retail gasoline prices are explained by the variables included in the model (although it should be noted that the data are time-series in nature and therefore a high R2 is expected). 7. To return to the conspiracy theory that prices are actively manipulated by retailers, the 95% confidence interval of the wholesale price parameter is [1.068, 1.115]. At the maximum, the historical pre-tax markup on marginal cost at the retail level is approximately 11.5%, which is consistent with the rest of the retail sector. 8. One other conclusion is that while wholesale price increases are associated with retail price decreases, it is also true that wholesale price decreases are associated with retail price decreases. Or is this conclusion too strong for the given estimation? • What if we defined a dummy variable that took a value of one when the wholesale price of gasoline declined from one month to the next and included that as an addi86


tional regressor. If the retail market reacts symmetrically to increases and decreases in wholesale price changes, this dummy variable should have an insignificant parameter. . tsset obs time variable:

obs, 1 to 264

. gen wpdown = [wholesaleprice<l.wholesaleprice] . sum wpdown Variable | Obs Mean Std. Dev. Min Max -------------+----------------------------------------------wpdown | 264 .4659091 .4997839 0 1 . reg allgradesprice fedtax avestatetax wprice wpdown obs

-----------------------------------------------gasprice | Coef. Std. Err. t P>|t| -------------+---------------------------------fedtax | 1.328 .132 10.00 0.000 avestatetax | .741 .168 4.39 0.000 wprice | 1.101 .009 112.04 0.000 wpdown | 3.777 .348 10.84 0.000 obs | .027 .007 3.63 0.000 _cons | 2.245 2.258 0.99 0.321 ------------------------------------------------• The historical data suggest that the price of gasoline is 3.777 cents higher in months when the wholesale price declines. This suggests that there is an asymmetric effect of wholesale price changes on retail prices of gasoline.

5.6

Multiple Regression Example: Software Piracy and Economic Freedom

• Many in information providing industries are anxious about software piracy. Many policy suggestions have been made and the industry is pursuing legal remedies against 87


individuals. • However, there might be economic influences on the prevalence of software piracy. Bezmen and Depken (2006, Economics Letters) looks at the impact of various socioeconomic factors on estimated software piracy rates in the United States from 1999, 2000, and 2001. • Consider the simple regression model in which piracy is related to per capita income (measured in thousands): . reg piracy lninc Regression with robust standard errors

Number of obs F( 1, 148) Prob > F R-squared Root MSE

= = = = =

150 38.48 0.0000 0.2045 8.3097

----------------------------------------------------------------Robust Piracy | Coef. Std. Err. t P>|t| [95% Conf. Interval] --------+-------------------------------------------------------lninc | -24.39808 3.932914 -6.20 0.000 -32.17 -16.62616 _cons | 114.1777 13.55488 8.42 0.000 87.39163 140.9638 -----------------------------------------------------------------

• As expected, states with greater income levels have lower levels of software piracy. • What if we include other factors such as Economic Freedom (from the Fraser Institute), the level of taxation (from the Tax Foundatin), unemployment (from the Bureau of Labor Statistics), and two dummy variables to control for year 2000 and year 2001: . reg piracy sfindex statetax lninc unemp yr00 yr01 Regression with robust standard errors

88

Number of obs = F( 6, 143) = Prob > F =

150 18.87 0.0000


R-squared = 0.2887 Root MSE = 7.994 ------------------------------------------------------------------| Robust piracy | Coef. Std. Err. t P>|t| [95% Conf. Interval] ----------+----------------------------------------------------sfindex | -2.652959 1.331309 -1.99 0.048 -5.284546 -.0213714 statetax | -1.914663 .6571426 -2.91 0.004 -3.213632 -.6156949 lninc | -24.83959 3.248853 -7.65 0.000 -31.26157 -18.41761 unemp | .2433874 .7928428 0.31 0.759 -1.323819 1.810594 yr00 | 1.568842 1.390505 1.13 0.261 -1.179758 4.317442 yr01 | 3.611868 1.611408 2.24 0.027 .4266101 6.797126 _cons | 150.7459 18.60083 8.10 0.000 113.9778 187.514 -------------------------------------------------------------------

• The parameter estimate on income did not change very much. 1. A 1% increase in income corresponds with a reduction in software piracy of 24%. 2. States with greater economic freedom tend to pirate less. 3. States with greater taxation (which might proxy for enforcement efforts) tend to pirate less. 4. States with greater unemployment experience do not pirate more. 5. The parameter on yr01 suggests that piracy was greater in 2001 than in 2000 or 1999. • The upshot: software piracy seems to be an inferior good.

89


6

Possible Problems in Regression • As long as the full ideal conditions hold, the linear regression model has many desirable properties, although at times other estimators might serve better. • However, there are possible problems that can arise with the classical model. They include 1. Omitted Variables 2. Measurement Error 3. multicollinearity 4. Specification errors 5. Non-spherical error terms (heteroscedasticity and autocorrelation) 6. Stochastic regressors • We will approach these problems for the remainder of the semester. The first three are covered in this section and the remainder will covered in the remainder of the class.

6.1

Omitted Variables

• What happens to our model if we have left out some variables that should have been in the model? • Suppose that the real model should be

y = β1 X1 + β2 X2 + ²

90


but we actually run the regression y = β1 X1 + ²∗ where ²∗ = β2 X2 + ². • Now we find that E[²∗ ] = E[β2 X2 + ²] = E[β2 ]X2 + E[²] = β2 X2 + 0 6= 0 unless β2 = 0

• If E[²∗ ] = β2 X2 then the estimates obtained via OLS do not have our desirable properties. • We find that E[βˆ1 ] = β1 + (X10 X1 )−1 X10 X2 β2 6= β1 unless β2 = 0 • Proof: βˆ1 = (X10 X1 )−1 X10 y = (X10 X1 )−1 X10 (X1 β1 + X2 β2 + ²) = β1 + (X10 X1 )−1 X10 X2 β2 + (X10 X1 )−1 ² Taking the expectation of both sides, we find E[βˆ1 ] = β1 + (X10 X1 )−1 X10 X2 β2

91


• Omitted variables causes a bias in βˆ1 . It is possible to determine the sign of the bias as: 1. If β2 > 0 and cov(X1 , X2 ) > 0 then β1 is biased upward. 2. If β2 < 0 and cov(X1 , X2 ) < 0 then β1 is biased upward. 3. If β2 > 0 and cov(X1 , X2 ) < 0 then β1 is biased downward. 4. If β2 < 0 and cov(X1 , X2 ) > 0 then β1 is biased downward. • Note: There is also a failure in efficiency and consistency.

6.2

Measurement Error

• At times we may have measurement error in one or more variables. Is this a problem? • Measurement error is only a problem if the error in measurement is correlated with ². • If the dependent variable is measured with error, then the resultant estimates are unbiased but not efficient. • If the independent variable(s) are measured with error but the measurement error is uncorrelated with ², then the parameter estimates are unbiased but are not consistent nor are they efficient (more on this later). • If the independent variable(s) are measured with error and the measurement error is correlated with ² then the parameter estimates are biased, inconsistent, and inefficient (all bad). • In general, it is better to have measurement error in the dependent variable. Standard errors of parameter estimates will tend to be larger, but we avoid bias.

92


6.3

Multicolinearity

• We mentioned in the Full Ideal Conditions that X needed to be of rank k, or that there are k independent column vectors in X. • Perfect colinearity occurs when X is [N × k] but has rank < k. • If X is not of full rank then (X 0 X) is not invertible and we have a problem: βˆ does not exist. • We find colinearity that is perfect implies that one independent variable is some linear combination of one or more other independent variables, e.g.,

X1 = δ1 X4 + δ2 X5 − δ3 X2

• Near colinearity occurs when the colinearity is very close, perhaps affected by a white noise term, e.g., X1 = δ1 X4 + δ2 X5 − δ3 X2 + ui where ui is a stochastic error term. • Perfect colinearity implies that the correlation between the variables is equal to one in absolute value. • Near colinearity implies that the correlation between the variables is approaching one in absolute value. • In perfect colinearity, βˆ does not exist. • In near collinearity, βˆ does exist but the standard errors are biased upwards. This implies that t-statistics are biased downwards and inference is problematic; perhaps Type II errors (fail to reject an incorrect null hypothesis). 93


• How to test for multicolinearity? There is no real statistical test, but there are some indicators. 1. High R2 but low t-statistics. 2. Look at the correlation between the variables. 3. Regress each independent variable on remaining independent variables and look for a high R2 . (Why would this work?) An alternative is to look at variance inflation factors or VIF, calculated as V IFj = 1/(1 − Rj2 ) where V IFj > 10 is a rule-of-thumb for evidence that multicolinearity might be a problem. In STATA, VIFs can be obtained using the command vif after a reg command. For example: In 1906 a contest was held in which automobile manufacturers were given two gallons of gasoline and were challenged to travel as far as they could given their automobile’s configuration. The data were gathered from a report in the New York Times and yield the following relationship: . reg distance hp cyl wgt Source | SS df MS -------------+-----------------------------Model | 3578.84342 3 1192.94781 Residual | 5427.07234 48 113.064007 -------------+-----------------------------Total | 9005.91576 51 176.586584

Number of obs F( 3, 48) Prob > F R-squared Adj R-squared Root MSE

= = = = = =

5 10.5 0.000 0.397 0.359 10.63

----------------------------------------------------------------------------distance | Coef. Std. Err. t P>|t| [95% Conf. Interval -------------+--------------------------------------------------------------hp | -.3858702 .1943494 -1.99 0.053 -.7766358 .004895 cyl | 1.767679 1.865736 0.95 0.348 -1.983634 5.51899 wgt | -.0043504 .0013539 -3.21 0.002 -.0070726 -.001628 _cons | 54.31428 5.95668 9.12 0.000 42.33757 66.2909 ----------------------------------------------------------------------------. vif 94


Variable | VIF 1/VIF -------------+---------------------hp | 1.77 0.564054 wgt | 1.73 0.578686 cyl | 1.49 0.669438 -------------+---------------------Mean VIF | 1.66 Notice that the VIFs are all relatively low. • How to fix the problem? 1. Drop the offending variables. The intuition here is that you are not able to influence the variance of Y when two or more independent variables are linear combinations of each other. This is true because var(X2 = aX1 ) = a2 var(X1 )

Therefore, you are not able to offer any ”new” variation in independent variables to explain the variation in the dependent variable. Another way to look at this is through what is called a Ballentine Venn Diagram

................................................................................................................................................................................................................................................................................... ... ... .... ... .... ... ... ... ... ... ... ... ... ... ... .... . . . . . ... . ..... . . . . ... ..... . . ..... ..... . . . ... . . ... ..... . .... . . . ..... ... . . . . . . . ... ..... .......... ................. . . . . . .... . ... ...... ..... .. ... ..... ..... ............................. ... . . . . ... . . . . . . . . . . . ... . ...... ... ..... .......... ... . . . . . . ... ..... ....... ................ ... . . . .... . . . . . . . . . . . . ... . ........ ......... ... ........................ ... . . . . ... . ... ... . ... ..... . ..... . ............... . . . . . . . . . . . . . . . . . . . . . . . . . . ..... . ... . . ... .. .. .. ..... ... ... .. ... ........ ..... ... . .. .. ... .. ... .. ... .... .... . ... . . .. ... . . . . ... ... ... . . . ... .... . .. .. .... ... ... .... .... . . . ... . ... . . ... .. ....... ..... ........ .. . . ... . . . ... . . . ... . . . . . . ............ ...... ..... ... ... . . ... . . . . . . . . . . . . . . ... . . .... ........................ .................... ... . ..... . ... . ... ...... ... . . . . . ... . .......... ...................... ... ..... ... ... ... ... ... .... ... ... ... ... ... ... ... ... .. ... ..................................................................................................................................................................................................................................................................................

var(X3 )

var(X1 )

var(Y )

var(X2 )

95


Note that the variance of X3 is completely encompassed by var(X1 ) whereas the var(X2 ) is only partly in common with the var(X1 ). Here, X1 and X3 are perfectly collinear. However, the best variable to drop is X3 because X3 explains less of the variation in Y than X1 . How could you tell which variable is X1 and which is X3 ? 2. Add more data. The intuition here is that the collinearity may only exist within a small time period. If you are able to add more data, you may be able to move away from this spurious collinearity. 3. Transform the data (Note that this is not scaling, which wouldn’t solve anything). A log transformation will sometime help. Also, first-differencing time-series data may help. If the levels are highly correlated, oftentimes the differences are not.

6.4

Specification Errors

• Specification errors can take multiple forms. The most common is the inclusion of an irrelevant variable. In this case we have included a variable that shouldn’t be in the model. • Why would an irrelevant variable be in the model? Usually it is from poor practice, either bad economic theory or a lack of economic theory (data mining). • Including an irrelevant variable does not, in and of itself, affect any of the desirable properties of the OLS parameter estimates. • However, including an irrelevant variable that is correlated with other right hand side variables can reduce the efficiency of the parameter estimates. In this case, the standard errors are biased upward and t-statistics are pushed down.

96


• Other specification errors focus on the functional form used to estimate the model. We might be estimating a relationship which is naturally non-linear using the linear model. If this is the case, it is unlikely that the linear model will provide unbiased estimates. • Specification tests differ in their null and alternative hypotheses. Some have the null being ”specification correct” and others have the null being ”specification incorrect.” • The Ramsey RESET test is an example where the null is that the specification is correct. The test is essentially an F-test of whether there is a difference between R2 of a linear and a non-linear specification. • If the linear model is correct then the non-linear components will be statistically insignificant. On the other hand, if the linear model is inappropriate then the non-linear components will be statistically significant. • Be careful not to interpret the Ramsey RESET test as a test for omitted variables even though STATA and other packages will claim that it is (technically it is not). • To implement the test 1. Estimate the linear (original) specification and obtain R2 , RL2 2. Save the predicted values of y (in STATA use the predict command immediately after the reg command) 3. Generate the square and cubed values of the fitted vales of y 2 . 4. Include the non-linear terms in a the original specification and obtain R2 , RN

5. Calculate an F-statistic as

F =

2 (RN − RL2 ) , 2 [(1 − RN )/(N − p)]

97


where N is the number of observations and p is the number of parameters in the non-linear model. • An alternative is to include the squares and cubes of the independent variables (replacing steps 2 and 3). • In STATA the Ramsey Test is available by using the command ovtest [short for Omitted Variables test] immediately after a reg command. If you want to use the alternative approach, use the rhs option ovtest, rhs • A different test in STATA is the linktest. From the STATA help files “linktest is based on the idea that if a regression is properly specified, one should not be able to find any additional independent variables that are significant except by chance. • linktest creates two new variables, the variable of prediction, hat, and the variable of squared prediction, hatsq. • The model is then refit using these two variables as predictors. hat should be significant since it is the predicted value. • On the other hand, hatsq shouldn’t be significant, because if our model is specified correctly, the squared predictions should not have much explanatory power. That is we wouldn’t expect hatsq to be a significant predictor if our model is specified correctly. • So we will be looking at the p-value for hatsq. . linktest Source | SS df MS ---------+-----------------------------Model | 277705.911 2 138852.955 Residual | 7736501.23 395 19586.0791 98

Number of obs F( 2, 395) Prob > F R-squared

= 398 = 7.09 = 0.0009 = 0.0347


---------+-----------------------------Total | 8014207.14 397 20186.9197

Adj R-squared = Root MSE =

0.0298 139.95

-----------------------------------------------------------------api00 | Coef. Std. Err. t P>|t| [95% Conf. Interv] ---------+-------------------------------------------------------_hat | -11.05006 8.104639 -1.36 0.174 -26.98368 4.883562 _hatsq | .0093318 .0062724 1.49 0.138 -.0029996 .0216631 _cons | 3884.48 2617.695 1.48 0.139 -1261.877 9030.837 ------------------------------------------------------------------

• From the above linktest, the test of hatsq is not significant. This is to say that linktest has failed to reject the assumption that the model is specified correctly.”

6.5

Example: Hedonic Models

• Hedonic modelling is a technique that attempts to measure the intrinsic value of product characteristics. For example, it is possible to determine the price of a car but how much of that price is attributable to horse power, number of doors, reputation, and so forth? • Hedonic models have their theoretical basis in Lancaster (AER, 1966) and Rosen (1974, JPE). • The methodology suggests that V ALU E = f (characteristics) + ². Typically the f (·) is a linear combination of characteristics. • Using data from Anglin and Gencay (1996, Journal of Applied Econometrics) that describe 546 houses sold in July, August, and September of 1987 in Windsor, Canada. . sum Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------99


price | 546 68121.6 26702.67 25000 190000 lotsize | 546 5150.266 2168.159 1650 16200 bedrooms | 546 2.965201 .7373879 1 6 baths | 546 1.285714 .5021579 1 4 stories | 546 1.807692 .8682025 1 4 -------------+-------------------------------------------------------driveway | 546 .8589744 .3483672 0 1 recroom | 546 .1776557 .3825731 0 1 fullbase | 546 .3498168 .4773493 0 1 gashw | 546 .0457875 .2092157 0 1 airco | 546 .3168498 .465675 0 1 -------------+-------------------------------------------------------garagepl | 546 .6923077 .8613066 0 3 prefarea | 546 .2344322 .4240319 0 1 . gen llot = log(lotsize) . gen lprice = log(price) . reg lprice llot bedrooms baths airco Source | SS df MS -------------+-----------------------------Model | 42.790971 4 10.6977427 Residual | 32.6221992 541 .060299814 -------------+-----------------------------Total | 75.4131702 545 .138372789

Number of obs F( 4, 541) Prob > F R-squared Adj R-squared Root MSE

= = = = = =

546 177.41 0.0000 0.5674 0.5642 .24556

-------------------------------------------------------------------lprice | Coef. Std. Err. t P>|t| [95% Conf. Interval] ---------+---------------------------------------------------------llot | .4004218 .0278122 14.40 0.000 .3457886 .455055 bedrooms | .0776997 .0154859 5.02 0.000 .0472798 .1081195 baths | .2158305 .0229961 9.39 0.000 .1706578 .2610031 airco | .2116745 .0237213 8.92 0.000 .1650775 .2582716 _cons | 7.093777 .231547 30.64 0.000 6.638935 7.548618 -------------------------------------------------------------------. ovtest Ramsey RESET test using powers of the fitted values of lprice Ho: model has no omitted variables 100


F(3, 538) = Prob > F =

0.56 0.6408

. linktest Source | SS df MS ---------+-----------------------------Model | 42.8049623 2 21.4024812 Residual | 32.6082079 543 .060051948 ---------+-----------------------------Total | 75.4131702 545 .138372789

Number of obs F( 2, 543) Prob > F R-squared Adj R-squared Root MSE

= = = = = =

546 356.40 0.0000 0.5676 0.5660 .24505

----------------------------------------------------------------lprice | Coef. Std. Err. t P>|t| [95% Conf. Interval] ---------+------------------------------------------------------_hat |-.1747656 2.433807 -0.07 0.943 -4.955595 4.606064 _hatsq | .0528108 .1093971 0.48 0.629 -.1620827 .2677042 _cons | 6.528758 13.53062 0.48 0.630 -20.05002 33.10753 ----------------------------------------------------------------• These results are consistent with other studies of real estate hedonics. • Interpretation of the results: 1. A 10% increase in lot size yields a sample average 4% increase in price. 2. Each bedroom corresponds to a sample average 7.7% increase in price. 3. Each bathroom corresponds to a sample average 21.5% increase in price. 4. Air conditioning yields a 21.1% sample average increase in price. • The RESET test and the link test both suggest that there is not an omitted variables problem. Do we believe this? • A predicted (log) price for a house with 4 bedrooms, 1 bathroom, and a lot size of 5,000 ft2 and no air conditioning?

7.094 + 0.400 × ln(5000) + 0.077 × 4 + 0.216 × 1 + 0.211 × 0 = 11.028 101


Hence, the actual price would be exp(11.028) = $61574.30 Canadian dollars. • We can also think about whether to build an additional bathroom or to add air conditioning? In the South we might lean towards the air condition, but in Windsor Canada? Perhaps the extra bathroom is actually a better choice. • Note: Local and county governments use hedonic models to determine taxable value of properties. What is a potential shortcoming of using multiple regression models to determine taxable value? How do local and county governments attempt to mitigate these shortcomings?

102


7

Functional Forms

7.1

Linear vs. Non-linear models

• We want models that are linear in parameters. Consider the following y = α + X1β + ²

• Can we estimate this model via OLS? Let’s see. ˆ

• ²ˆ = y − α ˆ − X1β

• Thus SSE = (y − α − X1β )2 • The first-order conditions are ∂SSE ˆ = −2(y − α ˆ − X1β ) = 0 ∂α # " β ∂SSE βˆ ∂(X1 ) = 2(y − α ˆ − X1 =0 ∂β ∂β • From these two first order conditions there is no unique closed-form solution for the ˆ two parameters of interest. α ˆ and β. • There are numerous occasions in economic theory where the parameters of interest enter into the functional form in a nonlinear fashion. Examples include production and utility functions.

103


7.2

Log-Linear Functions

• Consider the Cobb-Douglas production function Q = ALβ1 K β2

where Q is output, L is labor and K is capital A is a scale parameter and the β’s are parameters of interest. • Here, the β’s tell us something about the scale economies that are present in the production process. • In its current form, the model is not stochastic, it is deterministic. • We can make the model stochastic by simply adding an error term, i.e., Q = ALβ1 K β2 + ²

• However, the model is not linear in parameters, and thus it is not possible to apply OLS to this model. • There are several ways to linearize nonlinear systems. One popular way is to take the logs of both sides of the deterministic equation:

lnQ = lnA + β1 lnL + β2 lnK

• Now the model is linear in the parameters, but is nonlinear in the dependent and independent variables.

104


• We can make this model stochastic by adding an error term.

lnQ = lnA + β1 lnL + β2 lnK + ²

• This functional form is known as a log-log form. • Note that if we take the antilog of both sides of the stochastic equation we get Q = ALβ1 K β2 exp(²)

• It is popular in many studies because of the unique economic interpretation that can be placed on the estimated β’s. • Note that dlnQ = β1 dlnL and that dlnQ = dlnL

dQ Q dL L

µ =

dQ dL

¶µ ¶ L Q

The right-hand side of the differential is the functional form of an elasticity or ratio of percentage changes. • The nice thing about the log-log form is that the slope parameters are directly interpretable as elasticities. The drawback is that the functional form assumes a constant elasticity, which may or may not be a good assumption in the end. • In this example, β1 is the partial elasticity of output with respect to the labor input, holding capital constant. Also, β2 is the partial elasticity of output with respect to the capital input, holding labor constant.

105


• We can also measure Returns to Scale: β1 + β2 measures returns to scale, or the response to output with a proportional change in inputs. 1. s = β1 + β2 = 1 : Constant Returns to Scale 2. s = β1 + β2 < 1 : Decreasing Returns to Scale 3. s = β1 + β2 > 1 : Increasing Returns to Scale • Example: Primary Metals Industry – Using data originally constructed by Hildebrand and Liu (1957), and updated by Aigner, Lovell and Schmidt (1977), a production function investigating SIC 33, the primary metals industry, is estimated. – The model is lnQ = α + β1 lnL + β2 lnK + ², where Q is measured as millions of dollars in value added. – The results are as follows: Number of observations: 25 R-squared : 0.985 Variable Coefficient Std. Error t-statistic P-value α -.394332 .338434 -1.16517 [.256] Log K .247478 .086867 2.84893 [.009] Log L 1.34032 .138258 9.69433 [.000] – We see that the output elasticity of of capital is 0.24 and the output elasticity of labor is 1.34. – We look at RTS and see that β1 + β2 = 1.58 > 1 so that the primary metals industry operates under increasing returns to scale. – What can we say about the market structure of the primary metals industry? – The intercept is −0.394 which is the average value of lnQ when lnK and lnL are zero. We take the antilog of α and obtain Q = $0.67 million.

106


• Example: Demand Functions – It is possible to estimate demand as a log-linear model and thus to estimate the elasticity of demand. – An example would be lnQ = β0 + β1 lnPown + β2 lnPother + ², where Q is coffee consumption per day, Pown is the price of coffee per pound and Pother is the price of tea per pound. – Some results might look like Variable Std. Coefficient t-statistic β0 0.78 51.1 lnPown -0.25 -5.12 lnPother 0.38 3.25 – The own price elasticity is -0.25 (elastic or inelastic?) – Cross price elasticity is 0.38 (substitutes or complements?) • Example: Demand Function with Income – It is sometimes desirable to include income in the demand structure. – Consider a housing model with lnQ = β0 + β1 lnP + β2 lnIN COM E + ² – Some results might look like: Variable β0 lnP lnIN COM E

Coefficient t-statistic 4.17 37.90 -0.25 14.70 0.96 36.92

– Here the price elasticity of housing is 0.25 (elastic or inelastic)? – Here the income elasticity of housing is 0.96 (normal or inferior good)? – We can test whether income elasticity is statistically different from 1:

H0 : β2 = 1 107


t=

(0.96 − 1) = −1.54 0.026

– We look at the critical value at 0.05 significance - tc = 1.96. We fail to reject the null. – What does this mean in terms of the housing market?

7.3

Other Functional Forms

1. Linear model: Dependent variable and independent variables in levels.

y = β0 + β1 x1 + ²

Impact of x1 at the margin: β1 . Elasticity measure: β1 x¯/¯ y 2. Reciprocal Model: Here the independent variables enter the model as inverses of their levels.

y = β0 + β1

1 +² x1

• In this model as Xi → ∞ then Yi → β0 . Depending on your application you would have different expectations on the magnitudes of β0 and β1 . • Individual consumption patterns are useful examples. Let Y be the expenditure on a good (say cars). Let X be the income of the individual. • Then we can draw some various pictures depending on how individuals treat the goods

108


$ CAR........

β0

$ CAR........

.. .. ... .. .. .. .. ... .. ... .. .. ... .. ... .. ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... .... ... .... ...... ... ...... ... ....... ....... ... ........ ... .......... ............. ... ............................. ... . ... ... ........................................................................................................................................................ ... ... .. ... .......................................................................................................................................................

β0

IN COM E

.. ... .. ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ........................................................................................................................................................ ... ... .. ......... ................... .... ................. ................ ... ............... . . . . . . . . . . . . . . ... ....... ............... ... .......... ... ....... ........ ... ...... . . . ... . .........................................................................................................................................................

−β1 β0

IN COM E

• The picture on the left indicates that as income increases the dollars spent on cars decreases, indicating that in this sample cars are inferior goods. • The picture on the right indicates that there is a minimum amount of income required in order to initiate spending on a car, afterwards cars are normal goods and there is a maximum amount of money that individuals will spend on a car. • Impact of x1 at the margin: − xβ12 1

β1 • Elasticity measure: − xy

3. Log-log: Also called the double log or log-linear model. In this case, both the dependent and independent variables in logs.

lny = β0 + β1 lnx1 + ²

• This is a popular form for demand functions because of the unique economic interpretation of the slope coefficients. lnQi = β0 + β1 lnPi + β2 lnPix + ²

• Here β1 is the own-price elasticity and β2 is a cross-price elasticity. 109


• The sign of β2 indicates whether the two goods are compelements, substitutes or independent of each other. • Impact of x1 at the margin: β1 /xy • Elasticity measure: β1 4. Log-lin: Also called the semi-log model. • These models are useful for growth rates. Here, the dependent variable is in logs, all other variables are in levels. • Examples include: Projected government deficits; projected GDP growth; projected growth of outstanding consumer credit. • To measure rate of growth of outstanding consumer credit Yt = Y0 (1 + r)t

where Y0 is the initial value of Y, r is the rate of compound over time, Yt is the value of Y at time t and T is the number of time periods in the future. • When we take logs we get

lnYt = lnYo + tln(1 + r)

If we let β0 = lnY0 and β1 = ln(1 + r) then we get

lnYt = β0 + β1 t + ²

lny = β0 + β1 x1 + ²

110


• We then get β0 = lnY0 or that Y0 = exp(β0 ) or the value at the beginning of the sample. • To determine the rate of growth βˆ1 = ln(1 + r) The instantaneous rate of growth exp(βˆ1 ) = 1 + r r = exp(βˆ2 ) − 1 The compound rate of growth over the sample period

• Impact of x1 at the margin: β1 y • Elasticity measure: β1 x 5. Lin-log: Dependent variable in levels, all other variables are in logs.

y = β0 + β1 lnx1 + ² • These models are useful when we look at how choice variables might influence other variables. • For example, the Fed cannot control GDP directly, but it is able to control the money supply. GDPt = β0 + β1 lnM St + ² where GDPt is GDP at time t and M St is the money supply at time t. • This model tells us what will happen to GDP with a percentage change in the money supply. • With data from the U.S. from 1973-1987 using GDP and M2 we find the following results: 111


GDPt =

-16329.0

+ 2584.8lnM St

(-23.494)

(27.549)

where GDP and M2 are measured in billions of dollars and t − stats are reported in parentheses. • We find from this that

µ ∆GDP = 2548.8

if

∆X X

∆X X

↑ by 1% or 0.01 then ∆GDP =$25.48 billion dollars.

• Impact of x1 at the margin: β1 /x • Elasticity measure: β1 /y • A summary of function forms, marginal impacts and elasticity measures

7.4

Name

Functional Form

Linear

Y = β0 + β1 X

Log-Log

lnY = β0 + β1 lnX

Log-Lin

Slope =

dY dX

Elasticity =

β1

β1 ( X ) Y

Y β1 ( X )

β1

lnY = β0 + β1 X

β1 Y

β1 X

Lin-Log

Y = β0 + β1 lnX

β1 ( x1 )

β1 ( Y1 )

Reciprocal

Y = β0 + β1 ( X1 )

−β1 ( X12 )

1 β1 ( XY )

dY X dX Y

Testing for the Appropriate Functional Form

• It is possible to test between the linear and the log-log model. • The first step is to run a scatter plot of your raw data and see if it looks more linear or logarithmic. The problem with this eye-ball approach is that there is no statistical justification for your decision. We look for another way. • The MacKinnon, White and Davidson (MWD) test goes as follows:

112


• H0 : Linear model: Y is a linear function of the X’s.

y i = β 0 + β 1 xi + ² i

• Hα : Log-log model: Log of Y is a linear function of the log of X’s

lnyi = β0 + β1 lnxi + ²i

• The MWD is performed with the following steps: 1. Estimate the linear form and obtain the fitted values of y, yˆ. In STATA use the predict varname command. c 2. Estimate the log-log form and obtain the fitted values of log y, lny di 3. Create Z1i = lnˆ yi − lny 4. Regress y on X’s and Z1 . Reject H0 if the coefficient on Z1i is significant using a t-test. 5. di − yˆi ) 6. Generate Z2i = exp(lny 7. Regress lny on lnX’s and Z2 . Reject Hα if the coefficient on Z2 is significant using a t-test. • The intuition behind the MWD test is that if the linear model is correct, the variable Z1 should not be significant in a regression of y, and if the log-log model is correct the variable Z2 should not be significant in a regression of lny. • There is the possibility that the MWD test will fail to reject both nulls or reject both nulls. 113


• There is also the possibility that the predicted values of y will contain zeroes or negative values, in which case it is impossible to take logarithms.

7.5

The Polynomial Model

• These are models relating to cost, production and revenue functions. • Example: We look to estimate long-run average costs and output. • The LRAC curve is assumed to be U-shaped - which can be tested. • We capture the curvature of the LRAC by including a quadratic term LRAC = β0 + β1 Q + β2 Q2

• In a stochastic form: LRAC = β0 + β1 Q + β2 Q2 + ² • We can estimate LRAC with OLS - but there may be theoretical problems - what might they be? • Gather data for 86 S&L’s for 1975 • Output is measured as total assets in billions of dollars. • LRAC is measured as average operating expenses as a percentage of total assets. • LRAC = 2.38 − 0.615Q + 0.54Q2 . • The minimum average cost is reached at $5.69 billions. How do we know?

dLRAC/dQ = −0.615 + 2(0.54)Q 114


Set equal to zero to get −0.615 + .108Q = 0 or Q = 0.615/0.108 = 5.69 • We can look at the performance of sports teams: • Using MLB data from 1990-1997, Depken (2000) estimates Π = β0 + β1 W IN % + β2 (W IN %)2

• To maximize profit, we look for ∆Π =0 ∆W IN % where winning percentage is one measure of quality. • This implies that to maximize profits we should have ∆Π = β1 + 2β2 (W IN %) = 0 ∆W IN % β1 (W IN %)∗ = − 2β2

115


• Regression analysis indicates the following results in professional baseball: β1

β2

Q∗

Attendance

-672.87

1129.90

0.297

Gate Revenue

-22.52

84.94

0.129

Media Revenue

106.84

-97.40

0.548

Total Revenue

141.57

32.8

2.158

Profit

76.9523

-67.6805

0.544

Dep. Var.

• Note how the optimal quality changes depending upon the variable being maximized. • Given data from the 1990’s, it is not possible to maximize revenues because to do so would require winning 215% of a team’s games. • To maximize attendance, team quality need be only 0.297, however to maximize profits team quality needs to be 0.544. • Does this make sense? The optimal quality is to be barely above five hundred? • In many ways this does make sense. First, such quality may not be prohibitively expensive. Fans may attend more because the team is competitive. If the team is too good or too bad, fewer people go to the game because the outcome is more certain and there is no fun in that. • Recalling how we optimized profits earlier, we can do the same for win percentage as a function of player costs ∆W IN % ∆P LAY C

= β1 + 2β2 (P LAY C) = 0

(P LAY C)∗ = − 116

β1 2β2


• Regression analysis indicates that the following results in professional baseball and football for 1990-1997: Sport Variable

β1

β2

(P LAY C)∗

MLB

WINS

-22801.1

134.501

84.76

NFL

WINS

-8.26E6

389125

10.61

• Does this make sense? Minimum efficient scale (minimum average cost) occurs in baseball with 85 wins (a .524 winning percentage) and at 11 wins in football (a .687 winning percentage)

7.6

Example: Optimal Store Size

• There are concerns that big box stores such as Home Depot and Walmart pose a threat to the small-town feel of many communities. Large stores are sometimes unattractive and are difficult to deal with if they are abandoned. On the other hand, big box stores offer attractive features to consumers, especially the ability to minimize transaction costs by having lots of different things for sale in the same location. Anyone with kids appreciates the ability to stop once. • What is the optimal size of a big-box store from the point of view of revenue maximization? To address this question, I used a Google “hack” to track down some data that could be used in this situation. Specifically, I entered the following into a Google search window "store"+"sales"+"squarefootage"filetype:xls The second link offered is a spreadsheet with data describing Home Depot. How convenient!!

117


• I grabbed annual data on average sales per store and average square footage per store from 1989 through 2002 (the sample is admittedly small in this example).

500

600

sales 700

800

900

• First plot the data to see if there is any curvature in the data:

90

95

100 sqft

105

110

• Create a quadratic value of square footage and run a multiple regression model of average sales against square footage and it’s quadratic. • Here are the descriptive statistics. Sales are measured in thousands of dollars per year, square footage is measured in thousands of square feet: . sum sales sqft sqft2 Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------sales | 14 756.5 110.3175 515 876 sqft | 14 102.2857 6.684409 88 109 sqft2 | 14 10503.86 1328.572 7744 11881

118


• Here are the regression results and the calculation of the optimal store size (for Home Depot at least) . reg sales sqft sqft2 Source | SS df MS -------------+-----------------------------Model | 147433.063 2 73716.5314 Residual | 10776.4371 11 979.676104 -------------+-----------------------------Total | 158209.5 13 12169.9615

Number of obs F( 2, 11) Prob > F R-squared Adj R-squared Root MSE

= = = = = =

14 75.25 0.0000 0.9319 0.9195 31.3

---------------------------------------------------------------------- sales | ---------+-----------------------------------------------------------sqft | 115.1812 46.64647 2.47 0.031 12.51303 217.8494 sqft2 | -.5007673 .2346912 -2.13 0.056 -1.017319 .0157845 _cons | -5764.906 2308.097 -2.50 0.030 -10844.99 -684.819 ----------------------------------------------------------------------

. nlcom _b[sqft]/(-2*_b[sqft2]) _nl_1:

_b[sqft]/(-2*_b[sqft2])

--------------------------------------------------------------sales | Coef. Std. Err. t P>|t| [95% Conf. Interval] ------+-------------------------------------------------------_nl_1 | 115.0047 7.45517 15.43 0.000 98.59602 131.4135 ---------------------------------------------------------------

• The results suggest that every square foot of store space yields an average of $115 in annual revenue. However, there are diminishing returns to store space. • The nlcom command tells STATA that you wish to calculate a non-linear combination of parameter estimates. While you could do this with a hand calculator, the nlcom

119


command is useful because it will calculate standard errors, t-statistics, and provide confidence intervals for the non-linear combinations. • The results suggest that sales are maximized when Home Depot builds a 115,000 sq. ft. building. However, the upper and lower bounds of the 95% confidence interval suggest that Home Depot could build as small as 98,000 sq. ft. or as large as 131,000 sq. ft. • What do these results imply about attempts to limit the (re)location of big box retailers?

120


8

Dummy Variables and Functional Forms

8.1

Dummy Variables

• We have implicitly assumed that our data are continuous in nature. • Yet, many times data are inherently discrete NOT continuous. • What if we wanted to look at how GDP has changed over time? • There is a good argument that during war years GDP should behave differently than during peacetime. • We can control for these ”war years” with a ”dummy variable” • A DV is a dichotomous variable, which typically takes on a value of zero if a condition is not met and one if it is met. • Thus, for our GDP model we might have something like

Yt = β0 D1t + β1 D2t + β2 Xt + ²t

• Let D1t = 1 when there is a war and 0 otherwise. • Let D2t = 1 when there is no war, and 0 otherwise. • Note that this model has no specific intercept term. • If you regressed this model with an intercept term, the estimation would break down because of multicolinearity. This is because D1t + D2t = 1 and the constant term’s ”variable” is also 1.

121


• We could run the model with only one dummy variable and use an intercept term:

Yt = β0 + β1 D2t + β2 Xt + ²t

• Here, we could run OLS on this model with no problems. • What do dummy variables do for us? • The dummy variables actually act as a shift parameter for the mean of one group. They allow us to differentiate across different groups. Why? • We know that there are various functional forms we can use in econometrics. But they all have to be linear in parameters. This linearity is constrained because there are no ways to introduce discontinuities in the model. • Dummy variables allow us to introduce discontinuities into a linear system. • Consider the model with the intercept included. If the D = 0 then the intercept for the model is simply β0 . If the D = 1 then the intercept becomes β0 + β1 . • Thus, whenever the condition is satisfied, such that D = 1, the intercept of our regression line shifts. • Of course, depending on whether β1 is greater than, less than or equal to zero will determine which way the model will shift. Consider the β1 > 0 and β2 > 0, then

122


...... ....... ..... ........................ ....................... ..... ................... .... . . . . ........ ..... ....................... ..... ............... . . . . . . . . . . . . . ... ... . . . . . . ..... .... ..... ..... ..... ..... ..... ..... ..... ..... . . . . . . . . ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... . . . . . . .... .... ..... ..... ..... ..... ..... ..... ..... ..... . . . . . . . . .... .... ..... ..... ..... ..... ..... ..... ..... ..... . . . . . . . . .. ..... ..... ..... ..... ..... ..... ..... ..... . . . .... ..... ..... ..... ..... . . . . .. ..... ....

•slope = β2

Y

β0 + β1 β0

X

• Here, the intercept shift allows the same marginal influence of X on GDP (Y ) but recognizes that during peace-time GDP is that much greater. • One must be careful in interpreting intercept shifts by looking at the exact definition of the dummy variables. • One must also be careful of the “dummy variable trap.” • As mentioned earlier, one cannot estimate a collinear model (because the rank of X < k). Thus, if all of the dummy variables included in your model will always sum to one, then there can be no formal intercept in the econometric model. • A good way to avoid the ”trap” is to include an intercept term always. • Then, we include j − 1 dummy variables for our qualitative variables where j is the number of possible categories. • For example: If sex is our qualitative variable, then we include SEX = 0 for male and 1 for female. • However, if we are looking at highest education then we may have several categories, e.g., grade school, high-school, some college, undergraduate degree, master’s degree, 123


doctorate. Here, j = 6 but we only include 5 dummy variables. • That category omitted is the “reference” category - which all other categories are compared against. • Thus, if income was the dependent variable and level of education dummy variables are included, then we might expect positive parameter estimates if grade-school is the reference category. On the other hand, if graduate school was the reference category, we might expect negative parameter estimates. Be careful in the interpretation of dummy variable coefficients!!

8.2

Interaction Terms

• A potential pitfall in the use of dummy variables can be given in an example taken from labor economics. • Consider the hypothesis that there is wage discrimination against females. One may wish to estimate a wage equation that would control for various individual attributes such as productivity, motivation, teamwork, etc. At the same time, one would want to include a dummy variable that would control for the sex of a particular worker. • This is a common approach in labor models. Thus, an example of a wage model would be Wi = β0 + β1 SEXi + β2 EDUi + αZi + ²i where EDU is the education (typically in years), and Zi is a set of variables thought to influence wages such as the years of tenure at a particular job, the age of the worker and the experience of the worker.

124


• Oftentimes, the explanatory variables are highly correlated with each other. This is a potential problem. • Nonetheless, let’s say that we estimate a model and find that β1 < 0. Some would claim that this is evidence of discrimination, but is it really? • The estimated equation only states that, given the sample used, that women start out at a lower wage. In this model, the returns to education, tenure, age and experience are assumed the same for men and women. • Thus, a picture of this in the EDU space would look like Males •

wage

...... ...... ....... ...... ..... ........................ ...... ....................... ..... ..... ............ ..... ..... . . . . . . . . ..................... .. ..... ....................... .... ..... ............. . . . . . . ... . . . . . . . .. ... . . .. . . . . . . . .... ..... ... ..... ..... ..... ......... ..... ..... .. ...... .... . . . . . . . . . .... .... ..... ..... ..... ..... ..... ..... ..... ..... . . . . . . . . .... ..... ..... ... .... ................................................................. ..... ............ ........ ..... .. ..... . . . . . . . . ..... ..... ..... ..... . . . . . . . . ..... ..... .... .... ..... ..... ..... ..... ..... ..... . . . . . . . . .. ..... ..... ..... ..... .... . . . . ..... ..... ..... ..... ....

• slope =

∆W ∆EDU

= β2

Females

β0 β0 + β1

EDUCATION

• Here, the implication is that women start at a lower wage (as indicated by the lower intercept), i.e., that β1 < 0 but that the returns to education are the same for men and women alike. • This might be true in some areas, not in others, e.g., economics. • Perhaps we think that women and men are actually rewarded differently for their education.

125


• To accommodate this possibility, we then interact our SEX dummy variable with EDU to obtain:

Wi = β0 + β1 SEXi + β2 EDUi + γ0 (EDUi × SEXi ) + αZi + ²i

• In this model, the intercepts are allowed to shift as well as the slope parameter on EDU. • If the observation is of a male, then we see that ∆Wi = β2 ∆EDUi • Whereas if the observation is of a female then we see that, ∆Wi = β2 + γ0 ∆EDUi • When the slopes differ across groups, then we can claim a difference in the returns to education across the two sexes. • If the parameter γ0 is insignificant, then it would imply that there is no difference in the marginal effect of education on wages across the sexes. This is NOT the same as saying that there is a wage differential, however. • What if we think that the returns to education are not exactly linear, but that there may be some second-order affect of education on wages. This would seem a rather straight forward idea. • We could accommodate this non-linearity by including an interaction of education with

126


itself to obtain: Wi = β0 + β1 SEXi + β2 EDUi + β3 EDUi2 + γ0 (EDUi × SEXi ) + γ1 (EDUi2 × SEXi ) + αZi + ²i

• This model allows there to be a second-order effect (probably negative) of education and that it may differ across sexes. • A picture of this may look like

slope = β2 + 2β3 EDUi

Wi

β0 β0 + β1

.... ... .. ... ............. ... .................. ................ ... ............... ................................ ........ ............... ............... ............ ....... . . . . . . . .... ... ................... ........ ................. ...... ................ .... ............... ... ............... . . . . . .. . . . . . . . . . . ..... ............... .... ............... ... ......... .... ........ ....... . . ... . . . . .... .. . . . . ... . . ...

Males Females

slope = (β2 + γ0 ) + 2EDUi (β3 + γ1 )

EDUCATION

• In this case, there is a difference in the intercept terms, but the γ parameter would be insignificant or equal. This is reflected in the parallel course of the two curves. • Note that there would be a ”glass ceiling” in this graph even though the returns to education were the same on the margin but that the returns differ because of the starting values of the wages (as reflected in the intercepts).

127


• Consider an alternative picture, however.

Wi

..... ....... ...... ...... ...... . . . . .... ...... ...... ..... ...... . . . . . ..... ..... ...... ................... ................. ...... ..... ............................... . . . . . .. ........................ . . . . . . . . . .... ..................... ............... ...... ... ............... ........... . ........ ... ....... . . . . . . . . . .. ...... . . . . . . . .... ..... .... ... .... ... .... ..... . . . . ..... ..... .................... ... .......... ... .......... .......... .... ... ... . .....

Females Males

slope = β2 + 2β3 EDUi

β0 β0 + β1

slope = (β2 + γ0 ) + 2EDUi (β3 + γ1 )

EDUCATION

• Here, we see that β1 < 0 so that women start out at a lower wage. We also see that the returns to education on the margin are greater for women (γ1 > 0) than men but that at some level of education, women will begin to earn more than men. The question then is how much education does a woman need to overcome the initial disparity? • Thus, the use of dummy variables and interaction terms can allow us to test all sorts of extra hypotheses. • These pictures do not necessarily represent general conclusions. Different data samples will reveal different results.

8.3

Time Trends

• In certain applications we recognize that there may be a time trend in the data. • A good example is prices or GDP which tend to grow over time. We would like to control for these time trends. • One may think to use dummy variables to control for different time periods. However, this could be problematic. 128


• If you have T time periods in your sample and T observations, then to treat each time period as different would require T − 1 dummy variables, plus and intercept term. At this point, you have exhausted all the degrees of freedom available in the sample. • An alternative approach is to create what is called a ”trend variable”. • A trend variable is a monotonically increasing variable, typically just the time index. • Trend variables enter into an equation as a separate explanatory variable, e.g.,

yt = β0 + β1 Xt + β2 T IM E

where T IM E = 1, 2, . . . , T . • Note: there is a potential problem with time trends in log-log or lin-log models.

0

20

40

60

80

100

• If one takes the log of time, you obtain the following vis-a-vis linear time trends:

0

20

40

60

80

trend trend

129

lntrend

100


• Thus, when you take the log of time, you may be discounting the effect of time on your dependent variable. • Note: The log of zero does not exist; one should be careful in how you set up your time trend.

130


8.4

Example: The Taft-Hartley Act of 1947

0

1000

strikes 2000 3000

4000

5000

• Here is a graph of annual work stoppages from 1916-2001.

1920

1940

1960 year

1980

2000

• It is very apparent that something dramatic happened around the mid 1940s - work stoppages declined significantly. • In fact, in 1947 the Taft-Hartley Act- was passed. From Infoplease.com “the act qualified or amended much of the National Labor Relations (Wagner) Act of 1935, the federal law regulating labor relations of enterprisers engaged in interstate commerce, and it nullified parts of the Federal Anti-Injunction (Norris-LaGuardia) Act of 1932. The act established control of labor disputes on a new basis by enlarging the National Labor Relations Board and providing that the union or the employer must, before terminating a collective-bargaining agreement, serve notice on the other party and on a government mediation service.”

131


• Multiple regression results using 73 years of data from 1929-2001: . sum strikes realgdp unemp minwage time if e(sample) Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------strikes | 73 829.2466 1330.081 17 4985 realgdp | 73 3627.236 2473.242 603.3 9214.54 unemp | 73 7.306849 5.179692 1.2 25.2 minwage | 73 1.846575 1.621499 0 5.15 time | 73 36 21.21713 0 72 . reg strikes realgdp unemp time minwage Source | SS df MS -------------+-----------------------------Model | 54920903.3 4 13730225.8 Residual | 72455474.2 68 1065521.68 -------------+-----------------------------Total | 127376378 72 1769116.36

Number of obs F( 4, 68) Prob > F R-squared Adj R-squared Root MSE

= = = = = =

73 12.89 0.0000 0.4312 0.3977 1032.2

--------------------------------------------------------------------strikes | Coef. Std. Err. t P>|t| [95% Conf. Interval] ---------+----------------------------------------------------------realgdp | .530194 .3803215 1.39 0.168 -.2287257 1.289114 unemp |-27.50805 29.85787 -0.92 0.360 -87.0885 32.07241 time |-106.8932 25.38961 -4.21 0.000 -157.5574 -56.22906 minwage | 84.74099 515.9023 0.16 0.870 -944.726 1114.208 _cons | 2798.781 577.0888 4.85 0.000 1647.218 3950.344 --------------------------------------------------------------------. reg strikes realgdp unemp time minwage pre47 Source | SS df MS -------------+-----------------------------Model | 106626680 5 21325336 Residual | 20749697.4 67 309696.976 -------------+-----------------------------Total | 127376378 72 1769116.36

Number of obs F( 5, 67) Prob > F R-squared Adj R-squared Root MSE

= = = = = =

73 68.86 0.0000 0.8371 0.8249 556.5

--------------------------------------------------------------------strikes | Coef. Std. Err. t P>|t| [95% Conf. Interval] ---------+----------------------------------------------------------realgdp |-.8260032 .2303428 -3.59 0.001 -1.285769 -.366237 132


unemp |-100.4215 17.05749 -5.89 0.000 -134.4684 -66.37464 time | 112.2559 21.79502 5.15 0.000 68.75289 155.759 minwage | 24.52896 278.1735 0.09 0.930 -530.7077 579.7656 pre47 | 4519.136 349.7473 12.92 0.000 3821.037 5217.234 _cons | -641.696 409.5055 -1.57 0.122 -1459.072 175.6804 ---------------------------------------------------------------------

• If we do not include the pre47 dummy variable the constant term is overstated and the impacts of unemployment and real gross domestic product are muted (insignificant). • If we include the pre47 dummy variable, the constant term during the pre-1947 period is equal to . lincom _b[pre47]+_b[_cons] ( 1)

pre47 + _cons = 0

----------------------------------------------------------------strikes | Coef. Std. Err. t P>|t| [95% Conf. Interval] ----------+-----------------------------------------------------(1) | 3877.44 649.3468 5.97 0.000 2581.338 5173.542 whereas after 1947, the constant term is -641 but not significantly different from zero. • Notice that after we include the Pre47 dummy variable, REALGDP is negatively related to work stoppages as is UNEMP. Both of these results make sense: with greater income and higher unemployment workers are less likely to go on strike. • What if we include a post-1947 dummy instead? . reg strikes realgdp unemp time minwage post47 Source | SS df MS -------------+-----------------------------Model | 106626680 5 21325336 Residual | 20749697.4 67 309696.976 -------------+-----------------------------133

Number of obs F( 5, 67) Prob > F R-squared Adj R-squared

= = = = =

73 68.86 0.0000 0.8371 0.8249


Total |

127376378

72

1769116.36

Root MSE

=

556.5

--------------------------------------------------------------------strikes | Coef. Std. Err. t P>|t| [95% Conf. Interval] ---------+----------------------------------------------------------realgdp | -.8260032 .2303428 -3.59 0.001 -1.285769 -.366237 unemp | -100.4215 17.05749 -5.89 0.000 -134.4684 -66.37464 time | 112.2559 21.79502 5.15 0.000 68.75289 155.759 minwage | 24.52896 278.1735 0.09 0.930 -530.7077 579.7656 post47 | -4519.136 349.7473 -12.92 0.000 -5217.234 -3821.037 _cons | 3877.44 322.1265 12.04 0.000 3234.473 4520.407 ---------------------------------------------------------------------

• Notice that the marginal impacts of REALGDP and UNEMP have not changed. The only thing that has change is the intercept term. Now, after 1947 we find an intercept term is exactly equal to the CONS when including pre1947 . lincom _b[post47]+_b[_cons] ( 1)

post47 + _cons = 0

-----------------------------------------------------------------strikes | Coef. Std. Err. t P>|t| [95% Conf. Interval] ---------+-------------------------------------------------------(1) |-641.696 739.3518 -0.87 0.389 -2117.448 834.0564 ------------------------------------------------------------------

134


9

Hypothesis Testing

9.1

Tests on a Single Parameter

• These are the most common hypotheses tests. We test whether a single parameter is statistically different from a hypothetical value. We have already addressed this type of test through the t-test. • Much of our analysis is useful to test economic theories. These theories have to be tested against some alternative. • Thus, we look to test our empirical findings against THEORETICAL values. • Typically, the null hypothesis is that an estimated parameter is equal to zero. • This is tested with a t-test calculated as

t=

βˆi − 0 σi

where σi is the square root of the iith element of the covariance matrix of βˆ = σ 2 (X 0 X)−1 . • We compare the t-statistic with the appropriate critical value. If we use a two-tailed test, we seek whether the t-statistic falls within the upper and lower bounds of a confidence range, depending upon what level of significance (α) we choose. If this is the case then we fail to reject the null hypothesis, otherwise we reject the null in favor of the alternative.

135


................ ........ .... ............. ...... ..... ..... ..... .... ..... ..... . . ... . ..... ..... ... ..... ..... . . ..... . . . . .. . ..... . . . . . ..... . ... . . . . . ..... . ... . . . ..... . . .. . . ..... . . . . . ..... . .... . . . . . ..... .. . . . . . ..... . . ... . . ..... . . . . . ..... . .... . . . . ..... . . . .... ...... . . . . . . ...... ... . . . . . . ....... . . . . ....... . . . . . . . ... ................ . . ..... ... . . . . . .................. . . . . . . . . . . .... . . ......... ................. .. ..

1−α

α/2

α/2

−tc

tc

0

• Some limitations on the t-test 1. A t-test does not test theoretical validity: For example, a relationship between rainfall in the Andes and the Dow Jones Index. 2. A t-test does not measure relative importance

sales = 300 + 10.0(T V AD) + 200(RADIOAD)

where tT V AD = 10 but tRADIOAD = 8.0 According to the parameter estimates, radio advertising is more affective than television ad. But the t-stats point to TV being “more” statistically significant. However, realize that here, we are testing against the null hypothesis that the individual coefficients are equal to zero, not that the two coefficients are equal to each other. 3. A t-test is only valid for the sample at hand, not the entire population. Why?

t=

βˆ σβˆ

ˆ → 0 because βˆ is consistent. Therefore, but as N → 0 we know that the var(β) the standard error of βˆ → 0 and thus |t| → ∞. Upshot: Large samples tend to lead to high t-stats. This can introduce the potential for Type I errors. Conversely, small samples can lead to Type II errors. 136


9.2

Comparing Two Parameters

• As in the advertising model above, at times we may want to compare two estimated parameters for equality. This can be accomplished using a modified t-test or the more general F test. • Modified t-test is calculated as (βˆ1 − βˆ2 ) SE(βˆ1 − βˆ2 )

t=

where SE(βˆ1 − βˆ2 ) =

q var(βˆ1 ) + var(βˆ2 ) − 2cov(βˆ1 , βˆ2 ). The t-test is distributed

with N − k − 1 degrees of freedom, where k does not include the intercept term. • One can use the F-test for equality. The F-test is based upon a “restricted” and an “unrestricted” model. • The restricted model basically imposes the restriction that the two (or more) parameters are equal to each other. • The unrestricted model allows all parameters to be freely estimated.

Unrestricted:

y = β0 + β1 X1 + β2 X2 + ²

Restricted (H0 : β1 = β2 ):

y = β0 + β1 (X1 + X2 ) + ²

• The F-test can be generated using the R2 of the restricted and unrestricted models: µ F =

2 RU2 R − RR 1 − RU2 R

¶µ

where m is the number of restrictions.

137

N − kU R m

¶ ∼ F[m,n−k]


• Alternatively, the F-test can be calculated using the SSE from the restricted and unrestricted models.

9.3 9.3.1

Tests Involving Multiple Parameters Statistical Significance of the Model

• We can test whether parameters are jointly equal to some theoretical value. • We set up the following hypotheses to test whether we have a viable model

H0 : β1 = 0; β2 = 0; · · · ; βk = 0 Hα :

at least one of the slope parameters is nonzero.

• Note: the alternative hypothesis only says that one of the parameters is non-zero, but not which one is non-zero. • This test entails k − 1 different parameters jointly equalling zero. Notice that the intercept term is not included in the test. • How to generate a test statistic? We use the F-test. • The F-test is based upon a ”restricted” and an ”unrestricted” model. • The restricted model is estimated having imposed the restrictions. • In general, the F-test is calculated as µ F =

(SSER − SSEU R )/m SSEU R /N − kU R

138

¶ ∼ F[m,n−kU R ]


which can be rewritten as µ F =

SSER − SSEU R SSEU R

¶µ

N − kU R m

¶ ∼ F[m,n−kU R ]

where m is the number of restricted parameters, and SSE is the sum of squared errors, reported by most stat packages. and kU R includes the intercept term. • The F-statistic can be related to the R2 as

F[k−1,N −k] =

R2 /(k − 1) (1 − R2 )/(N − k)

where k includes the intercept term. • The intuition is that the restricted model will have always have a large sum of squared errors relative to the unrestricted model. However, if the restrictions cannot be rejected, the restricted SSE will not be that much different than the SSE from the unrestricted model. If, on the other hand, the difference is dramatic then we would reject the null that the parameters are jointly equal to zero. 9.3.2

The Significance of Advertising

• Consider a model that relates advertising and price to total revenue: T Rt = β0 + β1 Pt + β2 ADVt + β3 ADVt2 + ²t

• Here, T R is measured in thousands, P in dollars and ADV in thousands. • Suppose we have data for a firm for 78 weeks. We estimate the model and find

139


Variable

Estimate

Std. Error

Intercept 110.46

3.74

Price

-10.198

1.582

ADV

3.361

0.422

ADV 2

-0.0268

0.0159

• How would we test whether advertising has an impact on total revenues? We test whether β2 or β3 is nonzero. • The elements of the test: 1. The joint null hypothesis is H0 : β2 = 0; β3 = 0. 2. The alternative hypothesis is Hα : β2 6= 0 and/or β3 6= 0 3. The unrestricted model is T Rt = β0 + β1 Pt + β2 ADVt + β3 ADVt2 + ²t

4. The restricted model is T Rt = β0 + β1 Pt + ²t 5. The test statistic is F =

(SSER − SEEU R )/m SSEU R /(N − k)

where m = 2, N = 78, k = 4. Suppose SSEU R = 2592.301 and SSER = 20907.331. 6. If the joint null is “true” then F ∼ F[m,N −k] . The critical value from the F[2,74] distribution is 3.120 at the α = 0.05 significance level.

140


7. The value of the F-statistic is 261.41 > Fc , and thus we reject the null hypothesis. At least one of the parameters is nonzero. • How about the optimal level of advertising? • The marginal impact of advertising is ∆T Rt = β2 + 2β3 ADVt ∆ADVt • This is the expected marginal revenue of advertising. Neoclassical firm theory claims that M B = M C, therefore we need to know the marginal cost of the advertising plus the cost of preparing additional products sold due to effective advert9sing. If we ignore the extra production costs, advertising expenditures should be increased to the point where the marginal benefit of $1 of advertising equals $1, or where

β2 + 2β3 ADVt = 1

• Using the least squares estimates we can estimate the optimal level of advertising from ˆ t=1 3.361 + 2(−0.0268)ADV

ˆ t = 44.0485 or that our firm should spend $44,048.50 per week. • This implies that ADV • Suppose the firm’s management, based on experience in other locations, thinks that our estimate is too high and that the optimal level of advertising is really $40,000. • We can test this conjecture using either a t or F test. The null we wish to test is H0 : β2 + 2β3 (40) = 1 against the alternative Hα : β2 + 2β3 (40) 6= 1

141


• The t−test is t=

(β2 + 80β3 ) − 1 se(β2 + 80β3 )

• The tricky part here is calculating the denominator as var(β ˆ 2 + 80β3 ) = var(β ˆ 2 ) + 802 var(β ˆ 3 ) + 2(80)cov(β ˆ 2 , β3 ) = 0.76366

• We would get these elements from the variance-covariance matric of βˆ using the vce command after a reg y x1 x2 x3 command in Stata. √ • Therefore, t = (1.221 − 1)/ .76366 = .252 • We compare this to the tc for 74 degrees of freedom, which is 1.933. Therefore, we cannot reject the null hypothesis that the optimal level of advertising is $40,000 per week. • Alternatively, we could use the F-test. The restricted model incorporates the linear restriction, β2 = 1 − 80β3 therefore T Rt = β0 + β1 Pt + (1 − 80β3 )ADV t + β3 ADVt2 + ²t or T Rt − ADVt = β0 + β1 Pt + β3 (ADVt2 − 80ADVt ) + ²t • Estimating this model with OLS yields an SSER = 2594.533 • The F-stat then is F = 0.0637 ∼ F[1,74] , where Fc = 3.970. Again, we cannot reject the null hypothesis. [We can obtain the F-critical by the command invFtail(n1 , n2 , α)]. • Note, F = 0.0637 = t2 = (0.252)2 . This is true for all “single equality” tests. 142


• Suppose that management chooses $40,000 as the optimal level of advertising. Management now believes that at P =$2, total revenue will be $175,000. Thus, in our unrestricted model the claim is E[T Rt ] = β0 + β1 (2) + β2 (40) + β3 (40)2 = 175

• Are these two conjectures compatible with the evidence from the sample? • Formulate two joint hypotheses

H0 : β2 + 2β3 (40) = 1;

β0 + 2β1 + 40β2 + 1600β3 = 175

• The alternative hypothesis is that one of management’s conjectures is not true. • To construct the restricted model, we have to substitute both hypotheses into the equation. Perhaps you should try this on your own. • When we calculate the F-statistic we note that the number of restrictions is 2, therefore F ∼ F( 2, 74). • Assume the F-statistic calculated is 1.75 whereas the critical value of 3.120. Since F < Fc we cannot reject the null hypothesis.

9.4

Example: University Library Staff and School Size

• In this example we consider relating the number of people on library staffs at Carnegie I ranked schools in the United States. It is postulated that the number of librarians a school hires is a function of the number of full-time graduate students, full-time

143


undergraduate students, the number of full time faculty, the number of branch libraries on campus, and whether the school is public (1=Yes). • The data were obtained from the American Library Association for 1997-1998 academic year and are available in the file libstaff98.dta.

LIBST AF Fi = β0 + β1 F T GRADi + β2 F T U N DERi + β3 F ACU LT Yi + β4 LIBRARIESi + β5 P RIV AT Ei + ²i

• Preliminary results: . reg libstaff ftgrad ftunder faculty libraries public if carnegie<20&faculty>10 Source | SS df MS -------------+-----------------------------Model | 4232868.23 5 846573.646 Residual | 760406.448 221 3440.75316 -------------+-----------------------------Total | 4993274.68 226 22094.1357

Number of obs F( 5, 221) Prob > F R-squared Adj R-squared Root MSE

= = = = = =

227 246.04 0.0000 0.8477 0.8443 58.658

-----------------------------------------------------------------------------libstaff | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------ftgrad | .0209542 .0030921 6.78 0.000 .0148604 .0270481 ftunder | .0003817 .0013401 0.28 0.776 -.0022594 .0030228 faculty | .1392262 .0320396 4.35 0.000 .0760839 .2023685 libraries | 7.90013 .6859075 11.52 0.000 6.548373 9.251887 public | -21.09207 10.44973 -2.02 0.045 -41.68594 -.4981952 _cons | 16.03017 9.129241 1.76 0.080 -1.96134 34.02168 ------------------------------------------------------------------------------

• First we test whether the parameter for the number of full time graduate students is equal to zero. In STATA we use the test command:

144


. test ftgrad=0 ( 1)

ftgrad = 0 F(

1,

221) =

Prob > F =

45.92 0.0000

• Notice that the F-statistic of 45.92 is t = 6.78 squared. • If we wish to simultaneously test the null that the parameters on both full time graduate and undergraduate students is equal to zero, we use the test command with the accum option: . test ftunder=0,accum ( 1)

ftgrad = 0

( 2)

ftunder = 0 F(

2,

221) =

Prob > F =

23.08 0.0000

• If we wish to test that all five slope parameters are simultaneously equal to zero: . test public=0,accum ( 1)

ftgrad = 0

( 2)

ftunder = 0

( 3)

faculty = 0

( 4)

libraries = 0

( 5)

public = 0 F(

5,

221) =

Prob > F =

246.04 0.0000

Which is the same F-stat we obtained from the Stata output. 145


• Suppose the ALA suggests that the minimum staffing for a branch library should be nine full time equivalent employees (full time and part time employees). We can test that hypothesis and also test simultaneously that the parameters on graduate and undergraduate students are the same: . test libraries=9 ( 1) libraries = 9 F( 1, 221) = Prob > F =

2.57 0.1102

. test ftgrad=ftunder ( 1) ftgrad - ftunder = 0 F( 1, 221) = 40.60 Prob > F = 0.0000 . test libraries=9,accum ( 1) ftgrad - ftunder = 0 ( 2) libraries = 9 F( 2, 221) = 20.91 Prob > F = 0.0000

• If we want to test the hypothesis that the parameters on graduate and undergraduate students is the same, we could do this by hand. • First, we impose the restriction that β1 = β2 to obtain the following restricted model:

LIBST AF Fi = β0 + β1 F T GRADi + β1 F T U N DERi + β3 F ACU LT Yi + β4 LIBRARIESi +β5 P RIV AT Ei + ²i LIBST AF Fi = β0 + β1 (F T GRADi + F T U N DERi ) + β3 F ACU LT Yi + β4 LIBRARIESi + β5 P RIV AT Ei + ²i

• We need to define a new variable F T GRADU N DERi = F T GRADi + F T U N DERi 146


. gen ftgradunder = ftgrad + ftunder . reg libstaff ftgradunder libraries faculty public if carnegie<20&faculty>10 Source | SS df MS -------------+-----------------------------Model | 4093158.32 4 1023289.58 Residual | 900116.355 222 4054.57818 -------------+-----------------------------Total | 4993274.68 226 22094.1357

Number of obs F( 4, 222) Prob > F R-squared Adj R-squared Root MSE

= = = = = =

227 252.38 0.0000 0.8197 0.8165 63.676

-----------------------------------------------------------------------------libstaff | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------ftgradunder | .0030047 .0013844 2.17 0.031 .0002763 .005733 libraries | 9.687925 .6794421 14.26 0.000 8.348944 11.02691 faculty | .1750195 .0342416 5.11 0.000 .1075393 .2424997 public | -48.584 10.33164 -4.70 0.000 -68.94465 -28.22335 _cons | 25.56349 9.776182 2.61 0.010 6.297498 44.82949 ------------------------------------------------------------------------------

• The first regression is the unrestricted model. The second regression is the restricted model, using FTGRADUNDER rather than the two variables FTGRAD and FTUNDER. • We can then calculate the F-statistic by hand recognizing that we have one restirction and the unrestricted model has 221 degrees of freedom. • We obtain the sum of squared errors from both models (note that the SSE of the restricted model is greater than the SSE of the unrestricted model) . disp (900116.355-760406.448)/1/((760406.448)/221)) 40.604455 . invFtail(1,221,.01) 6.75 147


• Notice that the F = 40.60 is the same we obtained using the test command above. 9.4.1

Example: A Firm Production Function

• At times, we do not have really nice and neat restrictions. At times, theory may predict that a linear combination of estimated parameters should equal some theoretical value. • For this problem, we must come up with a more general approach to testing our restrictions. Although it is possible to use the previously mentioned methods, there is a much more convenient methodology. • To motivate the point, let us look at a Cobb-Douglas production function Qi = ALβi 1 Kiβ2 Wiβ3 exp(²i ) where Q is production, L is labor, K is capital and W is weather. A is a scale parameter, and the β’s are parameters of interest. We assume that ² satisfies the full ideal conditions. • We linearize this model by taking the log of both sides to obtain

lnQi = lnA + β1 lnLi + β2 lnKi + β3 lnWi + ²

• We could estimate this model using OLS and interpret the parameters appropriately. • However, economic theory predicts that certain restrictions should apply to the β’s under different conditions. • For instance, under constant returns to scale it must be true the β1 + β2 + β3 = 1 and with no setup bundles, lnA = 0 ⇒ A = 1. 148


• How would we incorporate these two restrictions? • One approach is to transform the data and estimate a restricted model. • To incorporate the restriction, write β3 = 1 − β1 − β2 then substitute in the original model to obtain:

lnQi = lnA + β1 lnLi + β2 lnKi + (1 − β1 − β2 )lnWi + ²

• Also impose the restriction lnA = 0 and reorganize the model by collecting terms:

lnQi − lnWi = 0 + β1 (lnLi − lnWi ) + β2 (lnKi − lnWi ) + ²

• To run OLS on this model, redefine the dependent and independent variables as lnQ∗ = lnQi − lnWi , lnL∗ = lnLi − lnWi , and lnK ∗ = lnKi − lnWi . • Running OLS would yield βˆ1 and βˆ2 • Using an F-test will determine whether the restrictions hold. • An alternative is to set up Constrained Least Squares or Restricted Least Squares. • Write the linear restrictions (as many as desired or called for by theory) as

Rβ = r

where R is [q × k] and r is [q × 1] where q is the number of linearly independent restrictions called for by theory.

149


• In this example, the restrictions would be written as: · For the CRS Restriction: R =

¸ ;

0 1 1 1 

 1 0 0 0  For both Restrictions: R =   0 1 1 1

;

r=1 

 0  r=  1

• Restricted least-squares would use the following Lagrangian process: L = (y − Xβ)0 (y − Xβ) + 2λ0 (Rβ − r)

where λ is a [q × 1] vector of Langrangian multipliers. • To minimize the Lagrangian process calculate the following fonc’s: ∂L ˆ=0 = 2X 0 (y − X βˆc ) + 2R0 λ ∂β ∂L = 2(Rβˆc − r) = 0 ∂λ • Expand the first fonc and solve for βˆc , substituting βˆ = (X 0 X)−1 X 0 y, which is the unrestricted parameter vector. • This yields βˆc = βˆ + (X 0 X)−1 R0 λ • Pre-multiply both sides by R and impose the restrictions that Rβˆc = r to get ˆ = −[R(X 0 X)−1 R0 ]−1 (Rβˆ − r). λ

150


ˆ in the constrained equation to obtain: • Now substitute for λ βˆc = βˆ − (X 0 X)−1 R0 [R(X 0 X)−1 R0 ]−1 (Rβˆ − r)

• The expectation of βˆc to obtain ˆ − E[(X 0 X)−1 R0 [R(X 0 X)−1 R0 ]−1 (Rβˆ − r)] E[βˆc ] = E[β]

• If the restrictions hold then ˆ =β E[βˆc ] = E[β] • Furthermore cov(βˆc ) = σ 2 [(X 0 X)−1 − (X 0 X)−1 R0 [R(X 0 X)−1 R0 ]−1 R(X 0 X)−1 ]

• If the restrictions hold then ˆ cov(βˆc ) < cov(β) and the restrictions lead to a more efficient estimator. • The upshot so far is that failing to invoke appropriate restrictions will lead to a less efficient OLS estimator. • If we falsely impose the restrictions we lose unbiasedness and consistency and possibly efficiency. • To test whether the restrictions hold statistically we look to test H0 : Rβ = r. • If ² ∼ N (0, σ 2 ) then βˆ ∼ N (β, σ 2 (X 0 X)−1 ) and Rβ ∼ N (Rβ, σ 2 (X 0 X)−1 R0 ).

151


• One possible test statistic:

∆SSE = (Rβˆ − r)0 [R(X 0 X)−1 R0 ]−1 (Rβˆ − r) ∼ χ2(q) • However, we don’t know enough to use the χ2 statistic, thus we move to the F −test. • Wald Test ξw =

1 (Rβˆ − r)0 [R(X 0 X)−1 R0 ]−1 (Rβˆ − r) ∼ F[q,N −k] qˆ σ2

where q is the number of restrictions modeled, and σ ˆ 2 comes from the unrestricted model. Note: If q = 1 and r = 0 then R is [1×k] with the j th element equal to one and all other elements equal to zero. Then, the test statistic collapses to ξw = βˆ2 /ˆ σ 2 ∼ (tn−k )2 .

9.5

Some other tests

1. Likelihood Ratio Test: Similar to the Wald test but comes from the Maximum Likelihood Principle. Let L(y|θˆc ) and L(y|θˆu ) be the log-likelihood values under the null and alternative hypotheses, respectively. Then the Likelihood Ratio Test is ³ ´ ξLR = −2 L(y|θˆc ) − L(y|θˆu ) ∼ χ2(q) 2. Lagrange Multiplier Test: Use the Lagrangian to develop the test statistic. Let L = L(y|θ) − λ0 h(θ) where h(θ) is a differentiable vector.

152


Then, ∂L ∂L(y|θ) ˆ ∂h0 (θ) ˆ ˆ = |θc − |θc λ = 0 ∂θ ∂θ θ ∂L = h(θˆc ) = 0 ∂λ ∂L(y|θ) ˆ ∂h0 (θ) ˆ ˆ ⇒ |θ c = |θ c λ ∂θ ∂θ The LM test is defined as · 2 ¸−1 ∂L(y|θ) ˆ ∂ L(y|θ) ˆ ∂L(y|θ) ˆ − |θ c E |θ c |θc ∼ χ2(q) 0 ∂θ ∂θ∂θ ∂θ0

9.6

Test for Structural Break

• We gather data that we think is generated by the same Data Generating Process. But this is not guaranteed. • Consider our model is y = xβ + ² which can be written as

y 1 = x1 β 1 + ² 1 y 2 = x2 β 2 + ² 2

where x1 is [N1 × k] and x2 is [N2 × k] and N1 + N2 = N . • We can test for different regressions with the following H0 : β1 = β2 vs. Hα : β1 6= β2 .

153


• We perform this test by stacking the two equations to obtain: 



 y1   x 1 0   β1   ²1   =  +  y2 0 x2 β2 ²2 • H0 : β1 = β2 is equivalent to the restriction ·

¸ Ik −Ik

 β1   =0 β2

on the stacked model. • Test for structural break is equivalent to testing whether the two parameter vectors β1 = β2 , or the data can be combined into the same regression. • To test H0 we calculate "

²ˆ0 ²ˆ − (²ˆ1 0 ²ˆ1 + ²ˆ2 0 ²ˆ2 ) ²ˆ1 0 ²ˆ1 + ²ˆ0 ²ˆ2

2

¸ N − 2k ∼ F[k,N −2k] k

where ²0 ² is the sum of squared residuals from OLS on the entire data sample, ²0i ²i is the SSE from OLS on the ith subsample, and k is the number of right hand side variables in the pooled model (including the constant term). • If N1 − k < 0 or N2 − k < 0 then we need to alter our test statistic accordingly. Thus, we obtain the Chow Test as ·

²ˆ0 ²ˆ + ²ˆ1 0 ²ˆ1 ²ˆ1 0 ²ˆ1

¸·

¸ N1 − k ∼ F[N −N1 ,N1 −k] N − N1

• We can go back to the data on work stoppages to show how to implement a Chow test

154


in STATA. The basic idea is to create a dummy variable that takes a value of zero for “period 1” and a value of one for “period 2.” Create new variables that interact this dummy variable with the regressors in the original model. • We then estimate the fully interacted dummy variable model. • We then use the test command with the accum option, which tells STATA that we want to accumulate the tests previous. • Here is the STATA output from this procedure for the work stoppage data: . gen rgdp47 = rgdp*post47 (48 missing values generated) . gen unemp47=unemp*post47 (19 missing values generated) . gen time47 = time*post47 . reg ws rgdp unemp time post47 rgdp47 unemp47 time47 Source | SS df MS ---------+--------------------------Model | 119499904 7 17071414.9 Residual |7876473.41 65 121176.514 ---------+--------------------------Total |127376378 72 1769116.36

Number of obs F( 7, 65) Prob > F R-squared Adj R-squared Root MSE

= 73 = 140.88 = 0.0000 = 0.9382 = 0.9315 = 348.1

------------------------------------------------------------------ws | Coef. Std. Err. t P>|t| [95% Conf. Interval] --------+---------------------------------------------------------rgdp | -.9652389 1.090466 -0.89 0.379 -3.143051 1.212573 unemp | -34.36876 28.89167 -1.19 0.239 -92.06941 23.3319 time | 286.5985 50.3445 5.69 0.000 186.0536 387.1434 post47 | -1517.194 1143.837 -1.33 0.189 -3801.593 767.2054 rgdp47 | .8427357 1.100846 0.77 0.447 -1.355806 3.041277 unemp47 | -5.785106 52.64424 -0.11 0.913 -110.9229 99.35271 time47 | -275.35 54.99114 -5.01 0.000 -385.1749 -165.5251 _cons | 2000.585 1120.819 1.78 0.079 -237.8463 4239.015 . test post47 =0 155


( 1)

post47 = 0 F( 1, 65) = Prob > F =

1.76 0.1893

. test rgdp47=0,accum ( 1) post47 = 0 ( 2) rgdp47 = 0 F( 2, 65) = Prob > F =

1.64 0.2028

. test ( 1) ( 2) ( 3)

. test ( 1) ( 2) ( 3) ( 4)

unemp47=0,accum post47 = 0 rgdp47 = 0 unemp47 = 0 F( 3, 65) = Prob > F =

1.25 0.2990

time47=0,accum post47 = 0 rgdp47 = 0 unemp47 = 0 time47 = 0 F( 4, 65) = Prob > F =

133.29 0.0000

• Note that the last test rejects the null hypothesis that the interaction terms are jointly equal to zero. In other words, the data should not be pooled.

156


• We can go do it the old fashioned way (note that I have not reported the parameter estimates): . reg ws rgdp unemp time if post47==0 Source | SS df MS -------------+-----------------------------Model | 31678039.3 3 10559346.4 Residual | 7641704.72 14 545836.051 -------------+-----------------------------Total | 39319744 17 2312926.12

Number of obs F( 3, 14) Prob > F R-squared Adj R-squared Root MSE

= = = = = =

18 19.35 0.0000 0.8057 0.7640 738.81

. reg ws rgdp unemp time if post47==1 Source | SS df MS -------------+-----------------------------Model | 809567.416 3 269855.805 Residual | 234768.693 51 4603.3077 -------------+-----------------------------Total | 1044336.11 54 19339.5576

Number of obs F( 3, 51) Prob > F R-squared Adj R-squared Root MSE

= = = = = =

55 58.62 0.0000 0.7752 0.7620 67.848

. reg ws rgdp unemp time Source | SS df MS -------------+-----------------------------Model | 54892154.9 3 18297385 Residual | 72484222.7 69 1050495.98 -------------+-----------------------------Total | 127376378 72 1769116.36

Number of obs F( 3, 69) Prob > F R-squared Adj R-squared Root MSE

= = = = = =

73 17.42 0.0000 0.4309 0.4062 1024.9

• The first two regressions are for the pre-1947 and post-1947 subsamples, respectively. The last regression is for the pooled sample. • We can calculate the Chow test using the information given: ·

72484222.7 − (234768.693 + 7641704.72) (234768.693 + 7641704.72)

¸·

¸ 73 − 8 ∼ F[4,73−8] 4

or 133.29264, exactly equal to the accumulated F-test statistic.

157


10 10.1

Generalized Least Squares Motivation

• Thus far we have worked with the OLS model under the full ideal conditions. The FIC led to estimators that have the desirable properties of unbiasedness, consistency and efficiency. • What happens if the full ideal conditions are not met? • We already know what happens when E[²] 6= 0. The intercept term is adjusted accordingly. • What if the cov(²) = σ 2 Ω where Ω 6= I? • There are plenty of different structures for Ω. Two common structures are 

1 ω12  ω21 1 Ω=  . . ωN 1 .

10.2

 . ω1N . ω2N   . .  . 1

and

ω11 0  0 ω22 Ω=  0 . . .

 . 0 . 0   . .  . ωN N

Implications for the OLS Estimator

• If Ω 6= I, what are the statistical implications for the OLS estimator? βˆ is still unbiased and consistent. • Suppose that X 0 Ω−1 X is finite and non-singular. N →∞ N lim

ˆ = β. • Theorem: E[β] ˆ = E[β] + E[(X 0 X)−1 X 0 ²] = β Proof: E[β]

158


• Theorem: plimβˆ = β Proof:

µ plimβˆ = β + limN →∞

X 0X N

¶−1 plim

X 0² N

but X 0 ²/N has a zero mean and a covariance σ 2 X 0 ΩX/N 2 by ·µ E

X 0² N

Thus, if

¶µ

· lim

N →∞

X 0² N

X 0 ΩX N

¶0 ¸

·

X 0 ²²0 X = E N2 X 0 σ 2 ΩX = N2 X 0 ΩX = σ2 N2

¸

¸ is finite and with

·

¸ X 0 ΩX lim = 0 then we have N →∞ N2 µ 0 ¶ X² = 0 and thus plim N plimβˆ = β • But is βˆ still efficient if cov(²) = σ 2 Ω? ˆ = σ 2 (X 0 X)−1 X 0 ΩX(X 0 X)−1 • Theorem: cov(β) Proof: h i 0 ˆ ˆ ˆ cov(β) = E (β − β)(β − β) £ ¤ = E (X 0 X)−1 X 0 ²²0 X(X 0 X)−1 = (X 0 X)−1 X 0 σ 2 ΩX(X 0 X)−1 ˆ = σ 2 (X 0 X)−1 X 0 ΩX(X 0 X)−1 cov(β) 159


• The upshot of all of this is that if we don’t account for Ω then ˆ Ω=I ) Q cov(β| ˆ Ω6=I ) cov(β|

• Note: s2 =

SSE N −k

is biased and inconsistent.

• Generalized Least Squares is a fix for this problem.

10.3

Generalized Least Squares

• Lemma: There exists a nonsingular matrix V such that V 0 V = Ω−1 . Proof: Since Ω is a covariance matrix and is nonsingular, it is positive definite. Thus Ω−1 is positive definite and there is a matrix V such that V 0 V = Ω−1 . (Note that a formal proof is beyond the scope of this class) • Proposition: Suppose that y = xβ + ² satisfies the full ideal conditions except that cov(²) = σ 2 Ω where Ω 6= I. • Suppose that X 0 Ω−1 X is finite and non-singular. N →∞ N lim

• Let V be a matrix such that V 0 V = Ω−1 then the transformed equation

V y = V Xβ + V ² does satisify the full ideal conditions.

• Proof: Since V is nonsingular and non-stochastic, V X is non-stochastic and of full rank if X is of full rank. Also, note that X 0 Ω−1 X (V X)0 (V X) = lim N →∞ N →∞ N N lim

160


is finite and nonsingular by assumption. • Consider the transformed residual V ².

E[V ²] = V E[²] = 0 E[(V ²)(V ²)0 ] = E[V ²²0 V 0 ] = σ 2 V ΩV 0 but we know that V ΩV 0 = I so that E[V ²²0 V 0 ] = σ 2 I

• Note: V ΩV 0 = I because V ΩV 0 = V (V 0 V )−1 V 0 = V (V )−1 (V 0 )−1 V 0 = I

• If ² ∼ N (0, σ 2 Ω) then V ² ∼ N (0, σ 2 I). • Aitken Theorem: The BLUE of β when ² ∼ N (0, σ 2 Ω) is β˜ = (X 0 Ω−1 X)−1 X 0 Ω−1 y where β˜ is the GLS estimator of β. • Proof: Consider the transformed model V y = V Xβ + V ² which does satisfy the full

161


ideal conditions. Substituting V X for X in the original βˆ we obtain −1 β˜ = [(V X)0 (V X)] (V X)0 V y

= [X 0 V 0 V X]−1 (X 0 V 0 V y) = [X 0 Ω−1 X]−1 X 0 Ω−1 y

• β˜ is the OLS estimator on the transformed equation. At this point, all our previous OLS results are reinstated because the transformed model satisfies the full ideal conditions. • Note that β˜ is unbiased: ˜ = E[(X 0 Ω−1 X)−1 X 0 Ω−1 y] E[β] = E[(X 0 Ω−1 X)−1 X 0 Ω−1 (Xβ + ²)] = E[(X 0 Ω−1 X)−1 X 0 Ω−1 Xβ) + (X 0 Ω−1 X)−1 X 0 Ω−1 ²)] ˜ = β+0 E[β]

˜ = σ 2 (X 0 Ω−1 X)−1 because • The cov(β) ˜ = E[(β˜ − β)(β˜ − β)0 ] cov(β) h¡ ¢ ¡ 0 −1 −1 0 −1 ¢0 i 0 −1 −1 0 −1 = E (X Ω X) X Ω ² (X Ω X) X Ω ² = E[X 0 Ω−1 X)−1 X 0 Ω−1 ²²0 Ω−1 X(X 0 Ω−1 X)−1 ] = [X 0 Ω−1 X)−1 X 0 Ω−1 σ 2 ΩΩ−1 X(X 0 Ω−1 X)−1 ] = σ 2 [X 0 Ω−1 X)−1 X 0 Ω−1 X(X 0 Ω−1 X)−1 ] ˜ = σ 2 (X 0 Ω−1 X)−1 cov(β)

162


• Claim: An unbiased, consistent, and efficient estimator of σ 2 is σ ˜2 =

²˜0 Ω−1 ²˜ N −k

˜ where ²˜ = y − X β. • Note: So far we have acted as if we know what Ω actually is. But that is not always the case, perhaps never!! • Thus, we need some way of dealing with the problem when Ω is unknown.

10.4

Feasible Generalized Least Squares

• We don’t normally know Ω. Thus, we have to look for an estimator for Ω instead. b has 1 N (N + 1) unique elements. Thus, if we tried to estimate each element • Note: Ω 2 of Ω we would run out of degrees of freedom, which is (N − k). Thus, we have to find another way to estimate Ω. b is a consistent estimator of Ω if plimΩ b ij = Ωij for all i and j. • Note: Ω b be an estimator of Ω. The FGLS estimator of β is then • Let Ω b −1 X)−1 X 0 Ω b −1 y β˜F GLS = (X 0 Ω

Proof: Same as before. b is consistent does not imply that β˜F GLS is consistent. • Note: Just because Ω • Proposition: A set of sufficient conditions for β˜F GLS to be consistent is

plim

b −1 X X 0Ω be finite and nonsingular N 163


and plim

b −1 ² X 0Ω =0 N

Proof: Consider the FGLS estimator as b −1 X)−1 X 0 Ω b −1 ² β˜F GLS = β + (X 0 Ω

hence,

"

plimβ˜F GLS

b −1 X X 0Ω = β + plim N

164

#−1 "

# b −1 ² X 0Ω plim =β N


11 11.1

Heteroscedasticity Motivation

• We have developed a methodology to handle error structures that do not satisfy the full ideal conditions. • We first turn to the idea of heteroscedasticity. • Heteroscedasticity - Occurs when the variances across individual ²i are not equal, i.e., cov(²i ) 6= cov(²j ). • Sometimes a picture is easier to understand. In this case, I took data from Major League Baseball in 2004. I regressed player salary (in thousands) against the player’s total bases (hits+doubles+2×triples+3×homeruns), total homeruns, and total walks. I then obtain the fitted residuals using the predict res if e(sample), res command and then plotted the fitted residuals against total bases (tb) using the scatter res tb

−10000

−5000

Residuals 0 5000

10000

15000

if e(sample):

0

100

200 tb

300

400

• Heteroscedasticity is often exhibited in cross-sectional data, because in a cross-sectional 165


analysis there are likely plenty of variables that are not measured or cannot be measured at all. • For example, if we are investigating the wages of a cross-section of workers, we would not necessarily expect homoscedastic errors. Why? • In our wage model, we may not be able to adequately control for such issues as preference, motivation, talent, skill, etc. These unmeasurable qualities and quantities will cause the fitted errors to differ in variance. • The consequences for OLS: ˆ =β 1. E[β] ˆ = σ 2 (X 0 X)−1 X 0 ΩX(X 0 X)−1 2. var(β) 3. plimβˆ = β • Other examples: Higher income levels lead to higher variances in expenditures. Higher tuition rates or academic standards should see less variation in the quality of applications over time. • Note: Under heteroscedasticity, βˆ is inefficient relative to β˜GLS . • Note: Under heteroscedasticity, σ 2 is biased. • Typically standard errors are biased upwards so that t-stats are pushed down. This, in turn, can lead to Type II errors, falsely failing to reject the null hypothesis.

166


11.2

Generalized Least Squares

• Consider the linear model 

 σ12

y = Xβ + ²

where

   0  cov(²) = Ω =   0   0

0 0

0   . 0 0    0 . 0    2 0 0 σN

• The GLS estimator is  1 σ12

β˜ = (X 0 Ω−1 X)−1 X 0 Ω−1 y

where

Ω−1

 0   0    0   

0 0

   0  =  0   0

. 0 0 . 0 0

1 2 σN

because Ω is diagonal. • Transform the original model by Ω−1/2 to obtain  1 σ1

Ω−1/2 y = Ω−1/2 Xβ + Ω−1/2 ²

where

Ω−1/2

• The N th observation then becomes yN XN ²N = β+ σN σN σN which is an example of ”Weighted Least Squares” 167

   0  =   0  0

0 0 . 0 0 . 0 0

 0   0     0   1 σN


• Note that

µ cov

²N σN

"µ

¶ =E

²N σN

¶2 # =

2 1 σN 2 E[² ] = =1 N 2 2 σN σN

which holds for all observations. • Therefore, the transformation does satisfy the full ideal conditions. • Note: To estimate Ω−1 we need to estimate N + k parameters, which exhausts our N degrees of freedom. We thus try to reparameterize the problem so that Ω = f (θ).

11.3

Weighted Least Squares

• It might be possible to relate the heteroscedasticity as a deterministic function of one or more explanatory variables. • Ex: lnwi = β0 + β1 AGEi + β2 AGEi2 + ²i If

     σ2Ω = σ2    

 age21

0 0

0

0

. 0

0

0

0 .

0

0

0 0 age2N

       

• Then, the Ω−1/2 transformation would entail lnwi β0 ²i = + β1 + β2 agei + agei agei agei √ where var(²i ) = σi2 = σ 2 Zi and var(1/ Zi ²i ) = Zi σ 2 /Zi = σ 2 . • Note: Be careful when running this model. The researcher must keep straight what parameters are what. 168


• Note: In this transformation, the intercept is still β0 /agei not β1 . • How to implement this transformation? 1. First, transform the data appropriately. Thus, – yi ⇒ yi /agei = y ∗ – The intercept’s “variable” is one, thus we need to define a new variable as int = 1/agei – The first rhs variable is agei and thus it will be wiped out in the transformation (this is the constant term in the actual regression command) – The second rhs variable is age2i and thus it will be transformed to agei for use in the actual regression command. 2. After transforming the data, you would estimate the following OLS model: y ∗ = γ0 + γ1 IN Ti + γ2 agei

where γ0 is actually β1 , γ1 is actually β0 and γ2 is actually β2 . 3. An alternative is to use the weight option to the reg command. If x1 is the variable thought to cause the heteroscedasticity, then the following command will estimate weighted least squares without messing up the variable names or their order! 4. One other alternative is the vwls command. • Note: Unfortunately, there is no way to test whether σi2 = σ 2 age2i or any other possible deterministic function. Thus, it is often tough to find the appropriate weights to place on the data.

169


ˆ = σ 2 (X 0 X)−1 X 0 ΩX 0 (X 0 X)−1 but is it possible to obtain a con• We know that cov(β) sistent estimate of this matrix?

11.4

Adjustment of Standard Errors

• White (1980) shows that X 0 ΩX can be estimated consistently, as opposed to Ω alone. • He claims that 

ˆ2  ²1 0   0 .  b Ω=   0 0  0 0

0

0   0 0     . 0   2 ˆ 0 ²N

(2)

where ²ˆi are OLS residuals. ˆ = σ 2 (X 0 X)−1 X 0 ΩX b 0 (X 0 X)−1 where σ 2 is buried in Ω b and • Then it is true that cov( c β) the plim

11.5

1 b − X 0 ΩX) = 0 (X 0 ΩX N

A Specific Example of heteroscedasticity

• Sometimes heteroscedasticity is a natural outcome of the way economic data is gathered. • Many times econometricians obtain data from government agencies. However, these agencies often report only aggregated figures and not data for individual firms or people. • For example, consider a standardized exam and parent’s income. We propose that

qij = β0 + β1 xij + ²ij 170


for i = 1, 2, . . . , nj and j = 1, 2, . . . , m where qij is the test score of a particular student from the j th school and xij is the parent’s income. There are m schools with nj students in each. The error term is thought to be heteroscedastic. • As a result of privacy laws we do not observe individual scores or income. Rather, we observe the average score for each school, P

qij , nj

Qj =

and the average income of parents in school j, P

xij . nj

Xj =

• Note that the schools are of different sizes. Therefore, it will be necessary to weight the observations. • The model we are really using is

Qj = β0 + β1 Xj + Uj

for j = 1, 2, . . . , m. • Note: The error structure in the less specific model is actually P Uj =

²ij

nj

• Because the denominator is different for each school,the variance of each Uj will differ. While OLS will still be unbiased, it will not be efficient. What can we do to make our estimates of β0 and β1 as good as possible? 171


• We begin by finding the error variance for the less specific model as ·

²1j + ²2j + · · · + ²nj V ar nj hX i 1 V ar ²ij n2j 1 X 2 σ n2j 1 2 σ nj

V ar(Uj ) = = = ⇒ V ar(Uj ) =

¸

• Using these results we can construct the entire error covariance matrix to obtain      0 E[U U ] =    

 σ2 n1

0

···

0 .. .

σ2 n2

.. .

··· .. .

0

0

···

0   0   ..  .    σ2 nm

• We must correct for the different variances on the diagonal. If we properly weight the observations, we can get a scalar diagonal covariance matrix. Define 

P −1

    =   

 n1

0 .. .

0 ··· √ n2 · · · .. .. . .

0

0

0

0 .. . √ ··· nm

       

• Now, by transforming the data as follows Q∗ = P −1 Q, X ∗ = P −1 X, and U ∗ = P −1 U

we arrive at the following conclusion regarding the error term: 172


0

E[U ∗ U ∗ ] = σ 2 Im 0 0 • Therefore β˜ = (X ∗ X ∗ )−1 X ∗ Q∗ is BLUE.

11.6

Tests for heteroscedasticity

There are several tests for heteroscedasticity. However, many of them are limited in their applicability because you have to have some a priori idea of where the heteroscedasticity is coming from. There is an alternative test that will avoid these pitfalls. 11.6.1

Park Test

• Here the test is that var(²i ) = σ 2 Zi How to implement the test? ˆ2 1. Estimate y = Xβ + ² and obtain ²ˆ2i = (yi − Xi β) 2. Estimate lnˆ²2i = δ0 + δ1 lnZi + ui 3. Perform a t-test on δ1 under the null of homoscedasticity with respect to Zi and the alternative that heteroscedasticity is present. 11.6.2

Goldfeld-Quandt Test

• This is an exact test and assumes that σi2 are monotonic functions of some explanatory variable, Zi . How to implement the test? 1. Order the observations by Zi , in either ascending or descending order. Note that in this case there can only be one variable in Z. 173


2. Throw out the middle observations (say 5 to 10 or up to 1/5 of them depending on your sample size) 3. Split the sample into the first N1 observations (with N1 > k) and the last N2 observations (with N2 > k). 4. Perform OLS of the original model y = Xβ + ² on the two subsamples N1 and N2 , obtain the SSE1 and SSE2 where SSEi is the sum of squared errors in each model. 5. Calculate S12 =

SSE1 SSE2 and S22 = N1 − k N2 − k

6. The test statistic is then S22 ∼ F[N1 −k,N2 −k] S12 under the null hypothesis that S12 = S22 versus the alternative hypothesis that Z causes the heteroscedasticity. • Note: The reason that we split the sample up is that the test is best when comparing the highest values of Z and the lowest values of Z. At this point, we have polarized the data as much as possible, within Z, so that a comparison of the variances across these two samples is more practical. • Note that this test is only valid on a particular Z. Thus, if you fail to reject the null hypothesis, this is not the same as saying that there is no heteroscedasticity. Failing to reject the null only says that the particular Z variable that you have chosen to test against isn’t causing the heteroscedasticity. • Other problems may be more complicated. What if you think the heteroscedasticity is being caused by more than one variable? The Goldfeld-Quandt test is only useful if 174


one variable causes the heteroscedasticity. • In STATA the Goldfeld-Quandt test is not available as a canned routine. 11.6.3

Breusch-Pagan Test

• Breusch-Pagan Test: This test is useful when we think that there is more than one variable that is driving the heteroscedasticity. 1. Assume that σi2 = g(Zi α) where g is an arbitrary function, independent of i, Zi are exogenous variables thought to affect σi2 and α is a vector of parameters to be estimated. 2. The test boils down to whether or not α = 0 or that there is no statistical relationship between Z and σi2 . 3. A Lagrange Multiplier test amounts to an F-test on ²ˆ2i = Zi α + ui

4. The LM test is obtained from the SST and SSE from OLS applied to ²ˆ2i 1 X 2 2 = Z α + u where σ ˜ = ²ˆ i i i σ ˜i2 N i i and the test statistic is 1 (SST − SSE) ∼ χ2j−1 2 where j is the number of variables in Z. • In STATA the command bpagan or hettest implements this test.

175


11.6.4

White Test

• This is a test that is most robust when you have no idea where the heteroscedasticity is coming from. 1. We estimate the original model and obtain ²ˆ2i . 2. Then we regress ²ˆ2i on the unique elements of X ⊗ X 0 where X ⊗ X 0 is the Kroenecker product. Note: The Kroenecker product is defined as 

 0

 X11 X   .  X ⊗ X0 =   .   XN 1 X 0

0

. . X1N X    . . .    . . .   0 . . XN N X

3. The short-hand way of doing this is to regress ²ˆ2i on a constant term, the square of each right-hand side variable and the unique interaction terms of the rhs vars. For example, if the model regressed is y = β0 + β1 X1 + β2 X2 + β3 X3 + ² then we regress ²ˆ2i on an intercept term, X1 , X2 , X3 , X12 , X22 , X32 , X1 X2 , X1 X3 , X2 X3 . 4. We obtain the R2 from this secondary regression and the test statistic, under the null hypothesis that there is no heteroscedasticity, is calculated as N R2 ∼ χ2p−1 where p is the number of distinct elements in X ⊗ X 0 . 5. Note: If p > N then step three is not possible. We have to make the test parsimonious, so we drop some of the variables. Which ones? Typically drop 176


some of the interaction terms. • In STATA the command whitetst or imtest, white implements this test.

11.7

General Corrections

• We have weighted least squares as a possible correction for heteroscedasticity. • However, the goal is to get back to the full ideal conditions. Therefore, if we don’t know the exact form of the heteroscedasticity, then it is not practical to implement weighted least squares. • Luckily, White (1980) has provided us with a transformation matrix. This simply alters the standard errors of the βi ’s and does not try to re-estimate the model. • The White heteroscedastic-consistent covariance matrix of β is given as ˆ Asymptotic cov( c β)

=

1 N

µ

1 0 XX N

¶−1 Ã

N 1 X 2 ²ˆ Xi Xi0 N i=1 i

1 0 XX N

¶−1

N (X 0 X)−1 S0 (X 0 X)−1 1 X 2 where S0 = ²ˆi Xi Xi0 N =

• Note that there is no downside to this correction. If var(²i ) = σ 2 then the middle term P 2 \ ˆ = cov( ˆ is N1 σ = σ 2 and cov(β) β). • In STATA robust standard errors are available with the robust or r option after a standard reg command.

177


11.8

Example: Major League Baseball Attendance 1991-2003

• In this example we investigate the influences on professional baseball team attendance from 1991-2003 (the data are available in baseballdemand.dta. We include as explanatory variables the log of real ticket prices, the log of city per capita income, the log of city population, the once lagged team winning percentage, the number of other professional baseball franchises in the city, the annualized city unemployment rate, and the percentage of stadium capacity used on average, and year dummy variables. . reg lnatt lnrtix lnrinc lagwin lnpop subs unemp capuse yr91-yr02 Source | SS df MS -------------+-----------------------------Model | 32.0326262 19 1.68592769 Residual | 6.33010085 321 .01971994 -------------+-----------------------------Total | 38.362727 340 .11283155

Number of obs F( 19, 321) Prob > F R-squared Adj R-squared Root MSE

= = = = = =

341 85.49 0.0000 0.8350 0.8252 .14043

-----------------------------------------------------------------------------lnatt | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------lnrtix | -.2782798 .049354 -5.64 0.000 -.375378 -.1811815 lnrinc | .1411211 .0201078 7.02 0.000 .1015615 .1806808 lagwin | .0006778 .0001231 5.50 0.000 .0004355 .00092 lnpop | .1183839 .0196911 6.01 0.000 .0796439 .1571239 subs | -.0998553 .028712 -3.48 0.001 -.1563428 -.0433678 unemp | -.0048731 .0087013 -0.56 0.576 -.0219919 .0122456 capuse | .0149096 .0005339 27.92 0.000 .0138592 .01596 yr91 | -.0239419 .0446612 -0.54 0.592 -.1118075 .0639236 yr92 | -.0321874 .0461611 -0.70 0.486 -.1230039 .058629 yr93 | .0095513 .0451907 0.21 0.833 -.0793562 .0984587 yr94 | -.0657123 .0412653 -1.59 0.112 -.1468969 .0154723 yr95 | -.1099128 .0415233 -2.65 0.009 -.191605 -.0282207 yr96 | -.0564592 .0418679 -1.35 0.178 -.1388295 .025911 yr97 | -.0417006 .0419565 -0.99 0.321 -.1242451 .0408438 yr98 | -.0187739 .0425066 -0.44 0.659 -.1024006 .0648529 yr99 | .0023851 .0422016 0.06 0.955 -.0806416 .0854117 yr00 | .0004738 .0420967 0.01 0.991 -.0823466 .0832941 178


yr01 | -.0047924 .0398704 -0.12 0.904 -.0832326 .0736479 yr02 | -.0072476 .037566 -0.19 0.847 -.0811543 .0666591 _cons | 11.89959 .3217043 36.99 0.000 11.26667 12.5325 ------------------------------------------------------------------------------

• All parameter estimates take the expected sign and are significant except for the level of unemployment. • We suspect there is heteroscedasticity in the data because of differences in market size and, perhaps, other influences. We plot the residuals from this regression against the

−.4

−.2

Residuals 0 .2

.4

.6

log of city population:

10

12

14 lnpop

16

18

• It looks like there might be some heteroscedasticity in the data with respect to population but it isn’t extreme. • We can plot the residuals against other independent variables, such as capuse:

179


.6 .4 Residuals 0 .2 −.2 −.4

20

40

60 capuse

80

100

• Here there seems to be a bit more obvious heteroscedasticity. • We next move to more formal testing. First, the Goldfeld-Quandt test for heteroscedasticity with respect to population: . reg lnatt lnrtix lnrinc lagwin lnpop subs unemp capuse yr91-yr02 Source | SS df MS -------------+-----------------------------Model | 32.0326262 19 1.68592769 Residual | 6.33010085 321 .01971994 -------------+-----------------------------Total | 38.362727 340 .11283155

Number of obs F( 19, 321) Prob > F R-squared Adj R-squared Root MSE

= = = = = =

341 85.49 0.0000 0.8350 0.8252 .14043

. sort lnpop . reg lnatt lnrtix lnrinc lagwin lnpop subs unemp capuse yr91-yr02 if _n<124 Source | SS df MS -------------+-----------------------------Model | 10.8164034 18 .6009113 Residual | 1.46879382 81 .018133257 -------------+-----------------------------Total | 12.2851972 99 .124092901 180

Number of obs F( 18, 81) Prob > F R-squared Adj R-squared Root MSE

= = = = = =

100 33.14 0.0000 0.8804 0.8539 .13466


. reg lnatt lnrtix lnrinc lagwin lnpop subs unemp capuse yr91-yr02 if _n>288 Source | SS df MS -------------+-----------------------------Model | 5.99326998 19 .315435262 Residual | .723368494 80 .009042106 -------------+-----------------------------Total | 6.71663847 99 .067844833

Number of obs F( 19, 80) Prob > F R-squared Adj R-squared Root MSE

= = = = = =

100 34.89 0.0000 0.8923 0.8667 .09509

. disp (1.46/(100-18))/(0.72/(100-19)) 2.00 . disp invFtail(82,81,.05) 1.443

• The GQ test statistic is then

F =

1.46/(100 − 19) = 2.00 > Fc = 1.44 0.72/(100 − 19)

• In this case we would reject the hypothesis of homoscedasticity with respect to the log of population even though it didn’t look like there was much heteroscedasticity in the graph. • We can move to the Breusch-Pagan test, which allows for more general sources of heteroscedasticity: . hettest lnpop Breusch-Pagan / Cook-Weisberg test for heteroskedasticity Ho: Constant variance Variables: lnpop chi2(1) Prob > chi2

= =

0.91 0.3402

181


. hettest capuse Breusch-Pagan / Cook-Weisberg test for heteroskedasticity Ho: Constant variance Variables: capuse chi2(1) Prob > chi2

= =

4.53 0.0334

• These tests suggest that heteroscedasticity might be a problem, most especially with respect to capacity usage. • We can also implement White’s Test using the command whitetst or the command imtest, white: . whitetst White’s general test statistic :

196.8705

Chi-sq(130)

. imtest, white White’s test for Ho: homoskedasticity against Ha: unrestricted heteroskedasticity chi2(130) Prob > chi2

= =

196.87 0.0001

Cameron & Trivedi’s decomposition of IM-test --------------------------------------------------Source | chi2 df p ---------------------+----------------------------Heteroskedasticity | 196.87 130 0.0001 Skewness | 28.61 19 0.0724 Kurtosis | 0.54 1 0.4638 ---------------------+----------------------------Total | 226.02 150 0.0001 --------------------------------------------------182

P-value =

1.4e-04


• The results of the tests looking at specific sources of heteroscedasticity don’t fare as well as the general tests. However, it seems likely that heteroscedasticity of some form exists in the data. • One correction is to apply a specific form of heteroscedasticity in a WLS approach. Stata has numerous ways to implement weighted least squares. One command is the reghv command which is a user-added command which can be installed using the ssc install regh if your computer is connected to the Internet. . reghv lnatt lntix lnrinc lagwin lnpop subs unemp capuse yr91-yr02,var(lnrinc capuse Multiplicative heteroscedastic regression Estimator: mle Model chi2(21) = 624.565

Number of obs

=

341

Prob > chi2 = 0.000 Log Likelihood = 200.932 Pseudo R2 = 2.8045 VWLS R2 = 0.8467 -----------------------------------------------------------------------------lnatt | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------lp_mean | lntix | -.2625372 .0474059 -5.54 0.000 -.355451 -.1696234 lnrinc | .1421367 .0192399 7.39 0.000 .1044271 .1798462 lagwin | .0006463 .0001175 5.50 0.000 .000416 .0008766 lnpop | .1184888 .0177219 6.69 0.000 .0837546 .153223 subs | -.0974033 .0262315 -3.71 0.000 -.1488161 -.0459905 unemp | -.0010126 .0078463 -0.13 0.897 -.0163911 .0143658 capuse | .0156725 .0005196 30.16 0.000 .014654 .016691 yr91 | -.0659296 .0480046 -1.37 0.170 -.1600169 .0281577 yr92 | -.0697102 .0484363 -1.44 0.150 -.1646436 .0252232 yr93 | -.0174509 .0477851 -0.37 0.715 -.1111081 .0762062 yr94 | -.0944295 .04127 -2.29 0.022 -.1753172 -.0135419 yr95 | -.1354816 .0413096 -3.28 0.001 -.216447 -.0545162 yr96 | -.0765455 .0419235 -1.83 0.068 -.158714 .0056229 yr97 | -.0608344 .0420279 -1.45 0.148 -.1432076 .0215388 yr98 | -.0445053 .0422915 -1.05 0.293 -.1273951 .0383845 yr99 | -.0150456 .0413368 -0.36 0.716 -.0960641 .065973 yr00 | -.0201181 .0411384 -0.49 0.625 -.1007479 .0605116 183


yr01 | -.0141837 .0392728 -0.36 0.718 -.091157 .0627895 yr02 | -.0228795 .036702 -0.62 0.533 -.0948141 .0490551 _cons | 11.81567 .2925199 40.39 0.000 11.24234 12.389 -------------+---------------------------------------------------------------lp_lnvar | lnrinc | .1697198 .0813125 2.09 0.037 .0103502 .3290894 capuse | .0123682 .0037211 3.32 0.001 .0050751 .0196614 _cons | -5.327524 .3930449 -13.55 0.000 -6.097878 -4.55717 ------------------------------------------------------------------------------

• In this case our inferences are not affected by the correction for heteroscedasticity. The bottom panel reports the estimates for the heteroscedasticity function. Teams in host cities with greater income have greater variation in attendance. Those teams that use a higher percentage of their stadium also enjoy higher variance in attendance (does this make sense?)If these variables were not related to the heteroscedasticity, their estimates would be insignificant (jointly) and the only significant variable would be the constant term. (why?) • We could implement Huber-White corrected standard errors, using the robust option to the standard reg command. . reg lnatt lntix lnrinc lagwin lnpop subs unemp capuse yr91-yr02,r Linear regression

Number of obs F( 19, 321) Prob > F R-squared Root MSE

= = = = =

341 77.32 0.0000 0.8350 .14043

-----------------------------------------------------------------------------| Robust lnatt | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------lntix | -.2782798 .0543087 -5.12 0.000 -.3851256 -.1714339 184


lnrinc | .1411211 .019711 7.16 0.000 .1023421 .1799002 lagwin | .0006778 .0001096 6.19 0.000 .0004622 .0008933 lnpop | .1183839 .0187513 6.31 0.000 .081493 .1552748 subs | -.0998553 .0292461 -3.41 0.001 -.1573936 -.0423171 unemp | -.0048731 .0092985 -0.52 0.601 -.0231668 .0134205 capuse | .0149096 .0006208 24.02 0.000 .0136883 .0161309 yr91 | -.0880087 .0503316 -1.75 0.081 -.1870302 .0110129 yr92 | -.089933 .0485685 -1.85 0.065 -.1854859 .0056198 yr93 | -.0418247 .0518327 -0.81 0.420 -.1437993 .0601499 yr94 | -.1112307 .0456196 -2.44 0.015 -.2009819 -.0214796 yr95 | -.1497853 .0440048 -3.40 0.001 -.2363594 -.0632111 yr96 | -.0911239 .0416693 -2.19 0.029 -.1731032 -.0091445 yr97 | -.0717777 .0421388 -1.70 0.089 -.1546808 .0111255 yr98 | -.0457763 .041863 -1.09 0.275 -.1281367 .0365841 yr99 | -.0206079 .0403115 -0.51 0.610 -.0999161 .0587003 yr00 | -.0165278 .0415649 -0.40 0.691 -.0983018 .0652462 yr01 | -.0151941 .0374446 -0.41 0.685 -.0888619 .0584738 yr02 | -.0128002 .0380569 -0.34 0.737 -.0876725 .0620722 _cons | 11.91659 .3159352 37.72 0.000 11.29502 12.53816 ------------------------------------------------------------------------------

• Notice that in this case there is no change in parameter estimates from the original regression model and unemployment is still insignificant. • Borrowed from Mike Barrow from Sussex University: Practical advice Although we have given a reasonably formal treatment, the practical solution is often somewhat ad hoc. The problem often arises because the scale of a variable varies enormously within the sample. For example, US population is much greater than that of Honduras. If the error variance were related to population in some way then heteroscedasticity would be a serious problem. Estimating the equation in per capita terms would improve matters. US GDP for example is much greater than that of

185


Honduras, but GDP per head is not so different in scale. This is reasonably intuitive and also good econometric practice. Variables which differ dramatically in scale can also cause rounding errors, even for computers, so it makes sense to ensure all variables are of similar size (it’s OK if some are ten times larger than others, but if some are 10,000 larger than others you might have problems). Taking logs is another way in which heteroscedasticity can be removed, since this transformation reduces the variation in the variables. These simple, practical measures will probably cure most heteroscedasticity problems.

186


12 12.1

Autocorrelation Motivation

• Autocorrelation occurs when something that happens today has an impact on what happens tomorrow, and perhaps even further into the future. • This is a phenomena that is mainly found in time-series applications. • Note: Autocorrelation can only happen into the past, not into the future. • Typically found in financial data, macro data, sometimes in wage data. • Autocorrelation occurs when cov(²i , ²j ) 6= 0 ∀ i, j.

12.2

AR(1) Errors

• AR(1) errors occur when yi = Xi β + ²i and

²i = ρ²i−1 + ui where ρ is the autocorrelation coefficient, |ρ| < 1 and ui ∼ N (0, σu2 ). • Note: In general we can have AR(p) errors which implies p lagged terms in the error structure, i.e., ²i = ρ1 ²i−1 + ρ2 ²i−2 + · · · + ρp ²i−p • Note: We will need |ρ| < 1 for stability and stationarity. If |ρ| < 1 happens to fail then we have the following problems: 1. ρ = 0: No serial correlation present

187


2. ρ > 1: The process explodes 3. ρ = 1: The process follows a random walk 4. ρ = −1: The process is oscillatory 5. ρ < −1: The process explodes in an oscillatory fashion • The consequences for OLS: βˆ is unbiased and consistent but no longer efficient and usual statistical inference is rendered invalid. • Lemma: ²i =

∞ X

ρj ui−j

j=0

• Proof:

²i = ρ²i−1 + ui ²i−1 = ρ²i−2 + ui−1 ²i−2 = ρ²i−3 + ui−2

Thus, via substitution we obtain

²i−1 = ρ²i−2 + ui−1 = ρ(ρ²i−3 + ui−2 ) + ui−1 = ρ2 ²i−3 + ρui−2 + ui−1 and ²i = ρ(ρ2 ²i−3 + ρui−2 + ui−1 ) + ui = ρ3 ²i−3 + ρ2 ui−2 + ρui−1 + ui

188


If we continue to substitute for ²i−k we get

²i =

∞ X

ρj ui−j

j=0

• Note the expectation of ²i is ∞ X E[²i ] = E[ ρj ui−j ] j=0

=

∞ X

ρj E[ui−j ]

j=0

=

∞ X

ρj 0 = 0

j=0

• The variance of ² is var(²i ) = E[²2i ] = E[(ui + ρui−1 + ρ2 ui−2 + · · ·)2 ] = E[u2i + ρui−1 ui + ρ2 u2i−1 + ρ4 u2i−2 + · · ·] var(²i ) = σu2 + ρ2 σu2 + ρ4 σu2 + · · ·

• Note: E[ui uj ] = 0 for all i 6= j via the white noise assumption. Therefore, all terms ρN where N is odd are wiped out. This is not the same as E[²i , ²j ] = 0. • Therefore, the var(²i ) is var(²i ) = σu2 + ρ2 σu2 + ρ4 σu2 + · · · = σu2 + ρ2 (var(²i−1 ))

189


But, assuming homoscedasticity, var(²i ) = var(²i−1 ) so that var(²i ) = σu2 + ρ2 (var(²i−1 )) = σu2 + ρ2 (var(²i )) var(²i ) =

σu2 ≡ σ2 1 − ρ2

• Note: This is why we need |ρ| < 1 for stability in the process. • If |ρ| > 1 then the denominator is negative and the var(²i ) cannot be negative. • What about the covariance across different observations?

cov(²i ²i−1 ) = E[²i ²j ] = E[(ρ²i−1 + ui )²i−1 ) = E[ρ²i−1 ²i−1 + ui ²i−1 ]: but Ui and ²i−1 are independent, so cov(²i ²i−1 ) = ρvar(²i−1 ) + 0: but var(²i ) = cov(²i ²i−1 ) =

2 σu , 1−ρ2

so

ρ σ2 1 − ρ2 u

• In general cov(²i ²i−j ) = E[²i ²i−j ] =

190

ρj−i 2 σ 1 − ρ2 u


• which implies that 

       2 σu  2  σ Ω= 1 − ρ2       

2

N −1

· ρ

1

ρ

ρ

ρ

1

·

·

· ρN −2

ρ2

·

1

·

·

·

·

·

·

·

·

·

·

·

·

·

·

·

·

·

·

1

ρN −1 ρN −2

ρ

3

              

• We note the correlation between ²i and ²i−1 . cov(²i , ²i−1 )

corr(²i , ²i−1 ) = p

var(²i )var(²i−1 )

=

ρ σ2 1−ρ2 u 2 σu 1−ρ2

where ρ is the correlation coefficient. • Note: If we know Ω then we can apply our previous results of GLS for an easy fix. • However, we rarely know the actual structure of Ω. • At this point the following results hold 1. The OLS estimate of s2 is biased but consistent 2. s2 is usually biased downward because we usually find ρ > 0 in economic data. • This implies that σ 2 (X 0 X)−1 tends to be less than σ 2 (X 0 X)−1 X 0 ΩX(X 0 X)−1 if ρ > 0 and the variables of X are positively correlated over time. • This implies that t-statistics are over-stated and we may introduce Type I errors in our inferences. 191


• How do we know if we have Autocorrelation or not?

12.3

Tests for Autocorrelation

1. Plot residuals (ˆ²i ) against time. 2. Plot residuals (ˆ²i ) against ²ˆi−1 3. The Runs Test • Take the sign of each residual and write them out as such (++++)

(——-) (++++)

(4)

(7)

(4)

(-)

(+)

(—)

(++++++)

(1)

(1)

(3)

(6)

• Let a ”run” be an uninterrupted sequence of the same sign and let the ”length” be the number of elements in a run. • Here we have 7 runs: 4 plus, 7 minus, 4 plus, 1 minus, 1 plus, 3 minus, 6 plus. • Then to complete the test let N = n1 + n2

Total Observations

n1

Number of positive residuals

n2

Number of negative residuals

k

Number of runs

• Let H0 : Errors are Random and Hα : Errors are Correlated • At the 0.05 significance level, we fail to reject the null hypothesis if

E[k] − 1.96σk ≤ k ≤ E[k] + 1.96σk

192


where E[k] =

2n1 n2 2n1 n2 (2n1 n2 − n1 − n2 ) ; σk2 = n1 + n2 (n1 + n2 )2 (n1 + n2 − 1)

• Here we have n1 = 15, n2 = 11 and k = 7 thus E[k] = 13.69 σk2 = 5.93 and σk = 2.43 • Thus our confidence interval is written as

[13.69 ± (1.96)(2.43)] = [8.92, 18.45]

However, k = 7 so we reject the null hypothesis that the errors are truly random. In STATA after a reg command, calculate the fitted residuals and use the command runtest, e.g., reg y x1 x2 x3, predict res, r, runtest res. 4. Durbin-Watson Test • The Durbin-Watson test is a very popular test for AR(1) error terms. • Assumptions: (a) Regression has a constant term (b) No lagged dependent variables (c) No missing values (d) AR(1) error structure • The null hypothesis is that ρ = 0 or that there is no serial correlation. • The test statistic is calculated as PN d=

193

²t − ²ˆt−1 )2 t=2 (ˆ PN 2 ˆt t=1 ²


which is equivalent to 

       0  ²ˆ Aˆ² where A =   0 ²ˆ ²ˆ      

1

−1

0

·

·

−1

2

0

−1

2

0 ·

·

·

·

·

·

·

·

· 2

·

·

·

· 1

−1 0 ·

·

0   0     0    ·    −1    1

• An equivalent test is d = 2(1 − ρˆ) where ρˆ comes from ²ˆt = ρˆ²t−1 + ut . • Note that −1 ≤ ρ ≤ 1 so that d ∈ [0, 4] where (a) d = 0 indicates perfect positive serial correlation (b) d = 4 indicates perfect negative serial correlation (c) d = 2 indicates no serial correlation. • Some statistical packages report the Durbin-Watson statistic for every regression command. Be careful to only use the DW statistic when it makes sense. • a rule of thumb for the DW test: a statistic very close to 2, either above or below, suggests that serial correlation is not a major problem. • There is a potential problem with the DW test, however. The DW test has three regions: We can reject the null, we can fail to reject the null, or we may have an inconclusive result. • The reason for the ambiguity is that the DW statistic does not follow a standard distribution. The distribution of the statistic depends on the ²ˆt , which are dependent upon the Xt0 s in the model. Further, each application of the test has a different number of degrees of freedom. 194


• To implement the Durbin-Watson test (a) Calculate the DW statistic (b) Using N , the number of observations, and k the number of rhs variables (excluding the intercept) determine the upper and lower bounds of the DW statistic. • Let H0 : No positive correlation (ρ ≤ 0) and H0∗ : Positive autocorrelation (ρ > 0) • Then if d < DWL

Reject H0 : Evidence of positive correlation

DWL < d < DWU

We have an inconclusive result.

DWU < d < 4 − DWU

Fail to reject H0 or H0∗

4 − DWU < d < 4 − DWL

We have an inconclusive result.

4 − DWU < d < 4

We reject H0∗ : Evidence of negative correlation

• For example, let N = 25, k = 3 then DWL = 0.906 and DWU = 1.409. If d = 1.78 then d > DWU but d < 4 − DWU and we fail to reject the null. • Graphically this looks like

... ... .. ... ... ... ... ... ... ... ...

0

Reject H0 Positive Correlation

... ... .. ... ... ... ... ... ... ... ...

Inconclusive Zone

DWL

... ... .. ... ... ... ... ... ... ... ...

Fail to Reject H0 or H0∗

DWU

2

... ... .. ... ... ... ... ... ... ... ...

Inconclusive Zone

4 − DWU

... ... .. ... ... ... ... ... ... ... ...

Reject H0∗ Negative Correlation

4 − DWL

4

• The range of inconclusiveness is a problem. Some programs, such as TSP, will generate a P -value for the DW calculated. If not, then non-parametric tests may be useful at this point.

195


• Let’s look a little closer at our DW statistic. PN

i=2 ²ˆi

DW = =

2

−2

PN

ˆi ²ˆi−1 i=2 ² 0 ²ˆ ²ˆ

+

PN

ˆ2i−1 i=2 ²

h i P 0 2 2 ²ˆ0 ²ˆ − 2 N ² ˆ ² ˆ + ² ˆ ² ˆ − ² ˆ − ² ˆ 1 N i=2 i i−1 ²ˆ0 ²ˆ

why? Note the following: PN 2 ˆi = ²ˆ22 + ²ˆ23 + · · · + ²ˆ2N i=2 ²

PN

ˆ2i−1 i=2 ²

= ²ˆ21 + ²ˆ22 + · · · + ²ˆ2N −1

²ˆ0 ²ˆ = ²ˆ21 + ²ˆ22 + · · · + ²ˆ2N

²ˆ0 ²ˆ = ²ˆ21 + ²ˆ22 + · · · + ²ˆ2N

Therefore we have simply added and subtracted ²ˆ21 and ²ˆ2N . Therefore,

DW =

2ˆ²0 ²ˆ − 2

= 2−

2

PN

PN

ˆi ²ˆi−1 i=2 ² ²ˆ0 ²ˆ

²i−1 i=2 (ρˆ

− ²ˆ21 − ²ˆ2N

+ ui )ˆ²i−1 − [ˆ²21 + ²ˆ2N ] ²ˆ0 ²ˆ

then DW = 2 − 2γ1 ρˆ − γ2 where PN γ1 =

ˆ2i−1 i=2 ² ²ˆ0 ²ˆ

and γ2 =

²ˆ21 + ²ˆ2N ²ˆ0 ²ˆ

• Note that as N → ∞ then γ1 → 1 and γ2 → 0 so that DW → 2 − 2ˆ ρ. • Under H0 : ρ = 0 and thus DW = 2. • Note: We can calculate ρˆ as ρˆ = 1 − 0.5DW . 5. Durbin’s h-Test • The Durbin-Watson test assumes that X is non-stochastic. This may not always be the case, e.g., if we include lagged dependent variables on the right-hand side. 196


• Durbin offers an alternative test in this case. • Under the null hypothesis that ρ = 0 the test becomes µ ¶s d N h= 1− 2 1 − N (var(α)) where α is the coefficient on the lagged dependent variable. • Note: If N var(α) > 1 then we have a problem because we can’t take the square root of a negative number. • Durbin’s h statistic is approximately distributed as a normal with unit variance. 6. Wald Test • It can be shown that √

d

N (ˆ ρ − ρ) → N (0, 1 − ρ2 )

So that a test statistic ρˆ W =q

1−ˆ ρ N

d

→ N (0, 1)

7. Breusch-Godfrey Test • This is basically a Lagrange Multiplier test of H0 : No autocorrelation versus Hα : Errors are AR(p). • Regress ²ˆi on Xi , ²ˆi−1 , . . . , ²ˆi−p and obtain N R2 ∼ χ2p where p is the number of lagged values that contribute to the correlation. • The intuition behind this test is rather straightforward. We know that X 0 ²ˆ = 0 so that any R2 > 0 must be caused by correlation between the current and the lagged residuals. 197


8. Box-Pierce Test • This is also called the Q-test. It is calculated as

Q=N

L X

PN ri2

where ri =

i=1

ˆj ²ˆj−1 j=i+1 ² PN 2 ˆj j=1 ²

and Q ∼ χ2L where L is the number of lags in the correlation. • A criticism of this approach is how to choose L.

12.4

Correcting an AR(1) Process

• One way to fix the problem is to get the error term of the estimated equation to satisfy the full ideal conditions. One way to do this might be through substitution. • Consider the model we estimate is yt = β0 + β1 Xt + ²t where ²t = ρ²t−1 + ut and ut ∼ (0, σu2 ). • It is possible to rewrite the original model as

yt = β0 + β1 Xt + ρ²t−1]+ut but ²t−1 = yt−1 − β0 − β1 Xt−1 thus yt = β0 + β1 Xt + ρ(yt−1 − β0 − β1 Xt−1 ) + ut : via substitution yt − ρyt−1 = β0 (1 − ρ) + β1 (Xt − ρXt−1 ) + ut : via gathering terms ⇒ yt∗ = β0∗ + β1 Xt∗ + ut

• We can estimate the transformed model, which satisfies the full ideal conditions as long as ut satisfies the full ideal conditions.

198


• One downside is the loss of the first observation, which can be a considerable sacrifice in degrees of freedom. For instance, if our sample size were 30 observations, this transformation would cost us approximately 3% of the sample size. • An alternative would be to implement GLS if we know Ω, i.e., we know ρ such that β˜ = (X 0 Ω−1 X)−1 X 0 Ω−1 y

where

       1   Ω= 1 − ρ2       

2

· · ρ

N −1

1

ρ ρ

ρ

1

·

· ·

·

·

·

·

· ·

·

·

·

·

· ·

·

·

·

·

· ·

ρ

ρN −1 ·

·

· ρ

1

              

• Note that for GLS we seek Ω−1/2 such that Ω−1/2 ΩΩ−1/2 = I and transform the model. Thus we estimate Ω−1/2 y = Ω−1/2 Xβ + Ω−1/2 ² where

Ω−1/2

 p 1 − ρ2 0 · ·    −ρ 1 0 ·   = 0 −ρ 1 0    · · · ·   0 · · −ρ

 0

  0    0     ·   1

• This is known as the Prais-Winsten (1954) Transformation Matrix.

199


• This implies that p

1 − ρ2 y1 =

1st observation Other N − 1 obs.

p

1 − ρ2 X1 β +

p

1 − ρ2 ²1

(yi − ρyi−1 ) = (Xi − ρXi−1 )β + ui

where

ui = ²i − ρ²i−1

• Thus, ˜ = σ 2 (X 0 Ω−1 X)−1 cov(β) u and σ ˜2 = 12.4.1

1 ˜ 0 Ω−1 (y − X β) ˜ (y − X β) N

What if ρ is unknown?

• We seek a consistent estimator of ρ so as to run Feasible GLS. • Methods of estimating ρ 1. Cochranne-Orcutt (1949): Throw out the first observation. We assume an AR(1) process which implies ²i = ρ²i−1 + ui . So, we run OLS on ²ˆi = ρˆ²i−1 + ui and obtain PN ρˆ =

i=2 P N

²ˆi ²ˆi−1

ˆ2i i=2 ²

which is the OLS estimator of ρ. Note: ρˆ is a biased estimator of ρ, but it is consistent and that is all we really need. With ρˆ in hand we can go to an FGLS procedure. 200


2. Durbin’s Method (1960) After substituting for ²i we see that

yi = β0 + β1 Xi1 + β2 Xi2 + · · · + βk Xik + ρ²i−1 + ui = β0 + β1 Xi1 + · · · + βk Xik + ρ(yi−1 − β0 − β1 Xi−1,1 − · · · − βk Xi−1,k ) + ui

So, we run OLS on

yi = ρyi−1 + (1 − ρ)β0 + β1 Xi1 − β1 ρXi−1,1 + · · · + βk ρXi,k − βk ρXi−1,k + ui

From this we obtain ρˆ which is the coefficient on yi−1 . This parameter estimate is biased but consistent. Note: When k is large, we may have a problem in the degrees of freedom. To preserve the degrees of freedom, we must have N > 2k +1 observations to employ this method. In small samples, this method may not be feasible. 3. Newey-West Covariance Matrix We can correct the covariance matrix of βˆ much like we did in the case of heteroscedasticity. This extention of White (1980) was offered by Newey and West. We seek a consistent estimator of X 0 ΩX which then leads to \ 0 ΩX(X 0 X)−1 ˆ = σ 2 (X 0 X)−1 X \ cov( β)

where L N 1 X 2 1 X X 0 0 0 \ X ΩX = ²ˆi Xi Xi + ωi ²ˆj ²ˆj−1 (Xj Xj−1 + Xj−1 Xj0 ) N N i=1 j=i+1

201


where ωi = 1 −

i L+1

A possible problem in this approach is to determine L, or how far back into the past to go to correct the covariance matrix of autocorrelation.

12.5

Large Sample Fix

• We sometimes use this method because simulation models have shown that a more efficient estimator may be obtained by including lagged dependent variables. • Include lagged dependent variables until the autocorrelation disappears. We know when this happens because the estimated coefficient on the k th lag will be insignificant. • Problem: Estimates are biased, but they are consistent. Be Careful!! • This approach is useful in time-series studies with lots of data. Thus you are safely within the “large sample” world.

12.6

Forecasting in the AR(1) Environment

• Having estimated β˜GLS we know that β˜GLS is BLUE when the cov(²) = σ 2 Ω when Ω 6= I. • With an AR(1) process, we know that tomorrow’s output is dependent upon today’s output and today’s random error. • We estimate yt = Xt β + ²t where ²t = ρ²t−1 + ut . 202


• The forecast becomes

yt+1 = Xt+1 β + ²t+1 = Xt+1 β + ρ²t + ut+1

• To finish the forecast, we need ρˆ from our previous estimation techniques and then we recongize that ²˜t = yt − Xt β˜ from GLS estimation. We assume that ut+1 as a zero mean. • Then we see that yˆt+1 = Xt+1 β˜ + ρˆ²˜ • What if Xt+1 doesn’t exist. This occurs when we try to perform out-of-sample fore˜ casting. Perhaps we use Xt β? • In general we find that yˆt+s = Xt+s β˜ + ρˆs ²˜t

12.7

Example: Gasoline Retail Prices

• In this example we look at the relationship between the U.S. average retail price of gasoline and the wholesale price of gasoline from from January 1985 through February 2006, using the Stata data file gasprices.dta. • As an initial step, we plot the two series over time and notice a highly correlated set of series:

203


250 200 150 100 50 0

50

100

150

200

250

obs allgradesprice

wprice

• A simple OLS regression model produces: . reg allgradesprice wprice Source | SS df MS Number of obs = 254 -------------+-----------------------------F( 1, 252) = 3467.83 Model | 279156.17 1 279156.17 Prob > F = 0.0000 Residual | 20285.6879 252 80.4987614 R-squared = 0.9323 -------------+-----------------------------Adj R-squared = 0.9320 Total | 299441.858 253 1183.56466 Root MSE = 8.9721 -----------------------------------------------------------------------------allgradesp~e | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------wprice | 1.219083 .0207016 58.89 0.000 1.178313 1.259853 _cons | 31.98693 1.715235 18.65 0.000 28.60891 35.36495

• The results suggest that for every penny in wholesale price, there is a 1.21 penny increase in the average retail price of gasoline. The constant term suggests that, on 204


average, there is approximately 32 cents difference between retail and wholesale prices, comprised of profits, state and federal taxes. • A Durbin-Watson statistic calculated after the regression yields . dwstat Durbin-Watson d-statistic(

2,

254) =

.1905724

. disp 1- .19057/2 .904715 • The DW statistic suggests that the data suffer from significant autocorrelation. Reversing out an estimate of ρˆ = 1 − d/2 suggests that ρ = 0.904.

−20

−10

Residuals 0

10

20

• Here is a picture of the fitted residuals against time:

0

50

100

150 obs

• Here are robust-regression results:

205

200

250


.

reg allgradesprice wprice, r

Regression with robust standard errors Number of obs= 254 F( 1, 252) = 5951.12 Prob > F = 0.0000 R-squared = 0.9323 Root MSE = 8.9721 -----------------------------------------------------------------------------| Robust allgradesp~e | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------wprice | 1.219083 .0158028 77.14 0.000 1.18796 1.250205 _cons | 31.98693 1.502928 21.28 0.000 29.02703 34.94683 • The robust regression results suggest that the naive OLS over-states the variance in the parameter estimate on wprice, but the positive value of ρ suggests the opposite is likely true. • Various “fixes” are possible. First, Newey-West standard errors: . newey allgradesprice wprice, lag(1) Regression with Newey-West standard errors Number of obs = 254 maximum lag: 1 F(1,252) = 3558.42 Prob > F = 0.0000 ---------------------------------------------------------------------| Newey-West allgrades | Coef. Std. Err. t P>|t| [95% Conf. Interval] ----------+----------------------------------------------------------wprice | 1.219083 .0204364 59.65 0.000 1.178835 1.259331 _cons | 31.98693 2.023802 15.81 0.000 28.00121 35.97265 • The Newey-West corrected standard errors, assuming AR(1) errors, are significantly higher than the robust OLS standard errors but are only slightly lower than those in naive OLS. 206


• Prais-Winsten using Cochrane-Orcutt transformation (note: the first observation is lost): Cochrane-Orcutt AR(1) regression -- iterated estimates Source | SS df MS Number of obs = 253 -------------+-----------------------------F( 1, 251) = 606.24 Model | 5740.73875 1 5740.73875 Prob > F = 0.0000 Residual | 2376.81962 251 9.46940088 R-squared = 0.7072 -------------+-----------------------------Adj R-squared = 0.7060 Total | 8117.55837 252 32.2125332 Root MSE = 3.0772 -----------------------------------------------------------------------------allgradesp~e | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------wprice | .8133207 .0330323 24.62 0.000 .7482648 .8783765 _cons | 75.27718 12.57415 5.99 0.000 50.5129 100.0415 -------------+---------------------------------------------------------------rho | .9840736 -----------------------------------------------------------------------------Durbin-Watson statistic (original) 0.190572 Durbin-Watson statistic (transformed) 2.065375

• Prais-Winsten transformation which includes the first observation: Prais-Winsten AR(1) regression -- iterated estimates Source | SS df MS Number of obs = 254 -------------+-----------------------------F( 1, 252) = 639.29 Model | 6070.42888 1 6070.42888 Prob > F = 0.0000 Residual | 2392.88989 252 9.4955948 R-squared = 0.7173 -------------+-----------------------------Adj R-squared = 0.7161 Total | 8463.31877 253 33.4518529 Root MSE = 3.0815 -----------------------------------------------------------------------------allgradesp~e | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------wprice | .8160378 .0330426 24.70 0.000 .7509629 .8811126 _cons | 66.01763 9.451889 6.98 0.000 47.40287 84.63239 -------------+---------------------------------------------------------------rho | .9819798 207


-----------------------------------------------------------------------------Durbin-Watson statistic (original) 0.190572 Durbin-Watson statistic (transformed) 2.052344

• Notice that both Prais-Winsten results reduce the parameter on WPRICE and the increases the standard error. The t-statistic drops, although the qualitative result doesn’t change. • In both cases, the DW stat on the transformed data is nearly two, indicating zero autocorrelation. • We can try the “large sample fix” by going back to the original model and including the once-lagged dependent variable: . reg allgradesprice wprice l.allgradesprice Source | SS df MS Number of obs = 253 -------------+-----------------------------F( 2, 250) = 8709.80 Model | 294898.633 2 147449.316 Prob > F = 0.0000 Residual | 4232.28346 250 16.9291339 R-squared = 0.9859 -------------+-----------------------------Adj R-squared = 0.9857 Total | 299130.916 252 1187.02744 Root MSE = 4.1145 --------------------------------------------------------------------------allgradesp~e | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+------------------------------------------------------------wprice | .4171651 .0278594 14.97 0.000 .3622 .4720 allgradesp~e | L1 | .6860807 .0224148 30.61 0.000 .6419348 .7302265 _cons | 7.668692 1.1199 6.85 0.000 5.463052 9.874333 . durbina Durbin’s alternative test for autocorrelation --------------------------------------------------------------------------lags(p) | chi2 df Prob > chi2 -------------+------------------------------------------------------------1 | 134.410 1 0.0000 208


--------------------------------------------------------------------------H0: no serial correlation • The large sample fix suggests a smaller parameter estimate on WPRICE, the standard error is larger and the t-statistic is much lower than the original OLS model. . reg allgradesprice wprice l(1/3).allgradesprice Source | SS df MS Number of obs = 251 -------------+-----------------------------F( 4, 246) = 5172.42 Model | 295015.56 4 73753.8899 Prob > F = 0.0000 Residual | 3507.73067 246 14.2590677 R-squared = 0.9882 -------------+-----------------------------Adj R-squared = 0.9881 Total | 298523.29 250 1194.09316 Root MSE = 3.7761 -----------------------------------------------------------------------------allgradesp~e | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------wprice | .375143 .0268634 13.96 0.000 .3222313 .4280547 allgradesp~e L1 | .994551 .0556423 17.87 0.000 .8849549 1.104147 L2 | -.5186459 .0761108 -6.81 0.000 -.668558 -.3687339 L3 | .243287 .0463684 5.25 0.000 .1519572 .3346167 _cons | 6.778409 1.063547 6.37 0.000 4.68359 8.873228 • In this case, we included three lagged values of the dependent variable. Note that they are all significant. If we include four or more lags, the fourth (and higher) lags are insignificant. Notice that the marginal effect of wholesale price on retail price is dampened when we include the lagged values of retail price.

12.8

Example: Presidential approval ratings

• In this example we investigate how various political/macroeconomic variables relate to the percentage of people who answer, “I don’t know” to the Gallup poll question “How is the president doing in his job?” The data are posted at the course website and were borrowed from Christopher Gelpi at Duke University. 209


• Our first step is to take a crack at the standard OLS model: . reg dontknow newpres unemployment eleyear inflation Source | SS df MS -------------+-----------------------------Model | 1096.28234 4 274.070585 Residual | 1660.23101 167 9.94150307 -------------+-----------------------------Total | 2756.51335 171 16.1199611

Number of obs F( 4, 167) Prob > F R-squared Adj R-squared Root MSE

= = = = = =

172 27.57 0.0000 0.3977 0.3833 3.153

-----------------------------------------------------------------------------dontknow | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------newpres | 4.876033 .6340469 7.69 0.000 3.624252 6.127813 unemployment | -.6698213 .1519708 -4.41 0.000 -.9698529 -.3697897 eleyear | -1.94002 .5602409 -3.46 0.001 -3.046087 -.8339526 inflation | .1646866 .0770819 2.14 0.034 .0125061 .3168672 _cons | 16.12475 .9076888 17.76 0.000 14.33273 17.91678

• Things look pretty good. All variables are statistically significant and take reasonable values and signs. We wonder if there is autocorrelation in the data, as the data are time series. If there is autocorrelation it is possible that the standard errors are biased downwards, t-stats are biased upwards, and Type I errors are possible (falsely rejecting the null hypothesis). We grab the fitted residuals from the above regression: . predict e1, resid. We then plot the residuals using scatter and tsline (twoway tsline e1——scatter e1 yearq) • It’s not readily apparent, but the data look to be AR(1) with positive autocorrelation. How do we know? A positive residual tends to be followed by another positive residual and a negative residual tends to be followed by another negative residual. • We can plot out the partial autocorrelations: . pac e1

210


• We see that the first lag is the most important, the other lags (4, 27, 32) are also important statistically, but perhaps not economically/politically. • Can we test for AR(1) process in a more statistically valid way? How about the Runs test? Use the STATA command runtest and give the command the error term defined above, e1.

. runtest e1 N(e1 <= -.3066953718662262) = 86 N(e1 > -.3066953718662262) = 86 obs = 172 N(runs) = 54 z = -5.05 Prob>|z| = 0

• Looks like error terms are not distributed randomly (p-value is small). The threshold(0) option tells STATA to create a new run when e1 crosses zero.

. runtest e1, N(e1 > 0) = obs = N(runs) = z = Prob>|z| =

threshold(0) N(e1 <= 0) = 97 75 172 56 -4.6 0

• It still looks like the error terms are not distributed randomly. • We move next to the Durbin-Watson test

. dwstat Durbin-Watson d-statistic(

5,

172) = 211

1.016382


• The results suggest there is positive autocorrelation (DW stat is less than 2). We can test this more directly by regressing the current error term on the previous period’s error term

. reg e1 l.e1 Source | SS df MS -------------+-----------------------------Model | 386.705902 1 386.705902 Residual | 1243.1 169 7.35562129 -------------+-----------------------------Total | 1629.8059 170 9.58709353

Number of obs F( 1, 169) Prob > F R-squared Adj R-squared Root MSE

= = = = = =

171 52.57 0.0000 0.2373 0.2328 2.7121

-----------------------------------------------------------------------------e1 | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------e1 | L1 | .4827093 .066574 7.25 0.000 .3512854 .6141331 _cons | -.0343568 .2074016 -0.17 0.869 -.4437884 .3750747

• The l.e1 variable tells STATA to use the once-lagged value of e1. In the results, notice the L1 tag for e1 - the parameter estimate suggests positive autocorrelation with rho close to 0.48. • Just for giggles, we find that 2*(1-rho) is ”close” to the reported DW stat . disp 2*(1-_b[l.e1]) 1.0345815

• We try the AR(1) estimation without the constant term: . reg e1 l.e1,noc Source | SS df MS -------------+-----------------------------212

Number of obs = F( 1, 170) =

171 52.87


Model | 386.680946 1 386.680946 Residual | 1243.30184 170 7.31354026 -------------+-----------------------------Total | 1629.98279 171 9.5320631

Prob > F R-squared Adj R-squared Root MSE

= = = =

0.0000 0.2372 0.2327 2.7044

-----------------------------------------------------------------------------e1 | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------e1 | L1 | .4826932 .0663833 7.27 0.000 .3516515 .6137348

• Now, we try the AR(2) estimation to see if there is a second-order process: . reg e1 l.e1 l2.e1,noc Source | SS df MS -------------+-----------------------------Model | 364.907208 2 182.453604 Residual | 1225.9515 168 7.29733038 -------------+-----------------------------Total | 1590.85871 170 9.35799242

Number of obs F( 2, 168) Prob > F R-squared Adj R-squared Root MSE

= = = = = =

170 25.00 0.0000 0.2294 0.2202 2.7014

-----------------------------------------------------------------------------e1 | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------e1 | L1 | .4986584 .0766115 6.51 0.000 .3474132 .6499036 L2 | -.0572775 .0759676 -0.75 0.452 -.2072515 .0926966

• It doesn’t look like there is an AR(2) process. Let’s test for AR(2) with BrueschGodfrey test: . reg e1 l.e1 l2.e1 newpres unemployment eleyear inflation Source | SS df MS -------------+-----------------------------Model | 373.257912 6 62.209652 Residual | 1216.78801 163 7.46495711 213

Number of obs F( 6, 163) Prob > F R-squared

= = = =

170 8.33 0.0000 0.2347


-------------+-----------------------------Total | 1590.04592 169 9.40855575

Adj R-squared = Root MSE =

0.2066 2.7322

-----------------------------------------------------------------------------e1 | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------e1 | L1 | .5015542 .0775565 6.47 0.000 .3484092 .6546992 L2 | -.0400182 .0795952 -0.50 0.616 -.1971888 .1171524 newpres | -.5280386 .5741936 -0.92 0.359 -1.661855 .6057781 unemployment | .0092944

.1327569

0.07

eleyear

|

.1199647

.4869477 0.25

inflation

|

.0300826

.0680796

0.944 0.806

0.44

-.2528507 -.8415742

.2714395 1.081504

0.659 -.104349 .1645142

_cons | -.1698233 .7882084 -0.22 0.830 -1.726239 1.386592 -----------------------------------------------------------------------------. test l.e1 l2.e1 ( 1) ( 2)

L.e1 = 0 L2.e1 = 0 F(

2, 163) = Prob > F =

24.80 0.0000

• Notice that we reject the null hypothesis that the once and twice lagged error terms are jointly equal to zero. The t-stat on the twice lagged error term is not different from zero, therefore it looks like the error process is AR(1). • An AR(1) process has been well confirmed. What do we do to ”correct” the original AR(1)-plagued OLS model? 1. We can estimate using Cochranne-Orcutt approach

214


. prais dontknow newpres unemployment eleyear inflation, corc Cochrane-Orcutt AR(1) regression -- iterated estimates Source | SS df MS -------------+-----------------------------Model | 715.863671 4 178.965918 Residual | 1230.2446 166 7.41111202 -------------+-----------------------------Total | 1946.10827 170 11.4476957

Number of obs F( 4, 166) Prob > F R-squared Adj R-squared Root MSE

= = = = = =

171 24.15 0.0000 0.3678 0.3526 2.7223

-----------------------------------------------------------------------------dontknow | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------newpres | 5.248822 .7448114 7.05 0.000 3.778298 6.719347 unemployment | -.6211503 .2441047 -2.54 0.012 -1.1031 -.1392004 eleyear | -2.630427 .6519536 -4.03 0.000 -3.917617 -1.343238 inflation | .1577092 .1247614 1.26 0.208 -.0886145 .4040329 _cons | 15.91768 1.474651 10.79 0.000 13.00619 18.82917 -------------+---------------------------------------------------------------rho | .5022053 -----------------------------------------------------------------------------Durbin-Watson statistic (original) 1.016382 Durbin-Watson statistic (transformed) 1.941491

• Now, inflation is insignificant - autocorrelation led to Type I error? Notice the new DW statistic is very close to 2, suggesting no AR(2) process. 2. We can estimate using Prais-Winsten transformation . prais dontknow newpres unemployment eleyear inflation Prais-Winsten AR(1) regression -- iterated estimates Source | SS df MS -------------+-----------------------------Model | 760.152634 4 190.038159 Residual | 1250.04815 167 7.48531828 -------------+-----------------------------215

Number of obs F( 4, 167) Prob > F R-squared Adj R-squared

= = = = =

172 25.39 0.0000 0.3781 0.3633


Total |

2010.20079

171

11.7555602

Root MSE

=

2.7359

-----------------------------------------------------------------------------dontknow | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------newpres | 5.209751 .749466 6.95 0.000 3.730103 6.6894 unemployment | -.5766546 .2457891 -2.35 0.020 -1.061909 -.0914003 eleyear | -2.707991 .6550137 -4.13 0.000 -4.001165 -1.414816 inflation | .1100198 .1229761 0.89 0.372 -.1327684 .3528079 _cons | 15.98344 1.493388 10.70 0.000 13.03509 18.9318 -------------+---------------------------------------------------------------rho | .5069912 -----------------------------------------------------------------------------Durbin-Watson statistic (original) 1.016382 Durbin-Watson statistic (transformed) 1.918532

• Once again, the inflation variable is not significant. 3. We could use Newey-West standard errors (similar to White, 1980) . newey dontknow newpres unemployment eleyear inflation, lag(1) Regression with Newey-West standard errors F( 4, 167) = 11.66

Number of obs

=

Prob > F

=

172 maximum 0.0000

-----------------------------------------------------------------------------| Newey-West dontknow | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------newpres | 4.876033 1.06021 4.60 0.000 2.78289 6.969175 unemployment | -.6698213 .1502249 -4.46 0.000 -.9664059 -.3732366 eleyear | -1.94002 .5212213 -3.72 0.000 -2.969052 -.9109878 inflation | .1646866 .083663 1.97 0.051 -.0004868 .3298601 _cons | 16.12475 .9251126 17.43 0.000 14.29833 17.95117

• Here, inflation is still significant and positive. 4. We could use REG with robust standard errors: 216


. reg dontknow newpres unemployment eleyear inflation, r Regression with robust standard errors

Number of obs F( 4, 167) Prob > F R-squared Root MSE

= = = = =

172 16.01 0.0000 0.3977 3.153

-----------------------------------------------------------------------------| Robust dontknow | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------newpres | 4.876033 .9489814 5.14 0.000 3.002486 6.749579 unemployment | -.6698213 .1257882 -5.32 0.000 -.9181613 -.4214812 eleyear | -1.94002 .419051 -4.63 0.000 -2.76734 -1.1127 inflation | .1646866 .0702247 2.35 0.020 .0260441 .3033292 _cons | 16.12475 .7825265 20.61 0.000 14.57983 17.66967

• Here, inflation is still significant, although the standard error of inflation is a bit smaller than with Newey-West standard errors. • Which to use? It depends. 1. Cochranne-Orcutt approach assumes a constant rho over the entire sample period, transforms the data, and drops the first observation. 2. Prais-Winsten approach transforms the data (essentially a weighted least squares approach) and alters both the parameter estimates and their standard errors. 3. Newey-West standard errors do not adjust parameter estimates but do alter the standard errors, however NW does require an accurate number of lags to be specified (although here that doesn’t seem to be a problem) 4. Robust standard errors are perhaps the most flexible option - the correction might allow for heteroscedasticity as well as autocorrelation, something that NW and 217


other approaches do not allow. However, the robust White/Sandwich standard errors are not guaranteed to accurately control for the first order autocorrelation. 5. In this case, I would lean towards the Newey-West standard errors.

218


13

Stochastic Regressors

• One of the full ideal conditions is that X is fixed in repeated samples. What if X is a random variable? • If X is random then we can loose unbiasedness, consistency and efficiency. • However, it is not always the case that this will be true. • Assume y = Xβ + ² satisfies the FIC except that X is stochastic, i.e., X = x + wi . • If X is independent of ² then βˆ and s2 retain all their desirable properties. All standard test are available and accurate. • If X and ² are contemporaneously uncorrelated then all small sample properties are lost. However, asymptotic properties will still hold and asymptotic test are still valid. X and ² are contemporaneously uncorrelated if

(i)

cov(Xik , ²i ) = 0

(ii) plim N1

PN i=1

Xik ²i = 0

• Theorem: If X is a random variable and contemporaneously uncorrelated with ² then βˆOLS is biased but remains consistent. • Proof: βˆ = β + (X 0 X)−1 X 0 ² Therefore, ˆ = β + Ex (E²|x [(X 0 X)−1 X 0 ²]) 6= β E[β] because E[²i |Xi ] = 0 but E[²|X] 6= 0.

219


• Examples of this include models with lagged dependent variables on the right-hand side. • If X and ² are contemporaneously correlated, all small and large sample properties are lost. At this point we need to find a general fix. X and ² are contemporaneously correlated if

(i)

cov(Xik , ²i ) 6= 0 P (ii) plim N1 N i=1 Xik ²i 6= 0

13.1

Instrumental Variables

• A general fix we can employ is known as Instrumental Variables. • This approach is useful in models that suffer from unobservable or imperfectly measured variables. Examples would include ability, utility, ambition, or political attitudes. • A good example is Friedman’s Permanent Income Hypothesis – Assume that Permanent consumption is unobserved but is proportional to permanent income, i.e., Cp = βYp – Actual observed income is a sum of permanent income and transitory income, i.e., Y = Yp + Yt – Actual consumption is a combination of permanent consumption and transitory

220


consumption, i.e., C = Cp + Ct – So we have possible problems here because we cannot observe or measure some of the variables. – Note: Unobservable variables are a distinctly different problem from measurement or errors in variables – A solution to the problem of unmeasurable variables is known as Instrumental Variables. – If we know that X is the real variable of interest, but we measure X ∗ = X + u where cov(X, ²) = 0 but cov(X ∗ , ²) 6= 0 then we look for another variable Z such that cov(Z, X ∗ ) 6= 0 but cov(Z, ²) = 0. – If such a variable Z exists then we have a shot of getting back to the full ideal conditions. • So, the upshot of instrumental variables is to find a set of variables Z that are correlated with X but not with ². • Consider the normal equation from the first-order necessary condition for minimizing the SSE X 0 y = X 0 Xβ + X 0 ² • Now, divide both sides by N and take the plim of both sides to obtain µ plim

X 0y N

µ = plim

221

X 0X N

µ β + plim

X 0² N


Let

µ Qxy = plim

and

µ Qxx = plim

X 0y N

X 0X N

then β = Q−1 xx Qxy • But, if plim(X 0 ²)/N 6= 0 then βˆ is not consistent. • If we have N observations of k variables in Z which are uncorrelated with ², i.e., µ plim

Z 0² N

¶ =0

but are correlated with X such that µ plim

Z 0X N

¶ = Qzx 6= 0

then we can transform the regression model to obtain

Z 0 y = Z 0 Xβ + Z 0 ² • If we divide both sides by N and take plims of both sides then we obtain µ plim

Z 0y N

µ = plim

Z 0X N

Qzy = Qzx β + 0 β = Q−1 zx Qzy 222

µ β + plim

Z 0² N


• How is this the case? Note that βˆIV can be derived as −1 βˆIV = [(Z 0 X)0 (Z 0 X)] (Z 0 X)0 (Z 0 y)

but we know that Z = [N × k] and X = [N × k] such that Z 0 X = [k × k]. • Therefore we can rewrite βˆIV as βˆIV

= (Z 0 X)−1 (X 0 Z)−1 (Z 0 X)0 (Z 0 y) = (Z 0 X)−1 Z 0 y

• Theorem: βˆIV is consistent when βˆIV = (Z 0 X)−1 Z 0 y • Proof: µ plimβˆIV

= plim µ

Z 0X N

¶−1 µ

Z 0y N

¶−1 µ 0 ¶ Z 0X Z Xβ Z 0 ² = plim + N N N µ 0 ¶ µ 0 ¶−1 Z² ZX plim = β + plim N N = β + Q−1 zx 0 = β

• Definition: Z is an instrumental variable matrix of X if

(i) (ii)

plim

¡ Z0X ¢ N

= Qzx is finite and non-singular

d √1 Z 0 ² → N

N (0, ψ) asymptotically

223


• Theorem: If Z is an IV matrix for X then the IV estimator βˆIV = (Z 0 X)−1 Z 0 y is √ −1 consistent and the asymptotic distribution of N (βˆIV − β) is N (0, Q−1 zx ψQzx ) • Proof: We know that βˆIV is consistent from our previous work. Note: √

N (βˆIV

1 − β) = ( Z 0 X)−1 N

is asymptotically distributed as

µ Q−1 zx

and

µ

Thus, we know that

Z 0² √ N

Z 0² √ N

µ

Z 0² √ N

¶ d

→ N (0, ψ)

· µ 0 ¶¸ Z² −1 E Qzx √ =0 N

and ·µ µ 0 ¶¶ µ µ 0 ¶¶0 ¸ · µ 0 ¶ µ 0 ¶0 ¸ Z² Z² Z² Z² −1 −1 −1 −1 √ E Qzx √ Qzx √ = E Qzx √ Qzx N N N N −1 = Q−1 zx ψQzx

This implies that √

−1 N (βˆIV − β) ∼ N (0, Q−1 zx ψQzx )

• If E[²²0 ] = σ 2 Ω we can move to a GLS procedure, i.e., Z 0 y = Z 0 Xβ + Z 0 ²

where var(Z 0 ²) = Z 0 σ 2 ΩZ = σ 2 Z 0 ΩZ 224


• Note that Z = [N × k] Y = [N × 1] X = [N × k] ² = [N × 1]

Z 0 X = [k × N ][N × k] = [k × k] ⇒

Z 0 y = [k × N ][N × 1] = [k × 1] Z 0 ² = [k × N ][N × 1] = [k × 1]

• So the transformed model is only on k observations, but we have k parameters to estimate. Thus we have exhausted all our degrees of freedom and we have yet to approach the problems in Z 0 ². • Thus, the application of GLS, in this case, ignores the error structure such that £ ¤−1 0 −1 2 0 β˜ = (Z 0 X)0 (σ 2 Z 0 ΩZ)−1 (Z 0 X) (Z X) (σ Z ΩZ)−1 Z 0 y

• Note that Z = [N × k] Ω = [N × N ]

Z 0 ΩZ = [k × N ][N × N ][N × k] = [k × k]

• This implies that β˜ = (Z 0 X)−1 (σ 2 Z 0 ΩZ)−1 (X 0 Z)−1 (Z 0 X)0 (σ 2 Z 0 ΩZ)−1 Z 0 y = (Z 0 X)−1 (σ 2 Z 0 ΩZ)−1 (σ 2 Z 0 ΩZ)−1 Z 0 y = (Z 0 X)−1 Z 0 y = βˆIV

• Note: βˆIV is the same whether var(²) = σ 2 I or σ 2 Ω, but the asymptotic variances are different, as shown above. 225


• When µ

¶ Z 0Z var(²) = σ I ⇒ ψ = σ plim N µ 0 ¶ Z ΩZ 2 2 var(²) = σ Ω ⇒ ψ = σ plim N 2

13.2

2

How to test for randomness in X?

• Assume that yi = β0 + β1 Xi + ²i , but we think that Xi is correlated with ². • Then we have a null hypothesis of H0 : plim N1

P

¯ i = 0. (Xi − X)²

• We can use a variant of Hausman’s (1978) Specification Test which tests whether βˆ is significantly different from βˆIV . If the null is true then both are consistent estimators of β but βˆ is efficient relative to βˆIV via the Gauss-Markov Theorem. If the alternative ˆ is true, then βˆIV is consistent relative to β. • We can test with m=

(βˆi IV − βˆi )2 var( ˆ βˆi IV ) − var( ˆ βˆi )

where βˆi is the parameter under suspicion. • We reject the null if m > χ2(1) = 3.94. • For more than one parameter under suspicion, we generalize to

ˆ = σ 2 (X 0 X)−1 βˆ = (X 0 X)−1 X 0 y and cov(β) βˆIV = (Z 0 X)−1 Z 0 y and cov(βˆIV ) = σ 2 (X 0 Z(X 0 X)−1 Z 0 X)−1

226


• Then the test statistic becomes ˆ 0 [cov( ˆ −1 [βˆIV − β] ˆ m = [βˆIV − β] ˆ βˆIV ) − cov( ˆ β)]

where m ∼ χ2J where J is the number of rhs variables.

13.3

Example: Wages for Married Women

• This example is borrowed from Chapter 15 of Woolridge’s book. • The data describe the wages of 428 women and are in file fmwww.bc.edu/ec-p/data/ wooldridge/MROZ . use http://fmwww.bc.edu/ec-p/data/wooldridge/MROZ . reg lwage educ Source | SS df MS -------------+-----------------------------Model | 26.3264237 1 26.3264237 Residual | 197.001028 426 .462443727 -------------+-----------------------------Total | 223.327451 427 .523015108

Number of obs F( 1, 426) Prob > F R-squared Adj R-squared Root MSE

= = = = = =

428 56.93 0.0000 0.1179 0.1158 .68003

-----------------------------------------------------------------------------lwage | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------educ | .1086487 .0143998 7.55 0.000 .0803451 .1369523 _cons | -.1851969 .1852259 -1.00 0.318 -.5492674 .1788735 ------------------------------------------------------------------------------

• In this model, each year of education correlates with a 10% increase in wages, each year of experience corresponds with a 4% increase in wages, wages increase at a decreasing rate. 227


• However, we might think education is a stochastic regressor. The data set has the level of education for father. • We can use father’s education as an instrument for education using the ivreg command. • In the ivreg command, we indicate the stochastic variable and it’s instrument with a pair of parentheses. ivreg lwage (educ = fatheduc ) Instrumental variables (2SLS) regression Source | SS df MS -------------+-----------------------------Model | 20.8673618 1 20.8673618 Residual | 202.460089 426 .475258426 -------------+-----------------------------Total | 223.327451 427 .523015108

Number of obs F( 1, 426) Prob > F R-squared Adj R-squared Root MSE

= = = = = =

428 2.84 0.0929 0.0934 0.0913 .68939

-----------------------------------------------------------------------------lwage | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------educ | .0591735 .0351418 1.68 0.093 -.0098994 .1282463 _cons | .4411035 .4461018 0.99 0.323 -.4357311 1.317938 -----------------------------------------------------------------------------Instrumented: educ Instruments: fatheduc -----------------------------------------------------------------------------• Notice that the parameter on education has fallen from 0.108 to 0.059, a decline of forty-five percent!!

13.4

Example: Cigarette smoking and birth weight

• This example was also borrowed from Woolridge, Chapter 15. • It is hypothesized that smoking cigarettes during pregnancy is correlated with reduced birth weight, which can have other negative ramifications for the health of the baby. 228


• The data set fmwww.bc.edu/ec-p/data/wooldridge/BWGHT contains data that can be used to test this hypothesis. • First, we estimate a standard OLS model: . reg lbwght packs Source | SS df MS -------------+-----------------------------Model | .997779279 1 .997779279 Residual | 49.4225453 1386 .035658402 -------------+-----------------------------Total | 50.4203246 1387 .036352073

Number of obs F( 1, 1386) Prob > F R-squared Adj R-squared Root MSE

= = = = = =

1388 27.98 0.0000 0.0198 0.0191 .18883

-----------------------------------------------------------------------------lbwght | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------packs | -.089813 .0169786 -5.29 0.000 -.1231196 -.0565064 _cons | 4.769404 .0053694 888.26 0.000 4.758871 4.779937 ------------------------------------------------------------------------------

• The results suggest that for every pack-per-day smoked, birth weight falls by approximately 9%. • There is some concern that variables such as family income, the price of cigarettes, and perhaps the mother’s family environment might influence both the number of cigarettes smoked and the birth weight of the baby. • One approach is to use the price of cigarettes as an instrument for cigarettes smoked: • This can be done using the ivreg command. The “first” option tells Stata that you want it to report the “first-stage” regression, which is a quick and dirty way to tell if your chosen instrument has a strong correlation with the endogenous variable (remember that is one of the conditions required for a variable to be a valid instrument). 229


. ivreg lbwght (packs = cigprice ), first First-stage regressions ----------------------Source | SS df MS -------------+-----------------------------Model | .011648626 1 .011648626 Residual | 123.684481 1386 .089238442 -------------+-----------------------------Total | 123.696129 1387 .089182501

Number of obs F( 1, 1386) Prob > F R-squared Adj R-squared Root MSE

= 1388 = 0.13 = 0.7179 = 0.0001 = -0.0006 = .29873

-----------------------------------------------------------------------------packs | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------cigprice | .0002829 .000783 0.36 0.718 -.0012531 .0018188 _cons | .0674257 .1025384 0.66 0.511 -.1337215 .2685728 ------------------------------------------------------------------------------

Instrumental variables (2SLS) regression Source | SS df MS -------------+-----------------------------Model | -1171.28083 1 -1171.28083 Residual | 1221.70115 1386 .881458263 -------------+-----------------------------Total | 50.4203246 1387 .036352073

Number of obs F( 1, 1386) Prob > F R-squared Adj R-squared Root MSE

= = = = = =

1388 0.12 0.7312 . . .93886

-----------------------------------------------------------------------------lbwght | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------packs | 2.988674 8.698884 0.34 0.731 -14.07573 20.05307 _cons | 4.448137 .9081547 4.90 0.000 2.66663 6.229643 -----------------------------------------------------------------------------Instrumented: packs Instruments: cigprice -----------------------------------------------------------------------------• Now the number of packs of cigarettes smoked has not statistically significant relationship with the log of birth weight. 230


• What happened? The bias in the original OLS estimates might have been sufficient to flip the sign and to cause a Type I error in our inference. • However, it is also possible that we don’t have the correct instrument. As we can see from the first-stage regression, there isn’t a strong relationship between cigarette prices and the packs of cigarettes smoked. Therefore, the instrumental variables estimator is not providing a lot of information on which to estimate βpacks . • Perhaps we can use another instrument. The data set contains the mother’s and father’s education. Perhaps using father’s education as an instrument for smoking might be equally valid as the price of cigarettes? .. ivreg lbwght (packs=fatheduc) ,first First-stage regressions ----------------------Source | SS df MS -------------+-----------------------------Model | 2.72968287 1 2.72968287 Residual | 82.2318924 1190 .069102431 -------------+-----------------------------Total | 84.9615752 1191 .071336335

Number of obs F( 1, 1190) Prob > F R-squared Adj R-squared Root MSE

= = = = = =

1192 39.50 0.0000 0.0321 0.0313 .26287

-----------------------------------------------------------------------------packs | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------fatheduc | -.0174342 .0027739 -6.29 0.000 -.0228765 -.0119919 _cons | .3182725 .0373615 8.52 0.000 .2449707 .3915743 ------------------------------------------------------------------------------

Instrumental variables (2SLS) regression Source | SS df MS -------------+-----------------------------Model | -1.94152266 1 -1.94152266 231

Number of obs = F( 1, 1190) = Prob > F =

1192 6.34 0.0120


Residual | 44.0155526 1190 .036987859 -------------+-----------------------------Total | 42.07403 1191 .035326641

R-squared = Adj R-squared = Root MSE =

. . .19232

-----------------------------------------------------------------------------lbwght | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------packs | -.2930019 .1164055 -2.52 0.012 -.5213847 -.064619 _cons | 4.793352 .0116993 409.71 0.000 4.770399 4.816306 -----------------------------------------------------------------------------Instrumented: packs Instruments: fatheduc ------------------------------------------------------------------------------

• In the first stage regression we see a much stronger, and intuitively appealing, relationship between the instrument and the suspected stochastic regressor. We have also obtained a negative and statistically significant relationship in the instrumental variables estimation. • However, the estimated parameter on packs smoked is now -0.29, suggesting that every pack smoked per day reduces birth weight by 29%, which is much greater than the original estimate and perhaps more sensible than the result obtained using the price of cigarettes as the instrumental variable. • One problem with the current model is that there is no way to test whether the instrumental variable we have selected is actually valid. The only rough test we have is that obtained in the first-stage regression. We will address this in the last section of the class. • The last thing considered in this example is whether there is a problem with the variable “packs.” We can do this with the hausman name1 name2 command. • The Hausman command calculates the Hausman test statistic and provides a p-value for the test. The null hypothesis is that there is no problem with the suspect variables, 232


i.e. OLS is both consistent and efficient while instrumental variables is only consistent. However, if there is a problem with the suspect variables, OLS is no longer consistent but instrumental variables will be consistent. • From the Stata help files: To use hausman, one has to perform the following steps. 1. obtain an estimator that is consistent whether or not the hypothesis is true; 2. store the estimation results under a name-consistent using estimates store; 3. obtain an estimator that is efficient (and consistent) under the hypothesis that you are testing, but inconsistent otherwise; 4. store the estimation results under a name-efficient using estimates store; 5. use hausman to perform the test hausman name-consistent name-efficient • In other words, store instrumental variables results to “name-consistent” and store OLS results to “name-efficient”. . quiet reg lbwght packs . estimates store bols . quiet ivreg lbwght (packs=fatheduc) . estimates store biv . hausman biv bols ---- Coefficients ---| (b) (B) (b-B) sqrt(diag(V_b-V_B)) | biv bols Difference S.E. -------------+---------------------------------------------------------------packs | -.2930019 -.089813 -.2031889 .1151606 -----------------------------------------------------------------------------b = consistent under Ho and Ha; obtained from ivreg B = inconsistent under Ha, efficient under Ho; obtained from ivreg

233


Test:

Ho:

difference in coefficients not systematic chi2(1) = (b-B)’[(V_b-V_B)^(-1)](b-B) = 3.11 Prob>chi2 = 0.0777

. disp (-0.293+0.089)^2/(0.116*0.116 + 0.0169*0.0169) 3.0284661

• In this case the Hausman test suggests that there is a potential problem with the variable packs. We can reject the null hypothesis at the 8% level, which given the evidence from the regression models that the potential bias (if it exists) is more than two times the original parameter estimate from OLS, I might err on the side of caution and use the instrumental variables estimates.

234


14 14.1

Seemingly Unrelated Regressions Several Examples

• At times we have several equations that may be estimated simultaneously. We start out with systems of equations that are related only through the error structure then we move on to more sophisticated models. • Some examples of systems of stochastic equations: 1. Grunfeld-Boot Investment Model Assume that Iit = β0 + β1 Fit + β2 Cit + ²it where i = 1, . . . , m is the number of different firms in the sample and t = 1, . . . , T are the time-series observations within each firm. 2. Note: For future reference we can write out the system of equations as

I1t = β0 + β1 F1t + β2 C1t + ²1t I2t = β0 + β1 F2t + β2 C2t + ²2t I3t = β0 + β1 F3t + β2 C3t + ²3t

Here, I is the level of gross investment, F is the market value of the firm in the previous year and C is the value of the stock of plants and equipment. It is thought that F and I reflect (anticipated) profit and the expected amount of investment required. There are some factors that are in common with all the ²it such as the general health of the economy and other industry specific factors.

235


3. CAPM Model Here, this famous (but relatively naive) model relates the return of a particular financial asset to the return in the market via

rit − rf t = α + βi (rmt − rf t ) = ²it

where rit is the return on a particular security, rf t is the risk free return, rmt is the market return and βi is the security’s risk measure. Obviously the ²it are correlated across different securities. 4. Output Production Function can be shown to be related to factor demands, i.e.,

Xm = fm (Y, p) where fm =

∂f ∂Xm

or in a stochastic framework

Xm = fm (y, p; θ) + ²m

Here the disturbances are related, also there are common parameters to each equation. Thus, they should be estimated together.

14.2

The SUR approach

• All of the examples listed in the previous section are grouped in what is called Seemingly Unrelated Regressions. • The Seemingly Unrelated nature of the equations is indicative that the connection between the different equations in the system will be in the error structure. 236


• These models were initially proposed by Zellner (1962). • Consider:

y 1 = x1 β 1 + ² 1 y 2 = x2 β 2 + ² 2

where ²1 and ²2 are related somehow. • Other examples include: Competitiveness of two “firms” in two types of industries (Depken (1999)), investments equations of two firms in a market (Zellner (1962)). • The covariance matrix of this system is now 

 ²1 ²01

²1 ²02

 ²1    E[²²0 ] = E   [ ²1 ²2 ] = E  =Ω ²2 ²2 ²01 ²2 ²02 • Note: ²1 = [N × 1] ²2 = [N × 2]

²1 ²01 = [N × N ] ²2 ²02

⇒ Ω = [JN × JN ]

= [N × N ]

where J is the number of equations in the system. • Thus, the covariance matrix becomes 

 σ11 IN σ12 IN  Ω=  σ21 IN σ22 IN where σij is the covariance between the ith and j th equations.

237


• Note: σ21 = σ12 may equal zero, i.e., the two equations are not related at all. However, if σ12 6= 0 then the two equations are related somehow. We seek to take advantage of this fact. • We can generalize the two equation system to M equations, i.e.,

y1 = X1 β1 + ²1 y2 = X2 β2 + ²2 .. . = yM = XM βM + ²M

• Assume that each equation has N observations and k regressors. • Then the system can be written in stacked form as

 y1   y2  y∗ =    .  yM

 X1 0   0 X2  X∗ =   .  .  0 .

0 0 . .

  β1       β2      β∗ =      .       βM    0  ²  1      ²2  0      ²∗ =      .  .      XM ²M

• This can be written as y∗ = X∗ β∗ + ²∗ 238


• The standard full ideal conditions for SUR models 1. E[²∗ ] = 0 2. E[²∗ ²0∗ ] = Ω where Ω = Σ ⊗ IN and Σ = [σij ] ∀ i, j = 1, . . . , M . 3. X∗ is nonstochastic and (X∗0 Ω−1 X∗ )−1 is nonsingular and µ plim

X∗0 Ω−1 X∗ N

¶−1 = Q∗ is finite and nonsingular

4. Xi is [N × k] for all i = 1, . . . , M • Note: The covariance matrix Ω is written as      Ω = Σ ⊗ IN =    

and

 σ11 IN

σ12 IN .

σ1M IN

.

.

.

.

.

.

.

.

σM 1 IN

.

. σM M IN

       

 σ11 σ12   . .  Σ=  . .   σM 1 .

.

σ1M   . .    . .    . σM M

• Theorem: Under the standard conditions for SUR models, the BLUE for β∗ is β˜∗ = (X∗0 Ω−1 X∗ )−1 X∗0 Ω−1 y∗ if Ω is known

239


and cov(β˜∗ ) = (X∗0 Ω−1 X∗ )−1 • Proof: The proof follows from our previous work with Generalized Least Squares. • Note: β˜∗ is more efficient then βˆ∗ from OLS where βˆ∗ = (X∗0 X∗ )−1 X∗0 y∗  0 0  X1 X1   0 X20 X2  =   .  .  . .   βˆ1      βˆ2    =     .     βˆM

 0 X y 1 1      X 0 y2  2       .     0 XM yM

−1  .

.

0

.

.

.

       

0 . XM XM

• Theorem: If σij for all i 6= j then β˜∗ = βˆ∗ and OLS is fully efficient. • Proof: Consider that β˜∗ = (X∗0 X∗ )−1 X∗0 y∗ and let σ ij = 1/σij so that −1

 11

X10 X1

σ  σ   .  ˜ β∗ =   .   0 σ M 1 XM X1

12

X10 X2

.

σ

1M

X10 XM

.

.

.

.

.

.

.

0 . σ M M XM XM

240

       

PM

 1j

X10 yj

j=1 σ   ..  .   P M Mj 0 XM yj j=1 σ

    


• If σ ij = 0 for all i 6= j then    ˜ β∗ =   

 0

−1

(X X) .. .

X10 y1

    =    

0 0 (XM XM )−1 XM yM

 βˆ1 .. . βˆM

    

• Note: The M equations are linked only through their error terms, thus the SUR terminology. • Theorem: If X1 = X2 = · · · = XM then β˜∗ = βˆ∗ . • Proof: Homework question. • Note: X1 = X2 does not necessarily have to imply that X1 ≡ X2 . • Some upshots: 1. If σij = 0 then there is no payoff to the GLS procedure. 2. If Xi ∈ Xj then there is no payoff to the GLS procedure. 3. In general as the correlation between ²i and ²j increases the more efficiency gain there is in GLS. 4. The less the correlation between Xi and Xj then the greater the gain in GLS. • If Ω is unknown then we have to move to an FGLS procedure. • Fortunately, Zellner (1962) gives us a solution of

sij =

1 0 ²ˆ ²ˆj for all i, j N i

241


where ²i comes from equation-by-equation OLS. Then     b= Σ    

 s11

. .

s1M

.

. .

.

.

. .

.

    b=Σ b ⊗I  and Ω   

sM 1 . . sM M then b −1 X∗ )−1 X 0 Ω b −1 y∗ βˆ˜∗ = (X∗0 Ω ∗ • Under the standard conditions for SUR models, 1. βˆ˜∗ is consistent and normally distributed. 2. If ²∗ is normal, then βˆ˜∗ is asymptotically efficient. 3. In small samples we have unbiasedness (Kakwani (1967)). However the other properties are unknown.

14.3

Goodness of Fit and Hypothesis Testing

• How to measure goodness of fit in an SUR model? The traditional R2 doesn’t make much sense. • An alternative R2 is b −1 ²ˆ∗ ²ˆ0 Ω R∗2 = PM PM∗ P ˆij [ t (yit )2 ] j=1 σ i=1 where ²ˆ∗ comes from OLS. • Problems with this measure: This measure distorts the variance across equations. It is still tough to interpret the goodness of fit on a system level. 242


• If y∗ = X∗ β∗ + ²∗ we can test whether β∗0 = [β1 β2 ] is characterized by β1 = β2 or that there is a common parameter vector across all equations. • Some models may contain equations that are related enough so that a common parameter vector may be reasonable. • If β1 = β2 = γ then the number of parameters to be estimated is reduced and we still have 2N observations. • We could test this with the Chow test from earlier. • Otherwise, we can test the linear restriction of β1 = β2 or β1 − β2 = 0. • Recall: Testing linear restrictions calls for

Rβ = r

where in this case 

 β1   1 0  R=  and β =   and r = 0 0 −1 β2 so that



 1 0   β1  Rβ = r ⇒   =0 0 −1 β2 • We can estimate the restricted model 

 ²1   y1   X1 0  y=  = Zγ + ² γ +  = ²2 0 X2 y2

243


• From this formulation we obtain γˆ = (Z 0 Z)−1 Z 0 y

and γ˜ = (Z 0 Ω−1 Z)−1 Z 0 Ω−1 y and b −1 Z)−1 Z 0 Ω b −1 y γˆ˜ = (Z 0 Ω • To test the restrictions, we proceed as we did before, i.e.,

q=

ˆ˜ 0 [R(X 0 Ω ˆ˜ b −1 X)−1 R0 ]−1 (Rβ) (Rβ) ∼ F[J,M N −k] J

where J is the number of regressors. • An alternative test statistic is ˆ˜ 0 [R(X 0 Ω ˆ˜ ∼ χ2 b −1 X)−1 R0 ]−1 (Rβ) Jq = (Rβ) J

• Note: The former statistic may be better in small samples because the later is dependent upon asymptotic theory. • How to determine if we have contemporaneous correlation? • Contemporaneous correlation is possible. We usually assume away correlation across equations through time, though this possibility can be modeled.

244


• Recall that

 σ11 I σ12 I  Ω=Σ⊗I =  σ21 I σ22 I • Then we can use σ ˆij =

(yi − Xi βˆi )0 (yj − Xj βˆj ) N

where βˆi and βˆj are the OLS βˆ from each separate equation. • Note: We use N instead of N − k because k may be ambiguous if the k are not equal across all equations. As N → ∞ then the difference becomes negligible. • Thus, σ ˆij are biased but still consistent. • This implies

 ²ˆ01 ²ˆ1 IN

²ˆ01 ²ˆ2 IN

σ ˆ11 I σ ˆ12 I  1   b= Ω =    N ²ˆ0 ²ˆ I 0 σ ˆ21 I σ ˆ22 I ² ˆ ² ˆ I 2 1 N 2 2 N b is a diagonal matrix, i.e., all of this is worthless we look at H0 : σ12 = 0 • To test if Ω which is tested as N r2 ∼ χ21 where r2 =

2 σ ˆ12 2 2 [ˆ σ11 ×σ ˆ22 ]

• If we reject the null, then σ12 6= 0 and our specification of the covariance matrix is appropriate. • A generalized test has been developed by Breush and Pagan (1980) based upon the Lagrange Multiplier Test, i.e.,

λLM = N

M M −1 X X i=1 j=1

245

2 rij ∼ χ2M (M −1) 2


where σ ˆij2 r = 2 2 [ˆ σii × σ ˆjj ] 2

• Note: The SUR model is not applicable in a stochastic regressors environment. To accommodate this problem, we have to move to a more sophisticated methodology.

14.4

Unbalanced SUR

• What if we have an unequal number of observations in the different equations? • Then we have 



 y1   X1 0   β1   ²1  y∗ = X∗ β∗ + ²∗ ⇒   +  = ²2 β2 0 X2 y2 where equation 1 has N observations and equation 2 has N + n observations. • Now we see that

y∗ = [2N + n × 1] x∗ = [2N + n × k1 + k2 ] β∗ = [k1 + k2 × 1] ²∗ = [2N + n × 1]    σ11 σ12  and Σ =   σ21 σ22 but Ω is now different.

246


• Ω is now written as 

0  σ11 IN σ21 IN  Ω= 0  σ21 IN σ22 IN  0 0 σ22 In

   6= Σ ⊗ In  

• Ω is now a partitioned matrix which is a combination of two sub-matrices - our original Ω matrix and the additional n observations in equation 2. • Note: the off-diagonal zeroes arise because there cannot be any correlation betwen n extra observations in equation 2 and non-existent observations in equation 1. • Let

  X2 = 

 X2∗

   and y2 = 

X20

 y2∗ y20

 

where ∗ indicates the first N observations and 0 indicates the extra n observations. • Then GLS becomes · β˜∗ =

¸ β˜1 β˜2

−1  11

(X10 X1 )

 σ =  σ 21 X20 X1

σ

12

X10 X2

0

0

σ 22 X2∗ X2∗ + σ 22 (X20 X20 )

 

 11

X10 y1

12

X10 y2∗

σ  σ    0 0 σ 21 X2∗ y1∗ σ 22 X2∗ y2∗ + σ 22 X20 y20

• We can choose to ignore the extra n observations. However, there may be some gain in small samples by using the extra observations. b • Regardless, we still get a consistent estimate of Σ.

247


14.5

Example: Production Function and Cost

• A particular example can be seen using production theory: • Let Y = f (X), then cost minimization implies the factor demand equations to be

xi = xi (y, p)

• From this we can derive total cost to be

C=

M X

Pi xi (Y, p) = C(Y, p)

i=1

• With CRS (Constant Returns to Scale) Technology, we know that

C = Y c(p) C Y

= c(p) where c(p) is average cost

• To minimize cost we use Shephard’s Lemma, i.e., x∗i =

∂C(Y, p) Y ∂c(P ) = ∂Pi ∂Pi

• Let si =

∂lnc(Y, P ) Pi xi = ∂lnPi C

be the cost share for each input.

248


• Usually we seek elasticity of factor substitutions and own price elasticities of demand ³ c θij = ³

∂2c ∂Pi ∂Pj

∂c ∂Pi

´³

´

∂c ∂Pj

´ are factor substitutions

and γij = si θii are the own price elasticities. • With the cost function and the cost shares we have M + 1 equations. • A useful cost function is the translog function

lnC = β0 +

¶ M µ X ∂lnC i=1

∂lnβi

M

M

1 XX + 2 i=1 j=1

µ

∂ 2 lnC ∂lnPi ∂lnPj

¶ lnβi lnβj

• Treat the derivatives as constants and impose symmetry on the cross-price derivatives, i.e., 1 1 lnC = β0 + β1 lnP1 + · · · βm lnPm + δ11 ( (lnP1 )2 ) + δ12 (lnP1 lnP2 ) + · · · + δmm ( (lnPm )2 ) 2 2 and the cost shares

s1 = .. . sm =

∂lnC = β1 + δ11 lnP1 + δ12 lnP2 + · · · + δ1m lnPm ∂lnP1 ∂lnC = βm + δm1 lnP1 + δm2 lnP2 + · · · + δmm lnPm ∂lnPm

• And cost shares must sum to one which requires

1 = β1 + β2 + · · · + βm M X

δij = 0

i=1

249


M X

δij = 0

j=1

• Because all M share equations sum to one we must drop one of the cost shares. Typically, we normalize the prices by dividing all M − 1 prices by the M th price. • For the translog cost function, the elasticities of substitution are easy to compute once the parameters have been estimated:

θmn =

θmm =

δij + si sj si sj

δij + si (si − 1) s2i

These will differ at each data point so it is common to compute the elasticities at the sample means. • Berndt and Wood (1975) look at the translog in a four-factor model with capital (K), labor (L), energy (E) and materials (M) as the inputs to U.S. manufacturing. • The three factor shares used to estimate the model are µ

sK sL sE

¶ µ ¶ µ ¶ PK PL PE = βK + δKK log + δKL log + δKE log PM PM P ¶ µ ¶ µ M¶ µ PK PL PE + δLL log + δLK log = βL + δKL log P P P µ M¶ µ M ¶ µ M ¶ PK PL PE = βE + δKE log + δLE log + δEE log PM PM PM

• Berndt and Wood’s data cover 1947 through 1971 and yields estimates of

250


Parameter Estimate

Std. Error

Parameter Estimate

Std. Error

βK

0.05690

0.00134

δKM

-0.0189

0.00971

βL

0.2534

0.00210

δLL

0.07542

0.00676

βE

0.0444

0.00085

δLE

-0.00476

0.00234

βM

0.6542

0.00330

δLM

-0.07061

0.01059

δKK

0.02951

0.00580

δEE

0.01838

0.00499

δKL

-0.000055 0.00385

δEM

-0.00299

0.00799

δKE

-0.01066

δM M

0.09237

0.02247

0.00339

• The estimated elasticities are Capital

Labor

Energy

Materials

Cost Shares for 1959 Fitted Share

0.05643

0.27451

0.04391

0.62515

Actual Share

0.06185

0.27303

0.04563

0.61948

Implied Elasticities of Substitution Capital

-7.783

Labor

0.9908

-1.643

Energy

-3.230

0.6021

-12.19

Materials

0.4581

0.5896

0.8834

-0.3623

Implied Own Price Elaticites (sm θmm ) -0.4392

14.6

-0.4210

-0.5353

-0.2265

Example: Automobile characteristics and price

• This example comes from the STATA help file for the sureg command: . use http://www.stata-press.com/data/r8/auto.dta (1978 Automobile Data)

251


. sureg (price foreign weight length) (mpg foreign weight) (displ foreign weight) Seemingly unrelated regression ---------------------------------------------------------------------Equation Obs Parms RMSE "R-sq" chi2 P ---------------------------------------------------------------------price 74 3 1967.769 0.5488 89.74 0.0000 mpg

74

2

3.337283

0.6627

145.39

0.0000

displacement 74 2 39.60002 0.8115 318.62 0.0000 --------------------------------------------------------------------------------------------------------------------------------------------------| Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------price | foreign | 3575.26 621.7961 5.75 0.000 2356.562 4793.958 weight | 5.691462 .9205043 6.18 0.000 3.887307 7.495618 length | -88.27114 31.4167 -2.81 0.005 -149.8467 -26.69554 _cons | 4506.212 3588.044 1.26 0.209 -2526.225 11538.65 -------------+---------------------------------------------------------------mpg | foreign | -1.650029 1.053958 -1.57 0.117 -3.715748 .4156902 weight | -.0065879 .0006241 -10.56 0.000 -.007811 -.0053647 _cons | 41.6797 2.121197 19.65 0.000 37.52223 45.83717 -------------+---------------------------------------------------------------displacement | foreign | -25.6127 12.50621 -2.05 0.041 -50.12441 -1.100984 weight | .0967549 .0074051 13.07 0.000 .0822411 .1112686 _cons | -87.23548 25.17001 -3.47 0.001 -136.5678 -37.90317 -----------------------------------------------------------------------------Correlation matrix of residuals:

price mpg displacement

price 1.0000 -0.0220 0.1765

mpg

displacement

1.0000 0.0229

1.0000

Breusch-Pagan test of independence: chi2(3) =

252

2.379, Pr = 0.4976


• The corr option requests STATA to report the correlation matrix between the error terms of the various equations. Notice that price and MPG are negatively correlated, displacement (engine size) and price are positively correlated, and displacement and MPG are positively correlated. • We can compare the SUR estimates with equation-by-equation OLS estimates: . reg price foreign weight length -----------------------------------------------------------------------------price | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------foreign | 3573.092 639.328 5.59 0.000 2297.992 4848.191 weight | 5.774712 .9594168 6.02 0.000 3.861215 7.688208 length | -91.37083 32.82833 -2.78 0.007 -156.8449 -25.89679 _cons | 4838.021 3742.01 1.29 0.200 -2625.183 12301.22 -----------------------------------------------------------------------------. reg mpg foreign weight -----------------------------------------------------------------------------mpg | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------foreign | -1.650029 1.075994 -1.53 0.130 -3.7955 .4954422 weight | -.0065879 .0006371 -10.34 0.000 -.0078583 -.0053175 _cons | 41.6797 2.165547 19.25 0.000 37.36172 45.99768 -----------------------------------------------------------------------------. reg displ foreign weight -----------------------------------------------------------------------------displacement | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------foreign | -25.6127 12.76769 -2.01 0.049 -51.07074 -.1546505 weight | .0967549 .0075599 12.80 0.000 .0816807 .111829 _cons | -87.23548 25.69627 -3.39 0.001 -138.4724 -35.99858 -----------------------------------------------------------------------------• The parameter estimates for the second two equations are the same in equation-byequation OLS as in the SUREG. The parameter estimates for the price equation are a 253


bit different, reflecting the additional information that is available when the two other equations are included. Notice, as well, that the standard errors for the parameters in the price equation are smaller in the SUREG output than in the single-equation OLS.

254


15

Simultaneous Equations

• Simultaneous equations is a situation where we have two or more equations which are related by left-hand side variables showing up on the right-hand side of other equations, i.e., there is more than one endogeneous variable in the model. • This occurs more often than one would think. In fact, most economic models have more than one endogenous variable, e.g., 1. Cost and revenue at profit maximization 2. Quantity and price (supply and demand) 3. Keynesian macro model 4. Advertising and sales 5. Interest rates and savings • Take the Keynesian macro model, for an example. The assumptions are 1. Ct is a stable function of income, Yt 2. Expenditures equal income, and is comprised of consumption and investment 3. Investment is independent of income and also ². • The assumptions imply that

Ct = β + γYt : Behavioral equation Yt = Ct + It : Identity

255


• To bring this into a stochastic framework we rewrite the model as

Ct = β + γYt + ²t Yt = Ct + It ² ∼ N (0, σ 2 )

• How do we estimate γ, which is interpreted as the marginal propensity to consume? Let β = 0 and then, via OLS we have P P C t Yt (γYt + ²t )(Yt ) P 2 γˆ = P 2 = Yt Yt P ²t Yt γˆ = γ + P 2 Yt • Note that

P ·P ¸ ·P ¸ Yt ² t Yt ²t E( Yt ²t ) P E[ˆ γ ] = γ + E P 2 but E P 2 6= E[ Yt2 ] Yt Yt

• Looking at plims we find P plim( T1 Yt ² t ) P 2 plimˆ γ=γ+ 1 plim( T ) Yt • However, from our model we know that

Ct = γYt + ²t Yt = Ct + It or Ct = Yt − It ⇒ Yt − It = γYt + ²t ⇒ (1 − γ)Yt = It + ²t

256


⇒ Yt =

It ²t + 1−γ 1−γ

• This implies that ·P ¸ 1X [(1 − γ)−1 It + (1 − γ)−1 ²t ]²t Yt ²t = plim plim T T · ¸ It ²t ²t ²t −1 = (1 − γ) plim + plim T t and we know the following It ²t = 0 via the independence of It and ²t T ²t ²t plim = σ2 T

plim

which implies that plim • If plim

1 T

P

1X Yt ²t = (1 − γ)−1 σ 2 T

Yt2 = Qyy is finite then plimˆ γ=γ+

(1 − γ)−1 σ 2 >γ Qyy

• The upshot of all of this is that γˆ is inconsistent and is also biased upward. • This is a general result when we have right-hand side endogenous variables in the model. The direct least squares approach is inappropriate. • We seek to make this a bit better. We can respecify the model in the following way:

257


• Let

Ct = β + γYt + ²t Ct = β + γ(Ct + IT ) + ²t Ct =

β γIt ²t + + 1−γ 1−γ 1−γ

• This last equation can be rewritten as

Ct = π10 + π11 It + vt

where β (1 − γ) γ = (1 − γ) 1 = ²t (1 − γ)

π10 = π11 vt

• We could substitute for Ct in our model to obtain

Yt = β + γYt + ²t + It Yt =

β 1 1 + It + ²t (1 − γ) (1 − γ) (1 − γ)

or Yt = π20 + π21 It + vt

258


where β (1 − γ) 1 = (1 − γ) 1 = ²t (1 − γ)

π20 = π21 vt

• Note:

µ vt ∼ N

σ2 0, (1 − γ)2

• This reformulated model is known as a reduced from model. It is “reduced” in the sense that the jointly determined variables have been expressed only in terms of exogenous variables. • Our two equation system in structural form was

Ct = β + γYt + ²t Yt = Ct + It

• Now, in reduced form our model is

Ct = π10 + π11 It + vt Yt = π20 + π21 It + vt

• The structural form comes from economic theory. The reduced form comes from a specific statistical framework. • Note: One reduced form model can be derived from any number of structural models. 259


This is known as the simultaneity problem. b • The reduced form does satisfy the FIC so that OLS can be run to obtain Π. b doesn’t tell us much. Thus, reduced form parameters are not directly • Note that Π interpreted in economic theory. This is very important to remember!! • All is not lost, however, we simply have to reverse engineer β and γ. • We reverse-engineer the direct parameters using Indirect Least Squares. We can obtain ˆ from the reduced form and then recognize that Π ³ γ˜ =

π ˆ11 = ³ π ˆ21 ³

γ ˜ 1−˜ γ 1 1−˜ γ β˜ 1−˜ γ

´ ´ ´

π ˆ20 ´ β˜0 = = ³ 1 π ˆ21 1−˜ γ ³ ˜ ´ β 1−˜ γ π ˆ 10 ´ β˜1 = = ³ 1 π ˆ21 1−˜ γ

• Here γ˜ and β˜ are the indirect least-squares estimates. • These estimates are not unbiased but they are consistent. • Note: We have just enough information from the economic theory to obtain γ˜ but β˜ can be estimated in two different ways, this indicates the simultaneity/identification problem, discussed below. • What if we have It = I1t + I2t where I1t is planned investment and I2t is government expenditure?

260


• In this instance we would not have enough information to obtain unique indirect least squares estimates. This is known as the identification problem. The upshot is that we need economic theory to tell us what to do.

15.1

A Generalization

• In general we have three types of variables 1. Endogenous Variables: Those variables that are determined within the model (e.g., Yt and Ct ). 2. Predetermined Variables: Those variables which are endogenous to the model but have been determined already at time t. (e.g., Yt−1 and Ct−1 ). 3. Exogenous Variables: Those variables determined outside of the model, which in essence recognizes the limitations of economic theory. • Reduced form: Endogenous variables in terms of predetermined variables. • Final form: Endogenous variables in terms of purely exogenous variables. • Let the following notation be defined G

Number of equations in the model, Number of endogenous variables

K

Number of predetermined variables

Yt· = [Yt1 · · · YtG ]

[1 × G] vector of endogenous variables

Xt· = [Xt1 · · · XtG ] [1 × k] vector of predetermined variables. Γ

[G × G] matrix of endogenous variable coefficients.

[k × G] matrix of predetermined variable coefficients.

²t·

[1 × G] vector of structural disturbance terms.

261


• Then for observation t Yt· Γ + Xt· ∆ + ²t· = 0 • For all T observations 

     Y =   

Y1·   Y11    Y2·    · = ..   .    ·   YG· YT 1

 X1·   X2·  X= .  ..   XG·

· · Y1G   · · ·    · · ·    · · YT G 

  X11     ·   =   ·     XT 1

· · X1G   · · ·    · · ·    · · XT G

• All of this allows us to rewrite the structural model as

Y Γ + X∆ + ² = 0

• We need some assumptions to place on our structural model. 1. plim T1 ²²0 = Σ or ²it iid N (0, Σ) 2. plim T1 X 0 X is finite and nonsingular 3. Γ is non-singular • Then, from the structural form we try to solve for Y which are our endogenous vari-

262


ables:

Y Γ + X∆ + ² = 0 Y Γ = −X∆ − ² Y ΓΓ−1 = −X∆Γ−1 − ²Γ−1 = −X∆Γ−1 − ²Γ−1

Y

• Let Π = −∆Γ−1 and V = −²Γ−1 to get

Y = XΠ + V

where Π is [k × G] and V is [T × G]. • We proceed with a specific example: Supply and Demand model – The traditional supply and demand model has two equations and two unknowns: Qdt = a + bPtd + ²dt Qst = α + βPts + γWt + ²st

– At equilibrium we have Qdt = Qst and Ptd = Pts assuming that there are no other distortions in the market. – Thus, our structural model is Qdt = a + bPtd + ²dt Qst = α + βPts + γWt + ²st

263


which can be rewritten as 

 ²dt

 −1 −1   a α    [ Qt P t ]   + [ 1 Wt ]  + =0 b β 0 γ ²st – We need Γ−1 so as to solve for Y .

15.1.1

Inverting a 2x2 Matrix

– For a [2 × 2] matrix, the inverse can be derived relatively easily: 

 a11 a12  If A =   a21 a22 then

µ A−1 =

1 a11 a22 − a12 a21

 a22 −a12    −a21 a11

– So, we can solve for Γ−1 to obtain  Γ−1 =

1  1  β   b − β −b −1

– With Γ−1 in hand, we can look at the rest of the system. 

1  a α    b−β 0 γ  1  aβ − αb = −  b−β −bγ

Π = −∆Γ−1 = −

264

 β

1   −b −1  a−α   −γ


 =

1  αb − aβ α − a    b−β bγ γ

We also see that Vt = −²t Γ−1 is actually  −[



²dt

1   1  β ]   b − β −b −1 ²st

which can be rewritten as 

 b²st

β²dt

− 1     b−β ²st − ²dt – Our reduced form becomes 

  [ Qt Pt ] = [ 1 Wt ] 

αb−aβ b−β

α−a b−β

bγ b−β

γ b−β

  +

 b²st −β²dt b−β ²st −²dt b−β

 

– Remember that reduced form estimates have no direct economic interpretation. ˆ as – We obtain consistent estimates of Π ˆ = (X 0 X)−1 X 0 Y = β + (X 0 X)−1 X 0 V Π

15.2

The Identification Problem

– Identification: Knowledge of reduced form parameters does not guarantee the ability to obtain unique estimates for structural (behavioral) parameters.

265


– Note: Consideration of identification comes from consideration of estimation. 1. Unidentified equation: It is not possible to obtain values for all structural parameters. 2. Exactly (Just) Identified: It is possible to obtain unique values for each structural parameter. 3. Over Idenfitied: It is possible to obtain more than one value for one or more structural parameters. (This is useful for hypothesis testing). – On a system level, some equations may be identified and others may not. – Consider the following examples: 1. If we have Qd = a + bP d + ²t and Qs = α + βP s + ut , neither equation is idenfitied. There are infinite supply and demand curves that would intersect at a given P and Q equilibrium. 2. If we have Qd = a + b1 P d + b2 Y + ²t and Qs = α + βP s + ut , now supply is identified but demand is not. Supply is identified because income has been excluded from the supply model, which theoretically is justified. However, demand is not identified. 3. If we have Qd = a + b1 P d + b2 Y + ²t and Qs = α + β1 P s + β2 Wt + ut , now supply is identified because income is excluded from the supply model and demand is identified because weather (W) has been excluded from demand. 4. If we have Qd = a + b1 P d + b2 Y + b3 Ct + ²t and Qs = α + β1 P s + β2 Wt + ut , now supply is overidentified, there are two variables omitted from the supply model. These two excluded exogenous variables cause the supply curve to be ”over” identified. 266


– Simple rules for identification (not exhaustive) – The order condition is the easiest rule for identification. However, it is not always guaranteed to work, so be careful. Let G be the number of equations in the model, g the number of endogenous variables in a given equation, K the number of predetermined variables in the model, and k the number of predetermined variables in the given equation. – With G equations, an equation is identified if it excludes at least G−1 endogenous and predetermined variables. If exclusions = G − 1 then the equation is just identified. If exclusions > G − 1 then the equation is over identified. – An equivalent statement is that K − k ≥ G − 1. – Note: Indirect Least Squares is only approprite in a just identified system, otherwise we will have more than one possible value for one or more of the structural parameters.

15.3

Estimation

– At times there will be restrictions placed on the reduced-form parameters with knowledge of the structural form. For example, if we know that

γ21 =

π22 π23 and γ21 = π12 π13

then π22 π23 = π12 π13 267


– But there is nothing in the least squares estimation that restricts π ˆ22 π ˆ23 = π ˆ12 π ˆ13 which in general will not hold. – In this case, ILS will not lead to a unique estimator of γ21 because there are two ways to estimate γ21 . – To use all the information in the data we need another approach. – Let

Y1t = Y2t γ12 + X1t β11 + X2t β12 + X3t β13 + ²1t Y2t = Y1t γ21 + X1t β21 + 0 + 0 + X4t β24 + ²2t

– Let



 



 ²1   0   σ11 IT σ12 IT  ²=   ∼ N   ,  σ21 IT σ22 IT 0 ²2 – We can rewrite these equations as:

 γ  12     β11     + ² 1 = Z1 δ 1 + ² 1 X3 ]    β  12    β13 

Y1 = [ Y2 X1 X2

268


 Y2 = [ Y1 X1

 γ21  X4 ]   β21  β24

    + ²2 = Z2 δ2 + ²2  

– To find δˆ1 we could use δˆ = (Z10 Z1 )−1 Z10 Y and then we would have E[δˆ1 ] = E[(Z10 Z1 )−1 Z10 (Z10 δ1 + ²1 )] = δ1 + E[(Z10 Z1 )−1 Z10 ²1 ] 6= δ1

– This holds because Z1 contains Y2 (likewise Z2 contains Y1 ) and Y2 is stochastic. – We try to implement an IV approach: We look for a set of instruments that are highly correlated with Y2 but not with ²1 . – We can consider all exogenous variables in the model which should be correlated with Y2 but, by definition should be uncorrelated with ².

X = [ X1 X2 X3 X4 ]

– Then, by premultiplying by X 0 we get a transformed model X 0 Y1 = X 0 Z1 δ1 + X 0 ²1

which leads to

plim

X 0 Z1 X 0 ²1 X 0 Y1 = plim δ1 + plim T T T 269


⇒ QXY1 = QXZ1 δ1 + 0 δ1 = Q−1 XZ1 QXY1

– Using sample values we would then obtain δˆ1 = (X 0 Z1 )−1 X 0 Y1

and cov(δˆ1 ) = σ ˜ 2 (X 0 Z1 )−1 X 0 X(Z10 X)−1 where σ ˜ −1 =

(Y1 − Z1 δˆ1 )0 (Y1 − Z1 δˆ1 ) T

– If we have just or overidentified equations we can move to what is called Two Stage Least Squares. – Recall that our model can be written as

Y1 = Z1 δ 1 + ² 1 Y2 = Z2 δ 2 + ² 2

where the second equation is overidentified because two (2) exogenous variables are omitted. – Now, note that X 0 Z2 is no longer square, and thus cannot be inverted. In this case, the IV approach doesn’t work, how do we get around this?

270


– From the reduced form model we obtain

Y1 = X1 π11 + X2 π12 + X3 π13 + X4 π14 + V1 = XΠ1 + V1 ˆ 1 + Vˆ1 where Π ˆ is an OLS estimate of Π1 . and Y1 = X Π – Now, replace Y1 in Z2 with Yˆ1 + Vˆ1 to get · Zˆ2 =

¸ Yˆ1 + Vˆ1 X1 X4

– We can now rewrite the second equation of the system as   γ21  X4 ]   β21  β24

Y2 = [ Yˆ1 + Vˆ1 X1

or

 Y2 = [ Yˆ1 X1

 γ21  X4 ]   β21  β24

    + ²2  

    + V1 γ21 + ²2  

Y2 = Zˆ2 δ2 + u2 where u2 = Vˆ1 γ21 + ²2 ˆ 1 dependends upon Y1 , but this – Note: Zˆ2 is correlated a bit with ²2 because Π problem tends to disappear as T → ∞. – Now, OLS can be run to obtain δˆ2 = [Zˆ20 Zˆ2 ]−1 Zˆ20 Y2

271


where δˆ2 is consistent and cov(δ2 ) = σ ˆ22 (Zˆ20 Zˆ2 )−1 where σ ˆ22

(Y2 − Z2 δˆ2 )0 (Y2 − Z2 δˆ2 ) = T

– Note: We use Z2 not Zˆ2 to calculate the cov(δˆ2 ). – This outlines the two-stage least squares (2SLS) approach. – 1st Stage: Regress each right-hand side endogenous variable on all predetermined variables, and obtain fitted values:

Yi = XΠi + Vi ˆ i = (X 0 X)−1 X 0 Yi Π ˆ i. and obtain fitted values Yˆi = X Π – 2nd Stage: Perform OLS on each separate equation substituting Yˆj for all j right-hand side endogenous variables. – This is also called a “Limited Information” approach because it estimates each equation in the system separately. There is a Three-Stage Least Squares (3SLS) or a “Full Information” approach which combines the 2SLS approach with the intuition of SUR models, taking advantage of any residual covariance across the equations in the system. – The full information methodology leads to efficiency gains by treating the system as a whole, not merely equation by equation. However, it is also more prone to 272


specification errors and is often more computationally intensive (although the latter is becoming less and less of a problem). – Ex: Supply and demand model is a limited information approach. However, wondering what the effect of taxation is on supply and demand decisions might require a full information approach, especially if there is little variation in the tax rates. – The full information approach is more appropriate when

E[²i ²j ] = σij I = Ω

– The three-stage least squares approach calls for 1. Estimate fitted values 2. Estimate the elements of Ω 3. Estimate the system as a whole just like in an SUR approach

15.4

Example: Math and Science Test Scores

• This example is lifted, basically verbatim, from the UCLA tutorial pages at www.gseis. ucla.edu/courses/ed231c/notes3/instrumental.html. Using the hsb2 dataset, consider the correlation between science and math. use http://www.gseis.ucla.edu/courses/data/hsb2 corr science math (obs=200) | science math -------------+-----------------science | 1.0000 math | 0.6307 1.0000 273


• The correlation of 0.63 may be satisfyingly large but it is also somewhat misleading. It would be tempting to interpret the correlation as reflecting the relationship between a measure of ability in science and ability in math. The problem is that both the science and math tests are standardized written tests so that general academic skills and intelligence are likely to influence the results of both, leading to an inflated correlation. In addition to the inflated correlation there is a more subtle problem that can arise when you try to use these test scores in a regression analysis. • Consider the following model: science = f (constant, matth, f emale) + e • Because portions of the variability of both science and math are jointly determined by general academic skills and intelligence there is a strong likelihood that there will be a correlation between math and the error (residuals) in the model. This correlation violates one of the basic assumptions of independence in OLS regression. Using reading and writing scores as indicators of general academic skills and intelligence, we can check out this possibility with the following commands. regress math female read write Source | SS df MS -------------+-----------------------------Model | 9176.66954 3 3058.88985 Residual | 8289.12546 196 42.2914564 -------------+-----------------------------Total | 17465.795 199 87.7678141

Number of obs F( 3, 196) Prob > F R-squared Adj R-squared Root MSE

= = = = = =

200 72.33 0.0000 0.5254 0.5181 6.5032

-----------------------------------------------------------------------------math | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------female | -2.023984 .991051 -2.04 0.042 -3.978476 -.0694912 read | .385395 .0581257 6.63 0.000 .270763 .500027 write | .3888326 .0649587 5.99 0.000 .2607249 .5169402 _cons | 13.09825 2.80151 4.68 0.000 7.573277 18.62322 -----------------------------------------------------------------------------. predict resmath, resid . regress science math female resmath Source |

SS

df 274

MS

Number of obs =

200


-------------+-----------------------------Model | 10284.6648 3 3428.22159 Residual | 9222.83523 196 47.0552818 -------------+-----------------------------Total | 19507.50 199 98.0276382

F( 3, 196) Prob > F R-squared Adj R-squared Root MSE

= = = = =

72.86 0.0000 0.5272 0.5200 6.8597

-----------------------------------------------------------------------------science | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------math | 1.007845 .0716667 14.06 0.000 .8665077 1.149182 female | -1.978643 .9748578 -2.03 0.044 -3.9012 -.0560859 resmath | -.7255874 .103985 -6.98 0.000 -.9306604 -.5205144 _cons | -.1296216 3.861937 -0.03 0.973 -7.745907 7.486664 ------------------------------------------------------------------------------

• The significant resmath coefficient indicates that there is a problem with using math as a predictor of science. In a traditional linear regression model the response variable is considered to be endogenous and the predictors to be exogenous. • An endogenous variable is a variable whose variation is explained by either exogenous variables or other endogenous variables in the model. Exogenous variables are variables whose variability is determined by variables outside of the model. • When one, or more, of the predictor variables is endogenous we encounter the problem of the variable being correlated with the error (residual). The test of resmath (above) can be considered to be a test of the endogeneity of math but is more specifically a test as to whether the OLS estimates in the model are consistent. • The ivreg command (or two-stage least squares; 2SLS) is designed to used in situations in which predictors are endogenous. In essence, ivreg simultaneously estimates two equations, math = f (constant1, read, write, f emale) + e1 and science = f (constant2, math∗, f emale) + e2

275


• Now we have the situation in which read, write and female are exogenous and are instruments used to predict math, which is treated as an endogenous variable. In the second equation above math* is used to indicate that it is the instrumented form of the variable math that is being used. The ivreg command for our example looks like this, ivreg science female (math = read write) Instrumental variables (2SLS) regression Source | SS df MS -------------+-----------------------------Model | 5920.63012 2 2960.31506 Residual | 13586.8699 197 68.9688827 -------------+-----------------------------Total | 19507.50 199 98.0276382

Number of obs F( 2, 197) Prob > F R-squared Adj R-squared Root MSE

= = = = = =

200 69.77 0.0000 0.3035 0.2964 8.3048

-----------------------------------------------------------------------------science | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------math | 1.007845 .0867641 11.62 0.000 .836739 1.17895 female | -1.978643 1.180222 -1.68 0.095 -4.306134 .3488478 _cons | -.1296216 4.675495 -0.03 0.978 -9.350068 9.090824 -----------------------------------------------------------------------------Instrumented: math Instruments: female read write -----------------------------------------------------------------------------. predict p1 . estimates store ivreg • Next, we can use Stata’s hausman command to test whether the differences between the ivreg and OLS estimates are large enough to suggest that the OLS estimates are not consistent. regress science math female Source | SS df MS -------------+-----------------------------Model | 7993.54995 2 3996.77498 Residual | 11513.95 197 58.4464469 -------------+-----------------------------276

Number of obs F( 2, 197) Prob > F R-squared Adj R-squared

= = = = =

200 68.38 0.0000 0.4098 0.4038


Total |

19507.50

199

98.0276382

Root MSE

=

7.645

-----------------------------------------------------------------------------science | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------math | .6631901 .0578724 11.46 0.000 .549061 .7773191 female | -2.168396 1.086043 -2.00 0.047 -4.310159 -.026633 _cons | 18.11813 3.167133 5.72 0.000 11.8723 24.36397 -----------------------------------------------------------------------------. predict p2 .

hausman ivreg . , constant sigmamore

---- Coefficients ---| (b) (B) (b-B) sqrt(diag(V_b-V_B)) | ivreg . Difference S.E. -------------+---------------------------------------------------------------math | 1.007845 .6631901 .3446546 .0550478 female | -1.978643 -2.168396 .1897529 .0303071 _cons | -.1296216 18.11813 -18.24776 2.914507 -----------------------------------------------------------------------------b = consistent under Ho and Ha; obtained from ivreg B = inconsistent under Ha, efficient under Ho; obtained from regress Test:

Ho:

difference in coefficients not systematic chi2(1) = (b-B)’[(V_b-V_B)^(-1)](b-B) = 39.20 Prob>chi2 = 0.0000

• Sure enough, the there is a significant (chi-square = 39.2, df = 1, p = 0.0000) difference between the ivreg and OLS coefficients, indicating clearly that OLS is an inconsistent estimator in this equation. The conclusion is that the reason for the inconsistent estimates is due to the endogeneity of math. • The R2 for the OLS model is much higher than the R2 for the ivreg model but this is due to the fact that both science and math are correlation with the exogenous variable 277


read and write. • If we wanted to represent this model graphically, it would look something like this

with squares for the exogenous variables and circles for the endogenous variables. • Let’s look at the variable science and the two predicted values, p1 from the ivreg model and p2 from the OLS model. . summarize science p1 p2 Variable | Obs Mean Std. Dev. Min Max -------------+----------------------------------------------------science | 200 51.85 9.900891 26 74 p1 | 200 51.85 9.522247 31.15061 75.45872 p2 | 200 51.85 6.33787 37.83501 67.85739 corr science p1 p2 (obs=200) | science p1 p2 -------------+--------------------------science | 1.0000 278


p1 | p2 |

0.6387 0.6401

1.0000 0.9977

1.0000

• Finally, let’s how close we can come to the ivreg results doing our own two-stage regression. . regress math read write female Source | SS df MS -------------+-----------------------------Model | 9176.66954 3 3058.88985 Residual | 8289.12546 196 42.2914564 -------------+-----------------------------Total | 17465.795 199 87.7678141

Number of obs F( 3, 196) Prob > F R-squared Adj R-squared Root MSE

= = = = = =

200 72.33 0.0000 0.5254 0.5181 6.5032

-----------------------------------------------------------------------------math | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------read | .385395 .0581257 6.63 0.000 .270763 .500027 write | .3888326 .0649587 5.99 0.000 .2607249 .5169402 female | -2.023984 .991051 -2.04 0.042 -3.978476 -.0694912 _cons | 13.09825 2.80151 4.68 0.000 7.573277 18.62322 -----------------------------------------------------------------------------. predict pmath . regress science pmath female Source | SS df MS -------------+-----------------------------Model | 9624.27766 2 4812.13883 Residual | 9883.22234 197 50.1686413 -------------+-----------------------------Total | 19507.50 199 98.0276382

Number of obs F( 2, 197) Prob > F R-squared Adj R-squared Root MSE

= = = = = =

200 95.92 0.0000 0.4934 0.4882 7.083

-----------------------------------------------------------------------------science | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------pmath | 1.007845 .0739996 13.62 0.000 .8619116 1.153778 female | -1.978643 1.006591 -1.97 0.051 -3.963721 .0064348 _cons | -.1296241 3.987652 -0.03 0.974 -7.993588 7.73434 -----------------------------------------------------------------------------279


/* this is the ivreg results from above */ Instrumental variables (2SLS) regression Source | SS df MS -------------+-----------------------------Model | 5920.63012 2 2960.31506 Residual | 13586.8699 197 68.9688827 -------------+-----------------------------Total | 19507.50 199 98.0276382

Number of obs F( 2, 197) Prob > F R-squared Adj R-squared Root MSE

= = = = = =

200 69.77 0.0000 0.3035 0.2964 8.3048

-----------------------------------------------------------------------------science | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------math | 1.007845 .0867641 11.62 0.000 .836739 1.17895 female | -1.978643 1.180222 -1.68 0.095 -4.306134 .3488478 _cons | -.1296216 4.675495 -0.03 0.978 -9.350068 9.090824 -----------------------------------------------------------------------------Instrumented: math Instruments: female read write ------------------------------------------------------------------------------

• In the first regression, we regressed the endogenous predictor on the three exogenous variables. In the second regression, we used the predicted math (pmath) as the instrumented variable in our model. Note that the coefficients in the second regression and the ivreg are the same, but that the standard errors are different. • One final note, it is also possible to estimate this system of equations using three-stage least squares (3SLS). Stata’s reg3 command can perform either 2SLS (equivalent to ivreg) or 3SLS and clearly illustrates the two equation nature of the problem. . reg3 (science = math female)(math = read write female), 2sls Two-stage least-squares regression ---------------------------------------------------------------------Equation Obs Parms RMSE "R-sq" F-Stat P ---------------------------------------------------------------------science 200 2 8.304751 0.3035 69.7726 0.0000 math 6.503188 0.5254 72.32879 0.0000 280


--------------------------------------------------------------------------------------------------------------------------------------------------| Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------science | math | 1.007845 .0867641 11.62 0.000 .8372648 1.178424 female | -1.978643 1.180222 -1.68 0.094 -4.298981 .3416951 _cons | -.1296216 4.675495 -0.03 0.978 -9.321732 9.062489 -------------+---------------------------------------------------------------math | read | .385395 .0581257 6.63 0.000 .2711189 .4996712 write | .3888326 .0649587 5.99 0.000 .2611226 .5165425 female | -2.023984 .991051 -2.04 0.042 -3.972408 -.075559 _cons | 13.09825 2.80151 4.68 0.000 7.590429 18.60607 -----------------------------------------------------------------------------Endogenous variables: science math Exogenous variables: female read write -----------------------------------------------------------------------------reg3 (science = math female)(math = read write female) Three-stage least squares regression ---------------------------------------------------------------------Equation Obs Parms RMSE "R-sq" chi2 P ---------------------------------------------------------------------science 200 2 8.24223 0.3035 141.6703 0.0000 math 6.438234 0.5253 221.4518 0.0000 --------------------------------------------------------------------------------------------------------------------------------------------------| Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------science | math | 1.007845 .0861109 11.70 0.000 .8390704 1.176619 female | -1.978643 1.171337 -1.69 0.091 -4.274421 .3171349 _cons | -.1296216 4.640297 -0.03 0.978 -9.224436 8.965192 -------------+---------------------------------------------------------------math | read | .3772331 .0496322 7.60 0.000 .2799557 .4745104 write | .3981663 .0550155 7.24 0.000 .290338 .5059947 female | -2.078337 .9617418 -2.16 0.031 -3.963316 -.1933579 _cons | 13.06158 2.770268 4.71 0.000 7.631958 18.49121 281


-----------------------------------------------------------------------------Endogenous variables: science math Exogenous variables: female read write ------------------------------------------------------------------------------

282


16

Additional Topics: Typically taught in more detail in ECON5329

16.1

Limited Dependent Variables

• These models are useful when the econometric model involves behavioral choices: 1. One votes yes or no in an election 2. One rides a bus or a subway on a commute 3. Either you have a job or you do not 4. Either you purchase a product or do not • When explanatory variables are dichotomous, we can use a dummy variable approach. However, dependent variables that are dichotomous introduce some complexities. • While a choice is being modeled, say to vote on a baseball stadium or not, we may think some attributes/characteristics influence the decision. • Perhaps income is important in voting. Higher incomes may correlate to a greater desire to vote yes on a stadium referendum. • Perhaps love of baseball may also influence voting. • While income may be relatively easy to measure, love of baseball may be more difficult. • Binary choice models thus seek to determine the likelihood of a choice occurring. Although the exact prediction may not be perfect or even desirable. • There are several possible model specifications, we will discuss a few. 283


16.1.1

Linear Probability Model

• Let Yi = α + βXi + ²i where Xi = income or some other characteristic. • The values of Yi are Yi =

   1

if vote yes

  0

if vote no

• Taking expectations: E[Yi ] = α + βXi • Let Pi = P rob(Yi = 1) and 1 − Pi = P rob(Yi = 0). • Then we see that E[Yi ] = 1 × Pi + 0 × (1 − Pi ) • So, the slope, β, is the change in the probability due to a unit change in the rhs var.    α + βXi    Pi = 1      0

when

0 < α + βXi < 1

when

α + βXi ≥ 1

when

α + βXi ≤ 0

• The error structure is then    1 − α − βXi ²i =   −α − βXi

when

Yi = 1

with

P rob = Pi

when

Yi = 0

with

P rob = 1 − Pi

284


• Note that E[²i ] = (1 − α − βXi )Pi + (−α − βXi )(1 − Pi ) = 0 • Solving for Pi we obtain

Pi = α + βXi 1 − Pi = 1 − α − βXi (3)

• Looking at the variance of ²i we see that E[²2i ] = (1 − α − βXi )2 Pi + (−α − βXi )2 (1 − Pi ) = (1 − Pi )2 Pi + Pi2 (1 − Pi ) = (1 − 2Pi + Pi2 )Pi + Pi2 − Pi3 = Pi − 2Pi2 + Pi3 + Pi2 − Pi3 σi2 = Pi − Pi2 = Pi (1 − Pi )

• Note that the error structure is heteroscedastic. • Note that as Pi → .5 the variance tends to increase • Note that as Pi → 0 or 1 the variance tends to decrease. • It is appropriate to use GLS in which the weights are the inverse of the OLS estimated variances. ˆ i. • Thus, we would have σ ˆ 2 = Yˆi (1 − Yˆi ) where Yˆi = α ˆ + βX

285


• But, note that there is no guarantee that Yˆi ∈ [0, 1] yielding inefficient predictions. What does it mean if the predicted probability is greater than one or less than zero? • An alternative is to move to a specific construct of ² so as to ensure that predicted values of Y fall on the unit interval. • We would want the alternative to maintain consistency with intuition, that is greater X would lead to greater probability of choice being made, or vice-versa. • We could accommodate this with a cumulative density function which by definition is on the unit interval. • Thus, let Pi = F (α + βXi ) = F (Zi ) • There are plenty of cumulative density functions that can and have been used. We will discuss two, the Normal and the Logistic. 16.1.2

Probit Model

• A picture of the distribution • The normal distribution is used in the Probit Model • Let Zi = α + βXi but we don’t observe the actual Z. We see a proxy for Z embodied in whether a choice is made or not. • Let Zi∗ be a critical cutoff of Z that determines which choice is made, for example whether or not an individual votes for a stadium or not. • BecauseZ is a reflection of desire for a stadium, we don’t know exactly what value of Z on the unit interval is required to motivate an individual to vote yes or no. 286


• Assume if Zi ≤ Zi∗ then the individual votes no and if Zi > Zi∗ then the individual votes yes. • Note that the critical cutoff point is varying by individual. • Thus, under the normality assumption we have 1 Pi = F (Zi ) = √ 2π

Z

Zi

s2

e− 2

−∞

where s is a random variable distributed with a zero mean and a constant variance. • Then, we can write Zi = F −1 (Pi ) = α + βXi • However, to estimate this model we need a nonlinear technique, typically we use maximum likelihood. Maximum likelihood techniques are numerical in nature. • Compared to the linear model, the slopes of the probit model are greater in the middle range of the values. Unfortunately, probit results are not that much different than the linear probability model and are often inappropriate for economic data. 16.1.3

Logit Model

• A picture of the distribution • This model is based upon the logistic distribution

Pi = F (Zi ) = F (α + βXi ) =

• Note that e is approximately equal to 2.718.

287

1 1 = −Z −(α+βX i i) 1+e 1+e


• To estimate this model we manipulate the equation above as (1 + e−Zi )Pi = 1 1 − Pi Pi Pi = 1 − Pi µ ¶ Pi = log = log(Pi ) − log(1 − Pi ) = α + βXi 1 − Pi

e−Zi = eZi Zi

• Let the likelihood function be the chance that all the observations occur simultaneously, or L = P rob(Y1 , Y2 , . . . , YN ) = P rob(Y1 ) × P rob(Y2 ) × · · · × P rob(YN ) • Further, let

L = P1 × P2 × · · · × Pm × (1 − Pm+1 ) · · · × (1 − PN ) =

m Y i=1

N Y

Pi

(1 − Pi )

j=m+1

where the first m observations have Y = 1 and the last N −m observations have Y = 0. • Then, to optimize we take ∂lnL/∂α and ∂lnL/∂β and use numerical techniques to solve the system. • Note that the slope parameters are no longer directly interpretable. • If we esitmate Zi = α+βXi as a logit model, then β on it’s own is not a very interesting parameter. • Note that β = ∂Zi /∂X but that doesn’t tell us much about the change in the probability of a choice being made.

288


• Thus, we recognize that β=

´ ³ Pi ∆ log 1−P i ∆X

• Recognizing that ∆lnX ' ∆X/X and log(X/Y ) = log(X) − log(Y ) then

∆log

Pi 1 1 '( + )∆P 1 − Pi Pi 1 − Pi

• Then, if ∆X = 1 we have ∆Pi ' β[Pi (1 − Pi )]. • Note that Pi does change with Xi but in a nonlinear fashion.

16.2

Panel Data

• When we have data similar to that used in an SUR model, that combines time-series and cross sections:

Yit = α + βXit + ²it for i = 1 . . . N, t = 1 . . . T

• Two basic approaches: a fixed effects and a random effects model.

16.2.1

Fixed Effects Model

• In this approach we try to wash the data of time and group specific effects. The resultant parameter estimates are then thought to be constant across all groups and time.

289


• Write the model as

Yit = α + βXit +

N X

γi Wit +

i=2

T X

δj Zij + ²it

j=2

• Where Wit = 1 for the ith individual and zero otherwise. • Where Zij = 0 for the j th time period and zero otherwise. • In essence, we have a set of dummy variables which are shifting the intercept terms for each group and time period. • Note that there are a lot of these intercepts, (N + T − 2) and eat up a lot of degrees of freedom. • A test of the fixed effects model versus the pooled OLS model would be an F-test of

F[N +t−2,N T −N −T ] =

(SSE1 + SSE2 )/(N + T − 2) SSE2 /(N T − N − T )

where SSE1 comes from the pooled OLS regression and SSE2 comes from the fixed effects OLS. • In otherwords, this is a joint test that all the time and group specific effects are all equal to zero at the same time. • Having both time and group specific intercepts eats up a lot of degrees of freedom. • Often we just include the group specific effects. • With lots of groups, it is sometimes of interest to explain these group specific effects with other explanatory variables.

290


• This does introduce the problem of generated regressors, e.g.,

γi = f (Zi ; θ) + ui

16.2.2

Random Effects Model

• One of the possible problems of the fixed effects model is that there is an inherent lack of knowledge about what is driving the different fixed effects. Perhaps it is better to let the error structure reflect these changes, especially if they are unmeasurable random variables. • In the Random Effects Model we pool the data and let the error structures be correlated over time and groups. The model is written as

Yit = α + βXit + ²it ²it = ui + vt + wit ui ∼ N (0, σu2 ) vi ∼ N (0, σv2 ) wi ∼ N (0, σw2 )

• We assume that the different components of the error structure are independent of each other so that var(²it ) = σu2 + σv2 + σw2 • The intuition behind the random effects model is that the time and group specific effects are actually random errors. These random errors are fleshed out of the error structure. 291


• If σu2 and σv2 both equal zero then pooled OLS is appropriate. • Estimation is a two-stage approach, in effect weighting each observation by the inverse of its variance. • First stage: Pooled OLS and obtain fitted values. • Second stage: Weight each obs by the inverse of its OLS variance and reestimate GLS. • Which one to use, R.E. or F.E. Typically we use a variant of the Hausman (Econometrica, 1978) specification test. Most stat packages will spit this test out. • We can calculate the test as h i−1 H = (βˆRE − βˆF E )0 V ar(βˆF E ) − V ar(βˆRE ) (βˆRE − βˆF E )0

where high values of H indicate favor for the fixed effects model. • We can also compare the random effects model to the pooled classical OLS model with the following Breusch-Pagan Lagrange multiplier statistic ·P P 2 ¸2 ²it NT i t P P 2 −1 LM = 2(T − 1) i i ²it where high values of LM support the random effects over the classical model.

292


17

Short Review

• Basic Statistics – Mean, Variance, Covariance. – Desirable properties of our estimators. • Simple Regression Model ˆ E[β], ˆ cov(β) ˆ – Basic intuition, β, ¯2 – Hypothesis testing, R2 and R • Matrix Algebra Review • Classical Model ˆ E[β], ˆ cov(β) ˆ – Full Ideal Conditions, β, – Gauss-Markov Theorem • Problems in Regression – Omitted variables, multicellularity • Function Forms – Types of regressions, interpretations of parameters – How to test for functional form • Variable Types – Dummy variables, how to use them, interpretation – Interaction terms, time trends • Hypothesis Testing – t-test, F-test, Joint and single parameter test – Linear restrictions, restricted least squares, Wald test, Lagrange Multiplier test, Log likelihood Ratio test 293


• Generalized Least Squares – Intuition behind GLS and the reasons it is necessary ˜ etc. – Derivation of β, • heteroscedasticity – Definition, how to control for it, how to test for it • Autocorrelation – Definition, how to control for it, how to test for it • Stochastic Regressors – What they do to OLS, how to remedy the problem (instrumental variables) • Seemingly Unrelated Regressions – Examples, intuition behind the approach, tests for appropriateness • Simultaneous Equations – Intuition, why the simeq approach is necessary vis-a-vis OLS – Reduced versus Structural models – Indirect Least Squares, appropriateness – Identification vs. simultaneity – Solving for reduced-form parameters, interpretation? – Estimation: ILS, Limited vs. Full information approaches

294


Issuu converts static files into: digital portfolios, online yearbooks, online catalogs, digital photo albums and more. Sign up and create your flipbook.