Econometrics
Badi H. Baltagi
Econometrics
Fourth Edition
123
Professor Badi H. Baltagi Syracuse University Center for Policy Research 426 Eggers Hall Syracuse, NY 132441020 USA bbaltagi@maxwell.syr.edu
ISBN 9783540765158
eISBN 9783540765165
DOI 10.1007/9783540765165 Library of Congress Control Number: 2007939803 c 2008 SpringerVerlag Berlin Heidelberg This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilm or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. The use of general descriptive names, registered names, trademarks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. Production: LETEX Jelonek, Schmidt & Vรถckler GbR, Leipzig Coverdesign: WMX Design GmbH, Heidelberg Printed on acidfree paper 987654321 springer.com
To My Wife Phyllis
Preface This book is intended for a ﬁrst year graduate course in econometrics. Courses requiring matrix algebra as a prerequisite to econometrics can start with Chapter 7. Chapter 2 has a quick refresher on some of the required background needed from statistics for the proper understanding of the material in this book. For an advanced undergraduate/masters class not requiring matrix algebra, one can structure a course based on Chapter 1; Section 2.6 on descriptive statistics; Chapters 36; Section 11.1 on simultaneous equations; and Chapter 14 on timeseries analysis. This book teaches some of the basic econometric methods and the underlying assumptions behind them. Estimation, hypotheses testing and prediction are three recurrent themes in this book. Some uses of econometric methods include (i) empirical testing of economic theory, whether it is the permanent income consumption theory or purchasing power parity, (ii) forecasting, whether it is GNP or unemployment in the U.S. economy or future sales in the computer industry. (iii) Estimation of price elasticities of demand, or returns to scale in production. More importantly, econometric methods can be used to simulate the effect of policy changes like a tax increase on gasoline consumption, or a ban on advertising on cigarette consumption. It is left to the reader to choose among the available econometric/statistical software to use, like EViews, SAS, STATA, TSP, SHAZAM, Microﬁt, PcGive, LIMDEP, and RATS, to mention a few. The empirical illustrations in the book utilize a variety of these software packages. Of course, these packages have different advantages and disadvantages. However, for the basic coverage in this book, these differences may be minor and more a matter of what software the reader is familiar or comfortable with. In most cases, I encourage my students to use more than one of these packages and to verify these results using simple programming languages like GAUSS, OX, R and MATLAB. This book is not meant to be encyclopedic. I did not attempt the coverage of Bayesian econometrics simply because it is not my comparative advantage. The reader should consult Koop (2003) for a more recent treatment of the subject. Nonparametrics and semiparametrics are popular methods in today’s econometrics, yet they are not covered in this book to keep the technical diﬃculty at a low level. These are a must for a followup course in econometrics, see Li and Racine (2007). Also, for a more rigorous treatment of asymptotic theory, see White (1984). Despite these limitations, the topics covered in this book are basic and necessary in the training of every economist. In fact, it is but a ‘stepping stone’, a ‘sample of the good stuff’ the reader will ﬁnd in this young, energetic and ever evolving ﬁeld. I hope you will share my enthusiasm and optimism in the importance of the tools you will learn when you are through reading this book. Hopefully, it will encourage you to consult the suggested readings on this subject that are referenced at the end of each chapter. In his inaugural lecture at the University of Birmingham, entitled “Econometrics: A View from the Toolroom,” Peter C.B. Phillips (1977) concluded: “the toolroom may lack the glamour of economics as a practical art in government or business, but it is every bit as important. For the tools (econometricians) fashion provide the key to improvements in our quantitative information concerning matters of economic policy.”
VIII
Preface
As a student of econometrics, I have beneﬁted from reading Johnston (1984), Kmenta (1986), Theil (1971), Klein (1974), Maddala (1977), and Judge, et al. (1985), to mention a few. As a teacher of undergraduate econometrics, I have learned from Kelejian and Oates (1989), Wallace and Silver (1988), Maddala (1992), Kennedy (1992), Wooldridge (2003) and Stock and Watson (2003). As a teacher of graduate econometrics courses, Greene (1993), Judge, et al. (1985), Fomby, Hill and Johnson (1984) and Davidson and MacKinnon (1993) have been my regular companions. The inﬂuence of these books will be evident in the pages that follow. At the end of each chapter I direct the reader to some of the classic references as well as further suggested readings. This book strikes a balance between a rigorous approach that proves theorems and a completely empirical approach where no theorems are proved. Some of the strengths of this book lie in presenting some diﬃcult material in a simple, yet rigorous manner. For example, Chapter 12 on pooling timeseries of crosssection data is drawn from the author’s area of expertise in econometrics and the intent here is to make this material more accessible to the general readership of econometrics. The exercises contain theoretical problems that should supplement the understanding of the material in each chapter. Some of these exercises are drawn from the Problems and Solutions series of Econometric Theory (reprinted with permission of Cambridge University Press). In addition, the book has a set of empirical illustrations demonstrating some of the basic results learned in each chapter. Data sets from published articles are provided for the empirical exercises. These exercises are solved using several econometric software packages and are available in the Solution Manual. This book is by no means an applied econometrics text, and the reader should consult Berndt’s (1991) textbook for an excellent treatment of this subject. Instructors and students are encouraged to get other data sets from the internet or journals that provide backup data sets to published articles. The Journal of Applied Econometrics and the Journal of Business and Economic Statistics are two such journals. In fact, the Journal of Applied Econometrics has a replication section for which I am serving as an editor. In my econometrics course, I require my students to replicate an empirical paper. Many students ﬁnd this experience rewarding in terms of giving them hands on application of econometric methods that prepare them for doing their own empirical work. I would like to thank my teachers Lawrence R. Klein, Roberto S. Mariano and Robert Shiller who introduced me to this ﬁeld; James M. Griﬃn who provided some data sets, empirical exercises and helpful comments, and many colleagues who had direct and indirect inﬂuence on the contents of this book including G.S. Maddala, Jan Kmenta, Peter Schmidt, Cheng Hsiao, Tom Wansbeek, Walter Kr¨ amer, Maxwell King, Peter C.B. Phillips, Alberto Holly, Essie Maasoumi, Aris Spanos, Farshid Vahid, Heather Anderson, Arnold Zellner and Bryan Brown. Also, I would like to thank my students WeiWen Xiong, MingJang Weng, Kiseok Nam, Dong Li and Gustavo Sanchez who read parts of this book and solved several of the exercises. Werner M¨ uller and Martina Bihn at Springer for their prompt and professional editorial help. I have also beneﬁted from my visits to the University of Arizona, University of California SanDiego, Monash University, the University of Zurich, the Institute of Advanced Studies in Vienna, and the University of Dortmund, Germany. A special thanks to my wife Phyllis whose help and support were essential to completing this book.
Preface
IX
References Berndt, E.R. (1991), The Practice of Econometrics: Classic and Contemporary (AddisonWesley: Reading, MA). Davidson, R. and J.G. MacKinnon (1993), Estimation and Inference In Econometrics (Oxford University Press: Oxford, MA). Fomby, T.B., R.C. Hill and S.R. Johnson (1984), Advanced Econometric Methods (SpringerVerlag: New York). Greene, W.H. (1993), Econometric Analysis (Macmillan: New York ). Johnston, J. (1984), Econometric Methods , 3rd. Ed., (McGrawHill: New York). Judge, G.G., W.E. Griﬃths, R.C. Hill, H. L¨ utkepohl and T.C. Lee (1985), The Theory and Practice of Econometrics , 2nd Ed., (John Wiley: New York). Kelejian, H. and W. Oates (1989), Introduction to Econometrics: Principles and Applications , 2nd Ed., (Harper and Row: New York). Kennedy, P. (1992), A Guide to Econometrics (The MIT Press: Cambridge, MA). Klein, L.R. (1974), A Textbook of Econometrics (PrenticeHall: New Jersey). Kmenta, J. (1986), Elements of Econometrics , 2nd Ed., (Macmillan: New York). Koop, G. (2003), Bayesian Econometrics, (Wiley: New York). Li, Q. and J.S. Racine (2007), Nonparametric Econometrics, (Princeton University Press: New Jersey). Maddala, G.S. (1977), Econometrics (McGrawHill: New York). Maddala, G.S. (1992), Introduction to Econometrics (Macmillan: New York). Phillips, P.C.B. (1977), “Econometrics: A View From the Toolroom,” Inaugural Lecture, University of Birmingham, Birmingham, England. Stock, J.H. and M.W. Watson (2003), Introduction to Econometrics , (AddisonWesley: New York). Theil, H. (1971), Principles of Econometrics (John Wiley: New York). Wallace, T.D. and L. Silver (1988), Econometrics: An Introduction (AddisonWesley: New York). White, H. (1984), Asymptotic Theory for Econometrics (Academic Press: Florida). Wooldridge, J.M. (2003), Introductory Econometrics , (SouthWestern: Ohio).
Data The data sets used in this text can be downloaded from the Springer website in Germany. The address is: http://www.springer.com/9783540765158. Please select the link “Samples & Supplements” from the righthand column.
Table of Contents Preface
VII
Table of Contents
XI
Part I
1
1 What Is Econometrics? 1.1 Introduction . . . . . . . 1.2 A Brief History . . . . . 1.3 Critiques of Econometrics 1.4 Looking Ahead . . . . . . Notes . . . . . . . . . . . . . . References . . . . . . . . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
3 . 3 . 5 . 7 . 8 . 9 . 10
2 Basic Statistical Concepts 2.1 Introduction . . . . . . . 2.2 Methods of Estimation . 2.3 Properties of Estimators 2.4 Hypothesis Testing . . . 2.5 ConďŹ dence Intervals . . . 2.6 Descriptive Statistics . . Notes . . . . . . . . . . . . . . Problems . . . . . . . . . . . . References . . . . . . . . . . . . Appendix . . . . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
13 13 13 16 21 30 31 36 36 42 42
. . . . . . . . Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
49 49 50 55 56 57 58 60 60 63 64 67 71 72
3 Simple Linear Regression 3.1 Introduction . . . . . . . . . . . . . . . . . 3.2 Least Squares Estimation and the Classical 3.3 Statistical Properties of Least Squares . . . 3.4 Estimation of Ďƒ 2 . . . . . . . . . . . . . . . 3.5 Maximum Likelihood Estimation . . . . . . 3.6 A Measure of Fit . . . . . . . . . . . . . . 3.7 Prediction . . . . . . . . . . . . . . . . . . 3.8 Residual Analysis . . . . . . . . . . . . . . 3.9 Numerical Example . . . . . . . . . . . . . 3.10 Empirical Example . . . . . . . . . . . . . Problems . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . Appendix . . . . . . . . . . . . . . . . . . . . . .
4 Multiple Regression Analysis 73 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
XII
Table of Contents
4.2 Least Squares Estimation . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Residual Interpretation of Multiple Regression Estimates . . . . . . 4.4 Overspeciﬁcation and Underspeciﬁcation of the Regression Equation 4.5 RSquared versus RBarSquared . . . . . . . . . . . . . . . . . . . 4.6 Testing Linear Restrictions . . . . . . . . . . . . . . . . . . . . . . . 4.7 Dummy Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . Note . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
73 75 76 78 78 81 85 85 91 92
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
95 95 95 96 98 98 109 119 120 126
6 Distributed Lags and Dynamic Models 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Inﬁnite Distributed Lag . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.1 Adaptive Expectations Model (AEM) . . . . . . . . . . . . . . . 6.2.2 Partial Adjustment Model (PAM) . . . . . . . . . . . . . . . . . 6.3 Estimation and Testing of Dynamic Models with Serial Correlation . . 6.3.1 A Lagged Dependent Variable Model with AR(1) Disturbances 6.3.2 A Lagged Dependent Variable Model with MA(1) Disturbances 6.4 Autoregressive Distributed Lag . . . . . . . . . . . . . . . . . . . . . . . Note . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
129 129 135 136 137 137 138 140 141 142 142 144
5 Violations of the Classical Assumptions 5.1 Introduction . . . . . . . . . . . . . . . 5.2 The Zero Mean Assumption . . . . . . 5.3 Stochastic Explanatory Variables . . . 5.4 Normality of the Disturbances . . . . . 5.5 Heteroskedasticity . . . . . . . . . . . . 5.6 Autocorrelation . . . . . . . . . . . . . Notes . . . . . . . . . . . . . . . . . . . . . . Problems . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
Part II 7 The 7.1 7.2 7.3 7.4 7.5 7.6 7.7
147 General Linear Model: The Basics Introduction . . . . . . . . . . . . . . . . . . . . . . Least Squares Estimation . . . . . . . . . . . . . . . Partitioned Regression and the FrischWaughLovell Maximum Likelihood Estimation . . . . . . . . . . . Prediction . . . . . . . . . . . . . . . . . . . . . . . Conﬁdence Intervals and Test of Hypotheses . . . . Joint Conﬁdence Intervals and Test of Hypotheses .
. . . . . . . . . . . . Theorem . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
149 149 149 152 154 157 158 158
Table of Contents
7.8 Restricted MLE and Restricted Least Squares . 7.9 Likelihood Ratio, Wald and Lagrange Multiplier Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . Problems . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . Appendix . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
159 160 165 165 170 171
8 Regression Diagnostics and Specification Tests 8.1 Inﬂuential Observations . . . . . . . . . . . . . . . . . . . . 8.2 Recursive Residuals . . . . . . . . . . . . . . . . . . . . . . 8.3 Speciﬁcation Tests . . . . . . . . . . . . . . . . . . . . . . . 8.4 Nonlinear Least Squares and the GaussNewton Regression 8.5 Testing Linear versus LogLinear Functional Form . . . . . Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
177 177 185 194 204 212 214 214 218
9 Generalized Least Squares 9.1 Introduction . . . . . . . . . . . . . . . . 9.2 Generalized Least Squares . . . . . . . . 9.3 Special Forms of Ω . . . . . . . . . . . . 9.4 Maximum Likelihood Estimation . . . . . 9.5 Test of Hypotheses . . . . . . . . . . . . 9.6 Prediction . . . . . . . . . . . . . . . . . 9.7 Unknown Ω . . . . . . . . . . . . . . . . 9.8 The W, LR and LM Statistics Revisited 9.9 Spatial Error Correlation . . . . . . . . . Note . . . . . . . . . . . . . . . . . . . . . . . . Problems . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
221 221 221 223 224 224 225 225 226 228 229 230 234
10 Seemingly Unrelated Regressions 10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.2 Feasible GLS Estimation . . . . . . . . . . . . . . . . . . . . 10.3 Testing Diagonality of the VarianceCovariance Matrix . . . 10.4 Seemingly Unrelated Regressions with Unequal Observations 10.5 Empirical Example . . . . . . . . . . . . . . . . . . . . . . . Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
237 237 239 242 242 244 245 249
11 Simultaneous Equations Model 11.1 Introduction . . . . . . . . . . . . . . . 11.1.1 Simultaneous Bias . . . . . . . . 11.1.2 The Identiﬁcation Problem . . . 11.2 Single Equation Estimation: TwoStage 11.2.1 Spatial Lag Dependence . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
253 253 253 256 259 266
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . Tests . . . . . . . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . Least Squares . . . . . . . .
. . . . . . . . . . . .
. . . . .
. . . . . .
XIII
. . . . . . . . . . . .
. . . . .
. . . . . . . . . . . .
. . . . .
. . . . .
XIV
Table of Contents
11.3 System Estimation: ThreeStage Least Squares 11.4 Test for OverIdentiﬁcation Restrictions . . . . 11.5 Hausman’s Speciﬁcation Test . . . . . . . . . . 11.6 Empirical Example: Crime in North Carolina . Notes . . . . . . . . . . . . . . . . . . . . . . . . . . Problems . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . Appendix . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
267 269 271 273 277 277 287 289
12 Pooling TimeSeries of CrossSection Data 12.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.2 The Error Components Model . . . . . . . . . . . . . . . . . 12.2.1 The Fixed Effects Model . . . . . . . . . . . . . . . . 12.2.2 The Random Effects Model . . . . . . . . . . . . . . 12.2.3 Maximum Likelihood Estimation . . . . . . . . . . . 12.3 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.4 Empirical Example . . . . . . . . . . . . . . . . . . . . . . . 12.5 Testing in a Pooled Model . . . . . . . . . . . . . . . . . . . 12.6 Dynamic Panel Data Models . . . . . . . . . . . . . . . . . . 12.6.1 Empirical Illustration . . . . . . . . . . . . . . . . . . 12.7 Program Evaluation and DifferenceinDifferences Estimator 12.7.1 The DifferenceinDifferences Estimator . . . . . . . . Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
295 295 295 296 298 302 303 303 307 311 314 316 317 317 320
. . . . . . . . . . . . . . . . . . .
323 . 323 . 323 . 324 . 326 . 331 . 332 . 334 . 334 . 335 . 339 . 339 . 340 . 341 . 344 . 345 . 347 . 347 . 351 . 353
13 Limited Dependent Variables 13.1 Introduction . . . . . . . . . . . . . . . . 13.2 The Linear Probability Model . . . . . . 13.3 Functional Form: Logit and Probit . . . . 13.4 Grouped Data . . . . . . . . . . . . . . . 13.5 Individual Data: Probit and Logit . . . . 13.6 The Binary Response Model Regression . 13.7 Asymptotic Variances for Predictions and 13.8 Goodness of Fit Measures . . . . . . . . . 13.9 Empirical Examples . . . . . . . . . . . . 13.10 Multinomial Choice Models . . . . . . . . 13.10.1 Ordered Response Models . . . . 13.10.2 Unordered Response Models . . . 13.11 The Censored Regression Model . . . . . 13.12 The Truncated Regression Model . . . . 13.13 Sample Selectivity . . . . . . . . . . . . . Notes . . . . . . . . . . . . . . . . . . . . . . . Problems . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . Appendix . . . . . . . . . . . . . . . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Marginal Effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
Table of Contents
14 TimeSeries Analysis 14.1 Introduction . . . . . . . . . . . . . . . . . . . 14.2 Stationarity . . . . . . . . . . . . . . . . . . . 14.3 The Box and Jenkins Method . . . . . . . . . 14.4 Vector Autoregression . . . . . . . . . . . . . . 14.5 Unit Roots . . . . . . . . . . . . . . . . . . . . 14.6 Trend Stationary versus Difference Stationary 14.7 Cointegration . . . . . . . . . . . . . . . . . . 14.8 Autoregressive Conditional Heteroskedasticity Note . . . . . . . . . . . . . . . . . . . . . . . . . . . Problems . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
XV
. . . . . . . . . . .
. . . . . . . . . . .
355 355 355 356 360 361 365 366 368 371 371 375
Appendix
379
List of Figures
385
List of Tables
387
Index
389
Part I
CHAPTER 1
What Is Econometrics? 1.1
Introduction
What is econometrics? A few deﬁnitions are given below: The method of econometric research aims, essentially, at a conjunction of economic theory and actual measurements, using the theory and technique of statistical inference as a bridge pier. Trygve Haavelmo (1944)
Econometrics may be deﬁned as the quantitative analysis of actual economic phenomena based on the concurrent development of theory and observation, related by appropriate methods of inference. Samuelson, Koopmans and Stone (1954)
Econometrics is concerned with the systematic study of economic phenomena using observed data. Aris Spanos (1986)
Broadly speaking, econometrics aims to give empirical content to economic relations for testing economic theories, forecasting, decision making, and for ex post decision/policy evaluation. J. Geweke, J. Horowitz, and M.H. Pesaran (2007)
For other deﬁnitions of econometrics, see Tintner (1953). An econometrician has to be a competent mathematician and statistician who is an economist by training. Fundamental knowledge of mathematics, statistics and economic theory are a necessary prerequisite for this ﬁeld. As Ragnar Frisch (1933) explains in the ﬁrst issue of Econometrica, it is the uniﬁcation of statistics, economic theory and mathematics that constitutes econometrics. Each view point, by itself is necessary but not suﬃcient for a real understanding of quantitative relations in modern economic life. Ragnar Frisch is credited with coining the term ‘econometrics’ and he is one of the founders of the Econometrics Society, see Christ (1983). Econometrics aims at giving empirical content to economic relationships. The three key ingredients are economic theory, economic data, and statistical methods. Neither ‘theory without measurement’, nor ‘measurement without theory’ are suﬃcient for explaining economic phenomena. It is as Frisch emphasized their union that is the key for success in the future development of econometrics. Lawrence R. Klein, the 1980 recipient of the Nobel Prize in economics “for the creation of econometric models and their application to the analysis of economic ﬂuctuations and economic policies,”1 has always emphasized the integration of economic theory, statistical methods and practical economics. The exciting thing about econometrics is its concern for verifying or refuting economic laws, such as purchasing power parity, the life cycle hypothesis, the quantity theory of money, etc. These economic laws or hypotheses are testable with economic data. In fact, David F. Hendry (1980) emphasized this function of econometrics:
4
Chapter 1: What Is Econometrics?
The three golden rules of econometrics are test, test and test; that all three rules are broken regularly in empirical applications is fortunately easily remedied. Rigorously tested models, which adequately described the available data, encompassed previous ﬁndings and were derived from well based theories would enhance any claim to be scientiﬁc. Econometrics also provides quantitative estimates of price and income elasticities of demand, estimates of returns to scale in production, technical eﬃciency, the velocity of money, etc. It also provides predictions about future interest rates, unemployment, or GNP growth. Lawrence Klein (1971) emphasized this last function of econometrics: Econometrics had its origin in the recognition of empirical regularities and the systematic attempt to generalize these regularities into “laws” of economics. In a broad sense, the use of such “laws” is to make predictions   about what might have or what will come to pass. Econometrics should give a base for economic prediction beyond experience if it is to be useful. In this broad sense it may be called the science of economic prediction. Econometrics, while based on scientiﬁc principles, still retains a certain element of art. According to Malinvaud (1966), the art in econometrics is trying to ﬁnd the right set of assumptions which are suﬃciently speciﬁc, yet realistic to enable us to take the best possible advantage of the available data. Data in economics are not generated under ideal experimental conditions as in a physics laboratory. This data cannot be replicated and is most likely measured with error. In some cases, the available data are proxies for variables that are either not observed or cannot be measured. Many published empirical studies ﬁnd that economic data may not have enough variation to discriminate between two competing economic theories. Manski (1995, p. 8) argues that Social scientists and policymakers alike seem driven to draw sharp conclusions, even when these can be generated only by imposing much stronger assumptions than can be defended. We need to develop a greater tolerance for ambiguity. We must face up to the fact that we cannot answer all of the questions that we ask. To some, the “art” element in econometrics has left a number of distinguished economists doubtful of the power of econometrics to yield sharp predictions. In his presidential address to the American Economic Association, Wassily Leontief (1971, pp. 23) characterized econometrics work as: an attempt to compensate for the glaring weakness of the data base available to us by the widest possible use of more and more sophisticated techniques. Alongside the mounting pile of elaborate theoretical models we see a fast growing stock of equally intricate statistical tools. These are intended to stretch to the limit the meager supply of facts. Most of the time the data collected are not ideal for the economic question at hand because they were posed to answer legal requirements or comply to regulatory agencies. Griliches (1986, p.1466) describes the situation as follows:
1.2
A Brief History
5
Econometricians have an ambivilant attitude towards economic data. At one level, the ‘data’ are the world that we want to explain, the basic facts that economists purport to elucidate. At the other level, they are the source of all our trouble. Their imperfections make our job difficult and often impossible... We tend to forget that these imperfections are what gives us our legitimacy in the ﬁrst place... Given that it is the ‘badness’ of the data that provides us with our living, perhaps it is not all that surprising that we have shown little interest in improving it, in getting involved in the grubby task of designing and collecting original data sets of our own. Most of our work is on ‘found’ data, data that have been collected by somebody else, often for quite diﬀerent purposes. Even though economists are increasingly getting involved in collecting their data and measuring variables more accurately and despite the increase in data sets and data storage and computational accuracy, some of the warnings given by Griliches (1986, p. 1468) are still valid today: The encounters between econometricians and data are frustrating and ultimately unsatisfactory both because econometricians want too much from the data and hence tend to be dissappointed by the answers, and because the data are incomplete and imperfect. In part it is our fault, the appetite grows with eating. As we get larger samples, we keep adding variables and expanding our models, until on the margin, we come back to the same insigniﬁcance levels.
1.2
A Brief History
For a brief review of the origins of econometrics before World War II and its development in the 19401970 period, see Klein (1971). Klein gives an interesting account of the pioneering works of Moore (1914) on economic cycles, Working (1927) on demand curves, Cobb and Douglas (1928) on the theory of production, Schultz (1938) on the theory and measurement of demand, and Tinbergen (1939) on business cycles. As Klein (1971, p. 415) adds: The works of these men mark the beginnings of formal econometrics. Their analysis was systematic, based on the joint foundations of statistical and economic theory, and they were aiming at meaningful substantive goals  to measure demand elasticity, marginal productivity and the degree of macroeconomic stability. The story of the early progress in estimating economic relationships in the U.S. is given in Christ (1985). The modern era of econometrics, as we know it today, started in the 1940’s. Klein (1971) attributes the formulation of the econometrics problem in terms of the theory of statistical inference to Haavelmo (1943, 1944) and Mann and Wald (1943). This work was extended later by T.C. Koopmans, J. Marschak, L. Hurwicz, T.W. Anderson and others at the Cowles Commission in the late 1940’s and early 1950’s, see Koopmans (1950). Klein (1971, p. 416) adds: At this time econometrics and mathematical economics had to ﬁght for academic recognition. In retrospect, it is evident that they were growing disciplines and becoming increasingly attractive to the new generation of economic students after World War II, but only a few of the largest and most advanced universities oﬀered formal work in these subjects. The mathematization of economics was strongly resisted.
6
Chapter 1: What Is Econometrics?
This resistance is a thing of the past, with econometrics being an integral part of economics, taught and practiced worldwide. Econometrica, the oﬃcial journal of the Econometric Society is one of the leading journals in economics, and today the Econometric Society boast a large membership worldwide. Today, it is hard to read any professional article in leading economics and econometrics journals without seeing mathematical equations. Students of economics and econometrics have to be proﬁcient in mathematics to comprehend this research. In an Econometric Theory interview, professor J. D. Sargan of the London School of Economics looks back at his own career in econometrics and makes the following observations: “... econometric theorists have really got to be much more professional statistical theorists than they had to be when I started out in econometrics in 1948... Of course this means that the starting econometrician hoping to do a Ph.D. in this ﬁeld is also ﬁnding it more diﬃcult to digest the literature as a prerequisite for his own study, and perhaps we need to attract students of an increasing degree of mathematical and statistical sophistication into our ﬁeld as time goes by,” see Phillips (1985, pp. 134135). This is also echoed by another giant in the ﬁeld, professor T.W. Anderson of Stanford, who said in an Econometric Theory interview: “These days econometricians are very highly trained in mathematics and statistics; much more so than statisticians are trained in economics; and I think that there will be more crossfertilization, more joint activity,” see Phillips (1986, p. 280). Research at the Cowles Commission was responsible for providing formal solutions to the problems of identiﬁcation and estimation of the simultaneous equations model, see Christ (1985).2 Two important monographs summarizing much of the work of the Cowles Commission at Chicago, are Koopmans and Marschak (1950) and Koopmans and Hood (1953).3 The creation of large data banks of economic statistics, advances in computing, and the general acceptance of Keynesian theory, were responsible for a great ﬂurry of activity in econometrics. Macroeconometric modelling started to ﬂourish beyond the pioneering macro models of Klein (1950) and Klein and Goldberger (1955). For the story of the founding of Econometrica and the Econometric Society, see Christ (1983). Suggested readings on the history of econometrics are Pesaran (1987), Epstein (1987) and Morgan (1990). In the conclusion of her book on The History of Econometric Ideas, Morgan (1990; p. 264) explains: In the ﬁrst half of the twentieth century, econometricians found themselves carrying out a wide range of tasks: from the precise mathematical formulation of economic theories to the development tasks needed to build an econometric model; from the application of statistical methods in data preperation to the measurement and testing of models. Of necessity, econometricians were deeply involved in the creative development of both mathematical economic theory and statistical theory and techniques. Between the 1920s and the 1940s, the tools of mathematics and statistics were indeed used in a productive and complementary union to forge the essential ideas of the econometric approach. But the changing nature of the econometric enterprise in the 1940s caused a return to the division of labour favoured in the late nineteenth century, with mathematical economists working on theory building and econometricians concerned with statistical work. By the 1950s the founding ideal of econometrics, the union of mathematical and statistical economics into a truly synthetic economics, had collapsed.
1.3
Critiques of Econometrics
7
In modern day usage, econometrics have become the application of statistical methods to economics, like biometrics and psychometrics. Although, the ideals of Frisch still live on in Econometrica and the Econometric Society, Maddala (1999) argues that: “In recent years the issues of Econometrica have had only a couple of papers in econometrics (statistical methods in economics) and the rest are all on game theory and mathematical economics. If you look at the list of fellows of the Econometric Society, you ﬁnd one or two econometricians and the rest are game theorists and mathematical economists.” This may be a little exagerated but it does summarize the rift between modern day econometrics and mathematical economics. For a recent world wide ranking of econometricians as well as academic institutions in the ﬁeld of econometrics, see Baltagi (2007).
1.3
Critiques of Econometrics
Econometrics has its critics. Interestingly, John Maynard Keynes (1940, p. 156) had the following to say about Jan Tinbergen’s (1939) pioneering work: No one could be more frank, more painstaking, more free of subjective bias or parti pris than Professor Tinbergen. There is no one, therefore, so far as human qualities go, whom it would be safer to trust with black magic. That there is anyone I would trust with it at the present stage or that this brand of statistical alchemy is ripe to become a branch of science, I am not yet persuaded. But Newton, Boyle and Locke all played with alchemy. So let him continue.4 In 1969, Jan Tinbergen shared the ﬁrst Nobel Prize in economics with Ragnar Frisch. Recent well cited critiques of econometrics include the Lucas (1976) critique which is based on the Rational Expectations Hypothesis (REH). As Pesaran (1990, p. 17) puts it: The message of the REH for econometrics was clear. By postulating that economic agents form their expectations endogenously on the basis of the true model of the economy and a correct understanding of the processes generating exogenous variables of the model, including government policy, the REH raised serious doubts about the invariance of the structural parameters of the mainstream macroeconometric models in face of changes in government policy. Responses to this critique include Pesaran (1987). Other lively debates among econometricians include Ed Leamer’s (1983) article entitled “Let’s Take the Con Out of Econometrics,” and the response by McAleer, Pagan and Volker (1985). Rather than leave the reader with criticisms of econometrics especially before we embark on the journey to learn the tools of the trade, we conclude this section with the following quote from Pesaran (1990, pp. 2526): There is no doubt that econometrics is subject to important limitations, which stem largely from the incompleteness of the economic theory and the nonexperimental nature of economic data. But these limitations should not distract us from recognizing the fundamental role that econometrics has come to play in the development of economics as a scientiﬁc discipline. It may not be possible conclusively to reject economic theories by means of econometric methods, but it does not mean that
8
Chapter 1: What Is Econometrics?
nothing useful can be learned from attempts at testing particular formulations of a given theory against (possible) rival alternatives. Similarly, the fact that econometric modelling is inevitably subject to the problem of speciﬁcation searches does not mean that the whole activity is pointless. Econometric models are important tools for forecasting and policy analysis, and it is unlikely that they will be discarded in the future. The challenge is to recognize their limitations and to work towards turning them into more reliable and eﬀective tools. There seem to be no viable alternatives.
1.4
Looking Ahead
Econometrics have experienced phenomenal growth in the past 50 years. There are ﬁve volumes of the Handbook of Econometrics running to 3833 pages. Most of it dealing with post 1960’s research. A lot of the recent growth reﬂects the rapid advances in computing technology. The broad availability of micro data bases is a major advance which facilitated the growth of panel data methods (see Chapter 12) and microeconometric methods especially on sample selection and discrete choice (see Chapter 13) and that also lead to the award of the Nobel Prize in Economics to James Heckman and Daniel McFadden in 2000. The explosion in research in time series econometrics which lead to the development of ARCH and GARCH and cointegration (see Chapter 14) which also lead to the award of the Nobel Prize in Economics to Clive Granger and Robert Engle in 2003. It is a different world than it was 30 years ago. The computing facilities changed dramatically. The increasing accessibility of cheap and powerful computing facilities are helping to make the latest econometric methods more readily available to applied researchers. Today, there is hardly a ﬁeld in economics which has not been intensive in its use of econometrics in empirical work. Pagan (1987, p. 81) observed that the work of econometric theorists over the period 19661986 have become part of the process of economic investigation and the training of economists. Based on this criterion, he declares econometrics as an “outstanding success.” He adds that: The judging of achievement inevitably involves contrast and comparison. Over a period of twenty years this would be best done by interviewing a timetravelling economist displaced from 1966 to 1986. I came into econometrics just after the beginning of this period, so have some appreciation for what has occurred. But because I have seen the events gradually unfolding, the eﬀects upon me are not as dramatic. Nevertheless, let me try to be a timetraveller and comment on the perceptions of a 1966’er landing in 1986. My ﬁrst impression must be of the large number of people who have enough econometric and computer skills to formulate, estimate and simulate highly complex and nonlinear models. Someone who could do the equivalent tasks in 1966 was well on the way to a Chair. My next impression would be of the widespread use and purchase of econometric services in the academic, government, and private sectors. Quantiﬁcation is now the norm rather than the exception. A third impression, gleaned from a sounding of the job market, would be a persistent tendency towards an excess demand for welltrained econometricians. The economist in me would have to acknowledge that the market judges the products of the discipline as a success.
Notes
9
The challenge for the 21st century is to narrow the gap between theory and practice. Many feel that this gap has been widening with theoretical research growing more and more abstract and highly mathematical without an application in sight or a motivation for practical use. Heckman (2001) argues that econometrics is useful only if it helps economists conduct and interpret empirical research on economic data. He warns that the gap between econometric theory and empirical practice has grown over the past two decades. Theoretical econometrics becoming more closely tied to mathematical statistics. Although he ﬁnds nothing wrong, and much potential value, in using methods and ideas from other ﬁelds to improve empirical work in economics, he does warn of the risks involved in uncritically adopting the methods and mind set of the statisticians: Econometric methods uncritically adapted from statistics are not useful in many research activities pursued by economists. A theoremproof format is poorly suited for analyzing economic data, which requires skills of synthesis, interpretation and empirical investigation. Command of statistical methods is only a part, and sometimes a very small part, of what is required to do ﬁrst class empirical research. In an Econometric Theory interview with Jan Tinbergen, Magnus and Morgan (1987, p.117) describe Tinbergen as one of the founding fathers of econometrics, publishing in the ﬁeld from 1927 until the early 1950s. They add: “Tinbergen’s approach to economics has always been a practical one. This was highly appropriate for the new ﬁeld of econometrics, and enabled him to make important contributions to conceptual and theoretical issues, but always in the context of a relevant economic problem.” The founding fathers of econometrics have always had the practitioner in sight. This is a far cry from many theoretical econometricians who refrain from applied work. The recent entry by Geweke, Horowitz, and Pesaran (2007) in the The New Palgrave Dictionary provides the following recommendations for the future: Econometric theory and practice seek to provide information required for informed decisionmaking in public and private economic policy. This process is limited not only by the adequacy of econometrics, but also by the development of economic theory and the adequacy of data and other information. Eﬀective progress, in the future as in the past, will come from simultaneous improvements in econometrics, economic theory, and data. Research that speciﬁcally addresses the eﬀectiveness of the interface between any two of these three in improving policy — to say nothing of all of them — necessarily transcends traditional subdisciplinary boundaries within economics. But it is precisely these combinations that hold the greatest promise for the social contribution of academic economics.
Notes 1. See the interview of Professor L.R. Klein by Mariano (1987). Econometric Theory publishes interviews with some of the giants in the ﬁeld. These interviews offer a wonderful glimpse at the life and work of these giants. 2. Simultaneous equations model is an integral part of econometrics and is studied in Chapter 11.
10
Chapter 1: What Is Econometrics? 3. Tjalling Koopmans was the joint recipient of the Nobel Prize in Economics in 1975. In addition to his work on the identiﬁcation and estimation of simultaneous equations models, he received the Nobel Prize for his work in optimization and economic theory. 4. I encountered this attack by Keynes on Tinbergen in the inaugural lecture that Peter C.B. Phillips (1977) gave at the University of Birmingham entitled “Econometrics: A View From the Toolroom,” and David F. Hendry’s (1980) article entitled “Econometrics  Alchemy or Science?”
References Baltagi, B.H. (2007), “Worldwide Econometrics Rankings: 19892005,” Econometric Theory, 23: 9521012. Christ, C.F. (1983), “The Founding of the Econometric Society and Econometrica,” Econometrica, 51: 36. Christ, C.F. (1985), “Early Progress in Estimating Quantiative Economic Relations in America,” American Economic Review, 12: 3952. Cobb, C.W. and P.H. Douglas (1928), “A Theory of Production,” American Economic Review, Supplement 18: 139165. Epstein, R.J. (1987), A History of Econometrics (NorthHolland: Amsterdam). Frisch, R. (1933), “Editorial,” Econometrica, 1: 114. Geweke, J., J. Horowitz, and M. H.Pesaran (2007), “Econometrics: A Bird’s Eye View,” forthcoming in The New Palgrave Dictionary, Second Edition. Griliches, Z. (1986), “Economic Data Issues,” in Z. Griliches and M.D. Intriligator (eds), Handbook of Econometrics Vol. III (North Holland: Amsterdam). Haavelmo, T. (1943), “The Statistical Implications of a System of Simultaneous Equations,” Econometrica, 11: 112. Haavelmo, T. (1944), “The Probability Approach in Econometrics,” Econometrica, Supplement to Volume 12: 1118. Heckman, J.J. (2001), “Econometrics and Empirical Economics,” Journal of Econometrics, 100: 35. Hendry, D.F. (1980), “Econometrics  Alchemy or Science?” Economica, 47: 387406. Keynes, J.M. (1940), “On Method of Statistical Research: Comment,” Economic Journal, 50: 154156. Klein, L.R. (1971), “Whither Econometrics?” Journal of the American Statistical Association, 66: 415421. Klein, L.R. (1950), Economic Fluctuations in the United States 19211941, Cowles Commission Monograph, No. 11 (John Wiley: New York). Klein, L.R. and A.S. Goldberger (1955), An Econometric Model of the United States 19291952 (NorthHolland: Amsterdam). Koopmans, T.C. (1950), ed., Statistical Inference in Dynamic Economic Models (John Wiley: New York). Koopmans, T.C. and W.C. Hood (1953), Studies in Econometric Method (John Wiley: New York). Koopmans, T.C. and J. Marschak (1950), eds., Statistical Inference in Dynamic Economic Models (John Wiley: New York).
References
11
Leamer, E.E. (1983), “Lets Take the Con Out of Econometrics,” American Economic Review, 73: 3143. Leontief, W. (1971), “Theoretical Assumptions and Nonobserved Facts,” American Economic Review, 61: 17. Lucas, R.E. (1976), “Econometric Policy Evaluation: A Critique,” in K. Brunner and A.M. Meltzer, eds., The Phillips Curve and Labor Markets, Carnegie Rochester Conferences on Public Policy, 1: 1946. Maddala, G.S. (1999), “Econometrics in the 21st Century,” in C.R. Rao and R. Szekeley, eds., Statistics for the 21 st Century (Marcel Dekker: New York). Magnus , J.R. and M.S. Morgan (1987), “The ET Interview: Professor J. Tinbergen,” Econometric Theory, 3: 117142. Malinvaud, E. (1966), Statistical Methods of Econometrics (NorthHolland: Amsterdam). Manski, C.F. (1995), Identiﬁcation Problems in the Social Sciences (Harvard University Press: Cambridge). Mann, H.B. and A. Wald (1943), “On the Statistical Treatment of Linear Stochastic Difference Equations,” Econometrica, 11: 173220. Mariano, R.S. (1987), “The ET Interview: Professor L.R. Klein,” Econometric Theory, 3: 409460. McAleer, M., A.R. Pagan and P.A. Volker (1985), “What Will Take The Con Out of Econometrics,” American Economic Review, 75: 293307. Moore, H.L. (1914), Economic Cycles: Their Law and Cause (Macmillan: New York). Morgan, M. (1990), The History of Econometric Ideas (Cambridge University Press: Cambridge, MA). Pagan, A. (1987), “Twenty Years After: Econometrics, 19661986,” paper presented at CORE’s 20th Anniversary Conference, LouvainlaNeuve. Pesaran, M.H. (1987), The Limits to Rational Expectations (Basil Blackwell: Oxford, MA). Pesaran, M.H. (1990), “Econometrics,” in J. Eatwell, M. Milgate and P. Newman; The New Palgrave: Econometrics (W.W. Norton and Company: New York). Phillips, P.C.B. (1977), “Econometrics: A View From the Toolroom,” Inaugural Lecture, University of Birmingham, Birmingham, England. Phillips, P.C.B. (1985), “ET Interviews: Professor J. D. Sargan,” Econometric Theory, 1: 119139. Phillips, P.C.B. (1986), “The ET Interview: Professor T. W. Anderson,” Econometric Theory, 2: 249288. Samuelson, P.A., T.C. Koopmans and J.R.N. Stone (1954), “Report of the Evaluative Committee for Econometrica,” Econometrica, 22: 141146. Schultz, H. (1938), The Theory and Measurement of Demand (University of Chicago Press: Chicago, IL). Spanos, A. (1986), Statistical Foundations of Econometric Modelling (Cambridge University Press: Cambridge, MA). Tinbergen, J. (1939), Statistical Testing of Business Cycle Theories, Vol. II: Business Cycles in the USA, 19191932 (League of Nations: Geneva). Tintner, G. (1953), “The Deﬁnition of Econometrics,” Econometrica, 21: 3140. Working, E.J. (1927), “What Do Statistical ‘Demand Curves’ Show?” Quarterly Journal of Economics, 41: 212235.
CHAPTER 2
Basic Statistical Concepts 2.1
Introduction
One chapter cannot possibly review what one learned in one or two prerequisite courses in statistics. This is an econometrics book, and it is imperative that the student have taken at least one solid course in statistics. The concepts of a random variable, whether discrete or continuous, and the associated probability function or probability density function (p.d.f.) are assumed known. Similarly, the reader should know the following statistical terms: Cumulative distribution function, marginal, conditional and joint p.d.f.’s. The reader should be comfortable with computing mathematical expectations, and familiar with the concepts of independence, Bayes Theorem and several continuous and discrete probability distributions. These distributions include: the Bernoulli, Binomial, Poisson, Geometric, Uniform, Normal, Gamma, Chisquared (χ2 ), Exponential, Beta, t and F distributions. Section 2.2 reviews two methods of estimation, while section 2.3 reviews the properties of the resulting estimators. Section 2.4 gives a brief review of test of hypotheses, while section 2.5 discusses the meaning of conﬁdence intervals. These sections are fundamental background for this book, and the reader should make sure that he or she is familiar with these concepts. Also, be sure to solve the exercises at the end of this chapter.
2.2
Methods of Estimation
Consider a Normal distribution with mean μ and variance σ 2 . This is the important “Gaussian” distribution which is symmetric and bellshaped and completely determined by its measure of centrality, its mean μ and its measure of dispersion, its variance σ 2 . μ and σ 2 are called the population parameters. Draw a random sample X1 , . . . , Xn independent and identically ¯ and σ 2 by distributed (IID) from this population. We usually estimate μ by μ =X s2 =
n
i=1 (Xi
¯ 2 /(n − 1). − X)
¯ = sample average of incomes of For example, μ = mean income of a household in Houston. X 100 households randomly interviewed in Houston. This estimator of μ could have been obtained by either of the following two methods of estimation:
(i) Method of Moments Simply stated, this method of estimation uses the following rule: Keep equating population moments to their sample counterpart until you have estimated all the population parameters.
14
Chapter 2: Basic Statistical Concepts
Sample
Population
n ¯ Xi /n = X i=1 n 2 i=1 Xi /n .. . n r i=1 Xi /n
E(X) = μ E(X 2 ) = μ2 + σ 2 .. . E(X r )
The normal density is completely identiﬁed by μ and σ 2 , hence only the ﬁrst 2 equations are needed ¯ μ =X and μ 2 + σ 2 = ni=1 Xi2 /n Substituting the ﬁrst equation in the second one obtains ¯ 2 = n (Xi − X) ¯ 2 /n σ 2 = ni=1 Xi2 /n − X i=1
(ii) Maximum Likelihood Estimation (MLE)
For a random sample of size n from the Normal distribution Xi ∼ N (μ, σ 2 ), we have √ − ∞ < Xi < +∞ fi (Xi ; μ, σ 2 ) = (1/σ 2π) exp −(Xi − μ)2 /2σ 2
Since X1 , . . . , Xn are independent and identically distributed, the joint probability density function is given as the product of the marginal probability density functions: n fi (Xi ; μ, σ 2 ) = (1/2πσ 2 )n/2 exp − ni=1 (Xi − μ)2 /2σ 2 (2.1) f (X1 , . . . , Xn ; μ, σ 2 ) = i=1
Usually, we observe only one sample of n households which could have been generated by any pair of (μ, σ 2 ) with −∞ < μ < +∞ and σ 2 > 0. For each pair, say (μ0 , σ 20 ), f (X1 , . . . , Xn ; μ0 , σ 20 ) denotes the probability (or likelihood) of obtaining that sample. By varying (μ, σ 2 ) we get different probabilities of obtaining this sample. Intuitively, we choose the values of μ and σ 2 that maximize the probability of obtaining this sample. Mathematically, we treat f (X1 , . . . , Xn ; μ, σ 2 ) as L(μ, σ 2 ) and we call it the likelihood function. Maximizing L(μ, σ 2 ) with respect to μ and σ 2 , one gets the ﬁrstorder conditions of maximization: (∂L/∂μ) = 0
(∂L/∂σ 2 ) = 0
and
Equivalently, we can maximize logL(μ, σ 2 ) rather than L(μ, σ 2 ) and still get the same answer. Usually, the latter monotonic transformation of the likelihood is easier to maximize and the ﬁrstorder conditions become (∂logL/∂μ) = 0
and
(∂logL/∂σ 2 ) = 0
For the Normal distribution example, we get logL(μ; σ 2 ) = −(n/2)log σ 2 − (n/2)log 2π − (1/2σ 2 ) ∂logL(μ; σ 2 )/∂μ = (1/σ 2 )
n
i=1 (Xi
n
i=1 (Xi
i=1 (Xi
¯ − μ) = 0 ⇒ μ M LE = X
∂logL(μ; σ 2 )/∂σ 2 = −(n/2)(1/σ 2 ) + ⇒σ 2M LE =
n
−μ M LE )2 /n =
n
i=1 (Xi
n
i=1 (Xi
− μ)2 /2σ 4 = 0 ¯ 2 /n − X)
− μ)2
2.2
Methods of Estimation
15
Note that the moments estimators and the maximum likelihood estimators are the same for the Normal distribution example. In general, the two methods need not necessarily give the same estimators. Also, note that the moments estimators will always have the same estimating equations, for example, the ﬁrst two equations are always E(X) = μ ≡
n
i=1 Xi /n
¯ =X
and
E(X 2 ) = μ2 + σ 2 ≡
n
2 i=1 Xi /n.
For a speciﬁc distribution, we need only substitute the relationship between the population moments and the parameters of that distribution. Again, the number of equations needed depends upon the number of parameters of the underlying distribution. For e.g., the exponential distribution has one parameter and needs only one equation whereas the gamma distribution has two parameters and needs two equations. Finally, note that the maximum likelihood technique is heavily reliant on the form of the underlying distribution, but it has desirable properties when it exists. These properties will be discussed in the next section. So far we have dealt with the Normal distribution to illustrate the two methods of estimation. We now apply these methods to the Bernoulli distribution and leave other distributions applications to the exercises. We urge the student to practice on these exercises. Bernoulli Example: In various cases in real life the outcome of an event is binary, a worker may join the labor force or may not. A criminal may return to crime after parole or may not. A television off the assembly line may be defective or not. A coin tossed comes up head or tail, and so on. In this case θ = Pr[Head] and 1 − θ = Pr[Tail] with 0 < θ < 1 and this can be represented by the discrete probability function f (X; θ) = θX (1 − θ)1−X =0
X = 0, 1 elsewhere
The Normal distribution is a continuous distribution since it takes values for all X over the real line. The Bernoulli distribution is discrete, because it is deﬁned only at integer values for X. Note that P [X = 1] = f (1; θ) = θ and P [X = 0] = f (0; θ) = 1 − θ for all values of 0 < θ < 1. A random sample of size n drawn from this distribution will have a joint probability function n n L(θ) = f (X1 , . . . , Xn ; θ) = θ i=1 Xi (1 − θ)n− i=1 Xi with Xi = 0, 1 for i = 1, . . . , n. Therefore, logL(θ) = ( ni=1 Xi )logθ + (n − ni=1 Xi )log(1 − θ) n (n − ni=1 Xi ) ∂logL(θ) i=1 Xi = − ∂θ θ (1 − θ)
Solving this ﬁrstorder condition for θ, one gets ( ni=1 Xi )(1 − θ) − θ(n − ni=1 Xi ) = 0
which reduces to
¯ θM LE = ni=1 Xi /n = X.
This is the frequency of heads in n tosses of a coin.
16
Chapter 2: Basic Statistical Concepts
For the method of moments, we need E(X) =
1
X=0 Xf (X, θ)
= 1.f (1, θ) + 0.f (0, θ) = f (1, θ) = θ
¯ to get ¯ Once again, the MLE and the method of moments yield and this is equated to X θ = X. the same estimator. Note that only one parameter θ characterizes this Bernoulli distribution and one does not need to equate second or higher population moments to their sample values.
2.3
Properties of Estimators
(i) Unbiasedness μ is said to be unbiased for μ if and only if E( μ) = μ ¯ we have E(X) ¯ = n E(Xi )/n = μ and X ¯ is unbiased for μ. No distributional For μ = X, i=1 assumption is needed as long as the Xi ’s are distributed with the same mean μ. Unbiasedness means that “on the average” our estimator is on target. Let us explain this last statement. If we repeat our drawing of a random sample of 100 households, say 200 times, then we get 200 ¯ Some of these X ¯ ’s will be above μ some below μ, but their average should be very close X’s. to μ. Since in real life situations, we observe only one random sample, there is little consolation ¯ is far from μ. But the larger n is the smaller is the dispersion of this X, ¯ since if our observed X 2 ¯ ¯ var(X) = σ /n and the lesser is the likelihood of this X to be very far from μ. This leads us to the concept of eﬃciency. (ii) Efficiency For two unbiased estimators, we compare their eﬃciencies by the ratio of their variances. We say ¯ 2 = X, that the one with lower variance is more eﬃcient. For example, taking μ 1 = X1 versus μ both estimators are unbiased but var( μ1 ) = σ 2 whereas, var( μ2 ) = σ 2 /n and {the relative eﬃciency of μ 1 with respect to μ 2 } = var( μ2 )/var( μ1 ) = 1/n, see Figure 2.1. To compare all unbiased estimators, we ﬁnd the one with minimum variance. Such an estimator if it exists is called the MVU (minimum variance unbiased estimator). A lower bound for the variance of any unbiased estimator μ of μ, is known in the statistical literature as the Cram´erRao lower bound, and is given by var( μ) ≥ 1/n{E(∂logf (X; μ))/∂μ}2 = −1/{nE(∂ 2 logf (X; μ))/∂μ2 }
(2.2)
where we use either representation of the bound on the right hand side of (2.2) depending on which one is the simplest to derive. Example 1: Consider the normal density logf (Xi ; μ) = (−1/2)logσ 2 − (1/2)log2π − (1/2)(Xi − μ)2 /σ 2 ∂logf (Xi ; μ)/∂μ = (Xi − μ)/σ 2 ∂ 2 logf (Xi ; μ)/∂μ2 = −(1/σ 2 )
2.3
Properties of Estimators
17
f(x)
X ~ ( m , s 2 / n)
X i ~ ( m, s 2 )
m
x
Figure 2.1 Eﬃciency Comparisons
with E{∂ 2 logf (Xi ; μ)/∂μ2 } = −(1/σ 2 ). Therefore, the variance of any unbiased estimator of μ, say μ satisﬁes the property that var( μ) ≥ σ 2 /n. 2 2 Turning to σ ; let θ = σ , then logf (Xi ; θ) = −(1/2)logθ − (1/2)log2π − (1/2)(Xi − μ)2 /θ
∂logf (Xi ; θ)/∂θ = −1/2θ + (Xi − μ)2 /2θ 2 = {(Xi − μ)2 − θ}/2θ 2 ∂ 2 logf (Xi ; θ)/∂θ 2 = 1/2θ 2 − (Xi − μ)2 /θ 3 = {θ − 2(Xi − μ)2 }/2θ 3 E[∂ 2 logf (Xi ; θ)/∂θ 2 ] = −(1/2θ 2 ), since E(Xi − μ)2 = θ. Hence, for any unbiased estimator of θ, say θ, its variance satisﬁes the following property var( θ) ≥ 2θ2 /n, or var( σ 2 ) ≥ 2σ 4 /n. Note that, if one ﬁnds an unbiased estimator whose variance attains the Cram´erRao lower bound, then this is the MVU estimator. It is important to remember that this is only a lower ¯ ∼ N (μ, σ 2 /n). bound and sometimes it is not necessarily attained. If the Xi ’s are normal, X 2 ¯ Hence, X is unbiased for μ with variance σ /n equal to the Cram´erRao lower bound. Therefore, ¯ is MVU for μ. On the other hand, X ¯ 2 /n, σ 2M LE = ni=1 (Xi − X) and it can be shown that (n σ 2M LE )/(n − 1) = s2 is unbiased for σ 2 . In fact, (n − 1)s2 /σ 2 ∼ χ2n−1 and the expected value of a Chisquared variable with (n − 1) degrees of freedom is exactly its degrees of freedom. Using this fact, E{(n − 1)s2 /σ 2 } = E(χ2n−1 ) = n − 1. Therefore, E(s2 ) = σ 2 .1 Also, the variance of a Chisquared variable with (n − 1) degrees of freedom is twice these degrees of freedom. Using this fact, var{(n − 1)s2 /σ 2 } = var(χ2n−1 ) = 2(n − 1)
18
Chapter 2: Basic Statistical Concepts
or {(n − 1)2 /σ 4 }var(s2 ) = 2(n − 1). Hence, the var(s2 ) = 2σ 4 /(n − 1) and this does not attain the Cram´erRao lower bound. In fact, σ 2M LE ) = {(n − 1)2 /n2 }var(s2 ) = {2(n − 1)}σ 4 /n2 . it is larger than (2σ 4 /n). Note also that var( 4 This is smaller than (2σ /n)! How can that be? Remember that σ 2M LE is a biased estimator 2 2 of σ and hence, var( σ M LE ) should not be compared with the Cram´erRao lower bound. This lower bound pertains only to unbiased estimators. Warning: Attaining the Cram´erRao lower bound is only a suﬃcient condition for eﬃciency. Failing to satisfy this condition does not necessarily imply that the estimator is not eﬃcient. Example 2: For the Bernoulli case logf (Xi ; θ) = Xi logθ + (1 − Xi )log(1 − θ) ∂logf (Xi , θ)/∂θ = (Xi /θ) − (1 − Xi )/(1 − θ) ∂ 2 logf (Xi ; θ)/∂θ 2 = (−Xi /θ 2 ) − (1 − Xi )/(1 − θ)2 and E[∂ 2 logf (Xi ; θ)/∂θ 2 ] = (−1/θ) − 1/(1 − θ) = −1/[θ(1 − θ)]. Therefore, for any unbiased estimator of θ, say θ, its variance satisﬁes the following property: var( θ) ≥ θ(1 − θ)/n.
For the Bernoulli random sample, we proved that μ = E(Xi ) = θ. Similarly, it can be easily ¯ has mean μ = θ and var(X) ¯ = σ 2 /n = θ(1−θ)/n. veriﬁed that σ 2 = var(Xi ) = θ(1−θ). Hence, X ¯ is unbiased for θ and it attains the Cram´erRao lower bound. Therefore, X ¯ This means that X is MVU for θ. Unbiasedness and eﬃciency are ﬁnite sample properties (in other words, true for any ﬁnite sample size n). Once we let n tend to ∞ then we are in the realm of asymptotic properties.
Example 3: For a random sample from any distribution with mean μ it is clear that μ = ¯ + 1/n) is not an unbiased estimator of μ since E( ¯ + 1/n) = μ + 1/n. However, as (X μ) = E(X n → ∞ the lim E( μ) is equal to μ. We say, that μ is asymptotically unbiased for μ. Example 4: For the Normal case σ 2M LE = (n − 1)s2 /n
and E( σ 2M LE ) = (n − 1)σ 2 /n.
2M LE is asymptotically unbiased for σ 2 . But as n → ∞, lim E( σ 2M LE ) = σ 2 . Hence, σ Similarly, an estimator which attains the Cram´erRao lower bound in the limit is asymp¯ = σ 2 /n, and this tends to zero as n → ∞. Hence, we totically efficient. Note that var(X) √ ¯ √ ¯ ¯ = σ 2 . We say that the = n var(X) consider nX which has ﬁnite variance since var( nX) 2 ¯ ¯ asymptotic variance of X denoted by asymp.var(X) = σ /n and that it attains the Cram´er¯ is therefore asymptotically eﬃcient. Similarly, Rao lower bound in the limit. X √ 2 var( n σ M LE ) = n var( σ 2M LE ) = 2(n − 1)σ 4 /n which tends to 2σ 4 as n → ∞. This means that asymp.var( σ 2M LE ) = 2σ 4 /n and that it attains 2 the Cram´erRao lower bound in the limit. Therefore, σ M LE is asymptotically eﬃcient.
2.3
Properties of Estimators
19
(iii) Consistency ¯ − μ > c] = 0 Another asymptotic property is consistency. This says that as n → ∞ lim Pr[X ¯ for any arbitrary positive constant c. In other words, X will not differ from μ as n → ∞. Proving this property uses the Chebyshev’s inequality which states in this context that ¯ − μ > kσ ¯ ] ≤ 1/k2 . Pr[X X If we let c = kσ X¯ then 1/k2 = σ 2X¯ /c2 = σ 2 /nc2 and this tends to 0 as n → ∞, since σ 2 and c are ﬁnite positive constants. A suﬃcient condition for an estimator to be consistent is that it is asymptotically unbiased and that its variance tends to zero as n → ∞.2 ¯ =μ Example 1: For a random sample from any distribution with mean μ and variance σ 2 , E(X) ¯ = σ 2 /n → 0 as n → ∞, hence X ¯ is consistent for μ. and var(X)
Example 2: For the Normal case, we have shown that E(s2 ) = σ 2 and var(s2 ) = 2(n−1)σ 4 /n2 → 0 as n → ∞, hence s2 is consistent for σ 2 . ¯ = θ and var(X) ¯ = θ(1 − θ)/n → 0 as Example 3: For the Bernoulli case, we know that E(X) ¯ is consistent for θ. n → ∞, hence X Warning: This is only a suﬃcient condition for consistency. Failing to satisfy this condition does not necessarily imply that the estimator is inconsistent. (iv) Sufficiency ¯ is suﬃcient for μ, if X ¯ contains all the information in the sample pertaining to μ. In other X ¯ is independent of μ. To prove this fact one uses the factorization words, f (X1 , . . . , Xn /X) ¯ is suﬃcient for μ, if and only if one can theorem due to Fisher and Neyman. In this context, X factorize the joint p.d.f. ¯ μ) · g(X1 , . . . , Xn ) f (X1 , . . . , Xn ; μ) = h(X; where h and g are any two functions with the latter being only a function of the X’s and independent of μ in form and in the domain of the X’s. Example 1: For the Normal case, it is clear from equation (2.1) that by subtracting and adding ¯ in the summation we can write after some algebra X n 2 2 2 ¯ 2 ¯ f (X1 , .., Xn ; μ, σ 2 ) = (1/2πσ 2 )n/2 e−{(1/2σ ) i=1 (Xi −X) } e−{(n/2σ )(X−μ) } 2
¯
2
¯ μ) = e−(n/2σ )(X−μ ) and g(X1 , . . . , Xn ) is the remainder term which is independent Hence, h(X; ¯ is of μ in form. Also −∞ < Xi < ∞ and hence independent of μ in the domain. Therefore, X suﬃcient for μ. Example 2: For the Bernoulli case, ¯
¯
f (X1 , . . . , Xn ; θ) = θnX (1 − θ)n(1−X)
Xi = 0, 1
for i = 1, . . . , n.
¯ ¯ θ) = θ nX¯ (1 − θ)n(1−X) Therefore, h(X, and g(X1 , . . . , Xn ) = 1 which is independent of θ in form ¯ is suﬃcient for θ. and domain. Hence, X
20
Chapter 2: Basic Statistical Concepts
Under certain regularity conditions on the distributions we are sampling from, one can show that the MVU of any parameter θ is an unbiased function of a suﬃcient statistic for θ.3 Advantages of the maximum likelihood estimators is that (i) they are suﬃcient estimators when they exist. (ii) They are asymptotically eﬃcient. (iii) If the distribution of the MLE satisﬁes certain regularity conditions, then making the MLE unbiased results in a unique MVU estimator. A prime example of this is s2 which was shown to be an unbiased estimator of σ 2 for a random sample drawn from the Normal distribution. It can be shown that s2 is suﬃcient for σ 2 and that (n − 1)s2 /σ 2 ∼ χ2n−1 . Hence, s2 is an unbiased suﬃcient statistic for σ 2 and therefore it is MVU for σ 2 , even though it does not attain the Cram´erRao lower bound. (iv) Maximum likelihood estimates are invariant with respect to continuous transformations. To explain the last property, ¯ an obvious estimator is eμ M LE = eX¯ . This is in consider the estimator of eμ . Given μ M LE = X, μ fact the MLE of e . In general, if g(μ) is a continuous function of μ, then g( μM LE ) is the MLE of g(μ). Note that E(eμ M LE ) = eE(μM LE ) = eμ , in other words, expectations are not invariant to all continuous transformations, especially nonlinear ones and hence the resulting MLE estimator ¯ ¯ is unbiased for μ. may not be unbiased. eX is not unbiased for eμ even though X In summary, there are two routes for ﬁnding the MVU estimator. One is systematically following the derivation of a suﬃcient statistic, proving that its distribution satisﬁes certain regularity conditions, and then making it unbiased for the parameter in question. Of course, MLE provides us with suﬃcient statistics, for example, ¯ M LE = X X1 , . . . , Xn ∼ IIN(μ, σ 2 ) ⇒ μ
and σ 2M LE =
n
i=1 (Xi
¯ 2 /n − X)
¯ is unbiased for μ and X ¯ ∼ N (μ, σ 2 /n). The are both suﬃcient for μ and σ 2 , respectively. X ¯ to be MVU for μ. σ Normal distribution satisﬁes the regularity conditions needed for X 2M LE is 2 2 2 2 2 2 2 biased for σ , but s = n σ M LE /(n − 1) is unbiased for σ and (n − 1)s /σ ∼ χn−1 which also satisﬁes the regularity conditions for s2 to be a MVU estimator for σ 2 . Alternatively, one ﬁnds the Cram´erRao lower bound and checks whether the usual estimator (obtained from say the method of moments or the maximum likelihood method) achieves this lower bound. If it does, this estimator is eﬃcient, and there is no need to search further. If it does not, the former strategy leads us to the MVU estimator. In fact, in the previous example ¯ attains the Cram´erRao lower bound, whereas s2 does not. However, both are MVU for μ X and σ 2 respectively.
(v) Comparing Biased and Unbiased Estimators Suppose we are given two estimators θ1 and θ2 of θ where the ﬁrst is unbiased and has a large variance and the second is biased but with a small variance. The question is which one of these two estimators is preferable? θ1 is unbiased whereas θ2 is biased. This means that if we repeat the sampling procedure many times then we expect θ1 to be on the average correct, whereas θ2 would be on the average different from θ. However, in real life, we observe only one sample. With a large variance for θ 1 , there is a great likelihood that the sample drawn could result in a θ1 far away from θ. However, with a small variance for θ2 , there is a better chance of getting a θ2 close to θ. If our loss function is L( θ, θ) = ( θ − θ)2 then our risk is R( θ, θ) = E[L( θ, θ)] = E( θ − θ)2 = M SE( θ) 2 = E[θ − E(θ) + E(θ) − θ] = var(θ) + (Bias( θ))2 .
2.4
Hypothesis Testing
21
Minimizing the risk when the loss function is quadratic is equivalent to minimizing the Mean Square Error (MSE). From its deﬁnition the MSE shows the tradeoff between bias and variance. MVU theory, sets the bias equal to zero and minimizes var( θ). In other words, it minimizes the above risk function but only over θ’s that are unbiased. If we do not restrict ourselves to unbiased estimators of θ, minimizing MSE may result in a biased estimator such as θ2 which beats θ1 because the gain from its smaller variance outweighs the loss from its small bias, see Figure 2.2. f (Gˆ )
f (Gˆ 2 )
Bias
f (Gˆ1 )
Gˆ G E (Gˆ 2 ) Figure 2.2 Bias versus Variance
2.4
Hypothesis Testing
The best way to proceed is with an example. Example 1: The Economics Departments instituted a new program to teach microprinciples. We would like to test the null hypothesis that 80% of economics undergraduate students will pass the microprinciples course versus the alternative hypothesis that only 50% will pass. We draw a random sample of size 20 from the large undergraduate microprinciples class and as a simple rule we accept the null if x, the number of passing students is larger or equal to 13, otherwise the alternative hypothesis will be accepted. Note that the distribution we are drawing from is Bernoulli with the probability of success θ, and we have chosen only two states of the world H0 ; θ0 = 0.80 and H1 ; θ 1 = 0.5. This situation is known as testing a simple hypothesis versus another simple hypothesis because the distribution is completely speciﬁed under the null or alternative hypothesis. One would expect (E(x) = nθ0 ) 16 students under H0 and (nθ1 ) 10 students under H1 to pass the microprinciples exams. It seems then logical to take x ≥ 13 as the cutoff point distinguishing H0 form H1 . No theoretical justiﬁcation is given at this stage to this arbitrary choice except to say that it is the midpoint of [10, 16]. Figure 2.3 shows that one can make two types of errors. The ﬁrst is rejecting H0 when in fact it is true, this is known as type I error and the probability of committing this error is denoted by α. The second is accepting H1 when it is false. This is known as type II error and the corresponding probability is denoted by β. For this example
22
Chapter 2: Basic Statistical Concepts
α = Pr[rejecting H0 /H0 is true] = Pr[x < 13/θ = 0.8] = b(n = 20; x = 0; θ = 0.8) + .. + b(n = 20; x = 12; θ = 0.8) = b(n = 20; x = 20; θ = 0.2) + .. + b(n = 20; x = 8; θ = 0.2) = 0 + .. + 0 + 0.0001 + 0.0005 + 0.0020 + 0.0074 + 0.0222 = 0.0322 where we have used the fact that b(n; x; θ) = b(n; n − x; 1 − θ) and b(n; x; θ) = denotes the binomial distribution for x = 0, 1, . . . , n, see problem 4.
n x n−x x θ (1 − θ)
True World
Decision
θ 0 = 0.80
θ1 = 0.50
θ0
No error
Type II error
θ1
Type I error
No Error
Figure 2.3 Type I and II Error
β = Pr[accepting H0 /H0 is false] = Pr[x ≥ 13/θ = 0.5]
= b(n = 20; x = 13; θ = 0.5) + .. + b(n = 20; x = 20; θ = 0.5) = 0.0739 + 0.0370 + 0.0148 + 0.0046 + 0.0011 + 0.0002 + 0 + 0 = 0.1316
The rejection region for H0 , x < 13, is known as the critical region of the test and α = Pr[Falling in the critical region/H0 is true] is also known as the size of the critical region. A good test is one which minimizes both types of errors α and β. For the above example, α is low but β is high with more than a 13% chance of happening. This β can be reduced by changing the critical region from x < 13 to x < 14, so that H0 is accepted only if x ≥ 14. In this case, one can easily verify that α = Pr[x < 14/θ = 0.8] = b(n = 20; x = 0; θ = 0.8) + .. + b(n = 20, x = 13, θ = 0.8) = 0.0322 + b(n = 20; x = 13; θ = 0.8) = 0.0322 + 0.0545 = 0.0867 and β = Pr[x ≥ 14/θ = 0.5] = b(n = 20; x = 14; θ = 0.5) + .. + b(n = 20; x = 20; θ = 0.5) = 0.1316 − b(n = 20; x = 13; θ = 0.5) = 0.0577
By becoming more conservative on accepting H0 and more liberal on accepting H1 , one reduces β from 0.1316 to 0.0577 but the price paid is the increase in α from 0.0322 to 0.0867. The only way to reduce both α and β is by increasing n. For a ﬁxed n, there is a tradeoff between α and β as we change the critical region. To understand this clearly, consider the real life situation of trial by jury for which the defendant can be innocent or guilty. The decision of incarceration or release implies two types of errors. One can make α = Pr[incarcerating/innocence] = 0 and β = its maximum, by releasing every defendant. Or one can make β = Pr[release/guilty] = 0 and α = its maximum, by incarcerating every defendant. These are extreme cases but hopefully they demonstrate the tradeoff between α and β.
2.4
Hypothesis Testing
23
The NeymanPearson Theory The classical theory of hypothesis testing, known as the NeymanPearson theory, ﬁxes α = Pr(type I error) ≤ a constant and minimizes β or maximizes (1 − β). The latter is known as the Power of the test under the alternative. The NeymanPearson Lemma: If C is a critical region of size α and k is a constant such that (L0 /L1 ) ≤ k inside C and (L0 /L1 ) ≥ k outside C then C is a most powerful critical region of size α for testing H0 ; θ = θ0 , against H1 ; θ = θ1 . Note that the likelihood has to be completely speciﬁed under the null and alternative. Hence, this lemma applies only to testing a simple versus another simple hypothesis. The proof of this lemma is given in Freund (1992). Intuitively, L0 is the likelihood function under the null H0 and L1 is the corresponding likelihood function under H1 . Therefore, (L0 /L1 ) should be small for points inside the critical region C and large for points outside the critical region C. The proof of the theorem shows that any other critical region, say D, of size α cannot have a smaller probability of type II error than C. Therefore, C is the best or most powerful critical region of size α. Its power (1 − β) is maximum at H1 . Let us demonstrate this lemma with an example. Example 2: Given a random sample of size n from N (μ, σ 2 = 4), use the NeymanPearson lemma to ﬁnd the most powerful critical region of size α = 0.05 for testing H0 ; μ0 = 2 against the alternative H1 ; μ1 = 4. Note that this is a simple versus simple hypothesis as required by the lemma, since σ 2 = 4 is known and μ is speciﬁed by H0 and H1 . The likelihood function for the N (μ, 4) density is given by √ L(μ) = f (x1 , . . . , xn ; μ, 4) = (1/2 2π)n exp − ni=1 (xi − μ)2 /8
so that
and
√ L0 = L(μ0 ) = (1/2 2π)n exp − ni=1 (xi − 2)2 /8 √ L1 = L(μ1 ) = (1/2 2π)n exp − ni=1 (xi − 4)2 /8
Therefore
n n n 2 2 L0 /L1 = exp − i=1 (xi − 2) − i=1 (xi − 4) /8 = exp {− i=1 xi /2 + 3n/2}
and the critical region is deﬁned by exp {− ni=1 xi /2 + 3n/2} ≤ k
inside C
Taking logarithms of both sides, subtracting (3/2)n and dividing by (−1/2)n one gets x ¯≥K
inside C
24
Chapter 2: Basic Statistical Concepts
In practice, one need not keep track of K as long as one keeps track of the direction of the inequality. K can be determined by making the size of C = α = 0.05. In this case √ α = Pr[¯ x ≥ K/μ = 2] = Pr[z ≥ (K − 2)/(2/ n)]
√ where z = (¯ x − 2)/(2/ n) is distributed N (0, 1) under H0 . From the N (0, 1) tables, we have K −2 √ = 1.645 (2/ n) Hence, √ K = 2 + 1.645(2/ n) √ and x ¯ ≥ 2 + 1.645(2/ n) deﬁnes the most powerful critical region of size α = 0.05 for testing H0 ; μ0 = 2 versus H1 ; μ1 = 4. Note that, in this case √ β = Pr[¯ x < 2 + 1.645(2/ n)/μ = 4] √ √ √ = Pr[z < [−2 + 1.645(2/ n)]/(2/ n)] = Pr[z < 1.645 − n] For n = 4; β = Pr[z < −0.355] = 0.3613 shown by the shaded region in Figure 2.4. For n = 9; β = Pr[z < −1.355] = 0.0877, and for n = 16; β = Pr[z < −2.355] = 0.00925.
a = 0.05
b = 0.3613 m0 = 2
3.645
m1 = 4
x
Figure 2.4 Critical Region for Testing μ0 = 2 against μ1 = 4 for n = 4
This gives us an idea of how, for a ﬁxed α = 0.05, the minimum β decreases with larger sample size n. As n increases from 4 to 9 to 16, the var(¯ x) = σ 2 /n decreases and the two distributions shown in Figure 2.4 shrink in dispersion still centered around μ0 = 2 and μ1 = 4, respectively. This allows better decision making (based on larger sample size) as reﬂected by the critical region shrinking from x ≥ 3.65 for n = 4 to x ≥ 2.8225 for n = 16, and the power (1 − β) rising from 0.6387 to 0.9908, respectively, for a ﬁxed α ≤ 0.05. The power function is the probability of rejecting H0 . It is equal to α under H0 and 1 − β under H1 . The ideal power function is zero at H0 and one at H1 . The NeymanPearson lemma allows us to ﬁx α, say at 0.05, and ﬁnd the test with the best power at H1 . In example 2, both the null and alternative hypotheses are simple. In real life, one is more likely to be faced with testing H0 ; μ = 2 versus H1 ; μ = 2. Under the alternative hypothesis, the distribution is not completely speciﬁed, since the mean μ is not known, and this is referred to as a composite hypothesis. In this case, one cannot compute the probability of type II error
2.4
Hypothesis Testing
25
since the distribution is not known under the alternative. Also, the NeymanPearson lemma cannot be applied. However, a simple generalization allows us to compute a Likelihood Ratio test which has satisfactory properties but is no longer uniformly most powerful of size α. In this case, one replaces L1 , which is not known since H1 is a composite hypothesis, by the maximum value of the likelihood, i.e., λ=
maxL0 maxL
Since max L0 is the maximum value of the likelihood under the null while maxL is the maximum value of the likelihood over the whole parameter space, it follows that maxL0 ≤ maxL and λ ≤ 1. Hence, if H0 is true, λ is close to 1, otherwise it is smaller than 1. Therefore, λ ≤ k deﬁnes the critical region for the Likelihood Ratio test, and k is determined such that the size of this test is α. Example 3: For a random sample x1 , . . . , xn drawn from a Normal distribution with mean μ and variance σ 2 = 4, derive the Likelihood Ratio test for H0 ; μ = 2 versus H1 ; μ = 2. In this case √ maxL0 = (1/2 2π)n exp − ni=1 (xi − 2)2 /8 = L0 and
√ ¯)2 /8 = L( μM LE ) maxL = (1/2 2π)n exp − ni=1 (xi − x
where use is made of the fact that μ M LE = x ¯. Therefore, λ = exp − ni=1 (xi − 2)2 + ni=1 (xi − x ¯)2 /8 = exp −n(¯ x − 2)2 /8
Hence, the region for which λ ≤ k, is equivalent after some simple algebra to the following region (¯ x − 2)2 ≥ K
or
¯ x − 2 ≥ K 1/2
where K is determined such that Pr[¯ x − 2 ≥ K 1/2 /μ = 2] = α
√ We know that x ¯ ∼ N (2, 4/n) under H0 . Hence, z = (¯ x − 2)/(2/ n) is N (0, 1) under H0 , and the critical region of size α will be based upon z ≥ zα/2 where zα/2 is given in Figure 2.5 and is the value of a N (0, 1) random variable such that the probability of exceeding it is α/2. For α = 0.05, zα/2 = 1.96, and for α = 0.10, zα/2 = 1.645. This is a twotailed test with rejection of H0 obtained in case z ≤ −zα/2 or z ≥ zα/2 . Note that in this case LR ≡ −2logλ = (¯ x − 2)2 /(4/n) = z 2 which is distributed as χ21 under H0 . This is because it is the square of a N (0, 1) random variable under H0 . This is a ﬁnite sample result holding for any n. In general, other examples may lead to more complicated λ statistics for which it is diﬃcult to ﬁnd the corresponding distributions and hence the corresponding critical values. For these cases, we have an asymptotic result
26
Chapter 2: Basic Statistical Concepts
Reject Ho
Reject Ho
Do not reject Ho
=/2
=/2
– z= / 2
z= / 2
0
Figure 2.5 Critical Values
which states that, for large n, LR = −2logλ will be asymptotically distributed as χ2ν where ν denotes the number of restrictions that are tested by H0 . For example 2, ν = 1 and hence, LR is asymptotically distributed as χ21 . Note that we did not need this result as we found LR is exactly distributed as χ21 for any n. If one is testing H0 ; μ = 2 and σ 2 = 4 against the alternative that H1 ; μ = 2 or σ 2 = 4, then the corresponding LR will be asymptotically distributed as χ22 , see problem 5, part (f). Likelihood Ratio, Wald and Lagrange Multiplier Tests Before we go into the derivations of these three tests we start by giving an intuitive graphical explanation that will hopefully emphasize the differences among these tests. This intuitive explanation is based on the article by Buse (1982). Consider a quadratic loglikelihood function in a parameter of interest, say μ. Figure 2.6 shows this loglikelihood logL(μ), with a maximum at μ . The Likelihood Ratio test, tests the μ) where null hypothesis H0 ; μ = μ0 by looking at the ratio of the likelihoods λ = L(μ0 )/L( log L( m )
log L( mˆ )
A.
log L1 ( m 0 )
B.
log L2 ( m 0 )
C.
log L1 ( m )
log L2 ( m ) m0 Figure 2.6 Wald Test
mˆ
m
2.4
Hypothesis Testing
27
−2logλ, twice the difference in loglikelihood, is distributed asymptotically as χ21 under H0 . This test differentiates between the top of the hill and a preassigned point on the hill by evaluating the height at both points. Therefore, it needs both the restricted and unrestricted maximum of the likelihood. This ratio is dependent on the distance of μ0 from μ and the curvature of . In fact, for a ﬁxed ( μ − μ0 ), the larger C( μ), the loglikelihood, C(μ) = ∂ 2 logL(μ)/∂μ2 , at μ the larger is the difference between the two heights. Also, for a given curvature at μ , the larger ( μ − μ0 ) the larger is the difference between the heights. The Wald test works from the top of the hill, i.e., it needs only the unrestricted maximum likelihood. It tries to establish the distance μ − μ0 ), and the curvature at μ . In fact the Wald to μ0 , by looking at the horizontal distance ( statistic is W = ( μ − μ0 )2 C( μ) and this is asymptotically distributed as χ21 under H0 . The usual form of W has I(μ) = −E[∂ 2 logL(μ)/∂μ2 ] the information matrix evaluated at μ , rather than C( μ), but the latter is a consistent estimator of I(μ). The information matrix will be studied in details in Chapter 7. It will be shown, under fairly general conditions, that μ the MLE of μ − μ0 )2 /var( μ) all evaluated at the unrestricted MLE. μ, has var( μ) = I −1 (μ). Hence W = ( The LagrangeMultiplier test (LM), on the other hand, goes to the preassigned point μ0 , i.e., it only needs the restricted maximum likelihood, and tries to determine how far it is from the top of the hill by considering the slope of the tangent to the likelihood S(μ) = ∂logL(μ)/∂μ at μ0 , and the rate at which this slope is changing, i.e., the curvature at μ0 . As Figure 2.7 shows, for two loglikelihoods with the same S(μ0 ), the one that is closer to the top of the hill is the one with the larger curvature at μ0 . log L( m )
log L1 ( mˆ 1 )
log L1 ( m )
log L2 ( mˆ 2 ) log L( m0 ) log L2 ( m )
m m0
mˆ 2
mˆ 1
Figure 2.7 LM Test
This suggests the following statistic: LM = S 2 (μ0 ){C(μ0 )}−1 where the curvature appears in inverse form. In the Appendix to this chapter, we show that the E[S(μ)] = 0 and var[S(μ)] = I(μ). Hence LM = S 2 (μ0 )I −1 (μ0 ) = S 2 (μ0 )/var[S(μ0 )] all evaluated at the restricted MLE. Another interpretation of the LM test is that it is a measure of failure of the restricted estimator, in this case μ0 , to satisfy the ﬁrstorder conditions of maximization of the unrestricted likelihood. We know that S( μ) = 0. The question is: to what extent does S(μ0 ) differ from zero? S(μ) is known in the statistics literature as the score, and the LM test is also referred to as the score test. For a more formal treatment of these tests, let us reconsider example 3 of a random sample x1 , . . . , xn from a N (μ, 4) where we are interested in testing H0 ; μ0 = 2 versus H1 ; μ = 2. The likelihood function L(μ) as well as LR = −2logλ = n(¯ x − 2)2 /4 were given in example 3. In
28
Chapter 2: Basic Statistical Concepts
fact, the score function is given by n (xi âˆ’ Îź) n(ÂŻ x âˆ’ Îź) âˆ‚logL(Îź) = i=1 = S(Îź) = âˆ‚Îź 4 4 and under H0 n(ÂŻ x âˆ’ 2) 4 âˆ‚ 2 logL(Îź) n n C(Îź) =   =âˆ’ = âˆ‚Îź2 4 4
2 n âˆ‚ logL(Îź) = = C(Îź). and I(Îź) = âˆ’E 2 âˆ‚Îź 4 The Wald statistic is based on S(Îź0 ) = S(2) =
W = ( ÎźM LE âˆ’ 2)2 I( ÎźM LE ) = (ÂŻ x âˆ’ 2)2 Âˇ The LM statistic is based on LM = S 2 (Îź0 )I âˆ’1 (Îź0 ) =
n 4
n2 (ÂŻ x âˆ’ 2)2 4 n(ÂŻ x âˆ’ 2)2 Âˇ = 16 n 4
Therefore, W = LM = LR for this example with known variance Ďƒ 2 = 4. These tests are all based upon the ÂŻ x âˆ’ 2 â‰Ľ k critical region, where k is determined such that the size of the test is Îą. In general, these test statistics are not always equal, as is shown in the next example. Example 4: For a random sample x1 , . . . , xn drawn from a N (Îź, Ďƒ 2 ) with unknown Ďƒ 2 , test the hypothesis H0 ; Îź = 2 versus H1 ; Îź = 2. Problem 5, part (c), asks the reader to verify that
n 2 n2 (ÂŻ x âˆ’ 2)2 n2 (ÂŻ x âˆ’ 2)2 i=1 (xi âˆ’ 2) whereas W = and LM = . LR = nlog n n n 2 ÂŻ)2 ÂŻ)2 i=1 (xi âˆ’ x i=1 (xi âˆ’ x i=1 (xi âˆ’ 2)
One can easily show that LM/n = (W/n)/[1+(W/n)] and LR/n = log[1+(W/n)]. Let y = W/n, then using the inequality y â‰Ľ log(1 + y) â‰Ľ y/(1 + y), one can conclude that W â‰Ľ LR â‰Ľ LM . This inequality was derived by Berndt and Savin (1977), and will be considered again when we study test of hypotheses in the general linear model. Note, however that all three test statistics are based upon ÂŻ x âˆ’ 2 â‰Ľ k and for ďŹ nite n, the same exact critical value could be obtained from the Normally distributed x ÂŻ. This section introduced the W, LR and LM test statistics, all of which have the same asymptotic distribution. In addition, we showed that using the normal distribution, when Ďƒ 2 is known, W = LR = LM for testing H0 ; Îź = 2 versus H1 ; Îź = 2. However, when Ďƒ 2 is unknown, we showed that W â‰Ľ LR â‰Ľ LM for the same hypothesis. Example 5: For a random sample x1 , . . . , xn drawn from a Bernoulli distribution with parameter Î¸, test the hypothesis H0 ; Î¸ = Î¸ 0 versus H1 ; Î¸ = Î¸0 , where Î¸0 is a known positive fraction. This example is based on Engle (1984). Problem 4, part (i), asks the reader to derive LR, W and LM for H0 ; Î¸ = 0.2 versus H1 ; Î¸ = 0.2. The likelihood L(Î¸) and the Score S(Î¸) were derived in section 2.2. One can easily verify that n n âˆ’ ni=1 xi âˆ‚ 2 logL(Î¸) i=1 xi C(Î¸) =  = + (1 âˆ’ Î¸)2 âˆ‚Î¸ 2 Î¸2
2.4
and
Hypothesis Testing
29
n ∂ 2 logL(θ) = I(θ) = −E 2 θ(1 − θ) ∂θ
The Wald statistic is based on W = ( θ M LE − θ0 )2 I( θ M LE ) = (¯ x − θ0 )2 ·
n (¯ x − θ0 )2 = x ¯(1 − x ¯) x ¯(1 − x ¯)/n
using the fact that θM LE = x ¯. The LM statistic is based on LM = S 2 (θ0 )I −1 (θ0 ) =
θ0 (1 − θ0 ) (¯ x − θ0 )2 (¯ x − θ0 )2 · = [θ 0 (1 − θ0 )/n]2 n θ0 (1 − θ 0 )/n
Note that the numerator of the W and LM are the same. It is the denominator which is the var(¯ x) = θ(1 − θ)/n that is different. For Wald, this var(¯ x) is evaluated at θM LE , whereas for LM, this is evaluated at θ0 . The LR statistic is based on logL( θ M LE ) = ni=1 xi log¯ x + (n − ni=1 xi )log(1 − x ¯) and
logL(θ 0 ) = so that
n
i=1 xi logθ 0
+ (n −
n
i=1 xi )log(1
− θ0)
θM LE ) = −2[ ni=1 xi (logθ0 − log¯ x) LR = −2logL(θ0 ) + 2logL( n +(n − i=1 xi )(log(1 − θ0 ) − log(1 − x ¯))]
For this example, LR looks different from W and LM. However, a secondorder TaylorSeries expansion of LR around θ0 = x ¯ yields the same statistic. Also, for n → ∞, plim x ¯ = θ and if H0 is true, then all three statistics are asymptotically equivalent. Note also that all three test statistics are based upon ¯ x − θ0  ≥ k and for ﬁnite n, the same exact critical value could be obtained from the binomial distribution. See problem 19 for more examples of the conﬂict in test of hypotheses using the W, LR and LM test statistics. Bera and Permaratne (2001, p. 58) tell the following amusing story that can bring home the interrelationship among the three tests: “Once around 1946 Ronald Fisher invited Jerzy Neyman, Abraham Wald, and C.R. Rao to his lodge for afternoon tea. During their conversation, Fisher mentioned the problem of deciding whether his dog, who had been going to an “obedience school” for some time, was disciplined enough. Neyman quickly came up with an idea: leave the dog free for some time and then put him on his leash. If there is not much difference in his behavior, the dog can be thought of as having completed the course successfully. Wald, who lost his family in the concentration camps, was adverse to any restrictions and simply suggested leaving the dog free and seeing whether it behaved properly. Rao, who had observed the nuisances of stray dogs in Calcutta streets did not like the idea of letting the dog roam freely and suggested keeping the dog on a leash at all times and observing how hard it pulls on the leash. If it pulled too much, it needed more training. That night when Rao was back in his Cambridge dormitory after tending Fisher’s mice at the genetics laboratory, he suddenly realized the connection of Neyman and Wald’s recommendations to the NeymanPearson LR and Wald tests. He got an idea and the rest is history.”
30
Chapter 2: Basic Statistical Concepts
2.5
Conﬁdence Intervals
Estimation methods considered in section 2.2 give us a point estimate of a parameter, say μ, and that is the best bet, given the data and the estimation method, of what μ might be. But it is always good policy to give the client an interval, rather than a point estimate, where with some degree of conﬁdence, usually 95% conﬁdence, we expect μ to lie. We have seen in Figure 2.5 that for a N (0, 1) random variable z, we have Pr[−zα/2 ≤ z ≤ zα/2 ] = 1 − α and for α = 5%, this probability is 0.95, giving the required 95% conﬁdence. In fact, zα/2 = 1.96 and Pr[−1.96 ≤ z ≤ 1.96] = 0.95 This says that if we draw 100 random numbers from a N (0, 1) density, (using a normal random number generator) we expect 95 out of these 100 numbers to lie in the [−1.96, 1.96] interval. Now, let us get back to the problem of estimating μ from a random sample x1 , . . . , xn drawn M LE = x ¯ and x ¯ ∼ N (μ, σ 2 /n). Hence, from a N (μ, σ 2 ) distribution. We found out that μ √ ¯ observed from the sample, and the z = (¯ x − μ)/(σ/ n) is N (0, 1). The point estimate for μ is x 95% conﬁdence interval for μ is obtained by replacing z by its value in the above probability statement: Pr[−zα/2 ≤
x ¯−μ √ ≤ zα/2 ] = 1 − α σ/ n
Assuming σ is known for the moment, one can rewrite this probability statement after some simple algebraic manipulations as √ √ ¯ + zα/2 (σ/ n)] = 1 − α Pr[¯ x − zα/2 (σ/ n) ≤ μ ≤ x Note that this probability statement has random variables on both ends and the probability that these random variables sandwich the unknown parameter μ is 1 − α. With the same conﬁdence of drawing 100 random N (0, 1) numbers and ﬁnding 95 of them falling in the (−1.96, 1.96) range we are conﬁdent that if we drew a 100 samples and computed a 100 x ¯’s, and a 100 intervals √ (¯ x ± 1.96 σ/ n), μ will lie in these intervals in 95 out of 100 times. If σ is not known, and is replaced by s, then problem 12 shows that this is equivalent to dividing a N (0, 1) random variable by an independent χ2n−1 random variable divided by its degrees of freedom, leading to a tdistribution with (n − 1) degrees of freedom. Hence, using the ttables for (n − 1) degrees of freedom Pr[−tα/2;n−1 ≤ tn−1 ≤ tα/2;n−1 ] = 1 − α √ and replacing tn−1 by (¯ x − μ)/(s/ n) one gets √ √ ¯ + tα/2;n−1 (s/ n)] = 1 − α Pr[¯ x − tα/2;n−1 (s/ n) ≤ μ ≤ x Note that the degrees of freedom (n−1) for the tdistribution come from s and the corresponding critical value tn−1;α/2 is therefore sample speciﬁc, unlike the corresponding case for the normal density where zα/2 does not depend on n. For small n, the tα/2 values differ drastically from
2.6
Descriptive Statistics
31
Table 2.1 Descriptive Statistics for the Earnings Data Sample: 1 595
Mean Median Maximum Minimum Std. Dev. Skewness Kurtosis JarqueBera Probability
Observations
LWAGE
WKS
ED
EX
MS
FEM
BLK
UNION
6.9507 6.9847 8.5370 5.6768 0.4384 –0.1140 3.3937
46.4520 48.0000 52.0000 5.0000 5.1850 –2.7309 13.7780
12.8450 12.0000 17.0000 4.0000 2.7900 –0.2581 2.7127
22.8540 21.0000 51.0000 7.0000 10.7900 0.4208 2.0086
0.8050 1.0000 1.0000 0.0000 0.3965 –1.5400 3.3715
0.1126 0.0000 1.0000 0.0000 0.3164 2.4510 7.0075
0.0723 0.0000 1.0000 0.0000 0.2592 3.3038 11.9150
0.3664 0.0000 1.0000 0.0000 0.4822 0.5546 1.3076
5.13 0.0769
3619.40 0.0000
8.65 0.0132
41.93 0.0000
238.59 0.0000
993.90 0.0000
3052.80 0.0000
101.51 0.0000
595
595
595
595
595
595
595
595
zα/2 , emphasizing the importance of using the tdensity in small samples. When n is large the difference between zα/2 and tα/2 diminishes as the tdensity becomes more like a normal density. For n = 20, and α = 0.05, tα/2;n−1 = 2.093 as compared with zα/2 = 1.96. Therefore, Pr[−2.093 ≤ tn−1 ≤ 2.093] = 0.95
√ and μ lies in x ¯ ± 2.093(s/ n) with 95% conﬁdence. More examples of conﬁdence intervals can be constructed, but the idea should be clear. Note that these conﬁdence intervals are the other side of the coin for tests of hypotheses. For example, in testing H0 ; μ = 2 versus H1 ; μ = 2 for a known σ, we discovered that the Likelihood Ratio test is based on the same probability statement that generated the conﬁdence interval for μ. In classical tests of hypothesis, we choose the level of conﬁdence α = 5% and compute √ z = (¯ x − μ)/(σ/ n). This can be done since σ is known and μ = 2 under the null hypothesis H0 . Next, we do not reject H0 if z lies in the (−zα/2 , zα/2 ) interval and reject H0 otherwise. For conﬁdence intervals, on the other hand, we do not know μ, and armed with a level of conﬁdence (1 − α)% we construct the interval that should contain μ with that level of conﬁdence. Having done that, if μ = 2 lies in that 95% conﬁdence interval, then we cannot reject H0 ; μ = 2 at the 5% level. Otherwise, we reject H0 . This highlights the fact that any value of μ that lies in this 95% conﬁdence interval (assuming it was our null hypothesis) cannot be rejected at the 5% level by this sample. This is why we do not say “accept H0 ”, but rather we say “do not reject H0 ”.
2.6
Descriptive Statistics
In Chapter 4, we will consider the estimation of a simple wage equation based on 595 individuals drawn from the Panel Study of Income Dynamics for 1982. This data is available on the Springer web site as EARN.ASC. Table 2.1 gives the descriptive statistics using EViews for a subset of the variables in this data set.
32
Chapter 2: Basic Statistical Concepts
80 Series: LWAGE Sample 1 595 Observations 595
70 60
Mean Median Maximum Minimum Std. Dev. Skewness Kurtosis
50 40 30 20 10
6.950745 6.984720 8.537000 5.676750 0.438403 â€“0.114001 3.393651
JarqueBera Probability
0 6.0
6.5
7.0
7.5
8.0
5.130525 0.076899
8.5
Figure 2.8 Log (Wage) Histogram 250 Series: WKS Sample 1595 Observations 595
200
Mean Median Maximum Minimum Std. Dev. Skewness Kurtosis
150
100
46.45210 48.00000 52.00000 5.000000 5.185025 â€“2.730880 13.77787
50 JarqueBera Probability
0 10
20
30
40
3619.416 0.000000
50
Figure 2.9 Weeks Worked Histogram
The average log wage is $6.95 for this sample with a minimum of $5.68 and a maximum of $8.54. The standard deviation of log wage is 0.44. A plot of the log wage histogram is given in Figure 2.8. Weeks worked vary between 5 and 52 with an average of 46.5 and a standard deviation of 5.2. This variable is highly skewed as evidenced by the histogram in Figure 2.9. Years of education vary between 4 and 17 with an average of 12.8 and a standard deviation of 2.79. There is the usual bunching up at 12 years, which is also the median, as is clear from Figure 2.10. Experience varies between 7 and 51 with an average of 22.9 and a standard deviation of 10.79. The distribution of this variable is skewed to the left, as shown in Figure 2.11. Marital status is a qualitative variable indicating whether the individual is married or not. This information is recoded as a numeric (1, 0) variable, one if the individual is married and zero
2.6
Descriptive Statistics
33
250 Series: ED Sample 1595 Observation 595
200
Mean Median Maximum Minimum Std. Dev. Skewness Kurtosis
150
100
12.84538 12.00000 17.00000 4.000000 2.790006 â€“0.258116 2.712730
50 JarqueBera Probability
0 4
6
8
10
12
14
8.652780 0.013215
16
Figure 2.10 Years of Education Histogram 80 Series: EX Sample 1 595 Observation 595
70 60 50 40 30 20 10 0 10
20
30
40
Mean Median Maximum Minimum Std. Dev. Skewness Kurtosis
22.85378 21.00000 51.00000 7.000000 10.79018 0.420826 2.008578
JarqueBera Probability
41.93007 0.000000
50
Figure 2.11 Years of Experience Histogram
otherwise. This recoded variable is also known as a dummy variable. It is basically a switch turning on when the individual is married and off when he or she is not. Female is another dummy variable taking the value one when the individual is a female and zero otherwise. Black is a dummy variable taking the value one when the individual is black and zero otherwise. Union is a dummy variable taking the value one if the individualâ€™s wage is set by a union contract and zero otherwise. The minimum and maximum values for these dummy variables are obvious. But if they were not zero and one, respectively, you know that something is wrong. The average is a meaningful statistic indicating the percentage of married individuals, females, blacks and union contracted wages in the sample. These are 80.5, 11.3, 7.2 and 30.6%, respectively. We would like to investigate the following claims: (i) women are paid less than men; (ii) blacks are paid less than nonblacks; (iii) married individuals earn more than nonmarried individuals; and (iv) union contracted wages are higher than nonunion wages.
34
Chapter 2: Basic Statistical Concepts
Table 2.2 Test for the Difference in Means Average log wage Male Female NonBlack Black Not Married Married NonUnion Union
Difference
$7,004 $6,530 $6,978 $6,601 $6,664 $7,020 $6,945 $6,962
–0.474 (–8.86) –0.377 (–5.57) 0.356 (8.28) 0.017 (0.45)
Table 2.3 Correlation Matrix
LWAGE WKS ED EX MS FEM BLK UNION
LWAGE
WKS
ED
EX
MS
FEM
BLK
UNION
1.0000 0.0403 0.4566 0.0873 0.3218 –0.3419 –0.2229 0.0183
0.0403 1.0000 0.0002 –0.1061 0.0782 –0.0875 –0.0594 –0.1721
0.4566 0.0002 1.0000 –0.2219 0.0184 –0.0012 –0.1196 –0.2719
0.0873 –0.1061 –0.2219 1.0000 0.1570 –0.0938 0.0411 0.0689
0.3218 0.0782 0.0184 0.1570 1.0000 –0.7104 –0.2231 0.1189
–0.3419 –0.0875 –0.0012 –0.0938 –0.7104 1.0000 0.2086 –0.1274
–0.2229 –0.0594 –0.1196 0.0411 –0.2231 0.2086 1.0000 0.0302
0.0183 –0.1721 –0.2719 0.0689 0.1189 –0.1274 0.0302 1.0000
A simple ﬁrst check could be based on computing the average log wage for each of these categories and testing whether the difference in means is signiﬁcantly different from zero. This can be done using a ttest, see Table 2.2. The average log wage for males and females is given along with their difference and the corresponding tstatistic for the signiﬁcance of this difference. Other rows of Table 2.2 give similar statistics for other groupings. In Chapter 4, we will show that this ttest can be obtained from a simple regression of log wage on the categorical dummy variable distinguishing the two groups. In this case, the Female dummy variable. From Table 2.2, it is clear that only the difference between union and nonunion contracted wages are insigniﬁcant. One can also plot log wage versus experience, see Figure 2.12, log wage versus education, see Figure 2.13, and log wage versus weeks, see Figure 2.14. The data shows that, in general, log wage increases with education level, weeks worked, but that it exhibits a rising and then a declining pattern with more years of experience. Note that the ttests based on the difference in log wage across two groupings of individuals, by sex, race or marital status, or the ﬁgures plotting log wage versus education, log wage versus weeks worked are based on pairs of variables in each case. A nice summary statistic based also on pairwise comparisons of these variables is the correlation matrix across the data. This is given in Table 2.3. The signs of this correlation matrix give the direction of linear relationship between the corresponding two variables, while the magnitude gives the strength of this correlation. In Chapter 3, we will see that these simple correlations when squared give the percentage of
2.6
Descriptive Statistics
35
9 9
8
LWAGE
LWAGE
8
7
6
7
6
5
5 0
10
20
30
40
50
60
EX
0
4
8
12
16
20
ED
Figure 2.12 Log (Wage) versus Experience
Figure 2.13 Log (Wage) versus Education
9.0 8.5
LWAGE
8.0 7.5 7.0 6.5 6.0 5.5 0
10
20
30
40
50
60
WKS Figure 2.14 Log (Wage) versus Weeks
variation that one of these variables explain in the other. For example, the simple correlation coeﬃcient between log wage and marital status is 0.32. This means that marital status explains (0.32)2 or 10% of the variation in log wage. One cannot emphasize enough how important it is to check one’s data. It is important to compute the descriptive statistics, simple plots of the data and simple correlations. A wrong minimum or maximum could indicate some possible data entry errors. Troughs or peaks in these plots may indicate important events for time series data, like wars or recessions, or inﬂuential observations. More on this in Chapter 8. Simple correlation coeﬃcients that equal one indicate perfectly collinear variables and warn of the failure of a linear regression that has both variables included among the regressors, see Chapter 4.
36
Chapter 2: Basic Statistical Concepts
Notes 1. Actually E(s2 ) = σ 2 does not need the normality assumption. This fact along with the proof of (n − 1)s2 /σ 2 ∼ χ2n−1 , under Normality, can be easily shown using matrix algebra and is deferred to Chapter 7. 2. This can be proven using the Chebyshev’s inequality, see Hogg and Craig (1995). 3. See Hogg and Craig (1995) for the type of regularity conditions needed for these distributions.
Problems 1. Variance and Covariance of Linear Combinations of Random Variables. Let a, b, c, d, e and f be arbitrary constants and let X and Y be two random variables. (a) Show that var(a + bX) = b2 var(X). (b) var(a + bX + cY ) = b2 var(X) + c2 var(Y ) + 2bc cov(X, Y ). (c) cov[(a + bX + cY ), (d + eX + f Y )] = be var(X) + cf var(Y ) + (bf + ce) cov(X, Y ). 2. Independence and Simple Correlation. (a) Show that if X and Y are independent, then E(XY ) = E(X)E(Y ) = μx μy where μx = E(X) and μy = E(Y ). Therefore, cov(X, Y ) = E(X − μx )(Y − μy ) = 0.
(b) Show that if Y = a + bX, where a and b are arbitrary constants, then ρxy = 1 if b > 0 and −1 if b < 0.
3. Zero Covariance Does Not Necessarily Imply Independence. Let X = −2, −1, 0, 1, 2 with Pr[X = x] = 1/5. Assume a perfect quadratic relationship between Y and X, namely Y = X 2 . Show that cov(X, Y ) = E(X 3 ) = 0. Deduce that ρXY = correlation (X, Y ) = 0. The simple correlation coefﬁcient ρXY measures the strength of the linear relationship between X and Y . For this example, it is zero even though there is a perfect nonlinear relationship between X and Y . This is also an example of the fact that if ρXY = 0, then X and Y are not necessarily independent. ρxy = 0 is a necessary but not suﬃcient condition for X and Y to be independent. The converse, however, is true, i.e., if X and Y are independent, then ρXY = 0, see problem 2. 4. The Binomial Distribution is deﬁned as the number of successes in n independent Bernoulli trials with probability of success θ. This discrete probability function is given by n X f (X; θ) = θ (1 − θ)n−X X = 0, 1, . . . , n X n = n!/[X!(n − X)!]. and zero elsewhere, with X (a) Out of 20 candidates for a job with a probability of hiring of 0.1. Compute the probabilities of getting X = 5 or 6 people hired? n n = n−X and use that to conclude that b(n, X, ) = b(n, n − X, 1 − θ). (b) Show that X (c) Verify that E(X) = nθ and var(X) = nθ(1 − θ).
(d) For a random sample of size n drawn from the Bernoulli distribution with parameter θ, show ¯ is the MLE of θ. that X ¯ in part (d), is unbiased and consistent for θ. (e) Show that X, ¯ in part (d), is suﬃcient for θ. (f) Show that X,
Problems
37
¯ in part (d), MVU (g) Derive the Cram´erRao lower bound for any unbiased estimator of θ. Is X, for θ? (h) For n = 20, derive the uniformly most powerful critical region of size α ≤ 0.05 for testing H0 ; θ = 0.2 versus H1 ; θ = 0.6. What is the probability of type II error for this test criteria? (i) Form the Likelihood Ratio test for testing H0 ; θ = 0.2 versus H1 ; θ = 0.2. Derive the Wald and LM test statistics for testing H0 versus H1 . When is the Wald statistic greater than the LM statistic? 5. For a random sample of size n drawn from the Normal distribution with mean μ and variance σ 2 . (a) Show that s2 is a suﬃcient statistic for σ 2 . (b) Using the fact that (n − 1)s2 /σ 2 is χ2n−1 (without proof), verify that E(s2 ) = σ 2 and that var(s2 ) = 2σ 4 /(n − 1) as shown in the text.
(c) Given that σ 2 is unknown, form the Likelihood Ratio test statistic for testing H0 ; μ = 2 versus H1 ; μ = 2. Derive the Wald and Lagrange Multiplier statistics for testing H0 versus H1 . Verify that they are given by the expressions in example 4.
(d) Another derivation of the W ≥ LR ≥ LM inequality for the null hypothesis given in part (c) can be obtained as follows: Let μ , σ 2 be the restricted maximum likelihood estimators under 2 H0 ; μ = μ0 . Let μ , σ be the corresponding unrestricted maximum likelihood estimators under the alternative H1 ; μ = μ0 . Show that W = −2log[L( μ, σ 2 )/L( μ, σ 2 )] and LM = 2 2 2 μ, σ )] where L(μ, σ ) denotes the likelihood function. Conclude that W ≥ −2log[L( μ, σ )/L( LR ≥ LM , see Breusch (1979). This is based on Baltagi (1994). (e) Given that μ is unknown, form the Likelihood Ratio test statistic for testing H0 ; σ = 3 versus H1 ; σ = 3.
(f) Form the Likelihood Ratio test statistic for testing H0 ; μ = 2, σ2 = 4 against the alternative that H1 ; μ = 2 or σ 2 = 4.
(g) For n = 20, s2 = 9 construct a 95% conﬁdence interval for σ 2 .
6. The Poisson distribution can be deﬁned as the limit of a Binomial distribution as n → ∞ and θ → 0 such that nθ = λ is a positive constant. For example, this could be the probability of a rare disease and we are random sampling a large number of inhabitants, or it could be the rare probability of ﬁnding oil and n is the large number of drilling sights. This discrete probability function is given by f (X; λ) =
e−λ λX X!
X = 0, 1, 2, . . .
For a random sample from this Poisson distribution (a) Show that E(X) = λ and var(X) = λ. MLE = X. ¯ (b) Show that the MLE of λ is λ
¯ (c) Show that the method of moments estimator of λ is also X. ¯ (d) Show that X is unbiased and consistent for λ. ¯ is suﬃcient for λ. (e) Show that X ¯ attains (f) Derive the Cram´erRao lower bound for any unbiased estimator of λ. Show that X that bound. (g) For n = 9, derive the Uniformly Most Powerful critical region of size α ≤ 0.05 for testing H0 ; λ = 2 versus H1 ; λ = 4.
38
Chapter 2: Basic Statistical Concepts (h) Form the Likelihood Ratio test for testing H0 ; λ = 2 versus H1 ; λ = 2. Derive the Wald and LM statistics for testing H0 versus H1 . When is the Wald test statistic greater than the LM statistic? 7. The Geometric distribution is known as the probability of waiting for the ﬁrst success in independent repeated trials of a Bernoulli process. This could occur on the 1st, 2nd, 3rd,.. trials. g(X; θ) = θ(1 − θ)X−1 for X = 1, 2, 3, . . . (a) Show that E(X) = 1/θ and var(X) = (1 − θ)/θ2 .
(b) Given a random sample from this Geometric distribution of size n, ﬁnd the MLE of θ and the method of moments estimator of θ. ¯ is unbiased and consistent for 1/θ. (c) Show that X (d) For n = 20, derive the Uniformly Most Powerful critical region of size α ≤ 0.05 for testing H0 ; θ = 0.5 versus H1 ; θ = 0.3. (e) Form the Likelihood Ratio test for testing H0 ; θ = 0.5 versus H1 ; θ = 0.5. Derive the Wald and LM statistics for testing H0 versus H1 . When is the Wald statistic greater than the LM statistic? 8. The Uniform density, deﬁned over the unit interval [0, 1], assigns a unit probability for all values of X in that interval. It is like a roulette wheel that has an equal chance of stopping anywhere between 0 and 1. f (X) = 1 =0
0≤X ≤1 elsewhere
Computers are equipped with a Uniform (0,1) random number generator so it is important to understand these distributions. (a) Show that E(X) = 1/2 and var(X) = 1/12. (b) What is the Pr[0.1 < X < 0.3]? Does it matter if we ask for the Pr[0.1 ≤ X ≤ 0.3]? 9. The Exponential distribution is given by f (X; θ) =
1 −X/θ e θ
X > 0 and θ > 0
This is a skewed and continuous distribution deﬁned only over the positive quadrant. (a) Show that E(X) = θ and var(X) = θ2 . ¯ (b) Show that θ MLE = X.
¯ (c) Show that the method of moments estimator of θ is also X. ¯ is an unbiased and consistent estimator of θ. (d) Show that X ¯ is suﬃcient for θ. (e) Show that X ¯ MVU for θ? (f) Derive the Cram´erRao lower bound for any unbiased estimator of θ? Is X (g) For n = 20, derive the Uniformly Most Powerful critical region of size α ≤ 0.05 for testing H0 ; θ = 1 versus H1 ; θ = 2. (h) Form the Likelihood Ratio test for testing H0 ; θ = 1 versus H1 ; θ = 1. Derive the Wald and LM statistics for testing H0 versus H1 . When is the Wald statistic greater than the LM statistic?
Problems
39
10. The Gamma distribution is given by f (X; α, β)
1 X α−1 e−X/β Γ(α)β α =0 =
for X > 0 elsewhere
where α and β > 0 and Γ(α) = (α − 1)! This is a skewed and continuous distribution. (a) Show that E(X) = αβ and var(X) = αβ 2 . (b) For a random sample drawn from this Gamma density, what are the method of moments estimators of α and β? (c) Verify that for α = 1 and β = θ, the Gamma probability density function reverts to the Exponential p.d.f. considered in problem 9. (d) We state without proof that for α = r/2 and β = 2, this Gamma density reduces to a χ2 distribution with r degrees of freedom, denoted by χ2r . Show that E(χ2r ) = r and var(χ2r ) = 2r. (e) For a random sample from the χ2r distribution, show that (X1 X2 , . . . , Xn ) is a suﬃcient statistic for r. (f) One can show that the square of a N (0, 1) random variable is a χ2 random variable with 1 degree of freedom, see the Appendix to the chapter. Also, one can show that the sum of independent χ2 ’s is a χ2 random variable with degrees of freedom equal the sum of the corresponding degrees of freedom of the individual χ2 ’s, see problem 15. This will prove useful for testing later on. Using these results, verify that the sum of squares of m independent N (0, 1) random variables is a χ2 with m degrees of freedom. 11. The Beta distribution is deﬁned by f (X)
Γ(α + β) α−1 X (1 − X)β−1 Γ(α)Γ(β) =0 =
for 0 < X < 1 elsewhere
where α > 0 and β > 0. This is a skewed continuous distribution. (a) For α = β = 1 this reverts back to the Uniform (0, 1) probability density function. Show that E(X) = (α/α + β) and var(X) = αβ/(α + β)2 (α + β + 1). (b) Suppose that α = 1, ﬁnd the estimators of β using the method of moments and the method of maximum likelihood. 12. The tdistribution with r degrees of freedom can be deﬁned as the ratio of two independent random variables. The numerator being a N (0, 1) random variable and the denominator being the squareroot of a χ2r random variable divided by its degrees of freedom. The tdistribution is a symmetric distribution like the Normal distribution but with fatter tails. As r → ∞, the tdistribution approaches the Normal distribution. (a) Verify that if X1√ , . . . , Xn are a random sample drawn from a N (μ, σ 2 ) distribution, then ¯ z = (X − μ)/(σ/ n) is N (0, 1). ¯ − μ)/(s/√n) has a (b) Use the fact that (n − 1)s2 /σ 2 ∼ χ2n−1 to show that t = z/ s2 /σ 2 = (X ¯ tdistribution with (n − 1) degrees of freedom. We use the fact that s2 is independent of X without proving it. (c) For n = 16, x ¯ = 20 and s2 = 4, construct a 95% conﬁdence interval for μ.
40
Chapter 2: Basic Statistical Concepts
13. The Fdistribution can be deﬁned as the ratio of two independent χ2 random variables each divided by its corresponding degrees of freedom. It is commonly used to test the equality of variances. Let s21 be the sample variance from a random sample of size n1 drawn from N (μ1 , σ 21 ) and let s22 be the sample variance from another random sample of size n2 drawn from N (μ2 , σ 22 ). We know that (n1 − 1)s21 /σ 21 is χ2(n1 −1) and (n2 − 1)s22 /σ 22 is χ2(n2 −1) . Taking the ratio of those two independent χ2 random variables divided by their appropriate degrees of freedom yields F =
s21 /σ 21 s22 /σ 22
which under the null hypothesis H0 ; σ 21 = σ 22 gives F = s21 /s22 and is distributed as F with (n1 − 1) and (n2 − 1) degrees of freedom. Both s21 and s22 are observable, so F can be computed and compared to critical values for the F distribution with the appropriate degrees of freedom. Two inspectors drawing two random samples of size 25 and 31 from two shifts of a factory producing steel rods, ﬁnd that the sampling variance of the lengths of these rods are 15.6 and 18.9 inches squared. Test whether the variances of the two shifts are the same. 14. Moment Generating Function (MGF). (a) Derive the MGF of the Binomial distribution deﬁned in problem 4. Show that it is equal to [(1 − θ) + θet ]n . 1
(b) Derive the MGF of the Normal distribution deﬁned in problem 5. Show that it is eμt+ 2 σ (c) Derive the MGF of the Poisson distribution deﬁned in problem 6. Show that it is e
t
2 2
t
λ(e −1)
.
.
(d) Derive the MGF of the Geometric distribution deﬁned in problem 7. Show that it is θet /[1 − (1 − θ)et ].
(e) Derive the MGF of the Exponential distribution deﬁned in problem 9. Show that it is 1/(1 − θt). (f) Derive the MGF of the Gamma distribution deﬁned in problem 10. Show that it is (1−βt)−α . r Conclude that the MGF of a χ2r is (1 − 2t)− 2 .
(g) Obtain the mean and variance of each distribution by differentiating the corresponding MGF derived in parts (a) through (f). 15. Moment Generating Function Method. . . . , Xn are independent Poisson distributed with parameters (λi ) respec(a) Show that if X1 , tively, then Y = ni=1 Xi is Poisson with parameter ni=1 λi .
2 (b) Show that if X1 , . . . , Xn are independent Normally distributed with parameters (μi , σ i ), then Y = ni=1 Xi is Normal with mean ni=1 μi and variance ni=1 σ 2i . ¯ ∼ N (μ, σ 2 /n). (c) Deduce from part (b) that if X1 , . . . , Xn are IIN(μ, σ2 ), then X
(d) Show that if X1 , . . . , Xn are independent χ2 distributed n with parameters (ri ) respectively, n then Y = i=1 Xi is χ2 distributed with parameter i=1 ri .
16. Best Linear Prediction. (Problems 16 and 17 are based on Amemiya (1994)). Let X and Y be two random variables with means μX and μY and variances σ 2X and σ 2Y , respectively. Suppose that ρ = correlation(X, Y ) = σ XY /σ X σ Y
where σ XY = cov(X, Y ). Consider the linear relationship Y = α + βX where α and β are scalars: (a) Show that the best linear predictor of Y based on X, where best in this case means the minimum mean squared error predictor which minimizes E(Y − α − βX)2 with respect to α 2 where α and β is given by Y = α + βX = μY − βμ X and β = σ XY /σ X = ρσ Y /σ X .
Problems
41
(b) Show that the var(Y ) = ρ2 σ 2Y and that u = Y − Y , the prediction error, has mean zero and 2 2 variance equal to (1 − ρ )σ Y . Therefore, ρ2 can be interpreted as the proportion of σ 2Y that is explained by the best linear predictor Y . (c) Show that cov(Y , u ) = 0.
17. The Best Predictor. Let X and Y be the two random variables considered in problem 16. Now consider predicting Y by a general, possibly nonlinear, function of X denoted by h(X). (a) Show that the best predictor of Y based on X, where best in this case means the minimum mean squared error predictor that minimizes E[Y − h(X)]2 is given by h(X) = E(Y /X). Hint: Write E[Y − h(X)]2 as E{[Y − E(Y /X)] + [E(Y /X) − h(X)]}2 . Expand the square and show that the crossproduct term has zero expectation. Conclude that this mean squared error is minimized at h(X) = E(Y /X). (b) If X and Y are bivariate Normal, show that the best predictor of Y based on X is identical to the best linear predictor of Y based on X. 18. Descriptive Statistics. Using the data used in section 2.6 based on 595 individuals drawn from the Panel Study of Income Dynamics for 1982 and available on the Springer web site as EARN.ASC, replicate the tables and graphs given in that section. More speciﬁcally (a) replicate Table 2.1 which gives the descriptive statistics for a subset of the variables in this data set. (b) Replicate Figures 2.6–2.11 which plot the histograms for log wage, weeks worked, education and experience. (c) Replicate Table 2.2 which gives the average log wage for various groups and test the difference between these averages using a ttest. (d) Replicate Figure 2.12 which plots log wage versus experience. Figure 2.13 which plots log wage versus education and Figure 2.14 which plots log wage versus weeks worked. (e) Replicate Table 2.3 which gives the correlation matrix among a subset of these variables. 19. Conﬂict Among Criteria for Testing Hypotheses: Examples from NonNormal Distributions. This is based on Baltagi (2000). Berndt and Savin (1977) showed that W ≥ LR ≥ LM for the case of a multivariate regression model with normal distrubances. Ullah and ZindeWalsh (1984) showed that this inequality is not robust to nonnormality of the disturbances. In the spirit of the latter article, this problem considers simple examples from nonnormal distributions and illustrates how this conﬂict among criteria is affected. (a) Consider a random sample x1 , x2 , . . . , xn from a Poisson distribution with parameter λ. Show that for testing λ = 3 versus λ = 3 yields W ≥ LM for x ¯ ≤ 3 and W ≤ LM for x¯ ≥ 3. (b) Consider a random sample x1 , x2 , . . . , xn from an Exponential distribution with parameter θ. Show that for testing θ = 3 versus θ = 3 yields W ≥ LM for 0 < x ¯ ≤ 3 and W ≤ LM for x¯ ≥ 3. (c) Consider a random sample x1 , x2 , . . . , xn from a Bernoulli distribution with parameter θ. Show that for testing θ = 0.5 versus θ = 0.5, we will always get W ≥ LM. Show also, that for testing θ = (2/3) versus θ = (2/3) we get W ≤ LM for (1/3) ≤ x¯ ≤ (2/3) and W ≥ LM for (2/3) ≤ x ¯ ≤ 1 or 0 < x¯ ≤ (1/3).
42
Chapter 2: Basic Statistical Concepts
References More detailed treatment of the material in this chapter may be found in: Amemiya, T. (1994), Introduction to Statistics and Econometrics (Harvard University Press: Cambridge). Baltagi, B.H. (1994), “The Wald, LR, and LM Inequality,” Econometric Theory, Problem 94.1.2, 10: 223224. Baltagi, B.H. (2000), “Conﬂict Among Critera for Testing Hypotheses: Examples from NonNormal Distributions,” Econometric Theory, Problem 00.2.4, 16: 288. Bera A.K. and G. Permaratne (2001), “General Hypothesis Testing,” Chapter 2 in Baltagi, B.H. (ed.), A Companion to Theoretical Econometrics (Blackwell: Massachusetts). Berndt, E.R. and N.E. Savin (1977), “Conﬂict Among Criteria for Testing Hypotheses in the Multivariate Linear Regression Model,” Econometrica, 45: 12631278. Breusch, T.S. (1979), “Conﬂict Among Criteria for Testing Hypotheses: Extensions and Comments,” Econometrica, 47: 203207. Buse, A. (1982), “The Likelihood Ratio, Wald, and Lagrange Multiplier Tests: An Expository Note,” The American Statistician, 36 :153157. DeGroot, M.H. (1986), Probability and Statistics (AddisonWesley: Mass.). Freedman, D., R. Pisani, R. Purves and A. Adhikari (1991), Statistics (Norton: New York). Freund, J.E. (1992), Mathematical Statistics (PrenticeHall: New Jersey). Hogg, R.V. and A.T. Craig (1995), Introduction to Mathematical Statistics (Prentice Hall: New Jersey). Jollife, I.T. (1995), “Sample Sizes and the Central Limit Theorem: The Poisson Distribution as an Illustration,” The American Statistician, 49: 269. Kennedy, P. (1992), A Guide to Econometrics (MIT Press: Cambridge). Mood, A.M., F.A. Graybill and D.C. Boes (1974), Introduction to the Theory of Statistics (McGrawHill: New York). Spanos, A. (1986), Statistical Foundations of Econometric Modelling (Cambridge University Press: Cambridge). Ullah, A. and V. ZindeWalsh (1984), “On the Robustness of LM, LR and W Tests in Regression Models,” Econometrica, 52: 10551065. Zellner, A. (1971), An Introduction to Bayesian Inference in Econometrics (Wiley: New York).
Appendix Score and Information Matrix: The likelihood function of a sample X1 , . . . , Xn drawn from f (Xi , θ) is really the joint probability density function written as a function of θ: L(θ) = f (X1 , . . . , Xn ; θ) This probability density function has the property that L(θ)dx = 1 where the integral is over all X1 , . . . , Xn written compactly as one integral over x. Differentiating this multiple integral with respect
Appendix
43
to θ, one gets ∂L dx = 0 ∂θ Multiplying and dividing by L, one gets 1 ∂L ∂logL L dx = L dx = 0 L ∂θ ∂θ But the score is by deﬁnition S(θ) = ∂logL/∂θ. Hence E[S(θ)] = 0. Differentiating again with respect to θ, one gets 2 ∂logL ∂L ∂ logL L+ dx = 0 ∂θ ∂θ ∂θ2 Multiplying and dividing the second term by L one gets 2 ∂logL ∂ 2 logL + =0 E ∂θ ∂θ2 or
∂ 2 logL E − ∂θ2
=E
∂logL ∂θ
2
= E[S(θ)]2
But var[S(θ)] = E[S(θ)]2 since E[S(θ)] = 0. Hence I(θ) = var[S(θ)]. Moment Generating Function (MGF): For the random variable X, the expected value of a special function of X, namely eXt is denoted by MX (t) = E(eXt ) = E(1 + Xt + X 2
t2 t3 + X 3 + ..) 2! 3!
where the second equality follows from the Taylor series expansion of eXt around zero. Therefore, MX (t) = 1 + E(X)t + E(X 2 )
t3 t2 + E(X 3 ) + .. 2! 3!
This function of t generates the moments of X as coeﬃcients of an inﬁnite polynomial in t. For example, μ = E(X) = coeﬃcient of t, and E(X 2 )/2 is the coeﬃcient of t2 , etc. Alternatively, one can differentiate ′ (0), i.e., the ﬁrst derivative of MX (t) with this MGF with respect to t and obtain μ = E(X) = MX r r respect to t evaluated at t = 0. Similarly, E(X ) = MX (0) which is the rth derivative of MX (t) with respect to t evaluated at t = 0. For example, for the Bernoulli distribution; 1 MX (t) = E(eXt ) = X=0 eXt θX (1 − θ)1−X = θet + (1 − θ)
′′ ′′ ′ ′ so that MX (0) = θ. (t) = θet which means that E(X 2 ) = MX (0) = θ = E(X) and MX (t) = θet and MX Hence,
var(X) = E(X 2 ) − (E(X))2 = θ − θ2 = θ(1 − θ). For the Normal distribution, see problem 14, it is easy to show that if X ∼ N (μ, σ 2 ), then MX (t) = 1 2 2 ′ ′′ eμt+ 2 σ t and MX (0) = E(X) = μ and MX (0) = E(X 2 ) = σ 2 + μ2 . There is a onetoone correspondence between the MGF when it exists and the corresponding p.d.f. 2 This means that if Y has a MGF given by e2t+4t then Y is normally distributed with mean 2 and variance 8. Similarly, if Z has a MGF given by (et + 1)/2, then Z is Bernoulli distributed with θ = 1/2.
44
Chapter 2: Basic Statistical Concepts
Change of Variable: If X ∼ N (0, 1), then one can ﬁnd the distribution function of Y = X by using the Distribution Function method. By deﬁnition the distribution function of y is deﬁned as G(y)
= Pr[Y ≤ y] = Pr[X ≤ y] = Pr[−y ≤ X ≤ y] = Pr[X ≤ y] − Pr[X ≤ −y] = F (y) − F (−y)
so that the distribution function of Y, G(y), can be obtained from the distribution function of X, F (x). Since the N (0, 1) distribution is symmetric around zero, then F (−y) = 1 − F (y) and substituting that in G(y) we get G(y) = 2F (y) − 1. Recall, that the p.d.f. of Y is given by g(y) = G′ (y). Hence, g(y) = f (y) + f (−y) and this reduces to 2f (y) if the distribution is symmetric around zero. So that if √ √ 2 2 f (x) = e−x /2 / 2π for −∞ < x < +∞ then g(y) = 2f (y) = 2e−y /2 / 2π for y ≥ 0. Let us now ﬁnd the distribution of Z = X 2 , the square of a N (0, 1) random variable. Note that dZ/dX = 2X which is positive when X > 0 and negative when X < 0. The change of variable method cannot be applied since Z = X 2 is not a monotonic transformation over the entire domain of X. However, using Y = X, we get Z = Y 2 = (X)2 and dZ/dY = 2Y which is always nonnegative since Y is nonnegative. In this case, the change of variable method√states that the p.d.f. of Z is obtained from that of Y by substituting the inverse transformation Y = Z into the p.d.f. of Y and multiplying it by the absolute value of the derivative of the inverse transformation: √ 2 1 dY 1  = √ e−z/2  √  = √ z −1/2 e−z/2 for z ≥ 0 h(z) = g( z) ·  dZ 2 z 2π 2π It is clear why this transformation will not work for X since Z = X 2 has two √ solutions for the inverse √ transformation, X = ± Z, whereas, there is one unique solution for Y = Z since it is nonnegative. Using the results of problem 10, one can deduce that Z has a gamma distribution with α = 1/2 and β = 2. This special Gamma density function is a χ2 distribution with 1 degree of freedom. Hence, we have shown that the square of a N (0, 1) random variable has a χ21 distribution. n Finally, if X1 , . . . , Xn are independently distributed then the distribution function of Y = i=1 Xi can be obtained from that of the Xi ’s using the Moment Generating Function (MGF) method: n
= E(eY t ) = E[e( i=1 Xi )t ] = E(eX1 t )E(eX2 t )..E(eXn t ) = MX1 (t)MX2 (t)..MXn (t)
MY (t)
If in addition these Xi ’s are identically distributed, then MXi (t) = MX (t) for i = 1, . . . , n and MY (t) = [MX (t)]n t For example, if X1 , . . . , X n are IID Bernoulli (θ), then MXi (t) = MX (t) = θe + (1 − θ) for i = 1, . . . , n. Hence the MGF of Y = ni=1 Xi is given by
MY (t) = [MX (t)]n = [θet + (1 − θ)]n
This can be easily shown to be the MGF of the Binomial distribution given in problem 14. This proves that the sum of n independent and identically distributed Bernoulli random variables with parameter θ is a Binomial random variable with same parameter θ. ¯ −μ X √ Central Limit Theorem: If X1 , . . . , Xn are IID(μ, σ2 ) from an unknown distribution, then Z = σ/ n is asymptotically distributed as N (0, 1). Proof : We assume that the MGF of the Xi ’s exist and derive the MGF of Z. Next, we show that lim 2 MZ (t) as n → ∞ is e1/2t which is the MGF of N (0, 1) distribution. First, note that Z=
n
Xi − nμ Y − nμ √ √ = σ n σ n
i=1
Appendix where Y =
45
n
MZ (t)
Xi with MY (t) = [MX (t)]n . Therefore, âˆš âˆš âˆš = E(eZt ) = E e(Y tâˆ’nÎźt)/Ďƒ n = eâˆ’nÎźt/Ďƒ n E eY t/Ďƒ n âˆš âˆš âˆš âˆš = eâˆ’nÎźt/Ďƒ n MY (t/Ďƒ n) = eâˆ’nÎźt/Ďƒ n [MX (t/Ďƒ n)]n
i=1
Taking log of both sides we get logMZ (t) =
t t2 t3 âˆ’nÎźt âˆš + nlog[1 + âˆš E(X) + 2 E(X 2 ) + 3 âˆš E(X 3 ) + ..] Ďƒ n Ďƒ n 2Ďƒ n 6Ďƒ n n
s2 s3 Using the Taylor series expansion log(1 + s) = s âˆ’ + âˆ’ .. we get 2 3
âˆš t t2 nÎź t3 2 3 logMZ (t) = âˆ’ t + n Îź âˆš + 2 E(X ) + 3 âˆš E(X ) + .. Ďƒ Ďƒ n 2Ďƒ n 6Ďƒ n n 2
2 3 t 1 t t 2 3 âˆš âˆš + E(X ) + 3 E(X ) + .. âˆ’ Îź 2 Ďƒ n 2Ďƒ 2 n 6Ďƒ n n
3 1 t2 t3 t 2 3 + Îź âˆš + 2 âˆš E(X ) + 3 âˆš E(X ) + .. âˆ’ .. 3 Ďƒ n 2Ďƒ n 6Ďƒ n n Collecting powers of t, we get âˆš âˆš (E(X 2 ) nÎź nÎź Îź2 + âˆ’ t2 log MZ (t) = âˆ’ t+ Ďƒ Ďƒ 2Ďƒ 2 2Ďƒ 2 1 2ÎźE(X 2 ) 1 Îź3 E(X 3 ) âˆš âˆš âˆš + âˆ’ Âˇ + t3 + .. 6Ďƒ 3 n 2 2Ďƒ 3 n 3 Ďƒ3 n Therefore
3 E(X 3 ) ÎźE(X 2 ) Îź3 t âˆš + .. âˆ’ + 6 2 3 Ďƒ3 n âˆš note that the coeďŹƒcient of t3 is 1/ n times a constant. Therefore, this coeďŹƒcient goes to zero as n â†’ âˆž. âˆš Similarly, it can be shown that the coeďŹƒcient of tr is 1/ nrâˆ’2 times a constant for r â‰Ľ 3. Hence, logMZ (t) =
1 2 t + 2
lim logMZ (t) =
nâ†’âˆž
1 2 t 2
and
1 2
lim MZ (t) = e 2 t
nâ†’âˆž
which is the MGF of a standard normal distribution. The Central Limit Theorem is a powerful tool for asymptotic inference. In real life we do not know what distribution we are sampling from, but as long as the sample drawn is random and we average (or sum) and standardize then as n â†’ âˆž, the resulting standardized statistic has an asymptotic N (0, 1) distribution that can be used for inference. Using a random number generator from say the uniform distribution on the computer, one can generate samples of size n = 20, 30, 50 from this distribution and show how the sampling distribution of the sum (or average) when it is standardized closely approximates the N (0, 1) distribution. The real question for the applied researcher is how large n should be to invoke the Central Limit Theorem. This depends on the distribution we are drawing from. For a Bernoulli distribution, a larger n is needed the more asymmetric this distribution is i.e., if Î¸ = 0.1 rather than 0.5. In fact, Figure 2.15 shows the Poisson distribution with mean = 15. This looks like a good approximation for a Normal distribution even though it is a discrete probability function. Problem 15 shows that the sum of n independent identically distributed Poisson random variables with parameter Îť is a Poisson random variable with parameter (nÎť). This means that if Îť = 0.15, an n of 100 will lead to the distribution of the sum being Poisson (nÎť = 15) and the Central Limit Theorem seems well approximated.
46
Chapter 2: Basic Statistical Concepts
p (x) 0.12 0.10 0.08 0.06 0.04 0.02 0.00 0
5
10
15
20
25
30
x
Figure 2.15 Poisson Probability Distribution, Mean = 15 p (x ) 0.4
0.3
0.2
0.1
0
0 2 4 6 Figure 2.16 Poisson Probability Distribution, Mean = 1.5
8
10
x
However, if λ = 0.015, an n of 100 will lead to the distribution of the sum being Poisson (nλ = 1.5) which is given in Figure 2.16. This Poisson probability function is skewed and discrete and does not approximate well a normal density. This shows that one has to be careful in concluding that n = 100 is a large enough sample for the Central Limit Theorem to apply. We showed in this simple example that this depends on the distribution we are sampling from. This is true for Poisson (λ = 0.15) but not Poisson (λ = 0.015), see Joliffe (1995). The same idea can be illustrated with a skewed Bernoulli distribution. Conditional Mean and Variance: Two random variables X and Y are bivariate Normal if they have the following joint distribution: x − μX 2 y − μY 2 1 1 exp − + f (x, y) = 2(1 − ρ2 ) σX σY 2πσ X σ Y 1 − ρ2 y − μY x − μX −2ρ σX σY
Appendix
47
where âˆ’âˆž < x < +âˆž, âˆ’âˆž < y < +âˆž, E(X) = ÎźX , E(Y ) = ÎźY , var(X) = Ďƒ 2X , var(Y ) = Ďƒ 2Y and Ď = correlation (X, Y ) = cov(X, Y )/Ďƒ X Ďƒ Y . This joint density can be rewritten as 2
1 ĎƒY 1 exp âˆ’ 2 f (x, y) = âˆš (x âˆ’ ÎźX ) y âˆ’ ÎźY âˆ’ Ď 2Ďƒ Y (1 âˆ’ Ď 2 ) ĎƒX 2Ď€Ďƒ Y 1âˆ’ Ď 2 1 1 exp âˆ’ 2 (x âˆ’ ÎźX )2 = f (y/x)f1 (x) Âˇâˆš 2ĎƒX 2Ď€Ďƒ X where f1 (x) is the marginal density of X and f (y/x) is the conditional density of Y given X. In this Ďƒ case, X âˆź N (ÎźX , Ďƒ 2X ) and Y /X is Normal with mean E(Y /X) = ÎźY + Ď Y (x âˆ’ Îźx ) and variance given ĎƒX by var(Y /X) = Ďƒ 2Y (1 âˆ’ Ď 2 ). By symmetry, the roles of X and Y can be interchanged and one can write f (x, y) = f (x/y) f2 (y) where f2 (y) is the marginal density of Y . In this case, Y âˆź N (ÎźY , Ďƒ 2Y ) and X/Y is Normal with mean Ďƒ E(X/Y ) = ÎźX + Ď X (y âˆ’ ÎźY ) and variance given by var(X/Y ) = Ďƒ 2X (1 âˆ’ Ď 2 ). If Ď = 0, then f (y/x) = ĎƒY f2 (y) and f (x, y) = f1 (x)f2 (y) proving that X and Y are independent. Therefore, if cov(X, Y ) = 0 and X and Y are bivariate Normal, then X and Y are independent. In general, cov(X, Y ) = 0 alone does not necessarily imply independence, see problem 3. One important and useful property is the law of iterated expectations. This says that the expectation of any function of X and Y say h(X, Y ) can be obtained as follows: E[h(X, Y )] = EX EY /X [h(X, Y )] where the subscript Y /X on E means the conditional expectation of Y given that X is treated as a constant. The next expectation EX treats X as a random variable. The proof is simple. +âˆž +âˆž h(x, y)f (x, y)dxdy E[h(X, Y )] = âˆ’âˆž
âˆ’âˆž
where f (x, y) is the joint density of X and Y . But f (x, y) can be written as f (y/x)f1 (x), hence +âˆž +âˆž âˆ’âˆž âˆ’âˆž h(x, y)f (y/x)dy f1 (x)dx = EX EY /X [h(X, Y )].
E[h(X, Y )] =
Example: This law of iterated expectation can be used to show that for the bivariate Normal density, the parameter Ď is indeed the correlation coeďŹƒcient of X and Y . In fact, let h(X, Y ) = XY , then Ďƒ E(XY ) = EX EY /X (XY /X) = EX XE(Y /X) = EX X[ÎźY + Ď Y (X âˆ’ ÎźX )] ĎƒX Ďƒ = ÎźX ÎźY + Ď Y Ďƒ 2X = ÎźX ÎźY + Ď Ďƒ Y Ďƒ X ĎƒX
Rearranging terms, one gets Ď = [E(XY ) âˆ’ ÎźX ÎźY ]/Ďƒ X Ďƒ Y = Ďƒ XY /Ďƒ X Ďƒ Y as required. Another useful result pertains to the unconditional variance of h(X, Y ) being the sum of the mean of the conditional variance and the variance of the conditional mean: var(h(X, Y )) = EX varY /X [h(X, Y )] + varX EY /X [h(X, Y )] Proof : We will write h(X, Y ) as h to simplify the presentation varY /X (h) = EY /X (h2 ) âˆ’ [EY /X (h)]2 and taking expectations with respect to X yields EX varY /X (h) = EX EY /X (h2 ) âˆ’ EX [EY /X (h)]2 = E(h2 ) âˆ’ EX [EY /X (h)]2 . Also, varX EY /X (h) = EX [EY /X (h)]2 âˆ’ (EX [EY /X (h)])2 = EX [EY /X (h)]2 âˆ’ [E(h)]2 adding these two terms yields E(h2 ) âˆ’ [E(h)]2 = var(h).
CHAPTER 3
Simple Linear Regression 3.1
Introduction
In this chapter, we study extensively the estimation of a linear relationship between two variables, Yi and Xi , of the form: Yi = α + βXi + ui
i = 1, 2, . . . , n
(3.1)
where Yi denotes the ith observation on the dependent variable Y which could be consumption, investment or output, and Xi denotes the ith observation on the independent variable X which could be disposable income, the interest rate or an input. These observations could be collected on ﬁrms or households at a given point in time, in which case we call the data a crosssection. Alternatively, these observations may be collected over time for a speciﬁc industry or country in which case we call the data a timeseries. n is the number of observations, which could be the number of ﬁrms or households in a crosssection, or the number of years if the observations are collected annually. α and β are the intercept and slope of this simple linear relationship between Y and X. They are assumed to be unknown parameters to be estimated from the data. A plot of the data, i.e., Y versus X would be very illustrative showing what type of relationship exists empirically between these two variables. For example, if Y is consumption and X is disposable income then we would expect a positive relationship between these variables and the data may look like Figure 3.1 when plotted for a random sample of households. If α and β were known, one could draw the straight line (α + βX) as shown in Figure 3.1. It is clear that not all the observations (Xi , Yi ) lie on the straight line (α + βX). In fact, equation (3.1) states that the difference between each Yi and the corresponding (α + βXi ) is due to a random error ui . This error may be due to (i) the omission of relevant factors that could inﬂuence consumption, other than disposable income, like real wealth or varying tastes, or unforseen events that induce households to consume more or less, (ii) measurement error, which could be the result of households not reporting their consumption or income accurately, or (iii) wrong choice of a linear relationship between consumption and income, when the true relationship may be nonlinear. These different causes of the error term will have different effects on the distribution of this error. In what follows, we consider only disturbances that satisfy some restrictive assumptions. In later chapters we relax these assumptions to account for more general kinds of error terms. In real life, α and β are not known, and have to be estimated from the observed data {(Xi , Yi ) for i = 1, 2, . . . , n}. This also means that the true line (α + βX) as well as the true disturbances (the ui ’s) are unobservable. In this case, α and β could be estimated by the best ﬁtting line through the data. Different researchers may draw different lines through the same data. What makes one line better than another? One measure of misﬁt is the amount of error from the i , where the hat (ˆ) denotes observed Yi to the guessed line, let us call the latter Yi = α + βX a guess on the appropriate parameter or variable. Each observation (Xi , Yi ) will have a corresponding observable error attached to it, which we will call ei = Yi − Yi , see Figure 3.2. In other words, we obtain the guessed Yi , (Yi ) corresponding to each Xi from the guessed line,
50
Chapter 3: Simple Linear Regression
i . Next, we ﬁnd our error in guessing that Yi , by subtracting the actual Yi from the α + βX guessed Yi . The only difference between Figure 3.1 and Figure 3.2 is the fact that Figure 3.1 draws the true consumption line which is unknown to the researcher, whereas Figure 3.2 is a guessed consumption line drawn through the data. Therefore, while the ui ’s are unobservable, the ei ’s are observable. Note that there will be n errors for each line, one error corresponding to every observation. Similarly, there will be another set of n errors for another guessed line drawn through the data. For each guessed line, we can summarize its corresponding errors by one number, the sum of squares of these errors, which seems to be a natural criterion for penalizing a wrong guess. Note that a simple sum of these errors is not a good choice for a measure of misﬁt since positive errors end up canceling negative errors when both should be counted in our measure. However, this does not mean that the sum of squared error is the only single measure of misﬁt. Other measures include the sum of absolute errors, but this latter measure is mathematically more diﬃcult to handle. Once the measure of misﬁt is chosen, α and β could then be estimated by minimizing this measure. In fact, this is the idea behind least squares estimation.
3.2
Least Squares Estimation and the Classical Assumptions
Least squares minimizes the residual sum of squares where the residuals are given by i ei = Yi − α − βX
i = 1, 2, . . . , n
denote guesses on the regression parameters α and β, respectively. The residual and α and β i )2 is minimized by the two − βX sum of squares denoted by RSS = ni=1 e2i = ni=1 (Yi − α ﬁrstorder conditions: n n Xi = 0 ∂( ni=1 e2i )/∂α = −2 ni=1 ei = 0; or α−β (3.2) i=1 Yi − n i=1 n n n X 2 = 0 ∂( ni=1 e2i )/∂β = −2 ni=1 ei Xi = 0; or Xi − β i=1 Yi Xi − α i=1 i
(3.3)
i=1
Solving the least squares normal equations given in (3.2) and (3.3) for α and β one gets OLS X OLS = n xi yi / n x2 ¯ and β (3.4) α OLS = Y¯ − β i=1 i=1 i ¯ = n Xi /n, yi = Yi − Y¯ , xi = Xi − X, ¯ n x2 = n X 2 − nX ¯ 2, where Y¯ = ni=1 Yi /n, X i=1 i=1 i i=1 i n n n n 2 2 2 ¯ ¯ ¯ i=1 yi = i=1 Yi − nY and i=1 xi yi = i=1 Xi Yi − nX Y . These estimators are subscripted by OLS denoting the ordinary least squares estimators. The OLS residuals ei = Yi − α OLS − β OLS Xi automatically satisfy the two numerical relationships given by (3.2) and (3.3). The ﬁrst relationship states that (i) ni=1 ei = 0, the residuals sum to zero. This is true as long as there is a constant in the regression. This numerical property of the least squares residuals also implies that the estimated regression line passes through the ¯ Y¯ ). To see this, average the residuals, or equation (3.2), this gives immediately sample means (X, n ¯ Y¯ = α OLS + β OLS X. The second relationship states that (ii) i=1 ei Xi = 0, the residuals and the explanatory variable are uncorrelated. Other numerical properties that the OLS estimators n n n satisfy are the following: (iii) i=1 Yi = i=1 Yi and (iv) i=1 ei Yi = 0. Property (iii) states that the sum of the estimated Yi ’s or the predicted Yi ’s from the sample is equal to the sum of the
3.2
Least Squares Estimation and the Classical Assumptions
Y
Y
ui
= + >X i
+ + + + + + + + + +
Consumption
+ + + + + + + + + + + +
+ + + + ++ + + + + +
=ˆ + >ˆX i
= + >X i
Yi
ei
=ˆ + >ˆX i
Disposable Income
Xi
Figure 3.1 ‘True’ Consumption Function
+ + + + + + + + + +
+ +
+ + +
Consumption
Yi
51
X
+ + + + + + + + + +
+ + + + ++ + + + + +
+ + +
Disposable Income
Xi
X
Figure 3.2 Estimated Consumption Function
actual Yi ’s. Property (iv) states that the OLS residuals and the predicted Yi ’s are uncorrelated. The proof of (iii) and (iv) follow from (i) and (ii) see problem 1. Of course, underlying our estimation of (3.1) is the assumption that (3.1) is the true model generating the data. In this case, (3.1) is linear in the parameters α and β, and contains only one explanatory variable Xi besides the constant. The inclusion of other explanatory variables in the model will be considered in Chapter 4, and the relaxation of the linearity assumption will be considered in Chapters 8 and 13. In order to study the statistical properties of the OLS estimators of α and β, we need to impose some statistical assumptions on the model generating the data. Assumption 1: The disturbances have zero mean, i.e., E(ui ) = 0 for every i = 1, 2, . . . , n. This assumption is needed to insure that on the average we are on the true line. To see what happens if E(ui ) = 0, consider the case where households consistently underreport their consumption by a constant amount of δ dollars, while their income is measured accurately, say by crossreferencing it with their IRS tax forms. In this case, (Observed Consumption) = (T rue Consumption) − δ and our regression equation is really (T rue Consumption)i = α + β(Income)i + ui But we observe, (Observed Consumption)i = α + β(Income)i + ui − δ This can be thought of as the old regression equation with a new disturbance term u∗i = ui − δ. Using the fact that δ > 0 and E(ui ) = 0, one gets E(u∗i ) = −δ < 0. This says that for all households with the same income, say $20, 000, their observed consumption will be on the average below that predicted from the true line [α+β($20, 000)] by an amount δ. Fortunately, one
52
Chapter 3: Simple Linear Regression
can deal with this problem of constant but nonzero mean of the disturbances by reparametizing the model as (Observed Consumption)i = α∗ + β(Income)i + ui where α∗ = α − δ. In this case, E(ui ) = 0 and α∗ and β can be estimated from the regression. Note that while α∗ is estimable, α and δ are nonestimable. Also note that for all $20, 000 income households, their average consumption is [(α − δ) + β($20, 000)].
Assumption 2: The disturbances have a constant variance, i.e., var(ui ) = σ 2 for every i = 1, 2, . . . , n. This insures that every observation is equally reliable. To see what this assumption means, consider the case where var(ui ) = σ 2i , for i = 1, 2, . . . , n. In this case, each observation has a different variance. An observation with a large variance is less reliable than one with a smaller variance. But, how can this differing variance happen? In the case of consumption, households with large disposable income (a large Xi , say $100, 000) may be able to save more (or borrow more to spend more) than households with smaller income (a small Xi , say $10, 000). In this case, the variation in consumption for the $100, 000 income household will be much larger than that for the $10, 000 income household. Therefore, the corresponding variance for the $100, 000 observation will be larger than that for the $10, 000 observation. Consequences of different variances for different observations will be studied more rigorously in Chapter 5. Assumption 3: The disturbances are not correlated, i.e., E(ui uj ) = 0 for i = j, i, j = 1, 2, . . . , n. Knowing the ith disturbance does not tell us anything about the jth disturbance, for i = j. For the consumption example, the unforseen disturbance which caused the ith household to consume more, (like a visit of a relative), has nothing to do with the unforseen disturbances of any other household. This is likely to hold for a random sample of households. However, it is less likely to hold for a time series study of consumption for the aggregate economy, where a disturbance in 1945, a war year, is likely to affect consumption for several years after that. In this case, we say that the disturbance in 1945 is related to the disturbances in 1946, 1947, and so on. Consequences of correlated disturbances will be studied in Chapter 5. Assumption 4: The explanatory variable X is nonstochastic, i.e., ﬁxed in repeated samples, and hence, not correlated with the disturbances. Also, ni=1 x2i /n = 0 and has a ﬁnite limit as n tends to inﬁnity.
This assumption states that we have at least two distinct values for X. This makes sense, since ¯ = X, the common we need at least two distinct points to draw a straight line. Otherwise X n 2 ¯ value, and x = X − X = 0, which violates i=1 xi = 0. In practice, one always has several distinct values of X. More importantly, this assumption implies that X is not a random variable and hence is not correlated with the disturbances. In section 5.3, we will relax the assumption of a nonstochastic X. Basically, X becomes a random variable and our assumptions have to be recast conditional on the set of X’s that are observed. This is the more realistic case with economic data. The zero mean assumption becomes E(ui /X) = 0, the constant variance assumption becomes var(ui /X) = σ 2 , the no serial correlation assumption becomes E(ui uj /X) = 0 for i = j. The conditional expectation here is with respect to every observation on Xi from i = 1, 2, . . . n. Of course, one can show that if E(ui /X) = 0 for all i, then Xi and ui are not correlated. The reverse is not necessarily true, see
3.2
Least Squares Estimation and the Classical Assumptions
53
Consumption
Y + +++ + ++ + + + + ++ + + + ++
a + bX
+ + + + + + + + + + + ++ ++ + ++ + + + + +
mx
Disposable Income
X
Figure 3.3 Consumption Function with Cov(X, u) > 0
problem 3 of Chapter 2. That problem shows that two random variables, say ui and Xi could be uncorrelated, i.e., not linearly related when in fact they are nonlinearly related with ui = Xi2 . Hence, E(ui /Xi ) = 0 is a stronger assumption than ui and Xi are not correlated. By the law of iterated expectations given in the Appendix of Chapter 2, E(ui /X) = 0 implies that E(ui ) = 0. It also implies that ui is uncorrelated with any function of Xi . This is a stronger assumption than ui is uncorrelated with Xi . Therefore, conditional on Xi , the mean of the disturbances is zero and does not depend on Xi . In this case, E(Yi /Xi ) = α + βXi is linear in α and β and is assumed to be the true conditional mean of Y given X. To see what a violation of assumption 4 means, suppose that X is a random variable and that X and u are positively correlated, then in the consumption example, households with income above the average income will be associated with disturbances above their mean of zero, and hence positive disturbances. Similarly, households with income below the average income will be associated with disturbances below their mean of zero, and hence negative disturbances. This means that the disturbances are systematically affected by values of the explanatory variable and the scatter of the data will look like Figure 3.3. Note that if we now erase the true line (α + βX), and estimate this line from the data, the least squares line drawn through the data is going to have a smaller intercept and a larger slope than those of the true line. The scatter should look like Figure 3.4 where the disturbances are random variables, not correlated with the Xi ’s, drawn from a distribution with zero mean and constant variance. Assumptions 1 and 4 insure that E(Yi /Xi ) = α + βXi , i.e., on the average we are on the true line. Several economic models will be studied where X and u are correlated. The consequences of this correlation will be studied in Chapters 5 and 11. We now generate a data set which satisﬁes all four classical assumptions. Let α and β take the arbitrary values, say 10 and 0.5 respectively, and consider a set of 20 ﬁxed X’s, say income classes from $10 to $105 (in thousands of dollars), in increments of $5, i.e., $10, $15, $20, $25, . . . , $105. Our consumption variable Yi is constructed as (10 + 0.5Xi + ui ) where ui is a
54
Chapter 3: Simple Linear Regression
f (u )
Y
= + >X
X1 X2 . . .
Xn
X Figure 3.4 Random Disturbances around the Regression
disturbance which is a random draw from a distribution with zero mean and constant variance, say σ 2 = 9. Computers generate random numbers with various distributions. In this case, Figure 3.4 would depict our data, with the true line being (10 + 0.5X) and ui being random draws from the computer which are by construction independent and identically distributed with mean zero and variance 9. For every set of 20 ui ’s randomly generated, given the ﬁxed Xi ’s, we obtain a corresponding set of 20 Yi ’s from our linear regression model. This is what we mean in assumption 4 when we say that the X’s are ﬁxed in repeated samples. Monte Carlo experiments generate a large number of samples, say a 1000, in the fashion described above. For each data set generated, least squares can be performed and the properties of the resulting estimators which are derived analytically in the remainder of this chapter, can be veriﬁed. For example, the average of the 1000 estimates of α and β can be compared to their true values to see whether these least squares estimates are unbiased. Note what will happen to Figure 3.4 if E(ui ) = −δ where δ > 0, or var(ui ) = σ 2i for i = 1, 2, . . . , n. In the ﬁrst case, the mean of f (u), the probability density function of u, will shift off the true line (10+ 0.5X) by −δ. In other words, we can think of the distributions of the ui ’s, shown in Figure 3.4 , being centered on a new imaginary line parallel to the true line but lower by a distance δ. This means that one is more likely to draw negative disturbances than positive disturbances, and the observed Yi ’s are more likely to be below the true line than above it. In the second case, each f (ui ) will have a different variance, hence the spread of this probability density function will vary with each observation. In this case, Figure 3.4 will have a distribution for the ui ’s which has a different spread for each observation. In other words, if the ui ’s are say normally distributed, then u1 is drawn from a N (0, σ 21 ) distribution, whereas u2 is drawn from a N (0, σ 22 ) distribution, and so on. Violation of the classical assumptions can also be studied using Monte Carlo experiments, see Chapter 5.
3.3
3.3
Statistical Properties of Least Squares
55
Statistical Properties of Least Squares
(i) Unbiasedness Given assumptions 14, it is easy to show that β OLS is unbiased for β. In fact, using equation (3.4) one can write OLS = n xi yi / n x2 = n xi Yi / n x2 = β + n xi ui / n x2 (3.5) β i=1 i=1 i i=1 i=1 i i=1 i=1 i where the second equality follows from the fact that yi = Yi − Y¯ and ni=1 xi Y¯ = Y¯ ni=1 xi = 0. The third equality follows from substituting Yi from (3.1) and using the fact that ni=1 xi = 0. Taking expectations of both sides of (3.5) and using assumptions 1 and 4, one can show that OLS ) = β. Furthermore, one can derive the variance of β OLS from (3.5) since E(β n n 2 2 2 (3.6) var(β OLS ) = E(β OLS − β) = E( i=1 xi ui / i=1 xi ) n n 2 2 n = var( i=1 xi ui / i=1 xi ) = σ / i=1 x2i
where the last equality uses assumptions 2 and 3, i.e., that the ui ’s are not correlated with each other and that their variance is constant, see problem 4. Note that the variance of the OLS estimator of β depends upon σ 2 , the variance of the disturbances in the true model, and on the variation in X. The larger the variation in X the larger is ni=1 x2i and the smaller is the OLS . variance of β (ii) Consistency
Next, we show that β OLS is consistent for β. A suﬃcient condition for consistency is that β OLS OLS is unbiased and its variance tends to zero as n tends to inﬁnity. We have already shown β to be unbiased, it remains to show that its variance tends to zero as n tends to inﬁnity. n 2 2 lim var(β OLS ) = lim [(σ /n)/( i=1 xi /n)] = 0 n→∞
n→∞
where the second equality follows from the fact that (σ 2 /n) → 0 and ( ni=1 x2i /n) = 0 and has a ﬁnite limit, see assumption 4. Hence, plim β OLS = β and β OLS is consistent for β. Similarly one can show that α OLS is unbiased and consistent for α with variance σ 2 ni=1 Xi2 /n ni=1 x2i , n 2 ¯ 2 and cov( αOLS , β OLS ) = −Xσ / i=1 xi , see problem 5. (iii) Best Linear Unbiased
OLS OLS as n wi Yi where wi = xi / n x2 . This proves that β Using (3.5) one can write β i=1 i=1 i is a linear combination of the Yi ’s, with weights wi satisfying the following properties: n n n n 2 2 (3.7) i=1 wi = 0; i=1 wi Xi = 1; i=1 wi = 1/ i=1 xi
The next theorem shows that among all linear unbiased estimators of β, it is β OLS which has the smallest variance. This is known as the GaussMarkov Theorem. = n ai Yi for β, where the ai ’s denote Theorem 1: Consider any arbitrary linear estimator β i=1 is unbiased for β, and assumptions 1 to 4 are satisﬁed, then var(β) ≥ arbitrary constants. If β var(β ). OLS
56
Chapter 3: Simple Linear Regression
one gets β = α n ai +β n ai Xi + n ai ui . For β Proof: Substituting Yi from (3.1) into β, i=1 i=1 i=1 n n =α to be unbiased for β it must followthat E(β) i=1 ai +β i=1 ai Xi = β for all observations i = 1, 2, . . . , n. This means that ni=1 ai = 0 and ni=1 ai Xi = 1 for all i = 1, 2, . . . , n. Hence, = β + n ai ui with var(β) = var(n ai ui ) = σ 2 n a2 where the last equality follows β i=1 i=1 i=1 i from assumptions 2 and 3. But the ai ’s are constants which differ from the wi ’s, the weights of the OLS estimator, by some other constants, say di ’s, i.e., ai = wi + di for i = 1, 2, . . . , n. Using n the properties of the a ’s and w one can deduce similar properties on the d ’s i.e., i i i i=1 di = 0 n and i=1 di Xi = 0. In fact, n n n n 2 2 2 i=1 ai = i=1 di + i=1 wi + 2 i=1 wi di n n n 2 where i=1 wi d i = i=1 xi di / i=1 xi = 0. This follows from the deﬁnition of wi and the fact n n that i=1 di = i=1 di Xi = 0. Hence, = σ2 var(β)
n
2 i=1 ai
= σ2
n
2 i=1 di
+ σ2
n
2 i=1 wi
2 = var(β OLS ) + σ
n
2 i=1 di
≥ var(β Since σ 2 ni=1 d2i is nonnegative, this proves that var(β) OLS ) with the equality holding reduces to β only if di = 0 for all i = 1, 2, . . . , n, i.e., only if ai = wi , in which case β OLS . Therefore, any linear estimator of β, like β that is unbiased for β has variance at least as large OLS ). This proves that β OLS is BLUE, Best among all Linear Unbiased Estimators of as var(β β. Similarly, one can show that α OLS is linear in Yi and has the smallest variance among all linear unbiased estimators of α, if assumptions 1 to 4 are satisﬁed, see problem 6. This result implies that the OLS estimator of α is also BLUE.
3.4
Estimation of σ 2
The variance of the regression disturbances σ 2 is unknown and has to be estimated. In fact, both the variance of β OLS depend upon σ 2 , see (3.6) and problem 5. An OLS and that n of 2α 2 2 unbiased estimator for σ is s = i=1 ei /(n − 2). To prove this, we need the fact that
ei = Yi − α OLS − β ¯) OLS Xi = yi − β OLS xi = (β − β OLS )xi + (ui − u n ¯ where u ¯ = OLS = Y¯ − β OLS X and the third i=1 ui /n. The second equality substitutes α equality substitutes yi = βxi + (ui − u ¯). Hence, n n n 2 n 2 2 ¯)2 − 2(β ¯), OLS − β) i=1 xi + i=1 (ui − u i=1 xi (ui − u i=1 ei = (β OLS − β) and
n n 2 2 2 n 2 E( ni=1 e2i ) = i=1 xi var(β OLS ) + (n − 1)σ − 2E( i=1 xi ui ) / i=1 xi
= σ 2 + (n − 1)σ 2 − 2σ 2 = (n − 2)σ 2 where the ﬁrst equality uses the fact that E( ni=1 (ui − u ¯)2 ) = (n − 1)σ 2 and β OLS − β = n n n 2 2 2 i=1 xi ui / i=1 xi . The second equality uses the fact that var(β OLS ) = σ / i=1 xi and E( ni=1 xi ui )2 = σ 2 ni=1 x2i .
3.5
Maximum Likelihood Estimation
57
Therefore, E(s2 ) = E( ni=1 e2i /(n − 2)) = σ 2 . Intuitively, the estimator of σ 2 could be obtained from ni=1 (ui − u ¯)2 /(n − 1) if the true disturbances were known. Since the u’s are not known, consistent estimates of them are used. These are the ei ’s. Since ni=1 ei = 0, our estimator of σ 2 becomes ni=1 e2i /(n − 1). Taking expectations we ﬁnd that the correct divisor ought to be (n−2) and not (n−1) for this estimator to be unbiased for σ 2 . This is plausible, since we have estimated two parameters α and β in obtaining the ei ’s, and there are only n − 2 independent pieces of information left in the data. To prove this fact, consider the OLS normal equations given in (3.2) and (3.3). These equations represent two relationships involving the ei ’s. Therefore, knowing (n − 2) of the ei ’s we can deduce the remaining two ei ’s from (3.2) and (3.3).
3.5
Maximum Likelihood Estimation
Assumption 5: The ui ’s are independent and identically distributed N (0, σ 2 ). This assumption allows us to derive distributions of estimators and other test statistics. In fact using (3.5) one can easily see that β OLS is a linear combination of the ui ’s. But, a linear combination of normal random variables see Chapter 2, is itself a normal random variable, problem 15. Hence, β is N (β, σ 2 / n x2 ). Similarly α OLS is N (α, σ 2 n X 2 /n n x2 ), OLS
i=1
i
i=1
i
i=1
i
and Yi is N (α+βXi , σ 2 ). Moreover, we can write the density function of the u’s njoint2 probability 2 2 2 n/2 as f (u1 , u2 , . . . , un ; α, β, σ ) = (1/2πσ ) exp(− i=1 ui /2σ ). To get the likelihood function we make the transformation ui = Yi − α− βXi and note that the Jacobian of the transformation is 1. Therefore, f (Y1 , Y2 , . . . , Yn ; α, β, σ 2 ) = (1/2πσ 2 )n/2 exp{− Taking the log of this likelihood, we get logL(α, β, σ 2 ) = −(n/2)log(2πσ 2 ) −
n
i=1 (Yi
n
i=1 (Yi
− α − βXi )2 /2σ 2 }
− α − βXi )2 /2σ 2
(3.8)
(3.9)
Maximizing this likelihood with respect to α, β and σ 2 one gets the maximum likelihood estimators (MLE). However, only the second term in the log likelihood contains α and β and that term (without the negative sign) has already been minimized with respect to α and β in (3.2) and (3.3) giving us the OLS estimators. Hence, α M LE = α OLS and β M LE = β OLS . Similarly, 2 and setting this derivative equal to zero one gets by differentiating logL with respect to σ σ 2M LE = ni=1 e2i /n, see problem 7. Note that this differs from s2 only in the divisor. In fact, E( σ 2M LE ) = (n − 2)σ 2 /n = σ 2 . Hence, σ 2M LE is biased but note that it is still asymptotically unbiased. So far, the gains from imposing assumption 5 are the following: The likelihood can be formed, maximum likelihood estimators can be derived, and distributions can be obtained for these estimators. One can also derive the Cram´erRao lower bound for unbiased estimators of the 2 parameters and show that the α OLS and β OLS attain this bound whereas s does not. This derivation is postponed until Chapter 7. In fact, one can show following the theory of complete 2 suﬃcient statistics that α OLS , β OLS and s are minimum variance unbiased estimators for α, β 2 and σ , see Chapter 2. This is a stronger result (for α OLS and β OLS ) than that obtained using the GaussMarkov Theorem. It says, that among all unbiased estimators of α and β, the OLS
58
Chapter 3: Simple Linear Regression
estimators are the best. In other words, our set of estimators include now all unbiased estimators and not just linear unbiased estimators. This stronger result is obtained at the expense of a stronger distributional assumption, i.e., normality. If the distribution of the disturbances is not normal, then OLS is no longer MLE. In this case, MLE will be more eﬃcient than OLS as long as the distribution of the disturbances is correctly speciﬁed. Some of the advantages and disadvantages of MLE were discussed in Chapter 2. 2 We found the distributions of α OLS , β OLS , now we give that of s . In Chapter 7, it is shown n 2 2 that i=1 ei /σ is a chisquared with (n − 2) degrees of freedom. Also, s2 is independent of α OLS and β OLS . This is useful for test of hypotheses. In fact, the major gain from assumption 5 is that we can perform test of hypotheses. OLS , one gets z = (β OLS − β)/(σ 2 / n x2 ) 12 ∼ Standardizing the normal random variable β i=1 i N (0, 1). Also, (n − 2)s2 /σ 2 is distributed as χ2n−2 . Hence, one can divide z, a N (0, 1) random variable, by the square root of (n − 2)s2 /σ 2 divided by its degrees of freedom (n − 2) to get a tstatistic with (n − 2) degrees of freedom. The resulting statistic is tobs = (β OLS − n 1 2 2 β)/(s / i=1 xi ) 2 ∼ tn−2 , see problem 8. This statistic can be used to test H0 ; β = β 0 , versus H1 ; β = β 0 , where β 0 is a known constant. Under H0 , tobs can be calculated and its value can be compared to a critical value from a tdistribution with (n − 2) degrees of freedom, at a speciﬁed critical value of α%. Of speciﬁc interest is the hypothesis H0 ; β = 0, which states that there is no linear relationship between Yi and Xi . Under H0 , OLS /(s2 / tobs = β
n
2 12 i=1 xi )
OLS /se( OLS ) =β β
n 2 2 12 where se( β OLS ) = (s / i=1 xi ) . If tobs  > tα/2;n−2 , then H0 is rejected at the α% signiﬁcance level. tα/2;n−2 represents a critical value obtained from a tdistribution with n − 2 degrees of freedom. It is determined such that the area to its right under a tn−2 distribution is equal to α/2. Similarly one can get a conﬁdence interval for β by using the fact that, Pr[−tα/2;n−2 < tobs < tα/2;n−2 ] = 1 − α and substituting for tobs its value derived above as (β β OLS − β)/se( OLS ). Since the critical values are known, β and se( β ) can be calculated from the data, the OLS OLS following (1 − α)% conﬁdence interval for β emerges OLS ). OLS ± tα/2;n−2 se( β β
Tests of hypotheses and conﬁdence intervals on α and σ 2 can be similarly constructed using the normal distribution of α OLS and the χ2n−2 distribution of (n − 2)s2 /σ 2 .
3.6
A Measure of Fit
We have obtained the least squares estimates of α, β and σ 2 and found their distributions under normality of the disturbances. We have also learned how to test hypotheses regarding these parameters. Now we turn to a measure of ﬁt for this estimated regression line. Recall, that ei = Yi − Yi where Yi denotes the predicted Yi from n the least squares regression nline at the nvalue Xi , i.e., α OLS + β OLS Xi . Using the fact that i=1 ei = 0, we deduce that i=1 Yi = i=1 Yi , and therefore, Y¯ = Y . The actual and predicted values of Y have the same sample mean, see numerical properties (i) and (iii) of the OLS estimators discussed in section 2. This is true
3.6
A Measure of Fit
59
as long as there is a constant in the regression. Adding and subtracting Y¯ from ei , we get ei = yi − yi , or yi = ei + yi . Squaring and summing both sides: n
2 i=1 yi
=
n
2 i=1 ei
+
n
i2 i=1 y
+2
n
i i=1 ei y
=
n
2 i=1 ei
+
n
i2 i=1 y
OLS xi and where the last equality follows from the fact that yi = β n
i i=1 ei y
=
n
=0
i=1 ei Yi
n
i=1 ei xi
(3.10)
= 0. In fact,
means that the OLS residuals are uncorrelated with the predicted values from the regression, see numerical properties (ii) and (iv) of the OLS estimates discussed in section 3.2. In other words, (3.10) says that the total variation in Yi , around its sample mean Y¯ i.e., ni=1 yi2 , can be 2 n x2 , decomposed into two parts: the ﬁrst is the regression of squares ni=1 yi2 = β OLS i=1 i n sums and the second is the residual sums of squares i=1 e2i . In fact, regressing Y on a constant yields α OLS = Y¯ , see problem 2, and the unexplained residual sums of squares of this naive model is n
i=1 (Yi
−α OLS )2 =
n
i=1 (Yi
− Y¯ )2 =
n
2 i=1 yi .
Therefore, ni=1 yi2 in (3.10) gives the explanatory power of X after the constant is ﬁt. Using this decomposition, one can deﬁne the explanatory power of the regression as the ratio of regression sums of squares to the total sums of squares. In other words, deﬁne the n n 2 = 2/ 2 and this value is clearly between 0 and 1. In fact, dividing (3.10) by y y R i i i=1 i=1 n 2 n 2 n n 2 2 2 i=1 ei is a measure of misﬁt which was i=1 yi . The i=1 ei / i=1 yi one gets R = 1 − minimized by least squares. If ni=1 e2i is large, this means that the regression is not explaining a lot of the variation in Y and hence, the R2 value would be small. Alternatively, if the ni=1 e2i is small, then the ﬁt is good and R2 is large. In fact, for a perfect where all the observations n ﬁt, 2 = 0 and R2 = 1. The other e lie on the ﬁtted line, Yi = Yi and ei = 0, which means that i=1 i extreme case is where the regression sums of squares ni=1 yi2 = 0. In other the linear n words, n 2 2 regression explains nothing of the variation in Yi . In this case, i=1 yi = i=1 ei and R2 = 0. Note that since ni=1 yi2 = 0 implies yi = 0 for every i, which in turn means that Yi = Y¯ for every i. The ﬁtted regression line is a horizontal line drawn at Y = Y¯ , and the independent variable X does not have any explanatory power in a linear relationship with Y . Note that R2 has two alternative meanings: (i) It is the simple squared correlation coeﬃcient between Yi and Yi , see problem 9. Also, for the simple regression case, (ii) it is the simple squared correlation between X and Y . This means that before one runs the regression of Y on 2 which in turn tells us the proportion of the variation in Y that will X, one can compute rxy be explained by X. If this number is pretty low, we have a weak linear relationship between Y and X and we know that a poor ﬁt will result if Y is regressed on X. It is worth emphasizing that R2 is a measure of linear association between Y and X. There could exist, for example, a perfect quadratic relationship between X and Y , yet the estimated least squares line through the data is a ﬂat line implying that R2 = 0, see problem 3 of Chapter 2. One should also be suspicious of least squares regressions with R2 that are too close to 1. In some cases, we may not want to include a constant in the regression. In such cases, one should use an uncentered R2 as a measure ﬁt. The appendix to this chapter deﬁnes both centered and uncentered R2 and explains the difference between them.
60
Chapter 3: Simple Linear Regression
3.7
Prediction
Let us now predict Y0 given X0 . Usually this is done for a time series regression, where the researcher is interested in predicting the future, say one period ahead. This new observation Y0 is generated by (3.1), i.e., Y0 = α + βX0 + u0
(3.11)
What is the Best Linear Unbiased Predictor (BLUP) of E(Y0 )? From (3.11), E(Y0 ) = α + βX0 is a linear combination of α and β. Using the GaussMarkov result, Y0 = α OLS + β OLS X0 is 2 2 ¯ BLUE for α+βX0 and the variance of this predictor of E(Y0 ) is σ [(1/n)+(X0 − X) / ni=1 x2i ], see problem 10. But, what if we are interested in the BLUP for Y0 itself ? Y0 differs from E(Y0 ) by u0 , and the best predictor of u0 is zero, so the BLUP for Y0 is still Y0 . The forecast error is Y0 − Y0 = [Y0 − E(Y0 )] + [E(Y0 ) − Y0 ] = u0 + [E(Y0 ) − Y0 ]
where u0 is the error committed even if the true regression line is known, and E(Y0 ) − Y0 is the difference between the sample and population regression lines. Hence, the variance of the forecast error becomes: ¯ 2 / n x2 ] var(u0 ) + var[E(Y0 ) − Y0 ] + 2cov[u0 , E(Y0 ) − Y0 ] = σ 2 [1 + (1/n) + (X0 − X) i=1 i
This says that the variance of the forecast error is equal to the variance of the predictor of E(Y0 ) plus the var(u0 ) plus twice the covariance of the predictor of E(Y0 ) and u0 . But, this last covariance is zero, since u0 is a new disturbance and is not correlated with the disturbances in the sample upon which Yi is based. Therefore, the predictor of the average consumption of a $20, 000 income household is the same as the predictor of consumption for a speciﬁc household whose income is $20, 000. The difference is not in the predictor itself but in the variance attached to it. The latter variance being larger only by σ 2 , the variance of u0 . The variance of the predictor therefore, depends upon σ 2 , the sample size, the variation in the X’s, and how far X0 is from To summarize, the smaller σ 2 is, the larger n nthe 2sample mean of the observed data. ¯ and i=1 xi are, and the closer X0 is to X, the smaller is the variance of the predictor. One can construct 95% conﬁdence intervals to these predictionsfor every value of X0 . In fact, this n 2 12 ¯ 2 is ( αOLS + β OLS X0 ) ± t.025;n−2 {s[1 + (1/n) + (X0 − X) / i=1 xi ] } where s replaces σ, and t.025;n−2 represents the 2.5% critical value obtained from a tdistribution with n − 2 degrees of freedom. Figure 3.5 shows this conﬁdence band around the estimated regression line. This is a ¯ as expected, and widens as we predict away from X. ¯ hyperbola which is the narrowest at X
3.8
Residual Analysis
A plot of the residuals of the regression is very important. The residuals are consistent estimates of the true disturbances. But unlike the ui ’s, these ei ’s are not independent. In fact, the OLS normal equations (3.2)and (3.3) give us two relationships between these residuals. Therefore, knowing (n − 2) of these residuals the remaining two residuals can be deduced. If we had the true ui ’s, and we plotted them, they should look like a random scatter around the horizontal axis with no speciﬁc pattern to them. A plot of the ei ’s that shows a certain pattern like a set of positive residuals followed by a set of negative residuals as shown in Figure 3.6(a) may be
3.8
Residual Analysis
61
Y + + + + + + ++ +
Y
+
Yˆi = =ˆ + >ˆX i
+ +
+ +
+ +
+ + +
+ ++ + +
+
X
X Figure 3.5 95% Conﬁdence Bands
indicative of a violation of one of the 5 assumptions imposed on the model, or simply indicating a wrong functional form. For example, if assumption 3 is violated, so that the ui ’s are say positively correlated, then it is likely to have a positive residual followed by a positive one, and a negative residual followed by a negative one, as observed in Figure 3.6(b). Alternatively, if we ﬁt a linear regression line to a true quadratic relation between Y and X, then a scatter of residuals like that in Figure 3.6(c) will be generated. We will study how to deal with this violation and how to test for it in Chapter 5. Large residuals are indicative of bad predictions in the sample. A large residual could be a typo, where the researcher entered this observation wrongly. Alternatively, it could be an inﬂuential observation, or an outlier which behaves differently from the other data points in the sample and therefore, is further away from the estimated regression line than the other data points. The fact that OLS minimizes the sum of squares of these residuals means that a large weight is put on this observation and hence it is inﬂuential. In other words, removing this observation from the sample may change the estimates and the regression line signiﬁcantly. For more on the study of inﬂuential observations, see Belsely, Kuh and Welsch (1980). We will focus on this issue in Chapter 8 of this book.
Yi
ei
Yˆi = =ˆ + >ˆX i
Yi = f ( X i ) Yi
+
Yˆi = =ˆ + >ˆX i
_ +
0

+
+
> Xi
>
(a) Figure 3.6 Positively Correlated Residuals
(b)
Xi
>
(c)
Xi
62
Chapter 3: Simple Linear Regression
ei
+ + + + ++ + + + + ++ + + + + + +++ + + + + + + ++ + + + + + + + + + + + + + + ++ + + +++ + + + + ++ + + + ++ + + ++ +
0
> Xi
Figure 3.7 Residual Variation Growing with X
One can also plot the residuals versus the Xi ’s. If a pattern like Figure 3.7 emerges, this could be indicative of a violation of assumption 2 because the variation of the residuals is growing with Xi when it should be constant for all observations. Alternatively, it could imply a relationship between the Xi ’s and the true disturbances which is a violation of assumption 4. In summary, one should always plot the residuals to check the data, identify inﬂuential observations, and check violations of the 5 assumptions underlying the regression model. In the next few chapters, we will study various tests of the violation of the classical assumptions. Most of these tests are based on the residuals of the model. These tests along with residual plots should help the researcher gauge the adequacy of his or her model.
Table 3.1 Simple Regression Computations OBS
Consumption yi
Income xi
yi=Yi−Y¯
1 2 3 4 5 6 7 8 9 10
4.6 3.6 4.6 6.6 7.6 5.6 5.6 8.6 8.6 9.6
5 4 6 8 8 7 6 9 10 12
–1.9 –2.9 –1.9 0.1 1.1 –0.9 –0.9 2.1 2.1 3.1
SUM
6.5
75
0
MEAN
6.5
7.5
xi yi
x2i
Yˆi
ei
–2.5 –3.5 –1.5 0.5 0.5 –0.5 –1.5 1.5 2.5 4.5
4.75 10.15 2.85 0.05 0.55 0.45 1.35 3.15 5.25 13.95
6.25 12.25 2.25 0.25 0.25 0.25 2.25 2.25 6.25 20.25
4.476190 3.666667 5.285714 6.904762 6.904762 6.095238 5.285714 7.714286 8.523810 10.142857
0.123810 –0.066667 –0.685714 –0.304762 0.695238 –0.495238 0.314286 0.885714 0.076190 –0.542857
0
42.5
52.5
65
0
¯ xi =XiX
6.5
3.9
3.9
Numerical Example
63
Numerical Example
Table 3.1 gives the annual consumption of 10 households each selected randomly from a group of households with a ﬁxed personal disposable income. Both income and consumption are measured in $10, 000, so that the ﬁrst household earns $50, 000 and consumes $46, 000 annually. It is worthwhile doing the computations necessary to obtain the least squares regression estimates of consumption on income in this simple case and to compare them with those obtained from ¯ = 7.5 and form two a regression package. In order to do this, we ﬁrst compute Y¯ = 6.5 and X OLS we need n xi yi , ¯ To get β new columns of data made up of yi = Yi − Y¯ and xi = Xi − X. i=1 so we multiply these last two columns by each other and sum to get 42.5. The denominator of n 2 2 β OLS is given by i=1 xi . This is why we square the xi column to get xi and sum to obtain 52.5. Our estimate of β OLS = 42.5/52.5 = 0.8095 which is the estimated marginal propensity to consume. This is the extra consumption brought about by an extra dollar of disposable income. OLS X ¯ = 6.5 − (0.8095)(7.5) = 0.4286 α OLS = Y¯ − β
This is the estimated consumption at zero personal disposable income. The ﬁtted values or predicted values from this regression are computed from Yi = α OLS + β OLS Xi = 0.4286 + 0.8095Xi and are given in Table 3.1. Note that the mean of Yi is equal to the mean of Yi conﬁrming one of the numerical from properties of least squares. The residuals are computed ei = Yi − Yi and they satisfy ni=1 ei = 0. It is left to the reader to verify that ni=1 ei Xi = 0. The residual sum of squares is obtained by squaring thecolumn of residuals and summing it. This gives us ni=1 e2i = 2.495238. This means that s2 = ni=1 e2i /(n − 2) = 0.311905. Its square root is given by s = 0.558. Thisis known as the standard error of the regression. In this case, n 2 2 the estimated var(β OLS ) is s / i=1 xi = 0.311905/52.5 = 0.005941 and the estimated
¯2 X (7.5)2 1 1 = 0.365374 + n + = 0.311905 var( α ) = s2 2 n 10 52.5 i=1 xi
Taking the square root of these estimated variances, we get the estimated standard errors of α OLS and β αOLS ) = 0.60446 and se( β OLS given by se( OLS ) = 0.077078. Since the disturbances are normal, the OLS estimators are also the maximum likelihood estimators, and are normally distributed themselves. For the null hypothesis H0a ; β = 0; the observed tstatistic is OLS − 0)/se( OLS ) = 0.809524/0.077078 = 10.50 tobs = (β β
and this is highly signiﬁcant, since Pr[t8  > 10.5] < 0.0001. This probability can be obtained using most regression packages. It is also known as the pvalue or probability value. It shows that this tvalue is highly unlikely and we reject Ha0 that β = 0. Similarly, the null hypothesis H0b ; α = 0, gives an observed tstatistic of tobs = ( αOLS − 0)/se( αOLS ) = 0.428571/0.604462 = 0.709, which is not signiﬁcant, since its pvalue is Pr[t8  > 0.709] < 0.498. Hence, we do not reject the null hypothesis H0b that α = 0. The total sum of squares is ni=1 yi2 = ni=1 (Yi −Y¯ )2 which can be obtained by squaring the yi column in Table 3.1 and summing. This yields ni=1 yi2 = 36.9. Also, the regression sum of squares = ni=1 yi2 = ni=1 (Yi − Y¯ )2 which can be obtained by subtracting Y¯ = Y = 6.5 from the Yi column, squaring that column and summing. This yields 34.404762. This could have also been obtained as
64
Chapter 3: Simple Linear Regression
2 n x2 = (0.809524)2 (52.5) = 34.404762. =β OLS i=1 i n = ni=1 yi2 − ni=1 e2i =36.9 − 2.495238 = 34.404762 as required. A ﬁnal check is that i=1 yi2 2 = ( n x y )2 /( n x2 )( n y 2 ) = (42.5)2 /(52.5)(36.9) = 0.9324. Recall, that R2 = rxy i=1 i i i=1 i i=1 i This could have also been obtained as R2 = 1 − ( ni=1 e2i / ni=1 yi2 ) = 1 − (2.495238/36.9) = 0.9324, or as R2 = ry2yˆ = ni=1 yi2 / ni=1 yi2 = 34.404762/36.9 = 0.9324. n
i2 i=1 y
This means that personal disposable income explains 93.24% of the variation in consumption. A plot of the actual, predicted and residual values versus time is given in Figure 3.8. This was done using EViews. 12
8
R esidual A ctual F itted
4
0
1
2
3
4
5
6
7
8
9
10
Figure 3.8 Residual Plot
3.10
Empirical Example
Table 3.2 gives (i) the logarithm of cigarette consumption (in packs) per person of smoking age (> 16 years) for 46 states in 1992, (ii) the logarithm of real price of cigarettes in each state, and (iii) the logarithm of real disposable income per capita in each state. This is drawn from Baltagi and Levin (1992) study on dynamic demand for cigarettes. It can be downloaded as Cigarett.dat from the Springer web site.
3.10
Empirical Example
Table 3.2 Cigarette Consumption Data LNC: log of consumption (in packs) per person of smoking age (>16) LNP: log of real price (1983$/pack) LNY: log of real disposable income percapita (in thousand 1983$) OBS
STATE
LNC
LNP
LNY
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46
AL AZ AR CA CT DE DC FL GA ID IL IN IA KS KY LA ME MD MA MI MN MS MO MT NE NV NH NJ NM NY ND OH OK PA RI SC SD TN TX UT VT VA WA WV WI WY
4.96213 4.66312 5.10709 4.50449 4.66983 5.04705 4.65637 4.80081 4.97974 4.74902 4.81445 5.11129 4.80857 4.79263 5.37906 4.98602 4.98722 4.77751 4.73877 4.94744 4.69589 4.93990 5.06430 4.73313 4.77558 4.96642 5.10990 4.70633 4.58107 4.66496 4.58237 4.97952 4.72720 4.80363 4.84693 5.07801 4.81545 5.04939 4.65398 4.40859 5.08799 4.930651 4.66134 4.82454 4.83026 5.00087
0.20487 0.16640 0.23406 0.36399 0.32149 0.21929 0.28946 0.28733 0.12826 0.17541 0.24806 0.08992 0.24081 0.21642 â€“0.03260 0.23856 0.29106 0.12575 0.22613 0.23067 0.34297 0.13638 0.08731 0.15303 0.18907 0.32304 0.15852 0.30901 0.16458 0.34701 0.18197 0.12889 0.19554 0.22784 0.30324 0.07944 0.13139 0.15547 0.28196 0.19260 0.18018 0.11818 0.35053 0.12008 0.22954 0.10029
4.64039 4.68389 4.59435 4.88147 5.09472 4.87087 5.05960 4.81155 4.73299 4.64307 4.90387 4.72916 4.74211 4.79613 4.64937 4.61461 4.75501 4.94692 4.99998 4.80620 4.81207 4.52938 4.78189 4.70417 4.79671 4.83816 5.00319 5.10268 4.58202 4.96075 4.69163 4.75875 4.62730 4.83516 4.84670 4.62549 4.67747 4.72525 4.73437 4.55586 4.77578 4.85490 4.85645 4.56859 4.75826 4.71169
Data: Cigarette Consumption of 46 States in 1992
65
66
Chapter 3: Simple Linear Regression
Table 3.3 Cigarette Consumption Regression Analysis of Variance Source Model Error
DF 1 44
Root MSE Dep Mean C.V.
Sum of Squares 0.48048 1.16905
Mean Square 0.48048 0.02657
F Value 18.084
0.16300 4.84784 3.36234
Rsquare Adj Rsq
0.2913 0.2752
Prob > F 0.0001
Parameter Estimates Variable INTERCEP LNP
DF 1 1
Parameter Estimate 5.094108 –1.198316
Standard Error 0.06269897 0.28178857
T for H0: Parameter=0 81.247 –4.253
Prob > T 0.0001 0.0001
.4
Residuals
.2 .0 –.2 –.4 –.6 –.1
.0
.1
.2
.3
.4
Log of Real Price (1983$/Pack) Figure 3.9 Residuals versus LNP
Table 3.3 gives the SAS output for the regression of logC on logP . The price elasticity of demand for cigarettes in this simple model is (dlogC/logP ) which is the slope coeﬃcient. This is estimated to be −1.198 with a standard error of 0.282. This says that a 10% increase in real price of cigarettes has an estimated 12% drop in per capita consumption of cigarettes. The R2 of this regression is 0.29, s2 is given by the Mean Square Error of the regression which is 0.0266. Figure 3.9 plots the residuals of this regression versus the independent variable, while Figure 3.10 plots the predictions along with the 95% conﬁdence interval band for these predictions. One observation clearly stands out as an inﬂuential observation given its distance from the rest of the data and that is the observation for Kentucky, a producer state with very low real price. This observation almost anchors the straight line ﬁt through the data. More on inﬂuential observations in Chapter 8.
Problems
67
Predicted Value of LNC
5.6
5.2
4.8
4.4
4.0 –.1
.0
.1
.2
.3
.4
Log of Real Price (1983$/Pack) Figure 3.10 95% Conﬁdence Band for Predicted Values
Problems 1. For the simple regression with a constant Yi = α + βXi + ui , given in equation (3.1) verify the following numerical properties of the OLS estimators: n n n n n i=1 ei = 0, i=1 ei Xi = 0, i=1 ei Yi = 0, i=1 Yi = i=1 Yi
2. For the regression with only a constant Yi = α + ui with ui ∼ IID(0, σ 2 ), show that the least squares estimate is α OLS = Y¯ , var( αOLS ) = σ 2 /n, and the residual sums of squares is n of α n 2 2 ¯ i=1 (Yi − Y ) . i=1 yi = 3. For the simple regression without a constant Yi = βXi + ui , with ui ∼ IID(0, σ 2 ). (a) Derive the OLS estimator of β and ﬁnd its variance.
(b) What numerical properties of the OLS estimators described in problem 1 still hold for this model? (c) derive the maximum likelihood estimator of β and σ 2 under the assumption ui ∼ IIN(0, σ 2 ). (d) Assume σ 2 is known. Derive the Wald, LM and LR tests for H0 ; β = 1 versus H1 ; β = 1. 4. Use the fact that E( ni=1 xi ui )2 = ni=1 nj=1 xi xj E(ui uj ); and assumptions 2 and 3 to prove equation (3.6). 5. Using the regression given in equation (3.1): ¯ ¯; and deduce that E( (a) Show that α OLS = α + (β − β αOLS ) = α. OLS )X + u n n 2 (b) Using the fact that β OLS − β = i=1 xi ; use the results in part (a) to show that i=1 xi ui / ¯ 2 / n x2 )] = σ 2 n X 2 /n n x2 . var( αOLS ) = σ 2 [(1/n) + (X i=1 i i=1 i i=1 i
(c) Show that α OLS is consistent for α. n 2 ¯ 2 ¯ (d) Show that cov( αOLS , β OLS ) = −Xvar(β OLS ) = −σ X/ i=1 xi . This means that the sign ¯ If X ¯ is positive, this covariance will be of the covariance is determined by the sign of X. negative. This also means that if α OLS is overestimated, β OLS will be underestimated.
68
Chapter 3: Simple Linear Regression 6. Using the regression given in equation (3.1): n ¯ i and wi = xi / n x2 . (a) Prove that α OLS = i=1 λi Yi where λi = (1/n) − Xw i=1 i n n (b) Show that i=1 λi = 1 and i=1 λi Xi = 0. n n (c) Prove = i=1 bi Yi must satisfy i=1 bi = 1 and n that any other linear estimator of α, say α to be unbiased for α. i=1 bi Xi = 0 for α n n (d) Let bi = λi + fi ; show that i=1 fi = 0 and i=1 fi Xi = 0. n n n n αOLS ) + σ 2 i=1 fi2 . (e) Prove that var( α) = σ 2 i=1 b2i = σ 2 i=1 λ2i + σ 2 i=1 fi2 =var( 7.
(a) Differentiate (3.9) with respect to α and β and show that α MLE = α OLS , β MLE = β OLS . n 2 (b) Differentiate (3.9) with respect to σ 2 and show that σ MLE = i=1 e2i /n.
8. It is well known that a standard normal random variable N (0, 1) divided by a square root of a chi1 squared random variable divided by its degrees of freedom (χ2ν /ν) 2 results in a random variable that is tdistributed with ν degrees of freedom, provided the N (0, 1) and the χ2 variables are n 2 12 independent, see Chapter 2. Use this fact to show that (β OLS − β)/[s/( i=1 xi ) ] ∼ tn−2 . n n n n 2 9. (a) Using the fact that R2 = i=1 yi2 / i=1 yi2 ; yi = β OLS xi ; and β OLS = i=1 xi , i=1 xi yi / 2 2 show that R = rxy where, 2 = ( ni=1 xi yi )2 /( ni=1 x2i )( ni=1 yi2 ). rxy n n (b) Using the fact that y i + ei , show that i=1 yi yi = i=1 yi2 , and hence, deduce that i = y ry2yˆ = ( ni=1 yi yi )2 /( ni=1 yi2 )( ni=1 yi2 ) is equal to R2 .
10. Prediction. Consider the problem of predicting Y0 from (3.11). Given X0 , (a) Show that E(Y0 ) = α + βX0 . (b) Show that Y0 is unbiased for E(Y0 ).
(c) Show that var(Y0 ) = var( αOLS )+ X02 var(β αOLS , β OLS )+ 2X0 cov( OLS ). Deduce that var(Y0 ) ¯ 2 / n x2 ]. = σ 2 [(1/n) + (X0 − X) i=1 i n n (d) Consider a linear predictor of E(Y0 ), say Y0 = i=1 ai Yi , show that i=1 ai = 1 and n a X = X for this predictor to be unbiased for E(Y ). i i 0 0 i=1 n n 2 (e) Show that the var(Y0 ) = σ 2 i=1 a2i . Minimize i=1 ai subject to the restrictions given OLS + β in (d). Prove that the resulting predictor is Y0 = α OLS X0 and that the minimum n 2 2 2 ¯ variance is σ [(1/n) + (X0 − X) / i=1 xi ].
11. Optimal Weighting of Unbiased Estimators. This is based on Baltagi (1995). For the simple regression without a constant Yi = βXi + ui , i = 1, 2, . . . , N ; where β is a scalar and ui ∼ IID(0, σ 2 ) independent of Xi . Consider the following three unbiased estimators of β: = n Xi Yi / n X 2 , β = Y¯ /X ¯ β i 2 1 i=1
i=1
and
= n (Xi − X)(Y ¯ 2, ¯ i − Y¯ )/ n (Xi − X) β 3 i=1 i=1 ¯ = n Xi /n and Y¯ = n Yi /n. where X i=1 i=1
,β ) = var(β ) > 0, and that ρ = (the correlation coeﬃcient of β and (a) Show that cov(β 1 2 1 12 1 1 β 2 ) = [var(β 1 )/var(β 2 )] 2 with 0 < ρ12 ≤ 1. Show that the optimal combination of β 1 and , given by β = αβ + (1 − α)β where −∞ < α < ∞ occurs at α∗ = 1. Optimality here β 2 1 2 refers to minimizing the variance. Hint: Read the paper by SamuelCahn (1994).
Problems
69
,β ) = var(β ) > 0, and that ρ = (the correlation coeﬃcient of (b) Similarly, show that cov(β 1 3 1 13 1 1 β 1 and β 3 ) = [var(β 1 )/var(β 3 )] 2 = (1 − ρ212 ) 2 with 0 < ρ13 < 1. Conclude that the optimal is again α∗ = 1. and β combination β 1 3 and β is β = (1 − ρ2 )β + (c) Show that cov(β 2 , β 3 ) = 0 and that optimal combination of β 2 3 12 3 2 ρ12 β 2 = β 1 . This exercise demonstrates a more general result, namely that the BLUE of , has a positive correlation with any other linear unbiased estimator of β, β in this case β 1 and that this correlation can be easily computed from the ratio of the variances of these two estimators. denote the Best Linear Unbiased 12. Efficiency as Correlation. This is based on Oksanen (1993). Let β denote any linear unbiased estimator of β. Show that the relative eﬃciency Estimator of β and let β with respect to β is the squared correlation coeﬃcient between β and β. Hint: Compute the of β is BLUE. This variance of β + λ(β − β) for any λ. This variance is minimized at λ = 0 since β 2 β) which in turn proves the required result, see Zheng ) = E(β should give you the result that E(β (1994).
13. For the numerical illustration given in section 3.9, what happens to the least squares regression 2 αOLS ) and se(β OLS coeﬃcient estimates ( αOLS , β OLS ), s , the estimated se( OLS ), tstatistic for α a b 2 and β for H ; α = 0, and H ; β = 0 and R when: OLS 0 0 (a) Yi is regressed on Xi + 5 rather than Xi . In other words, we add a constant 5 to each observation of the explanatory variable Xi and rerun the regression. It is very instructive to see how the computations in Table 3.1 are affected by this simple transformation on Xi .
(b) Yi + 2 is regressed on Xi . In other words, a constant 2 is added to Yi . (c) Yi is regressed on 2Xi . (A constant 2 is multiplied by Xi ). 14. For the cigarette consumption data given in Table 3.2. (a) Give the descriptive statistics for logC, logP and logY . Plot their histogram. Also, plot logC versus logY and logC versus logP . Obtain the correlation matrix of these variables. (b) Run the regression of logC on logY . What is the income elasticity estimate? What is its standard error? Test the null hypothesis that this elasticity is zero. What is the s and R2 of this regression? (c) Show that the square of the simple correlation coeﬃcient between logC and logY is equal to R2 . Show that the square of the correlation coeﬃcient between the ﬁtted and actual values of logC is also equal to R2 . (d) Plot the residuals versus income. Also, plot the ﬁtted values along with their 95% conﬁdence band. 15. Consider the simple regression with no constant: Yi = βXi + ui
i = 1, 2, . . . , n
where ui ∼ IID(0, σ 2 ) independent of Xi . Theil (1971) showed that among all linear estimators in − β)2 is given by Yi , the minimum mean square estimator for β, i.e., that which minimizes E(β = β 2 n Xi Yi /(β 2 n X 2 + σ 2 ). β i i=1 i=1 = β/(1 + c), where c = σ 2 /β 2 n X 2 > 0. (a) Show that E(β) i=1 i (b) Conclude that the Bias (β) = E(β) − β = −[c/(1 + c)]β. Note that this bias is positive is biased towards zero. (negative) when β is negative (positive). This also means that β n 2 2 2 = E(β − β)2 = σ 2 /[ (c) Show that MSE(β) i=1 Xi + (σ /β )]. Conclude that it is smaller than the MSE(β OLS ).
70
Chapter 3: Simple Linear Regression
Table 3.4 Energy Data for 20 countries RGDP (in 106 1975 U.S.$’s)
Country Malta Iceland Cyprus Ireland norway Finland Portugal Denmark Greece Switzerland Austria Sweden Belgium Netherlands Turkey Spain Italy U.K. France W. Germany
1251 1331 2003 11788 27914 28388 30642 34540 38039 42238 45451 59350 62049 82804 91946 159602 265863 279191 358675 428888
EN 106 Kilograms Coal Equivalents 456 1124 1211 11053 26086 26405 12080 27049 20119 23234 30633 45132 58894 84416 32619 88148 192453 268056 233907 352677
16. Table 3.4 gives crosssection Data for 1980 on real gross domestic product (RGDP) and aggregate energy consumption (EN ) for 20 countries (a) Enter the data and provide descriptive statistics. Plot the histograms for RGDP and EN. Plot EN versus RGDP. (b) Estimate the regression: log(En) = α + βlog(RGDP ) + u. Be sure to plot the residuals. What do they show? (c) Test H0 ; β = 1. (d) One of your Energy data observations has a misplaced decimal. Multiply it by 1000. Now repeat parts (a), (b) and (c). (e) Was there any reason for ordering the data from the lowest to highest energy consumption? Explain. Lesson Learned : Always plot the residuals. Always check your data very carefully. 17. Using the Energy Data given in Table 3.4, corrected as in problem 16 part (d), is it legitimate to reverse the form of the equation? log(RDGP ) = γ + δlog(En) + ǫ (a) Economically, does this change the interpretation of the equation? Explain. (b) Estimate this equation and compare R2 of this equation with that of the previous problem. Why are they different? Also, check if δ = 1/β.
References
71
(c) Statistically, by reversing the equation, which assumptions do we violate? = R2 . (d) Show that δβ (e) Eﬀects of changing units in which variables are measured. Suppose you measured energy in BTU’s instead of kilograms of coal equivalents so that the original series was multiplied by 60. How does it change α and β in the following equations? log(En) = α + βlog(RDGP ) + u
En = α∗ + β ∗ RGDP + ν
for the loglog model, whereas both α Can you explain why α changed, but not β ∗ and ∗ β changed for the linear model? (f) For the loglog speciﬁcation and the linear speciﬁcation, compare the GDP elasticity for Malta and W. Germany. Are both equally plausible? (g) Plot the residuals from both linear and loglog models. What do you observe? (h) Can you compare the R2 and standard errors from both models in part (g)? Hint: Retrieve log(En) and log(En) in the loglog equation, exponentiate, then compute the residuals and s. These are comparable to those obtained from the linear model.
18. For the model considered in problem 16: log(En) = α + βlog(RGDP ) + u and measuring energy in BTU’s (like part (e) of problem 17). (a) What is the 95% conﬁdence prediction interval at the sample mean? (b) What is the 95% conﬁdence prediction interval for Malta? (c) What is the 95% conﬁdence prediction interval for West Germany?
References Additional readings on the material covered in this chapter can be found in: Baltagi, B.H. (1995), “Optimal Weighting of Unbiased Estimators,” Econometric Theory, Problem 95.3.1, 11:637. Baltagi, B.H. and D. Levin (1992), “Cigarette Taxation: Raising Revenues and Reducing Consumption,” Structural Change and Economic Dynamics, 3: 321335. Belsley, D.A., E. Kuh and R.E. Welsch (1980), Regression Diagnostics (Wiley: New York). Greene, W. (1993), Econometric Analysis (Macmillian: New York). Gujarati, D. (1995), Basic Econometrics (McGrawHill: New York). Johnston, J. (1984), Econometric Methods (McGrawHill: New York). Kelejian, H. and W. Oates (1989), Introduction to Econometrics (Harper and Row: New York). Kennedy, P. (1992), A Guide to Econometrics (MIT Press: Cambridge). Kmenta, J. (1986), Elements of Econometrics (Macmillan: New York). Maddala, G.S. (1992), Introduction to Econometrics (Macmillan: New York). Oksanen, E.H. (1993), “Eﬃciency as Correlation,” Econometric Theory, Problem 93.1.3, 9: 146. SamuelCahn, E. (1994), “Combining Unbiased Estimators,” The American Statistician, 48: 3436. Wallace, D. and L. Silver (1988), Econometrics: An Introduction (Addison Wesley: New York). Zheng, J.X. (1994), “Eﬃciency as Correlation,” Econometric Theory, Solution 93.1.3, 10: 228.
72
Chapter 3: Simple Linear Regression
Appendix Centered and Uncentered R2 From the OLS regression on (3.1) we get Yi = Yi + ei
i = 1, 2, . . . , n
(A.1)
OLS + Xi β where Yi = α OLS . Squaring and summing the above equation we get n n 2 n 2 2 i=1 Yi = i=1 Yi + i=1 ei
since
n
i=1 Yi ei
(A.2)
= 0. The uncentered R2 is given by
uncentered R2 = 1 −
n
2 n 2 i=1 ei / i=1 Yi
=
2 2 i=1 Yi i=1 Yi /
n
(A.3)
Note that the total sum of squares for Yi is not expressed in deviation from the sample mean Y¯ . In other words, the uncentered R2 is the proportion of variation of ni=1 Yi2 that is explained by the regression on X.Regression usually report the centered R2 which was deﬁned npackages n 2 2 in section 3.6 as 1 − ( i=1 ei / i=1 yi ) where yi = Yi − Y¯ . The latter measure focuses on explaining the variation in Yi after ﬁtting the constant. From section 3.6, we have seen that a naive model with only a constant in it gives Y¯ as the estimate of the constant, see also problem 2. The variation in Yi that is not explained by this naive model is ni=1 yi2 = ni=1 (Yi − Y¯ )2 . Subtracting nY¯ 2 from both sides of (A.2) we get n
2 i=1 yi
=
n
2 i=1 Yi
and the centered R2 is
− nY¯ 2 + ni=1 e2i
centered R2 = 1 − ( ni=1 e2i / ni=1 yi2 ) = ( ni=1 Yi2 − nY¯ 2 )/ ni=1 yi2
(A.4)
If there is a constant in the model Y¯ = Y , see section 3.6, and ni=1 yi2 = ni=1 (Yi − Y )2 = n 2 n 2 ¯2 i2 / ni=1 yi2 which is the R2 reported by i=1 Yi − nY . Therefore, the centered R = i=1 y regression packages. If there is no constant in the model, some regression packages give you the option of (no constant) and the R2 reported is usually the uncentered R2 . Check your regression package documentation to verify what you are getting. We will encounter uncentered R2 again in constructing test statistics using regressions, see for example Chapter 11.
CHAPTER 4
Multiple Regression Analysis 4.1
Introduction
So far we have considered only one regressor X besides the constant in the regression equation. Economic relationships usually include more than one regressor. For example, a demand equation for a product will usually include real price of that product in addition to real income as well as real price of a competitive product and the advertising expenditures on this product. In this case Yi = α + β 2 X2i + β 3 X3i + .. + β K XKi + ui
i = 1, 2, . . . , n
(4.1)
where Yi denotes the ith observation on the dependent variable Y, in this case the sales of this product. Xki denotes the ith observation on the independent variable Xk for k = 2, . . . , K in this case, own price, the competitor’s price and advertising expenditures. α is the intercept and β 2 , β 3 , . . . , β K are the (K − 1) slope coeﬃcients. The ui ’s satisfy the classical assumptions 14 given in Chapter 3. Assumption 4 is modiﬁed to include all the X’s appearing in the regression, i.e., every Xk for k = 2, . . . , K, is uncorrelated with the ui ’s with the property that n n ¯ 2 ¯ i=1 (Xki − Xk ) /n where Xk = i=1 Xki /n has a ﬁnite probability limit which is different from zero. Section 4.2 derives the OLS normal equations of this multiple regression model and discovers that an additional assumption is needed for these equations to yield a unique solution.
4.2
Least Squares Estimation
As explained in Chapter 3, least squares minimizes the residual sum of squares where the Xki and α denote guesses on the residuals are now given by ei = Yi − α − K β and β k k=2 k regression parameters α and β k , respectively. The residual sum of squares RSS =
n
2 i=1 ei
=
n
i=1 (Yi
X2i − .. − β XKi )2 −α −β 2 K
is minimized by the following K ﬁrstorder conditions: ∂( ni=1 e2i )/∂ α = −2 ni=1 ei = 0 = −2 n ei Xki = 0, for k = 2, . . . , K. ∂( ni=1 e2i )/∂ β k i=1
or, equivalently n
K n Xki 2 n X2i + .. + β n + β i=1 i=1 i=1 Yi = α n n n 2 i=1 X2i + β 2 i=1 X2i + .. + β K i=1 Yi X2i = α
n
.. .
i=1 Yi XKi
=α
n
.. .
i=1 XKi
.. .
+β 2
n
i=1 X2i XKi
n
i=1 X2i XKi
.. .
n X 2 + .. + β K i=1 Ki
(4.2)
(4.3)
74
Chapter 4: Multiple Regression Analysis
where the ﬁrst equation multiplies the regression equation by the constant and sums, the second equation multiplies the regression equation by X2 and sums, and the Kth equation multiplies the regression equation by XK and sums. ni=1 ui = 0 and ni=1 ui Xki = 0 for k = 2, . . . , K are implicitly imposed to arrive at (4.3). Solving these K equations in K unknowns, we get the OLS estimators. This can be done more succinctly in matrix form, see Chapter 7. Assumptions 14 insure that the OLS estimator is BLUE. Assumption 5 introduces normality and as a result the OLS estimator is also (i) a maximum likelihood estimator, (ii) it is normally distributed, and (iii) it is minimum variance unbiased. Normality also allows test of hypotheses. Without the normality assumption, one has to appeal to the Central Limit Theorem and the fact that the sample is large to perform hypotheses testing. In order to make sure we can solve for the OLS estimators in (4.3) we need to impose one further assumption on the model besides those considered in Chapter 3. Assumption 6: No perfect multicollinearity, i.e., the explanatory variables are not perfectly correlated with each other. This assumption states that, no explanatory variable Xk for k = 2, . . . , K is a perfect linear combination of the other X’s. If assumption 6 is violated, then one of the equations in (4.2) or (4.3) becomes redundant and we would have K − 1 linearly independent equations in K unknowns. This means that we cannot solve uniquely for the OLS estimators of the K coeﬃcients. Example 1: If X2i = 3X4i − 2X5i + X7i for i = 1, . . . , n, then multiplying this relationship by ei and summing over i we get n
i=1 X2i ei
=3
n
i=1 X4i ei
−2
n
i=1 X5i ei
+
n
i=1 X7i ei .
This means that the second OLS normal equation in (4.2) can be represented as a perfect linear combination of the fourth, ﬁfth and seventh OLS normal equations. Knowing the latter three equations, the second equation adds no new information. Alternatively, one could substitute this relationship in the original regression equation (4.1). After some algebra, X2 would be eliminated and the resulting equation becomes: Yi = α + β 3 X3i + (3β 2 + β 4 )X4i + (β 5 − 2β 2 )X5i + β 6 X6i + (β 2 + β 7 )X7i
(4.4)
+.. + β K XKi + ui .
Note that the coeﬃcients of X4i , X5i and X7i are now (3β 2 + β 4 ), (β 5 − 2β 2 ) and (β 2 + β 7 ), respectively. All of which are contaminated by β 2 . These linear combinations of β 2 , β 4 , β 5 and β 7 can be estimated from regression (4.4) which excludes X2i . In fact, the other X’s, not contaminated by this perfect linear relationship, will have coeﬃcients that are not contaminated by β 2 and hence are themselves estimable using OLS. However, β 2 , β 4 , β 5 and β 7 cannot be estimated separately. Perfect multicollinearity means that we cannot separate the inﬂuence on Y of the independent variables that are perfectly related. Hence, assumption 6 of no perfect multicollinearity is needed to guarantee a unique solution of the OLS normal equations. Note that it applies to perfect linear relationships and does not apply to perfect nonlinear relation2 like (years ships among the independent variables. In other words, one can include X1i and X1i of experience) and (years of experience)2 in an equation explaining earnings of individuals. Although, there is a perfect quadratic relationship between these independent variables, this is not a perfect linear relationship and therefore, does not cause perfect multicollinearity.
4.3
4.3
Residual Interpretation of Multiple Regression Estimates
75
Residual Interpretation of Multiple Regression Estimates
Although we did not derive an explicit solution for the OLS estimators of the β’s, we know , the that they are the solutions to (4.2) or (4.3). Let us focus on one of these estimators, say β 2 OLS estimator of β 2 , the partial derivative of Yi with respect to X2i . As a solution to (4.2) or is a multiple regression coeﬃcient estimate of β . Alternatively, we can interpret β (4.3), β 2 2 2 as a simple linear regression coeﬃcient.
Claim 1: (i) Run the regression of X2 on all the other X’s in (4.1), and obtain the residuals 2 + ν2 , i.e., X2 = X ν 2 . (ii) Run the simple regression of Y on ν 2 , the resulting estimate of the . slope coeﬃcient is β 2 The ﬁrst regression essentially cleans out the effect of the other X’s from X2 , leaving the vari can be interpreted as a simple linear regression ation unique to X2 in ν 2 . Claim 1 states that β 2 coeﬃcient of Y on this residual. This is in line with the partial derivative interpretation of β 2 . The proof of claim 1 is given in the Appendix. Using the results of the simple regression given ν 2 , we get in (3.4) with the regressor Xi replaced by the residual n = n β ν 22i 2 i=1 ν 2 Yi / i=1
(4.5)
and from (3.6) we get ) = σ2/ var(β 2
n
ν 22i i=1
(4.6)
as a simple regression coeﬃcient is the following: An alternative interpretation of β 2
Claim 2: (i) Run Y on all the other X’s and get the predicted Y and the residuals, say ω . (ii) is the resulting estimate of the slope coeﬃcient. Run the simple linear regression of ω on ν 2. β 2 This regression cleans both Y and X2 from the effect of the other X’s and then regresses the cleaned out residuals of Y on those of X2 . Once again this is in line with the partial derivative interpretation of β 2 . The proof of claim 2 is simple and is given in the Appendix. are important in that they provide an easy way of looking at These two interpretations of β 2 a multiple regression in the context of a simple linear regression. Also, it says that there is no need to clean the effects of one X from the other X’s to ﬁnd its unique effect on Y . All one has to do is to include all these X’s in the same multiple regression. Problem 1 veriﬁes this result with an empirical example. This will also be proved using matrix algebra in Chapter 7. Recall that R2 = 1 − RSS/T SS for any regression. Let R22 be the R2 for the regression n 2 n 2 ¯ 2 and of X2 on all the other X’s,then R2 = 1 − i=1 ν 2i / i=1 x22i wherex2i = X2i − X n n n n 2 2 2 ¯ ¯ X X2i /n; T SS = i=1 (X2i − X2 ) = i=1 x2i and RSS = i=1 ν 2i . Equivalently, 2n= 2 i=1 n 2 (1 − R2 ) and the = x ν 2i 2 2i i=1 i=1 ) = σ2/ var(β 2
n
ν 22i i=1
= σ2 /
n
2 i=1 x2i (1
− R22 )
(4.7)
) holding σ 2 This means that the larger R22 , the smaller is (1 − R22 ) and the larger is var(β 2 n 2 and i=1 x2i ﬁxed. This shows the relationship between multicollinearity and the variance of the OLS estimates. High multicollinearity between X2 and the other X’s will result in high . Perfect multicollinearity is the extreme case R22 which in turn implies high variance for β 2 2 . In general, high multicollinearity where R2 = 1. This in turn implies an inﬁnite variance for β 2 among the regressors yields imprecise estimates for these highly correlated variables. The least
76
Chapter 4: Multiple Regression Analysis
squares regression estimates are still unbiased as long as assumptions 1 and 4 are satisﬁed, but these estimates are unreliable asreﬂected by their high variances. However, it is important n x2 could counteract the effect of a high R2 leading to note that a low σ 2 and a high 2 i=1 2i . Maddala (2001) argues that high intercorrelation among the to a signiﬁcant tstatistic for β 2 explanatory variables are neither necessary nor suﬃcient to cause the multicollinearity problem. In practice, multicollinearity is sensitive to the addition or deletion of observations. More on this in Chapter 8. Looking at high intercorrelations among the explanatory variables is useful only as a complaint. It is more important to look at the standard errors and tstatistics to assess the seriousness of multicollinearity. Much has been written on possible solutions to the multicollinearity problem, see Hill and Adkins (2001) for a recent summary. Credible candidates include: (i) obtaining new and better data, but this is rarely available; (ii) introducing nonsample information about the model parameters based on previous empirical research or economic theory. The problem with the latter solution is that we never truly know whether the information we introduce is good enough to reduce estimator Mean Square Error.
4.4
Overspeciﬁcation and Underspeciﬁcation of the Regression Equation
So far we have assumed that the true linear regression relationship is always correctly speciﬁed. This is likely to be violated in practice. In order to keep things simple, we consider the case where the true model is a simple regression with one regressor X1 . True model: Yi = α + β 1 X1i + ui with ui ∼ IID(0, σ 2 ), but the estimated model is overspeciﬁed with the inclusion of an additional irrelevant variable X2 , i.e., X1i + β X2i Estimated model: Yi = α +β 1 2
n = n ν 21i where ν 1 is the OLS From the previous section, it is clear that β 1 i=1 i=1 ν 1i Yi / residuals of X1 on X2 . Substituting the true model for Y we get n = β n ν1i X1i / n ν2 + n ν 21i β 1 1 i=1 i=1 ν 1i ui / i=1 1i i=1
n
1i + 1i = 0. But, X1i = X ν 1i and ni=1 X ν 1i = 0 implying that ni=1 ν 1i X1i = 2 . Hence, ν i=1 1i
since n
ν 1i i=1
n = β + n β ν 21i 1 1 i=1 ν 1i ui / i=1
(4.8)
) = β since and E(β ν 1 is a linear combination of the X’s, and E(Xk u) = 0 for k = 1, 2. Also, 1 1 ) = σ2/ var(β 1
n
ν 21i i=1
= σ2 /
n
2 i=1 x1i (1
− R12 )
(4.9)
¯ 1 and R2 is the R2 of the regression of X1 on X2 . Using the true model where x1i = X1i − X 1 n n 2 to estimate β 1 , one would get b1 = i=1 x1i yi / i=1 x1i with E(b1 ) = β 1 and var(b1 ) =
4.4
Overspeciﬁcation and Underspeciﬁcation of the Regression Equation
77
1 ) ≥ var(b1 ). Note also that in the overspeciﬁed model, the estimate σ 2 / ni=1 x21i . Hence, var(β for β 2 which has a true value of zero is given by n = n β ν 22i 2 i=1 ν 2i Yi / i=1
(4.10)
where ν 2 is the OLS residual of X2 on X1 . Substituting the true model for Y we get n = n β ν 22i 2 i=1 ν 2i ui / i=1
(4.11)
) = 0 since ν 2i X1i = 0 and ni=1 ν2i = 0. Hence, E(β ν 2 is a linear combination since ni=1 2 of the X’s and E(Xk u) = 0 for k = 1, 2. In summary, overspeciﬁcation still yields unbiased estimates of β 1 and β 2 , but the price is a higher variance. Similarly, the true model could be a tworegressors model True model: Yi = α + β 1 X1i + β 2 X2i + ui where ui ∼ IID(0, σ 2 ) but the estimated model is X1i Estimated model: Yi = α +β 1
The estimated model omits a relevant variable X2 and underspeciﬁes the true relationship. In this case = n x1i Yi / n x2 β 1 i=1 i=1 1i
(4.12)
¯ 1 . Substituting the true model for Y we get where x1i = X1i − X 1 = β 1 + β 2 n x1i X2i / n x2 + n x1i ui / n x2 β i=1 i=1 1i i=1 i=1 1i
(4.13)
) = β + β b12 since E(x1 u) = 0 with b12 = n x1i X2i / n x2 . Note that b12 Hence, E(β 1 1 2 i=1 i=1 1i is the regression slope estimate obtained by regressing X2 on X1 and a constant. Also, the ) = E(β − E(β ))2 = E(n x1i ui / n x2 )2 = σ 2 / n x2 var(β 1 1 1 i=1 i=1 1i i=1 1i
which understates the variance of the estimate of β 1 obtained from the true model, i.e., b1 = n 1i Yi / ni=1 ν 21i with i=1 ν var(b1 ) = σ 2 /
n
ν 21i i=1
= σ2/
n
2 i=1 x1i (1
). − R12 ) ≥ var(β 1
(4.14)
In summary, underspeciﬁcation yields biased estimates of the regression coeﬃcients and understates the variance of these estimates. This is also an example of imposing a zero restriction on β 2 when in fact it is not true. This introduces bias, because the restriction is wrong, but reduces the variance because it imposes more information even if this information may be false. We will encounter this general principle again when we discuss distributed lags in Chapter 6.
78
Chapter 4: Multiple Regression Analysis
4.5
RSquared versus RBarSquared
Since OLS minimizes the residual sums of squares, adding one or more variables to the regression cannot increase this residual sums of squares. After all, we are minimizing over a larger dimension parameter set and the minimum there is smaller or equal to that over a subset of the parameter space, , adding more vari see problem 4. Therefore, for the same dependent variable Y ables makes ni=1 e2i nonincreasing and R2 nondecreasing, since R2 = 1 − ( ni=1 e2i / ni=1 yi2 ). Hence, a criteria of selecting a regression that “maximizes R2 ” does not make sense, since we can add more variables to this regression and improve on this R2 (or at worst leave it the same). In order to penalize the researcher for adding an extra variable, one computes ¯ 2 = 1 − [n e2i /(n − K)]/[n yi2 /(n − 1)] R (4.15) i=1 i=1 n n 2 2 where i=1 yi have been adjusted by their i=1 ei and degrees of freedom. Note that the 2 numerator is the s of the regression and is equal to ni=1 e2i /(n − K). This differs from the s2 in Chapter 3 in the degrees of freedom. Here, it is n − K, because we have estimated K coeﬃcients, or because (4.2) represents K relationships among the residuals. Therefore knowing (n − K) residuals we can deduce the other K residuals from (4.2). ni=1 e2i is nonincreasing as we add more variables, but the degrees of freedom decrease by one with every added variable. Therefore, s2 will decrease only if the effect of the ni=1 e2i decrease outweighs the effect of the ¯ 2 , i.e., penalizing each added one degree of freedom loss on s2 . This is exactly the idea behind R ¯2 variable by decreasing n the2degrees of freedom by one. Hence,2 this variable will increase R only if the reduction in i=1 ei outweighs this loss, i.e., only if s is decreased. Using the deﬁnition ¯ 2 , one can relate it to R2 as follows: of R ¯ 2 ) = (1 − R2 )[(n − 1)/(n − K)] (1 − R
4.6
(4.16)
Testing Linear Restrictions
In the simple linear regression chapter, we proved that the OLS estimates are BLUE provided assumptions 1 to 4 were satisﬁed. Then we imposed normality on the disturbances, assumption 5, and proved that the OLS estimators are in fact the maximum likelihood estimators. Then we derived the Cram´erRao lower bound, and proved that these estimates are eﬃcient. This will be done in matrix form in Chapter 7 for the multiple regression case. Under normality one can test hypotheses about the regression. Basically, any regression package will report the OLS estimates, their standard errors and the corresponding tstatistics for the null hypothesis that each individual coeﬃcient is zero. These are tests of signiﬁcance for each coeﬃcient separately. But one may be interested in a joint test of signiﬁcance for two or more coeﬃcients simultaneously, or simply testing whether linear restrictions on the coeﬃcients of the regression are satisﬁed. This will be developed more formally in Chapter 7. For now, all we assume is that the reader can perform regressions using his or her favorite software like EViews, STATA, SAS, TSP, SHAZAM, LIMDEP or GAUSS. The solutions to (4.2) or (4.3) result in the OLS estimates. These multiple regression coeﬃcient estimates can be interpreted as simple regression estimates as shown in section 4.3. This allows a simple derivation of their standard errors. Now, we would like to use these regressions to test linear restrictions. The strategy followed is to impose these restrictions on the model and run the resulting restricted regression. The
4.6
Testing Linear Restrictions
79
corresponding Restricted Residual Sums of Squares is denoted by RRSS. Next, one runs the regression without imposing these linear restrictions to obtain the Unrestricted Residual Sums of Squares, which we denote by URSS. Finally, one forms the following F statistic: F =
(RRSS − URSS )/ℓ ∼ Fℓ,n−K URSS /(n − K)
(4.17)
where ℓ denotes the number of restrictions, and n − K gives the degrees of freedom of the unrestricted model. The idea behind this test is intuitive. If the restrictions are true, then the RRSS should not be much different from the URSS. If RRSS is different from URSS, then we reject these restrictions. The denominator of the F statistic is a consistent estimate of the unrestricted regression variance. Dividing by the latter makes the F statistic invariant to units of measurement. Let us consider two examples: Example 2: Testing the joint signiﬁcance of two regression coeﬃcients. For e.g., let us test the following null hypothesis H0 ; β 2 = β 3 = 0. These are two restrictions β 2 = 0 and β 3 = 0 and they are to be tested jointly. We know how to test for β 2 = 0 alone or β 3 = 0 alone with individual ttests. This is a test of joint signiﬁcance of the two coeﬃcients. Imposing this restriction, means the removal of X2 and X3 from the regression, i.e., running the regression of Y on X4 , . . . , XK excluding X2 and X3 . Hence, the number of parameters to be estimated becomes (K − 2) and the degrees of freedom of this restricted regression are n − (K − 2). The unrestricted regression is the one including all the X’s in the model. Its degrees of freedom are (n − K). The number of restrictions are 2 and this can also be inferred from the difference between the degrees of freedom of the restricted and unrestricted regressions. All the ingredients are now available for computing F in (4.17) and this will be distributed as F2,n−K . Example 3: Test the equality of two regression coeﬃcients H0 ; β 3 = β 4 against the alternative that H1 ; β 3 = β 4 . Note that H0 can be rewritten as H0 ; β 3 − β 4 = 0. This can be tested using a tstatistic that tests whether d = β 3 − β 4 is equal to zero. From the unrestricted regression, we −β with var(d) = var(β )+var(β ) − 2cov(β ,β ). The variancecovariance can obtain d = β 3 4 3 4 3 4 matrix of the regression coeﬃcients can be printed out with any regression package. In section 4.3, we gave these variances and covariances a simple regression interpretation. This means and the tstatistic is simply t = (d − 0)/se(d) which is distributed as that se(d) = var(d)
tn−K under H0 . Alternatively, one can run an F test with the RRSS obtained from running the following regression Yi = α + β 2 X2i + β 3i (X3i + X4i ) + β 5 X5i + .. + β K XKi + ui with β 3 = β 4 substituted in for β 4 . This regression has the variable (X3i + X4i ) rather than X3i and X4i separately. The URSS is the regression of Y on all the X’s in the model. The degrees of freedom of the resulting F statistic are 1 and n − K. The numerator degree of freedom states that there is only one restriction. It will be proved in Chapter 7 that the square of the tstatistic is exactly equal to the F statistic just derived. Both methods of testing are equivalent. The ﬁrst one computes only the unrestricted regression and involves some further variance computations, while the latter involves running two regressions and computing the usual F statistic. Example 4: Test the joint hypothesis H0 ; β 3 = 1 and β 2 − 2β 4 = 0. These two restrictions are usually obtained from prior information or imposed by theory. The ﬁrst restriction is β 3 = 1.
80
Chapter 4: Multiple Regression Analysis
The value 1 could have been any other constant. The second restriction shows that a linear combination of β 2 and β 4 is equal to zero. Substituting these restrictions in (4.1) we get Yi = α + β 2 X2i + X3i + 12 β 2 X4i + β 5 X5i + .. + β K XKi + ui which can be written as Yi − X3i = α + β 2 (X2i + 12 X4i ) + β 5 X5i + .. + β K XKi + ui Therefore, the RRSS can be obtained by regressing (Y − X3 ) on (X2 + 21 X4 ), X5 , . . . , XK . This regression has n − (K − 2) degrees of freedom. The URSS is the regression with all the X’s included. The resulting F statistic has 2 and n − K degrees of freedom. Example 5: Testing constant returns to scale in a CobbDouglas production function. Q = AK α Lβ E γ M δ eu is a CobbDouglas production function with capital(K), labor(L), energy(E) and material(M ). Constant returns to scale means that a proportional increase in the inputs produces the same proportional increase in output. Let this proportional increase be λ, then K ∗ = λK, L∗ = λL, E ∗ = λE and M ∗ = λM . Q∗ = λ(α+β+γ+δ) AK α Lβ E γ M δ eu = λ(α+β+γ+δ) Q. For this last term to be equal to λQ, the following restriction must hold: α + β + γ + δ = 1. Hence, a test of constant returns to scale is equivalent to testing H0 ; α + β + γ + δ = 1. The CobbDouglas production function is nonlinear in the variables, and can be linearized by taking logs of both sides, i.e., logQ = logA + αlogK + βlogL + γlogE + δlogM + u
(4.18)
This is a linear regression with Y = logQ, X2 = logK, X3 = logL, X4 = logE and X5 = logM . Ordinary least squares is BLUE on this nonlinear model as long as u satisﬁes assumptions 14. Note that these disturbances entered the original CobbDouglas production function multiplicatively as exp(ui ). Had these disturbances entered additively as Q = AK α Lβ E γ M δ + u then taking logs does not simplify the right hand side and one has to estimate this with nonlinear least squares, see Chapter 8. Now we can test constant returns to scale as follows. The unrestricted regression is given by (4.18) and its degrees of freedom are n − 5. Imposing H0 means substituting the linear restriction by replacing say β by (1 − α − γ − δ). This results after collecting terms in the following restricted regression with one less parameter log(Q/L) = logA + αlog(K/L) + γlog(E/L) + δlog(M/L) + u
(4.19)
The degrees of freedom are n − 4. Once again all the ingredients for the test in (4.17) are there and this statistic is distributed as F1 , n − 5 under the null hypothesis. Example 6: Joint signiﬁcance of all the slope coeﬃcients. The null hypothesis is H0 ; β 2 = β 3 = .. = β K = 0 against the alternative H1 ; at least one β k = 0 for k = 2, . . . , K. Under the null, only the constant is left in the regression. Problem 3.2 showed that for a regression of Y on a constant only, the least estimate of α is Y¯ . This means that the corresponding residual sum of n squares 2 ¯ squares is i=1 (Yi −Y ) . Therefore, RRSS = Total sums of squares of regression (4.1) = Σni=1 yi2 .
4.7
Dummy Variables
81
The URSS is the usual residual sums of squares ni=1 e2i from the unrestricted regression given by (4.1). Hence, the corresponding F statistic for H0 is ( ni=1 yi2 − ni=1 e2i )/(K − 1) R2 n−K (T SS − RSS)/(K − 1) n 2 = = · (4.20) F = 2 RSS/(n − K) 1−R K −1 i=1 ei /(n − K) where R2 = 1−( ni=1 e2i / ni=1 yi2 ). This F statistic has (K −1) and (n−K) degrees of freedom under H0 , and is usually reported by regression packages.
4.7
Dummy Variables
Many explanatory variables are qualitative in nature. For example, the head of a household could be male or female, white or nonwhite, employed or unemployed. In this case, one codes these variables as “M ” for male and “F ” for female, or change this qualitative variable into a quantitative variable called FEMALE which takes the value “0” for male and “1” for female. This obviously begs the question: “why not have a variable MALE that takes on the value 1 for male and 0 for female?” Actually, the variable MALE would be exactly 1FEMALE. In other words, the zero and one can be thought of as a switch, which turns on when it is 1 and off when it is 0. Suppose that we are interested in the earnings of households, denoted by EARN, and MALE and FEMALE are the only explanatory variables available, then problem 10 asks the reader to verify that running OLS on the following model: EARN = αM MALE + αF FEMALE + u
(4.21)
gives α M = “average earnings of the males in the sample” and α F = “average earnings of the females in the sample.” Notice that there is no intercept in (4.21), this is because of what is known in the literature as the “dummy variable trap.” Brieﬂy stated, there will be perfect multicollinearity between MALE, FEMALE and the constant. In fact, MALE + FEMALE = 1. Some researchers may choose to include the intercept and exclude one of the sex dummy variables, say MALE, then EARN = α + βFEMALE + u
(4.22)
= and the OLS estimates give α = “average earnings of males in the sample” = α M , while β α F − α M = “the difference in average earnings between females and males in the sample.” Regression (4.22) is more popular when one is interested in contrasting the earnings between males and females and obtaining with one regression the markup or markdown in average earnings ( αF − α M ) as well as the test of whether this difference is statistically different from zero. in (4.22). On the other hand, if one is interested in This would be simply the tstatistic on β estimating the average earnings of males and females separately, then model (4.21) should be M = 0 would involve further calculathe one to consider. In this case, the ttest for α F − α tions not directly given from the regression in (4.21) but similar to the calculations given in Example 3. What happens when another qualitative variable is included, to depict another classiﬁcation of the individuals in the sample, say for example, race? If there are three race groups in the sample, WHITE, BLACK and HISPANIC. One could create a dummy variable for each of these classiﬁcations. For example, WHITE will take the value 1 when the individual is white
82
Chapter 4: Multiple Regression Analysis
and 0 when the individual is nonwhite. Note that the dummy variable trap does not allow the inclusion of all three categories as they sum up to 1. Also, even if the intercept is dropped, once MALE and FEMALE are included, perfect multicollinearity is still present because MALE + FEMALE = WHITE + BLACK + HISPANIC. Therefore, one category from race should be dropped. Suits (1984) argues that the researcher should use the dummy variable category omission to his or her advantage, in interpreting the results, keeping in mind the purpose of the study. For example, if one is interested in comparing earnings across the sexes holding race constant, the omission of MALE or FEMALE is natural, whereas, if one is interested in the race differential in earnings holding gender constant, one of the race variables should be omitted. Whichever variable is omitted, this becomes the base category for which the other earnings are compared. Most researchers prefer to keep an intercept, although regression packages allow for a no intercept option. In this case one should omit one category from each of the race and sex classiﬁcations. For example, if MALE and WHITE are omitted: EARN = α + β F FEMALE + β B BLACK + β H HISPANIC + u
(4.23)
Assuming the error u satisﬁes all the classical assumptions, and taking expected values of both sides of (4.23), one can see that the intercept α = the expected value of earnings of the omitted category which is “white males”. For this category, all the other switches are off. Similarly, α + β F is the expected value of earnings of “white females,” since the FEMALE switch is on. One can conclude that β F = difference in the expected value of earnings between white females and white males. Similarly, one can show that α + β B is the expected earnings of “black males” and α + β F + β B is the expected earnings of “black females.” Therefore, β F represents the difference in expected earnings between black females and black males. In fact, problem 11 asks the reader to show that β F represents the difference in expected earnings between hispanic females and hispanic males. In other words, β F represents the differential in expected earnings between females and males holding race constant. Similarly, one can show that β B is the difference in expected earnings between blacks and whites holding sex constant, and β H is the differential in expected earnings between hispanics and whites holding sex constant. The main key to the interpretation of the dummy variable coeﬃcients is to be able to turn on and turn off the proper switches, and write the correct expectations. The real regression will contain other quantitative and qualitative variables, like EARN = α + β F FEMALE + β B BLACK + β H HISPANIC + γ 1 EXP
(4.24)
+γ 2 EXP 2 + γ 3 EDUC + γ 4 UNION + u where EXP is years of job experience, EDUC is years of education, and UNION is 1 if the individual belongs to a union and 0 otherwise. EXP2 is the squared value of EXP. Once again, one can interpret the coeﬃcients of these regressions by turning on or off the proper switches. For example, γ 4 is interpreted as the expected difference in earnings between union and nonunion members holding all other variables included in (4.24) constant. Halvorsen and Palmquist (1980) warn economists about the interpretation of dummy variable coeﬃcients when the dependent variable is in logs. For example, if the earnings equation is semilogarithmic: log(Earnings) = α + βUNION + γEDUC + u then γ = % change in earnings for one extra year of education, holding union membership constant. But, what about the returns for union membership? If we let Y1 = log(Earnings)
4.7
Dummy Variables
83
when the individual belongs to a union, and Y0 = log(Earnings) when the individual does not belong to a union, then g = % change in earnings due to union membership = (eY1 − eY0 )/eY0 . Equivalently, one can write that log(1 + g) = Y1 − Y0 = β, or that g = eβ − 1. In other words, one should not hasten to conclude that β has the same interpretation as γ. In fact, the % change in earnings due to union membership is eβ − 1 and not β. The error involved in using β β rather than e − 1 to estimate g could be substantial, especially if β is large. For example, when = 0.5, 0.75, 1; g = eβ − 1 = 0.65, 1.12, 1.72, respectively. Kennedy (1981) notes that if β is β implies consistency unbiased for β, g is not necessarily unbiased for g. However, consistency of β β+0.5Var( β) β . Based on this for g. If one assumes lognormal distributed errors, then E(e ) = e is a consistent β) result, Kennedy (1981) suggests estimating g by g = eβ+0.5Var(β) 1, where Var( estimate of Var(β).
Another use of dummy variables is in taking into account seasonal factors, i.e., including 3 seasonal dummy variables with the omitted season becoming the base for comparison.1 For example: Sales = α + β W W inter + β S Spring + β F F all + γ 1 P rice + u
(4.25)
the omitted season being the Summer season, and if (4.25) models the sales of airconditioning units, then β F is the difference in expected sales between the Fall and Summer seasons, holding the price of an airconditioning unit constant. If these were heating units one may want to change the base season for comparison. Another use of dummy variables is for War years, where consumption is not at its normal level say due to rationing. Consider estimating the following consumption function Ct = α + βYt + δWARt + ut
t = 1, 2, . . . , T
(4.26)
where Ct denotes real per capita consumption, Yt denotes real per capita personal disposable income, and W ARt is a dummy variable taking the value 1 if it is a War time period and 0 otherwise. Note that the War years do not affect the slope of the consumption line with respect to income, only the intercept. The intercept is α in nonWar years and α + δ in War years. In other words, the marginal propensity out of income is the same in War and nonWar years, only the level of consumption is different. Of course, one can dummy other unusual years like periods of strike, years of natural disaster, earthquakes, ﬂoods, hurricanes, or external shocks beyond control, like the oil embargo of 1973. If this dummy includes only one year like 1973, then the dummy variable for 1973, call it D73 , takes the value 1 for 1973 and zero otherwise. Including D73 as an extra variable in the regression has the effect of removing the 1973 observation from estimation purposes, and the resulting regression coeﬃcients estimates are exactly the same as those obtained excluding the 1973 observation and its corresponding dummy variable. In fact, using matrix algebra in Chapter 7, we will show that the coeﬃcient estimate of D73 is the forecast error for 1973, using the regression that ignores the 1973 observations. In addition, the standard error of the dummy coeﬃcient estimates is the standard error of this forecast. This is a much easier way of obtaining the forecast error and its standard error from the regression package without additional computations, see Salkever (1976). More on this in Chapter 7.
84
Chapter 4: Multiple Regression Analysis
Interaction Eﬀects So far the dummy variables have been used to shift the intercept of the regression keeping the slopes constant. One can also use the dummy variables to shift the slopes by letting them interact with the explanatory variables. For example, consider the following earnings equation: EARN = α + αF FEMALE + βEDUC + u
(4.27)
In this regression, only the intercept shifts from males to females. The returns to an extra year of education is simply β, which is assumed to be the same for males as well as females. But if we now introduce the interaction variable (FEMALE × EDUC), then the regression becomes: EARN = α + αF FEMALE + βEDUC + γ(FEMALE × EDUC ) + u
(4.28)
In this case, the returns to an extra year of education depends upon the sex of the individual. In fact, ∂(EARN )/∂(EDUC ) = β + γ(FEMALE ) = β if male, and β + γ if female. Note that the interaction variable = EDUC if the individual is female and 0 if the individual is male. Estimating (4.28) is equivalent to estimating two earnings equations, one for males and another one for females, separately. The only difference is that (4.28) imposes the same variance across the two groups, whereas separate regressions do not impose this, albeit restrictive, equality of the variances assumption. This setup is ideal for testing the equality of slopes, equality of intercepts, or equality of both intercepts and slopes across the sexes. This can be done with the F test described in (4.17). In fact, for H0 ; equality of slopes, given different intercepts, the restricted residuals sum of squares (RRSS) is obtained from (4.27), while the unrestricted residuals sum of squares (URSS) is obtained from (4.28). Problem 12 asks the reader to set up the F test for the following null hypothesis: (i) equality of slopes and intercepts, and (ii) equality of intercepts given the same slopes. Dummy variables have many useful applications in economics. For example, several tests including the Chow (1960) test, and Utts (1982) Rainbow test described in Chapter 8, can be applied using dummy variable regressions. Additionally, they can be used in modeling splines, see Poirier (1976) and Suits, Mason and Chan (1978), and ﬁxed effects in panel data, see Chapter 12. Finally, when the dependent variable is itself a dummy variable, the regression equation needs special treatment, see Chapter 13 on qualitative limited dependent variables. Empirical Example: Table 4.1 gives the results of a regression on 595 individuals drawn from the Panel Study of Income Dynamics (PSID) in 1982. This data is provided on the Springer web site as EARN.ASC. A description of the data is given in Cornwell and Rupert (1988). In particular, log wage is regressed on years of education (ED), weeks worked (WKS), years of fulltime work experience (EXP), occupation (OCC = 1, if the individual is in a bluecollar occupation), residence (SOUTH = 1, SMSA = 1, if the individual resides in the South, or in a standard metropolitan statistical area), industry (IND = 1, if the individual works in a manufacturing industry), marital status (MS = 1, if the individual is married), sex and race (FEM = 1, BLK = 1, if the individual is female or black), union coverage (UNION = 1, if the individual’s wage is set by a union contract). These results show that the returns to an extra year of schooling is 5.7%, holding everything else constant. It shows that Males on the average earn more than Females. Blacks on the average earn less than Whites, and Union workers earn more than nonunion workers. Individuals residing in the South earn less than those living elsewhere. Those residing in a standard metropolitan statistical area earn more on the average than those
Note
85
Table 4.1 Earnings Regression for 1982 Dependent Variable: LWAGE Analysis of Variance Source Model Error C Total
DF 12 582 594
Root MSE Dep Mean C.V.
Sum of Squares 52.48064 61.68465 114.16529
Mean Square 4.37339 0.10599
F Value 41.263
0.32556 6.95074 4.68377
Rsquare Adj Rsq
0.4597 0.4485
Prob > F 0.0001
Parameter Estimates
Variable INTERCEP WKS SOUTH SMSA MS EXP EXP2 OCC IND UNION FEM BLK ED
DF 1 1 1 1 1 1 1 1 1 1 1 1 1
Parameter
Standard
T for H0:
Estimate 5.590093 0.003413 –0.058763 0.166191 0.095237 0.029380 –0.000486 –0.161522 0.084663 0.106278 –0.324557 –0.190422 0.057194
Error 0.19011263 0.00267762 0.03090689 0.02955099 0.04892770 0.00652410 0.00012680 0.03690729 0.02916370 0.03167547 0.06072947 0.05441180 0.00659101
Parameter=0 29.404 1.275 –1.901 5.624 1.946 4.503 –3.833 –4.376 2.903 3.355 –5.344 –3.500 8.678
Prob > T 0.0001 0.2030 0.0578 0.0001 0.0521 0.0001 0.0001 0.0001 0.0038 0.0008 0.0001 0.0005 0.0001
who do not. Individuals who work in a manufacturing industry or are not blue collar workers or are married earn more on the average than those who are not. For EXP 2 = (EXP )2 , this regression indicates a signiﬁcant quadratic relationship between earnings and experience. All the variables were signiﬁcant at the 5% level except for WKS, SOUTH and MS.
Note 1. There are more sophisticated ways of seasonal adjustment than introducing seasonal dummies, see Judge et al. (1985).
Problems 1. For the Cigarette Data given in Table 3.2. Run the following regressions: (a) Real per capita consumption of cigarettes on real price and real per capita income. (All variables are in log form, and all regressions in this problem include a constant).
86
Chapter 4: Multiple Regression Analysis (b) Real per capita consumption of cigarettes on real price. (c) Real per capita income on real price. (d) Real per capita consumption on the residuals of part (c). (e) Residuals from part (b) on the residuals in part (c). (f) Compare the regression slope estimates in parts (d) and (e) with the regression coeﬃcient estimate of the real income coeﬃcient in part (a), what do you conclude? 2. Simple versus Multiple Regression Coefficients. This is based on Baltagi (1987b). Consider the multiple regression Yi = α + β 2 X2i + β 3 X3i + ui
i = 1, 2, . . . , n
along with the following auxiliary regressions: X2i X3i
= a + bX3i + ν 2i 2i + = c + dX ν 3i
, the OLS estimate of β can be interpreted as a simple regression In section 4.3, we showed that β 2 2 . Kennedy (1981, p. 416) of Y on the OLS residuals ν 2 . A similar interpretation can be given to β 3 claims that β 2 is not necessarily the same as δ 2 , the OLS estimate of δ 2 obtained from the regression Y on ν2, ν 3 and a constant, Yi = γ + δ 2 ν 2i + δ 3 ν 3i + wi . Prove this claim by ﬁnding a relationship and the between the β’s δ’s.
3. For the simple regression Yi = α + βXi + ui considered in Chapter 3, show that n n 2 (a) β OLS = i=1 xi can be obtained using the residual interpretation by regressing i=1 xi yi / X on a constant ﬁrst, getting the residuals ν and then regressing Y on ν. ¯ X can be obtained using the residual interpretation by regressing 1 on X (b) α OLS = Y¯ − β OLS and obtaining the residuals ω and then regressing Y on ω . (c) Check the var( αOLS ) and var(β OLS ) in parts (a) and (b) with those obtained from the residualing interpretation. 4. Eﬀect of Additional Regressors on R2 . This is based on Nieswiadomy (1986).
(a) Suppose that the multiple regression given in (4.1) has K1 regressors in it. Denote the least squares sum of squared errors by SSE1 . Now add K2 regressors so that the total number of regressors is K = K1 + K2 . Denote the corresponding least squares sum of squared errors by SSE2 . Show that SSE2 ≤ SSE1 , and conclude that the corresponding Rsquares satisfy R22 ≥ R12 . ¯2. (b) Derive the equality given in (4.16) starting from the deﬁnition of R2 and R ¯ 2 when the F statistic for the joint ¯ ¯2 ≥ R (c) Show that the corresponding Rsquares satisfy R 2 1 signiﬁcance of these additional K2 regressors is less than or equal to one. 5. Let Y be the output and X2 = skilled labor and X3 = unskilled labor in the following relationship: 2 2 + β 6 X3i + ui Yi = α + β 2 X2i + β 3 X3i + β 4 (X2i + X3i ) + β 5 X2i
What parameters are estimable by OLS? 6. Suppose that we have estimated the parameters of the multiple regression model: Yt = β 1 + β 2 Xt2 + β 3 Xt3 + ut by Ordinary Least Squares (OLS) method. Denote the estimated residuals by (et , t = 1, . . . , T ) and the predicted values by (Yt , t = 1, . . . , T ).
Problems
87
(a) What is the R2 of the regression of e on a constant, X2 and X3 ? (b) If we regress Y on a constant and Y , what are the estimated intercept and slope coeﬃcients? What is the relationship between the R2 of this regression and the R2 of the original regression? (c) If we regress Y on a constant and e, what are the estimated intercept and slope coeﬃcients? What is the relationship between the R2 of this regression and the R2 of the original regression? (d) Suppose that we add a new explanatory variable X4 to the original model and reestimate the parameters by OLS. Show that the estimated coeﬃcient of X4 and its estimated standard error will be the same as in the OLS regression of e on a constant, X2 , X3 and X4 . 7. Consider the CobbDouglas production function in example 5. How can you test for constant returns to scale using a tstatistic from the unrestricted regression given in (4.18). 8. For the multiple regression given in (4.1). Set up the F statistic described in (4.17) for testing (a) H0 ; β 2 = β 4 = β 6 . (b) H0 ; β 2 = −β 3 and β 5 − β 6 = 1. 9. Monte Carlo Experiments. Hanushek and Jackson (1977, pp. 6065) generated the following data Yi = 15 + 1X2i + 2X3i + ui for i = 1, 2, . . . , 25 with a ﬁxed set of X2i and X3i , and ui ’s that are IID ∼ N (0, 100). For each set of 25 ui ’s drawn randomly from the normal distribution, a corresponding set of 25 Yi ’s are created from the above equation. Then OLS is performed on the resulting data set. This can be repeated as many times as we can afford. 400 replications were performed by Hanushek and Jackson. This means that they generated 400 data sets each of size 25 and ran 400 regressions giving 400 OLS estimates of α, β 2 , β 3 and σ 2 . The classical assumptions are satisﬁed for this model, by construction, so we expect these OLS estimators to be BLUE, MLE and eﬃcient. (a) Replicate the Monte Carlo experiments of Hanushek and Jackson (1977) and generate the means of the 400 estimates of the regression coeﬃcients as well as σ 2 . Are these estimates unbiased? (b) Compute the standard deviation of these 400 estimates and call this σ b . Also compute the average of the 400 standard errors of the regression estimates reported by the regression. Denote this mean by sb . Compare these two estimates of the standard deviation of the regression coeﬃcient estimates to the true standard deviation knowing the true σ 2 . What do you conclude? (c) Plot the frequency of these regression coeﬃcients estimates? Does it resemble its theoretical distribution. (d) Increase the sample size form 25 to 50 and repeat the experiment. What do you observe? 10.
(a) Derive the OLS estimates of αF and αM for Yi = αF Fi + αM Mi + ui where Y is Earnings, F is FEMALE and M is MALE, see (4.21). Show that α F = Y¯F , the average of the Yi ’s only ¯ for females, and α M = YM , the average of the Yi ’s only for males. (b) Suppose that the regression is Yi = α + βFi + ui , see (4.22). Show that α = α M , and =α β F − α M . (c) Substitute M = 1 − F in (4.21) and show that α = αM and β = αF − αM . (d) Verify parts (a), (b) and (c) using the earnings data underlying Table 4.1.
11. For equation (4.23) EARN = α + β F F EM ALE + β B BLACK + β H HISP AN IC + u Show that
88
Chapter 4: Multiple Regression Analysis (a) E(Earnings/Hispanic Female) = α + β F + β H ; also E(Earnings/Hispanic Male) = α + β H . Conclude that β F = E(Earnings/Hispanic Female) – E(Earnings/Hispanic Male). (b) E(Earnings/Hispanic Female) – E(Earnings/White Female) = E(Earnings/Hispanic Male) – E(Earnings/White Male) = β H . (c) E(Earnings/Black Female) – E(Earnings/White Female) = E(Earnings/Black Male) – E(Earnings/White Male) = β B .
12. For the earnings equation given in (4.28), how would you set up the F test and what are the restricted and unrestricted regressions for testing the following hypotheses: (a) The equality of slopes and intercepts for Males and Females. (b) The equality of intercepts given the same slopes for Males and Females. Show that the resulting F statistic is the square of a tstatistic from the unrestricted regression. (c) The equality of intercepts allowing for different slopes for Males and Females. Show that the resulting F statistic is the square of a tstatistic from the unrestricted regression. (d) Apply your results in parts (a), (b) and (c) to the earnings data underlying Table 4.1. 13. For the earnings data regression underlying Table 4.1. (a) Replicate the regression results given in Table 4.1. (b) Verify that the joint signiﬁcance of all slope coeﬃcients can be obtained from (4.20). (c) How would you test the joint restriction that expected earnings are the same for Males and Females whether Black or NonBlack holding everything else constant? (d) How would you test the joint restriction that expected earnings are the same whether the individual is married or not and whether this individual belongs to a Union or not? (e) From Table 4.1 what is your estimate of the % change in earnings due to Union membership? If the disturbances are assumed to be lognormal, what would be the estimate suggested by Kennedy (1981) for this % change in earnings? (f) What is your estimate of the % change in earnings due to the individual being married? 14. Crude Quality. Using the data set of U.S. oil ﬁeld postings on crude prices ($/barrel), gravity (degree API) and sulphur (% sulphur) given in the CRUDES.ASC ﬁle on the Springer web site. (a) Estimate the following multiple regression model: POIL = β 1 +β 2 GRAVITY + β 3 SULPHUR + ǫ. (b) Regress GRAVITY = α0 + α1 SULPHUR + ν t then compute the residuals ( ν t ). Now perform the regression P OIL = γ 1 + γ 2 νt + ǫ
in part (a). What does this tell you? Verify that γ 2 is the same as β 2
(c) Regress POIL = φ1 + φ2 SULPHUR + w. Compute the residuals (w). Now regress w on ν obtained from part (b), to get w t = δ 1 + δ 2 ν t + residuals. Show that δ 2 = β 2 in part (a). Again, what does this tell you?
(d) To illustrate how additional data affects multicollinearity, show how your regression in part (a) changes when the sample is restricted to the ﬁrst 25 crudes. (e) Delete all crudes with sulphur content outside the range of 1 to 2 percent and run the multiple regression in part (a). Discuss and interpret these results. 15. Consider the U.S. gasoline data from 19501987 given in Table 4.2, and obtained from the ﬁle USGAS.ASC on the Springer web site.
Problems Table 4.2 U.S. Gasoline Data: 1950–1987
Year
CAR
QMG (1,000 Gallons)
PMG ($ )
POP (1,000)
RGNP (Billion)
PGNP
1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987
49195212 51948796 53301329 56313281 58622547 62688792 65153810 67124904 68296594 71354420 73868682 75958215 79173329 82713717 86301207 90360721 93962030 96930949 101039113 103562018 106807629 111297459 117051638 123811741 127951254 130918918 136333934 141523197 146484336 149422205 153357876 155907473 156993694 161017926 163432944 168743817 173255850 177922000
40617285 43896887 46428148 49374047 51107135 54333255 56022406 57415622 59154330 61596548 62811854 63978489 62531373 64779104 67663848 70337126 73638812 76139326 80772657 85416084 88684050 92194620 95348904 99804600 100212210 102327750 106972740 110023410 113625960 107831220 100856070 100994040 100242870 101515260 102603690 104719230 107831220 110467980
0.272 0.276 0.287 0.290 0.291 0.299 0.310 0.304 0.305 0.311 0.308 0.306 0.304 0.304 0.312 0.321 0.332 0.337 0.348 0.357 0.364 0.361 0.388 0.524 0.572 0.595 0.631 0.657 0.678 0.857 1.191 1.311 1.222 1.157 1.129 1.115 0.857 0.897
152271 154878 157553 160184 163026 165931 168903 171984 174882 177830 180671 183691 186538 189242 191889 194303 196560 198712 200706 202677 205052 207661 209896 211909 213854 215973 218035 220239 222585 225055 227757 230138 232520 234799 237001 239279 241613 243915
1090.4 1179.2 1226.1 1282.1 1252.1 1356.7 1383.5 1410.2 1384.7 1481.0 1517.2 1547.9 1647.9 1711.6 1806.9 1918.5 2048.9 2100.3 2195.4 2260.7 2250.7 2332.0 2465.5 2602.8 2564.2 2530.9 2680.5 2822.4 3115.2 3192.4 3187.1 3248.8 3166.0 3279.1 3489.9 3585.2 3676.5 3847.0
26.1 27.9 28.3 28.5 29.0 29.3 30.3 31.4 32.1 32.6 33.2 33.6 34.0 34.5 35.0 35.7 36.6 37.8 39.4 41.2 43.4 45.6 47.5 50.2 55.1 60.4 63.5 67.3 72.2 78.6 85.7 94.0 100.0 103.9 107.9 111.5 114.5 117.7
CAR: RMG: PMG:
Stock of Cars Motor Gasoline Consumption Retail Price of Motor Gasoline
POP: Population RGNP: Real GNP in 1982 dollars PGNP: GNP Deﬂator (1982=100)
89
90
Chapter 4: Multiple Regression Analysis (a) For the period 19501972 estimate models (1) and (2): logQM G =
β 1 + β 2 logCAR + β 3 logP OP + β 4 logRGN P
(1)
+β 5 logP GN P + β 6 logP M G + u log
RGN P CAR PMG QM G = γ 1 + γ 2 log + γ 3 log + γ 4 log +ν CAR P OP P OP P GN P
(2)
(b) What restrictions should the β’s satisfy in model (1) in order to yield the γ’s in model (2)? (c) Compare the estimates and the corresponding standard errors from models (1) and (2). (d) Compute the simple correlations among the X’s in model (1). What do you observe? (e) Use the ChowF test to test the parametric restrictions obtained in part (b). (f) Estimate equations (1) and (2) now using the full data set 19501987. Discuss brieﬂy the effects on individual parameter estimates and their standard errors of the larger data set. (g) Using a dummy variable, test the hypothesis that gasoline demand per CAR permanently shifted downward for model (2) following the Arab Oil Embargo in 1973? (h) Construct a dummy variable regression that will test whether the price elasticity has changed after 1973. 16. Consider the following model for the demand for natural gas by residential sector, call it model (1): logConsit = β 0 + β 1 logP git + β 2 logP oit + β 3 logP eit + β 4 logHDDit + β 5 logP Iit + uit where i = 1, 2, . . . , 6 states and t = 1, 2, . . . , 23 years. Cons is the consumption of natural gas by residential sector, P g, P o and P e are the prices of natural gas, distillate fuel oil, and electricity of the residential sector. HDD is heating degree days and P I is real per capita personal income. The data covers 6 states: NY, FL, MI, TX, UT and CA over the period 19671989. It is given in the NATURAL.ASC ﬁle on the Springer web site. (a) Estimate the above model by OLS. Call this model (1). What do the parameter estimates imply about the relationship between the fuels? (b) Plot actual consumption versus the predicted values. What do you observe? (c) Add a dummy variable for each state except California and run OLS. Call this model (2). Compute the parameter estimates and standard errors and compare to model (1). Do any of the interpretations of the price coeﬃcients change? What is the interpretation of the New York dummy variable? What is the predicted consumption of natural gas for New York in 1989? (d) Test the hypothesis that the intercepts of New York and California are the same. (e) Test the hypothesis that all the states have the same intercept. (f) Add a dummy variable for each state and run OLS without an intercept. Call this model (3). Compare the parameter estimates and standard errors to the ﬁrst two models. What is the interpretation of the coeﬃcient of the New York dummy variable? What is the predicted consumption of natural gas for New York in 1989? (g) Using the regression in part (f), test the hypothesis that the intercepts of New York and California are the same.
References
91
References This chapter draws upon the material in Kelejian and Oates (1989) and Wallace and Silver (1988). Several econometrics books have an excellent discussion on dummy variables, see Gujarati (1978), Judge et al. (1985), Kennedy (1992), Johnston (1984) and Maddala (2001), to mention a few. Other readings referenced in this chapter include: Baltagi, B.H. (1987a), “To Pool or Not to Pool: The Quality Bank Case,” The American Statistician, 41: 150152. Baltagi, B.H. (1987b), “Simple versus Multiple Regression Coeﬃcients,” Econometric Theory, Problem 87.1.1, 3: 159. Chow, G.C. (1960), “Tests of Equality Between Sets of Coeﬃcients in Two Linear Regressions,” Econometrica, 28: 591605. Cornwell, C. and P. Rupert (1988), “Eﬃcient Estimation with Panel Data: An Empirical Comparison of Instrumental Variables Estimators,” Journal of Applied Econometrics, 3: 149155. Dufour, J.M. (1980), “Dummy Variables and Predictive Tests for Structural Change,” Economics Letters, 6: 241247. Dufour, J.M. (1982), “Recursive Stability of Linear Regression Relationships,” Journal of Econometrics, 19: 3176. Gujarati, D. (1970), “Use of Dummy Variables in Testing for Equality Between Sets of Coeﬃcients in Two Linear Regressions: A Note,” The American Statistician, 24: 1821. Gujarati, D. (1970), “Use of Dummy Variables in Testing for Equality Between Sets of Coeﬃcients in Two Linear Regressions: A Generalization,” The American Statistician, 24: 5052. Halvorsen, R. and R. Palmquist (1980), “The Interpretation of Dummy Variables in Semilogarithmic Equations,” American Economic Review, 70: 474475. Hanushek, E.A. and J.E. Jackson (1977), Statistical Methods for Social Scientists (Academic Press: New York). Hill, R. Carter and L.C. Adkins (2001), “Collinearity,” Chapter 12 in B.H. Baltagi (ed.) A Companion to Theoretical Econometrics (Blackwell: Massachusetts). Kennedy, P.E. (1981), “Estimation with Correctly Interpreted Dummy Variables in Semilogarithmic Equations,” American Economic Review, 71: 802. Kennedy, P.E. (1981), “The Balentine: A Graphical Aid for Econometrics,” Australian Economic Papers, 20: 414416. Kennedy, P.E. (1986), “Interpreting Dummy Variables,” Review of Economics and Statistics, 68: 174175. Nieswiadomy, M. (1986), “Effect of an Additional Regressor on R2 ,” Econometric Theory, Problem 86.3.1, 2:442. Poirier, D. (1976), The Econometrics of Structural Change(North Holland: Amsterdam). Salkever, D. (1976), “The Use of Dummy Variables to Compute Predictions, Prediction Errors, and Conﬁdence Intervals,” Journal of Econometrics, 4: 393397. Suits, D. (1984), “Dummy Variables: Mechanics vs Interpretation,” Review of Economics and Statistics, 66: 132139.
92
Chapter 4: Multiple Regression Analysis
Suits, D.B., A. Mason and L. Chan (1978), “Spline Functions Fitted by Standard Regression Methods,” Review of Economics and Statistics, 60: 132139. Utts, J. (1982), “The Rainbow Test for Lack of Fit in Regression,”Communications in StatisticsTheory and Methods, 11: 18011815.
Appendix Residual Interpretation of Multiple Regression Estimates Proof of Claim 1: Regressing X2 on all the other X’s yields residuals ν 2 that satisfy the usual properties of OLS residuals similar to those in (4.2), i.e., n n n n ν 2i XKi = 0 (A.1) ν 2i X4i = .. = i=1 ν 2i X3i = i=1 2i = 0, i=1 i=1 ν Note that X2 is the dependent nvariable of this regression, and X2 is the predicted value from this 2i = 0. This holds because X 2 is a linear combination of the regression. The latter satisﬁes i=1 ν2i X other X’s, all of which satisfy (A.1). Turn now to the estimated regression equation: X2i + .. + β XKi + ei Yi = α +β 2 K
(A.2)
Multiply (A.2) by X2i and sum n
i=1
X2i Yi = α
n
i=1
n
n X2i XKi n X 2 + .. + β X2i + β K 2 2i i=1 i=1
(A.3)
This uses the fact that i=1 X2i ei = 0. Alternatively, (A.3) is just the second equation from (4.3). 2i + Substituting X2i = X ν 2i , in (A.3) one gets n
i=1
2i Yi + n 2i + β 2 + .. n X X ν 2i Yi = α ni=1 X 2 2i i=1 i=1 n n 2 + β K i=1 X2i XKi + β 2 i=1 ν 2i
2i 2i and sum, we get using (A.1) and the fact that Σni=1 X ν 2i = 0. Multiply (A.2) by X n
i=1
n
2i Yi = α X
n
n X n X 2i XKi + n X 2i X2i + .. + β 2i + β X K 2 i=1 2i ei i=1 i=1
n
2 n X n X 2i 2i + β + .. + β X K 2 i=1 2i XKi i=1
i=1
(A.4)
(A.5)
linear combination of all the other X’s, all of which satisfy (4.2). Also, i=1 X2 ei = 0 since X2 is a 2i 2 since n X 2i X2i = n X X ν 2i = 0. Hence (A.5) reduces to i=1 i=1 2i i=1
But n
n
i=1
2i Yi = α X
i=1
(A.6)
Subtracting (A.6) from (A.4), we get n
i=1
ν2i Yi = β 2
n
i=1
ν22i
is the slope estimate of the simple regression of Y on and β ν 2 as given in (4.5). 2 By substituting for Yi its expression from equation (4.1) in (4.5) we get n n n = β n X2i β ν 22i ν 2i ui / i=1 ν 22i + i=1 ν 2i / i=1 2 2 i=1
(A.7)
(A.8)
n n n 2i + ν 2i = 0, which implies ν 2i = 0. But, X2i = X ν 2i and i=1 X ν 2i = 0 and i=1 where i=1 X1i 2i n n n =β + n is unbiased with ν 22i and β ν 22i . This means that β ν 2i = i=1 that i=1 X2i 2 2 2 i=1 ν 2i ui / i=1
Appendix
93
) = β since E(β ν 2 is a linear combination of the X’s and these in turn are not correlated with the u’s. 2 2 Also, n n − β )2 = E(n ) = E(β ν 22i )2 = σ 2 / i=1 ν 22i var(β 2 2 2 i=1 ν 2i ui / i=1
for k = 2, . . . , K, i.e., The same results apply for any β k n n 2 ν ki ν ki Yi / i=1 β k = i=1
(A.9)
where ν k is the OLS residual of Xk on all the other X’s in the regression. Similarly, β k = β k + ni=1 ν ki ui / ni=1 ν 2ki (A.10) n 2 ) = σ2 / ) = β with var(β and E(β ν ki for k = 2, . . . , K. Note also that k k k i=1 n n n ,β ) = E(β − β )(β − β ) = E(n cov(β ν 2ki ) ν 22i )( i=1 νki ui / i=1 2 k 2 2 k k i=1 i=1 ν 2i ui / n n n 2 2 ν ki ν 2i i=1 ν 2i ν ki / i=1 = σ 2 i=1
Proof of Claim 2: Regressing Y on all the other X’s yields, Yi = Yi + ω i . Substituting this expression for Yi in (4.5) one gets n n n = (n n ν 22i (A.11) ν 2i ω i / i=1 ν 22i = i=1 i )/ i=1 β 2 i=1 ν 2i ω i=1 ν 2i Yi + where the last equality follows from the fact that Y is a linear combination of all X’s excluding X2 , all is the estimate of the slope coeﬃcient in the linear regression of ω on of which satisfy (A.1). Hence β 2 ν2.
Simple, Partial and Multiple Correlation Coefficients
2 , as the proportion of In Chapter 3, we interpreted the square of the simple correlation coefficient, rY,X 2 2 the variation in Y that is explained by X2 . Similarly, rY,Xk is the Rsquared of the simple regression of Y on Xk for k = 2, . . . , K. In fact, one can compute these simple correlation coeﬃcients and ﬁnd out which Xk is most correlated with Y , say it is X2 . If one is selecting regressors to include in the regression equation, X2 would be the best one variable candidate. In order to determine what variable to include next, we look at partial correlation coefficients of the form rY,Xk .X2 for k = 2. The square of this ﬁrstorder partial gives the proportion of the residual variation in Y , not explained by X2 , that is explained by the addition of Xk . The maximum ﬁrstorder partial (‘ﬁrst’ because it has only one variable after the dot) determines the best candidate to follow X2 . Let us assume it is X3 . The ﬁrstorder partial correlation coeﬃcients can be computed from simple correlation coeﬃcients as follows: rY,X3 − rY,X2 rX2 ,X3 rY,X3 .X2 = 2 2 1 − rY,X 1 − rX 2 2 ,X3
see Johnston (1984). Next we look at secondorder partials of the form rY,Xk .X2 ,X3 for k = 2, 3, and so on. This method of selecting regressors is called forward selection. Suppose there is only X2 , X3and X4 2 ) is the proportion of the variation in Y , i.e., ni=1 yi2 , in the regression equation. In this case (1 − rY,X 2 2 2 that is not explained by X2 . Also (1 − rY,X3 .X2 )(1 − rY,X ) denotes the proportion of the variation in Y 2 2 2 2 ) )(1 − rY,X )(1 − rY,X not explained after the inclusion of both X2 and X3 . Similarly (1 − rY,X 2 3 .X2 4 .X2 ,X3 is the proportion of the variation in Y unexplained after the inclusion of X2 , X3 and X4 . But this is exactly (1 − R2 ), where R2 denotes the Rsquared of the multiple regression of Y on a constant, X2 , X3 2 . Hence and X4 . This R2 is called the multiple correlation coefficient, and is also written as RY.X 2 ,X3 ,X4 2 2 2 2 ) = (1 − rY,X )(1 − rY,X )(1 − rY,X ) (1 − RY.X 2 ,X3 ,X4 2 3 .X2 4 .X2 ,X3
and similar expressions relating the multiple correlation coeﬃcient to simple and partial correlation coeﬃcients can be written by including say X3 ﬁrst then X4 and X2 in that order.
CHAPTER 5
Violations of the Classical Assumptions 5.1
Introduction
In this chapter, we relax the assumptions made in Chapter 3 one by one and study the effect of that on the OLS estimator. In case the OLS estimator is no longer a viable estimator, we derive an alternative estimator and propose some tests that will allow us to check whether this assumption is violated.
5.2
The Zero Mean Assumption
Violation of assumption 1 implies that the mean of the disturbances is no longer zero. Two cases are considered:
Case 1: E(ui ) = μ = 0 The disturbances have a common mean which is not zero. In this case, one can subtract μ from the ui ’s and get new disturbances u∗i = ui − μ which have zero mean and satisfy all the other assumptions imposed on the ui ’s. Having subtracted μ from ui we add it to the constant α leaving the regression equation intact: Yi = α∗ + βXi + u∗i
i = 1, 2, . . . , n
(5.1)
where α∗ = α + μ. It is clear that only α∗ and β can be estimated, and not α nor μ. In other words, one cannot retrieve α and μ from an estimate of α∗ without additional assumptions or further information, see problem 10. With this reparameterization, equation (5.1) satisﬁes the four classical assumptions, and therefore OLS gives the BLUE estimators of α∗ and β. Hence, a constant nonzero mean for the disturbances affects only the intercept estimate but not the slope. Fortunately, in most economic applications, it is the slope coeﬃcients that are of interest and not the intercept.
Case 2: E(ui ) = μi The disturbances have a mean which varies with every observation. In this case, one can transform the regression equation as in (5.1) by adding and subtracting μi . The problem, however, is that α∗i = α + μi now varies with each observation, and hence we have more parameters than observations. In fact, there are n intercepts and one slope to be estimated with n observations. Unless we have repeated observations like in panel data, see Chapter 12 or we have some prior information on these α∗i , we cannot estimate this model.
96
Chapter 5: Violations of the Classical Assumptions
5.3
Stochastic Explanatory Variables
Sections 5.5 and 5.6 will study violations of assumptions 2 and 3 in detail. This section deals with violations of assumption 4 and its effect on the properties of the OLS estimators. In this case, X is a random variable which may be (i) independent; (ii) contemporaneously uncorrelated; or (iii) simply correlated with the disturbances. Case 1: If X is independent of u, then all the results of Chapter 3 still hold, but now they are conditional on the particular set of X’s drawn in the sample. To illustrate this result, recall that for the simple linear regression: OLS = β + n wi ui where wi = xi / n x2 (5.2) β i=1 i=1 i n n Hence, when we take expectations E( i=1 wi ui ) = i=1 E(wi )E(ui ) = 0. The ﬁrst equality holds because X and u are independent and the second equality holds because the u’s have zero mean. In other words the unbiasedness property of the OLS estimator still holds. However, the n n n 2 2 n 2 var(β OLS ) = E( i=1 wi ui ) = i=1 E(wi ) j=1 E(wi wj )E(ui uj ) = σ i=1 where the last equality follows from assumptions 2 and 3, homoskedasticity and no serial correlation. The only difference between this result and that of Chapter 3 is that we have expectations on the X’s rather than the X’s themselves. Hence, by conditioning on the particular set of X’s that are observed, we can use all the results of Chapter 3. Also, maximizing the likelihood involves both the X’s and the u’s. But, as long as the distribution of the X’s does not involve the parameters we are estimating, i.e., α, β and σ 2 , the same maximum likelihood estimators are obtained. Why? Because f (x1 , x2 , . . . , xn , u1 , u2 , . . . , un ) = f1 (x1 , x2 , . . . , xn )f2 (u1 , u2 , . . . , un ) since the X’s and the u’s are independent. Maximizing f with respect to (α, β, σ 2 ) is the same as maximizing f2 with respect to (α, β, σ 2 ) as long as f1 is not a function of these parameters. Case 2: Consider a simple model of consumption, where Yt , current consumption, is a function of Yt−1 , consumption in the previous period. This is the case for a habit forming consumption good like cigarette smoking. In this case our regression equation becomes Yt = α + βYt−1 + ut
t = 2, . . . , T
(5.3)
where we lost one observation due to lagging. It is obvious that Yt is correlated to ut , but the question here is whether Yt−1 is correlated to ut . After all, Yt−1 is our explanatory variable Xt . As long as assumption 3 is not violated, i.e., the u’s are not correlated across periods, ut represents a freshly drawn disturbance independent of previous disturbances and hence is not correlated with the already predetermined Yt−1 . This is what we mean by contemporaneously uncorrelated, i.e., ut is correlated with Yt , but it is not correlated with Yt−1 . The OLS estimator of β is T T T T 2 2 β (5.4) OLS = t=2 yt yt−1 / t=2 yt−1 = β + t=2 yt−1 ut / t=2 yt−1
and the expected value of (5.4) is not β because in general, 2 2 ) = E( Tt=2 yt−1 ut )/E( Tt=2 yt−1 ). E( Tt=2 yt−1 ut / Tt=2 yt−1
The expected value of a ratio is not the ratio of expected values. Also, even if E(Yt−1 ut ) = 0, one can easily show that E(yt−1 ut ) = 0. In fact, yt−1 = Yt−1 − Y¯ , and Y¯ contains Yt in it, and
5.3
Stochastic Explanatory Variables
97
we know that E(Yt ut ) = 0. Hence, we lost the unbiasedness property of OLS. However, all the asymptotic properties still hold. In fact, β OLS is consistent because plim β OLS = β + cov(Yt−1 , ut )/var(Yt−1 ) = β
(5.5)
where the second equality follows from (5.4) and the fact that plim( Tt=2 yt−1 ut /T ) is T 2 /T ) = var(Y cov(Yt−1 , ut ) which is zero, and plim( t=2 yt−1 t−1 ) which is positive and ﬁnite.
Case 3: X and u are correlated, in this case OLS is biased and inconsistent. This can be easily deduced from (5.2) since plim( ni=1 xi ui /n) is the cov(X, u) = 0, and plim( ni=1 x2i /n) is positive and ﬁnite. This means that OLS is no longer a viable estimator, and an alternative estimator that corrects for this bias has to be derived. In fact we will study three speciﬁc cases where this assumption is violated. These are: (i) the errors in measurement case; (ii) the case of a lagged dependent variable with correlated errors; and (iii) simultaneous equations. Brieﬂy, the errors in measurement case involves a situation where the true regression model is in terms of X ∗ , but X ∗ is measured with error, i.e., Xi = Xi∗ + ν i , so we observe Xi but not Xi∗ . Hence, when we substitute this Xi for Xi∗ in the regression equation, we get Yi = α + βXi∗ + ui = α + βXi + (ui − βν i )
(5.6)
where the composite error term is now correlated with Xi because Xi is correlated with ν i . After all, Xi = Xi∗ + ν i and E(Xi ν i ) = E(ν 2i ) if Xi∗ and ν i are uncorrelated. Similarly, in case (ii) above, if the u’s were correlated across time, i.e., ut−1 is correlated with ut , then Yt−1 , which is a function of ut−1 , will also be correlated with ut , and E(Yt−1 ut ) = 0. More on this and how to test for serial correlation in the presence of a lagged dependent variable in Chapter 6. Finally, if one considers a demand and supply equations where quantity Qt is a function of price Pt in both equations Qt = α + βPt + ut
(demand)
(5.7)
Qt = δ + γPt + ν t
(supply)
(5.8)
The question here is whether Pt is correlated with the disturbances ut and ν t in both equations. The answer is yes, because (5.7) and (5.8) are two equations in two unknowns Pt and Qt . Solving for these variables, one gets Pt as well as Qt as a function of a constant and both ut and ν t . This means that E(Pt ut ) = 0 and E(Pt ν t ) = 0 and OLS performed on either (5.7) or (5.8) is biased and inconsistent. We will study this simultaneous bias problem more rigorously in Chapter 11. For all situations where X and u are correlated, it would be illuminating to show graphically why OLS is no longer a consistent estimator. Let us consider the case where the disturbances are, say, positively correlated with the explanatory variable. Figure 3.3 of Chapter 3 shows the true regression line α + βXi . It also shows that when Xi and ui are positively correlated then an Xi higher than its mean will be associated with a disturbance ui above its mean, i.e., a positive disturbance. Hence, Yi = α + βXi + ui will always be above the true regression line whenever Xi is above its mean. Similarly Yi would be below the true regression line for every Xi below its mean. This means that not knowing the true regression line, a researcher ﬁtting OLS on this data will have a biased intercept and slope. In fact, the intercept will be understated and the slope will be overstated. Furthermore, this bias does not disappear with more data, since
98
Chapter 5: Violations of the Classical Assumptions
this new data will be generated by the same mechanism described above. Hence these OLS estimates are inconsistent. Similarly, if Xi and ui are negatively correlated, the intercept will be overstated and the slope will be understated. This story applies to any equation with at least one of its right hand side variables correlated with the disturbance term. Correlation due to the lagged dependent variable with autocorrelated errors, is studied in Chapter 6, whereas the correlation due to the simultaneous equations problem is studied in Chapter 11.
5.4
Normality of the Disturbances
If the disturbance are not normal, OLS is still BLUE provided assumptions 14 still hold. Normality made the OLS estimators minimum variance unbiased MVU and these OLS estimators turn out to be identical to the MLE. Normality allowed the derivation of the distribution of these estimators and this in turn allowed testing of hypotheses using the t and F tests considered in the previous chapter. If the disturbances are not normal, yet the sample size is large, one can still use the normal distribution for the OLS estimates asymptotically by relying on the Central Limit Theorem, see Theil (1978). Theil’s proof is for the case of ﬁxed X’s in repeated samples, zero mean and constant variance on the disturbances. A simple asymptotic test for the normality assumption is given by Jarque and Bera (1987). This is based on the fact that the normal distribution has a skewness measure of zero and a kurtosis of 3. Skewness (or lack of symmetry) is measured by S=
Square of the 3rd moment about the mean [E(X − μ)3 ]2 = [E(X − μ)2 ]3 Cube of the variance
Kurtosis (a measure of ﬂatness) is measured by κ=
E(X − μ)4 4th moment about the mean = [E(X − μ)2 ]2 Square of the variance
For the normal distribution S = 0 and κ = 3. Hence, the JarqueBera (JB) statistic is given by
2 (κ − 3)2 S JB = n + 6 24 where S represents skewness and κ represents kurtosis of the OLS residuals. This statistic is asymptotically distributed as χ2 with two degrees of freedom under H0 . Rejecting H0 , rejects normality of the disturbances but does not offer an alternative distribution. In this sense, the test is nonconstructive. In addition, not rejecting H0 does not necessarily mean that the distribution of the disturbances is normal, it only means we do not reject that the distribution of the disturbances is symmetric and has a kurtosis of 3. See the empirical example in section 5.5 for an illustration. The JarqueBera test is part of the standard output using EViews.
5.5
Heteroskedasticity
Violation of assumption 2, means that the disturbances have a varying variance, i.e., E(u2i ) = σ 2i , i = 1, 2, . . . , n. First, we study the effect of this violation on the OLS estimators. For the simple
5.5
Heteroskedasticity
99
OLS given in equation (5.2) is still unbiased and consistent linear regression it is obvious that β because these properties depend upon assumptions 1 and 4, and not assumption 2. However, the variance of β OLS is now different n n n 2 2 2 2 n 2 2 var(β (5.9) OLS ) = var( i=1 wi ui ) = i=1 wi σ i = i=1 xi σ i /( i=1 xi )
where the second equality follows from assumption 3 and the fact that var(ui ) is now σ 2i . Note that if σ 2i = σ 2 , this reverts back to σ 2 / ni=1 x2i , the usual formula for var(β OLS ) under 2 2 homoskedasticity. Furthermore, one can show that E(s ) will involve all of the σ i ’s and not one common σ 2 , see problem 1. This means that the regression package reporting s2 / ni=1 x2i as the estimate of the variance of β OLS is committing two errors. One, it is not using the right formula for the variance, i.e., equation (5.9). Second, it is using s2 to estimate a common σ 2 when in n fact the σ 2i ’s are different. The bias from using s2 / i=1 x2i as an estimate of var(β OLS ) will 2 depend upon the nature of the heteroskedasticity and the regressor. In fact, if σ i is positively related to x2i , one can show that s2 / ni=1 x2i understates the true variance and hence the tstatistic reported for β = 0 is overblown, and the conﬁdence interval for β is tighter than it is supposed to be, see problem 2. This means that the tstatistic in this case is biased towards rejecting H0 ; β = 0, i.e., showing signiﬁcance of the regression slope coeﬃcient, when it may not be signiﬁcant. The OLS estimator of β is linear unbiased and consistent, but is it still BLUE? In order to answer this question, we note that the only violation we have is that the var(ui ) = σ 2i . Hence, if we divided ui by σ i /σ, the resulting u∗i = σui /σ i will have a constant variance σ 2 . It is easy to show that u∗ satisﬁes all the classical assumptions including homoskedasticity. The regression model becomes σYi /σ i = ασ/σ i + βσXi /σ i + u∗i
(5.10)
and OLS on this model (5.10) is BLUE. The OLS normal equations on (5.10) are n n n 2 2 2 i=1 (Yi /σ i ) = α i=1 (1/σ i ) + β i=1 (Xi /σ i ) n
2 i=1 (Yi Xi /σ i ) = α
n
2 i=1 (Xi /σ i ) + β
n
(5.11)
2 2 i=1 (Xi /σ i )
Note that σ 2 drops out of these equations. Solving (5.11), see problem 3, one gets n (Xi /σ 2 )/ n (1/σ 2 )] = Y¯ ∗ − β X ¯∗ α = [ ni=1 (Y1i /σ 2i )/ ni=1 (1/σ 2i )] − β[ i i i=1 i=1 with Y¯ ∗ = [ ni=1 (Yi /σ 2i )/ ni=1 (1/σ 2i )] = ni=1 wi∗ Yi / ni=1 wi∗ and ¯ ∗ = [n (Xi /σ 2 )/ n (1/σ 2 )] = n w∗ Xi / n w∗ X i i i=1 i=1 i=1 i i=1 i where wi∗ = (1/σ 2i ). Similarly, [ ni=1 (1/σ 2i )][ ni=1 (Yi Xi /σ 2i )] − [ ni=1 (Xi /σ 2i )][ ni=1 (Yi /σ 2i )] β = [ ni=1 Xi2 /σ 2i ][ ni=1 (1/σ 2i )] − [ ni=1 (Xi /σ 2i )]2 ( ni=1 wi∗ )( ni=1 wi∗ Xi Yi ) − ( ni=1 wi∗ Xi )( ni=1 wi∗ Yi ) = ( ni=1 wi∗ )( ni=1 wi∗ Xi2 ) − ( ni=1 wi∗ Xi )2 n ¯ ∗ )(Yi − Y¯ ∗ ) w∗ (X − X i=1 ni i∗ = ¯∗ 2 i=1 wi (Xi − X )
(5.12a)
(5.12b)
100
Chapter 5: Violations of the Classical Assumptions
obtained from the regression in (5.10), are different It is clear that the BLU estimators α and β, 2 from the usual OLS estimators α OLS and β OLS since they depend upon the σ i ’s. It is also true 2 2 that when σ i = σ for all i = 1, 2, . . . , n, i.e., under homoskedasticity, (5.12) reduces to the usual OLS estimators given by equation (3.4) of Chapter 3. The BLU estimators weight the ith observation by (1/σ i ) which is a measure of precision of that observation. The more precise the observation, i.e., the smaller σ i , the larger is the weight attached to that observation. α are also known as Weighted Least Squares (WLS) estimators which are a speciﬁc form and β of Generalized Least Squares (GLS). We will study GLS in details in Chapter 9, using matrix notation. Under heteroskedasticity, OLS looses eﬃciency in that it is no longer BLUE. However, because it is still unbiased and consistent and because the true σ 2i ’s are never known some researchers compute OLS as an initial consistent estimator of the regression coeﬃcients. It is important to emphasize however, that the standard errors of these estimates as reported by the regression package are biased and any inference based on these estimated variances including the reported tstatistics are misleading. White (1980) proposed a simple procedure that would yield heteroskedasticity consistent standard errors of the OLS estimators. In equation (5.9), this amounts to replacing σ 2i by e2i , the square of the ith OLS residual, i.e., W hite’s var(β OLS ) =
n
2 2 n 2 2 i=1 xi ei /( i=1 xi )
(5.13)
Note that we can not consistently estimate σ 2i by e2i , since there is one observation per parameter estimated. As the sample size increases, so does the number of unknown σ 2i ’s. What White (1980) 2 consistently estimates is the var(β OLS ) which is a weighted average of the ei . The same analysis applies to the multiple regression OLS estimates. In this case, White’s (1980) heteroskedasticity consistent estimate of the variance of the kth OLS regression coeﬃcient β k , is given by 2 2 n ) = n 2ki )2 W hite’s var(β k i=1 ν ki ei /( i=1 ν
where ν 2k is the squared OLS residual obtained from regressing Xk on the remaining regressors in the equation being estimated. ei is the ith OLS residual from this multiple regression equation. Many regression packages provide White’s heteroskedasticityconsistent estimates of the variances and their corresponding robust tstatistics. For example, using EViews, one clicks on Quick, choose Estimate Equation. Now click on Options, a menu appears where one selects White to obtain the heteroskedasticityconsistent estimates of the variances. While the regression packages correct for heteroskedasticity in the tstatistics they do not usually do that for the F statistics studied, say in Example 2 in Chapter 4. Wooldridge (1991) suggests a simple way of obtaining a robust LM statistic for H0 ; β 2 = β 3 = 0 in the multiple regression (4.1). This involves the following steps: (1) Run OLS on the restricted model without X2 and X3 and obtain the restricted least squares residuals u . (2) Regress each of the independent variables excluded under the null (i.e., X2 and X3 ) on all of the other included independent variables (i.e., X4 , X5 , . . . , XK ) including the constant. Get the corresponding residuals v2 and v3 , respectively.
(3) Regress the dependent variable equal to 1 for all observations on v2 u , v3 u without a constant and obtain the robust LM statistic equal to the n the sum of squared residuals of
5.5
Heteroskedasticity
101
this regression. This is exactly nRu2 of this last regression. Under H0 this LM statistic is distributed as χ22 . The only problem is that the σ i ’s Since OLS is no longer BLUE, one should compute α and β. are rarely known. One example where the σ i ’s are known up to a scalar constant is the following simple example of aggregation. Example 5.1: Aggregation and Heteroskedasticity. Let Yij be the observation on the jth ﬁrm in the ith industry, and consider the following regression: Yij = α + βXij + uij
j = 1, 2, . . . , ni ; i = 1, 2, . . . , m
(5.14)
If only aggregate observations on each industry are available, then (5.14) is summed over ﬁrms, i.e., (5.15) Yi = αni + βXi + ui i = 1, 2, . . . , m ni ni ni where Yi = j=1 Yij , Xi = j=1 Xij , ui = j=1 uij for i = 1, 2, . . . , m. Note that although the uij ’s are IID(0, σ 2 ), by aggregating, we get ui ∼ (0, ni σ 2 ). This means that the disturbances in (5.15) are heteroskedastic. However, σ 2i = ni σ 2 and is known up to a scalar constant. In fact, σ/σ i is 1/(ni )1/2 . Therefore, premultiplying (5.15) by 1/(ni )1/2 and performing OLS on the transformed equation results in BLU estimators of α and β. In other words, BLU estimation reduces to performing OLS of Yi /(ni )1/2 on (ni )1/2 and Xi /(ni )1/2 , without an intercept. There may be other special cases in practice where σ i is known up to a scalar, but in general, σ i is usually unknown and will have to be estimated. This is hopeless with only n observations, since there are n σ i ’s, so we either have to have repeated observations, or know more about the σ i ’s. Let us discuss these two cases. Case 1: Repeated Observations Suppose that ni households are selected randomly with income Xi for i = 1, 2, . . . , m. For each household j = 1, 2, . . . , ni , we observe its consumption expenditures on food, say Yij . The regression equation is Yij = α + βXi + uij
i = 1, 2, . . . , m ; j = 1, 2, . . . , ni
(5.16)
where m is the number of income groups selected. Note that Xi has only one subscript, whereas Yij has two subscripts denoting the repeated observations on households with the same income Xi . The uij ’s are independently distributed (0, σ 2i ) reﬂecting the heteroskedasticity in consumption expenditures among the different income groups. In this case, there are n = m i=1 ni observations and m σ 2i ’s to be estimated. This is feasible, and there are two methods for estimating these σ 2i ’s. The ﬁrst is to compute i (Yij − Y¯i )2 /(ni − 1) s2i = nj=1
ni ni 2 2i = where Y¯i = j=1 Yij /ni . The second is to compute s j=1 eij /ni where eij is the OLS residual given by eij = Yij − α OLS − β OLS Xi
102
Chapter 5: Violations of the Classical Assumptions
Both estimators of σ 2i are consistent. Substituting either s2i or s2i for σ 2i in (5.12) will result in feasible estimators of α and β. However, the resulting estimates are no longer BLUE. The substitution of the consistent estimators of σ 2i is justiﬁed on the basis that the resulting α and β estimates will be asymptotically eﬃcient, see Chapter 9. Of course, this step could have si on (1/ si ) and (Xi / si ) without a constant, or the similar been replaced by a regression of Yij / regression in terms of si . For this latter estimate, s2i , one can iterate, i.e., obtaining new residuals based on the new regression estimates and therefore new s2i . The process continues until the estimates obtained from the rth iteration do not differ from those of the (r + 1)th iteration in absolute value by more than a small arbitrary positive number chosen as the convergence criterion. Once the estimates converge, the ﬁnal round estimators are the maximum likelihood estimators, see Oberhofer and Kmenta (1974). Case 2: Assuming More Information on the Form of Heteroskedasticity If we do not have repeated observations, it is hopeless to try and estimate n variances and α and β with only n observations. More structure on the form of heteroskedasticity is needed to estimate this model, but not necessarily to test it. Heteroskedasticity is more likely to occur with crosssection data where the observations may be on ﬁrms with different size. For example, a regression relating proﬁts to sales might have heteroskedasticity, because larger ﬁrms have more resources to draw upon, can borrow more, invest more, and loose or gain more than smaller ﬁrms. Therefore, we expect the form of heteroskedasticity to be related to the size of the ﬁrm, which is reﬂected in this case by the regressor, sales, or some other variable that measures size, like assets. Hence, for this regression we can write σ 2i = σ 2 Zi2 , where Zi denotes the sales or assets of ﬁrm i. Once again the form of heteroskedasticity is known up to a scalar constant and the BLU estimators of α and β can be obtained from (5.12), assuming Zi is known. Alternatively, one can run the regression of Yi /Zi on 1/Zi and Xi /Zi without a constant to get the same result. Special cases of Zi are Xi and E(Yi ). (i) If Zi = Xi the regression becomes that of Yi /Xi on 1/Xi and a constant. Note that the regression coeﬃcient of 1/Xi is the estimate of α, while the constant of the regression is now the estimate of β. But, is it possible to have ui uncorrelated with Xi when we are assuming var(ui ) related to Xi ? The answer is yes, as long as E(ui /Xi ) = 0, i.e., the mean of ui is zero for every value of Xi , see Figure 3.4 of Chapter 3. This, in turn, implies that the overall mean of the ui ’s is zero, i.e., E(ui ) = 0 and that cov(Xi , ui ) = 0. If the latter is not satisﬁed and say cov(Xi , ui ) is positive, then large values of Xi imply large values of ui . This would mean that for these values of Xi , we have a nonzero mean for the corresponding ui ’s. This contradicts E(ui /Xi ) = 0. Hence, if E(ui /Xi ) = 0, then cov(Xi , ui ) = 0. (ii) If Zi = E(Yi ) = α + βXi , then σ 2i is proportional to the population regression line, which is a linear function of α and β. Since the OLS estimates are consistent OLS + β one can estimate E(Yi ) by Yi = α OLS Xi use Zi = Yi instead of E(Yi ). In other words, run the regression of Yi /Yi on 1/Yi and Xi /Yi without a constant. The resulting estimates are asymptotically eﬃcient, see Amemiya (1973). One can generalize σ 2i = σ 2 Zi2 to σ 2i = σ 2 Ziδ where δ is an unknown parameter to be estimated. Hence rather than estimating n σ 2i ’s one has to estimate only σ 2 and δ. Assuming normality one can set up the likelihood function and derive the ﬁrstorder conditions by differentiating that likelihood with respect to α, β, σ 2 and δ. The resulting equations are highly nonlinear. Alternatively, one can search over possible values for δ = 0, 0.1, 0.2, . . . , 4, and get the δ/2 δ/2 δ/2 corresponding estimates of α, β, and σ 2 from the regression of Yi /Zi on 1/Zi and Xi /Zi
5.5
Heteroskedasticity
103
without a constant. This is done for every δ and the value of the likelihood function is reported. Using this search procedure one can get the maximum value of the likelihood and corresponding to it the MLE of α, β, σ 2 and δ. Note that as δ increases so does the degree of heteroskedasticity. Problem 4 asks the reader to compute the relative eﬃciency of the OLS estimator with respect to the BLU estimator for Zi = Xi for various values of δ. As expected the relative eﬃciency of the OLS estimator declines as the degree of heteroskedasticity increases. One can also generalize σ 2i = σ 2 Ziδ to include more Z variables. In fact, a general form of this multiplicative heteroskedasticity is logσ 2i = logσ 2 + δ 1 logZ1i + δ 2 logZ2i + . . . + δr logZri
(5.17)
with r < n, otherwise one cannot estimate with n observations. Z1 , Z2 , . . . , Zr are known variables determining the heteroskedasticity. Note that if δ 2 = δ 3 = . . . = δ r = 0, we revert back to σ 2i = σ 2 Ziδ , where δ = δ1 . For the estimation of this general multiplicative form of heteroskedasticity, see Harvey (1976). Another form for heteroskedasticity, is the additive form σ 2i = a + b1 Z1i + b2 Z2i + . . . + br Zri
(5.18)
where r < n, see Goldfeld and Quandt (1972). Special cases of (5.18) include σ 2i = a + b1 Xi + b2 Xi2
(5.19)
where if a and b1 are zero we have a simple form of multiplicative heteroskedasticity. In order to estimate the regression model with additive heteroskedasticity of the type given in (5.19), one can get the OLS residuals, the ei ’s, and run the following regression e2i = a + b1 Xi + b2 Xi2 + vi e2i
(5.20)
σ 2i .
where vi = − The vi ’s are heteroskedastic, and the OLS estimates of (5.20) yield the following estimates of σ 2i aOLS + b1,OLS Xi + b2,OLS Xi2 σ 2i =
(5.21)
(e2i / σ i ) = a(1/ σ i ) + b1 (Xi / σ i ) + b2 (Xi2 / σ i ) + wi
(5.22)
One can obtain a better estimate of the σ 2i ’s by computing the following regression which corrects for the heteroskedasticity in the vi ’s The new estimates of σ 2i are a + b1 Xi + b2 Xi2 σ 2i =
(5.23)
one can run the regression of where a, b1 and b2 are the OLS estimates from (5.22). Using the Yi / σ i on (1/ σ i ) and Xi / σ i without a constant to get asymptotically eﬃcient estimates of α and β. These have the same asymptotic properties as the MLE estimators derived in Rutemiller and Bowers (1968), see Amemiya (1977) and Buse (1984). The problem with this iterative procedure is that there is no guarantee that the σ 2i ’s are positive, which means that the square root σ i may not exist. This problem would not occur if σ 2i = (a + b1 Xi + b2 Xi2 )2 because in this case one regresses ei  on a constant, Xi and Xi2 and the predicted value from this regression would be an estimate of σ i . It would not matter if this predictor is negative, because we do not have to take its square root and because its sign cancels in the OLS normal equations of the ﬁnal regression of Yi / σ i on (1/ σ i ) and (Xi / σ i ) without a constant. σ 2i ’s
104
Chapter 5: Violations of the Classical Assumptions
Testing for Homoskedasticity In the repeated observation’s case, one can perform Bartlett’s (1937) test. The null hypothesis is H0 ; σ 21 = σ 22 = . . . = σ 2m . Under σ 2 which can be estimated m the2 null there is onevariance m 2 by the pooled variance s = i /ν where ν = i=1 ν i s i=1 ν i , and ν i = ni − 1. Under the alternative hypothesis there are m different variances estimated by s2i for i = 1, 2, . . . , m. The Likelihood Ratio test, which computes the ratio of the likelihoods under the null and alternative hypotheses, reduces to computing B = [νlogs2 −
m
i=1 ν i
logs2i ]/c
(5.24)
2 where c = 1 + [ m i=1 (1/ν i ) − 1/ν] /3(m − 1). Under H0 , B is distributed χm−1 . Hence, a large pvalue for the Bstatistic given in (5.24) means that we do not reject homoskedasticity whereas, a small pvalue leads to rejection of H0 in favor of heteroskedasticity. In case of no repeated observations, several tests exist in the literature. Among these are the following: (1) Glejser’s (1969) Test: In this case one regresses ei  on a constant and Ziδ for δ = 1, −1, 0.5 and −0.5. If the coeﬃcient of Ziδ is signiﬁcantly different from zero, this would lead to a rejection of homoskedasticity. The power of this test depends upon the true form of heteroskedasticity. One important result however, is that this power is not seriously impaired if the wrong value of δ is chosen, see Ali and Giaccotto (1984) who conﬁrmed this result using extensive Monte Carlo experiments. (2) The Goldfeld and Quandt (1965) Test: This is a simple and intuitive test. One orders the observations according to Xi and omits c central observations. Next, two regressions are run on the two separated sets of observations with (n − c)/2 observations in each. The c omitted observations separate the low value X’s from the high value X’s, and if heteroskedasticity exists and is related to Xi , the estimates of σ 2 reported from the two regressions should be different. Hence, the test statistic is s22 /s21 where s21 and s22 are the Mean Square Error of the two regressions, respectively. Their ratio would be the same as that of the two residual sums of squares because the degrees of freedom of the two regressions are the same. This statistic is F distributed with ((n−c)/2)−K degrees of freedom in the numerator as well as the denominator. The only remaining question for performing this test is the magnitude of c. Obviously, the larger c is, the more central observations are being omitted and the more conﬁdent we feel that the two samples are distant from each other. The loss of c observations should lead to loss of power. However, separating the two samples should give us more conﬁdence that the two variances are in fact the same if we do not reject homoskedasticity. This trade off in power was studied by Goldfeld and Quandt using Monte Carlo experiments. Their results recommend the use of c = 8 for n = 30 and c = 16 for n = 60. This is a popular test, but assumes that we know how to order the heteroskedasticity. In this case, using Xi . But what if there are more than one regressor on the right hand side? In that case one can order the observations using Yi . (3) Spearman’s Rank Correlation Test: This test ranks the Xi ’s and the absolute value of the OLS residuals, the ei ’s. Then it computes the difference between these rankings, i.e., di =
rank(ei )− rank(Xi ). The SpearmanCorrelation coeﬃcient is r = 1 − 6 ni=1 d2i /(n3 − n) . Finally, test H0 ; the correlation coeﬃcient between the rankings is zero, by computing t =
5.5
Heteroskedasticity
105
[r 2 (n − 2)/(1 − r 2 )]1/2 which is tdistributed with (n − 2) degrees of freedom. If this tstatistic has a large pvalue we do not reject homoskedasticity. Otherwise, we reject homoskedasticity in favor of heteroskedasticity. (4) Harvey’s (1976) Multiplicative Heteroskedasticity Test: If heteroskedasticity is related to Xi , it looks like the Goldfeld and Quandt test or the Spearman rank correlation test would detect it, and the Glejser test would establish its form. In case the form of heteroskedasticity is of the multiplicative type, Harvey (1976) suggests the following test which rewrites (5.17) as loge2i = logσ 2 + δ 1 logZ1i + . . . + δr logZri + vi
(5.25)
where vi = log(e2i /σ 2i ). This disturbance term has an asymptotic distribution that is logχ21 . This random variable has mean −1.2704 and variance 4.9348. Therefore, Harvey suggests performing the regression in (5.25) and testing H0 ; δ 1 = δ 2 = . . . = δ r = 0 by computing the regression sum of squares divided by 4.9348. This statistic is distributed asymptotically as χ2r . This is also asymptotically equivalent to an F test that tests for δ 1 = δ 2 = . . . = δr = 0 in the regression given in (5.25). See the F test described in example 6 of Chapter 4. (5) Breusch and Pagan (1979) Test: If one knows that σ 2i = f (a + b1 Z1 + b2 Z2 + .. + br Zr ) but does not know the form of this function f , Breusch and Pagan (1979) suggest following n the 2 2 test for homoskedasticity, i.e., H0 ; b1 = b2 = . . . = br = 0. Compute σ = i=1 ei /n, which σ 2 on the would be the MLE estimator of σ 2 under homoskedasticity. Run the regression of e2i / Z variables and a constant, and compute half the regression sum of squares. This statistic is distributed as χ2r . This is a more general test than the ones discussed earlier in that f does not have to be speciﬁed. (6) White’s (1980) Test: Another general test for homoskedasticity where nothing is known about the form of this heteroskedasticity is suggested by White (1980). This test is based on the difference between the variance of the OLS estimates under homoskedasticity and that under heteroskedasticity. For the case of a simple regression with a constant, White shows that OLS ) given by (5.13) with the usual var(β OLS ) = s2 / n x2 this test compares White’s var(β i=1 i under homoskedasticity. This test reduces to running the regression of e2i on a constant, Xi and Xi2 and computing nR2 . This statistic is distributed as χ22 under the null hypothesis of homoskedasticity. The degrees of freedom correspond to the number of regressors without the constant. If this statistic is not signiﬁcant, then e2i is not related to Xi and Xi2 and we can not reject that the variance is constant. Note that if there is no constant in the regression, we run e2i on a constant and Xi2 only, i.e., Xi is no longer in this regression and the degree of freedom of the test is 1. In general, White’s test is based on running e2i on the crossproduct of all the X’s in the regression being estimated, computing nR2 , and comparing it to the critical value of χ2r where r is the number of regressors in this last regression excluding the constant. For the case of two regressors, X2 and X3 and a constant, White’s test is again based on nR2 for the regression of e2i on a constant, X2 , X3 , X22 , X2 X3 , X32 . This statistic is distributed as χ25 . White’s test is standard using EViews. After running the regression, click on residuals tests then choose White. This software gives the user a choice between including or excluding the crossproduct terms like X2 X3 from the regression. This may be useful when there are many regressors. A modiﬁed BreuschPagan test was suggested by Koenker (1981) and Koenker and Bassett (1982). This attempts to improve the power of the BreuschPagan test, and make it more robust
106
Chapter 5: Violations of the Classical Assumptions
to the nonnormality of the disturbances. This amounts to multiplying the BreuschPagan statistic (half the regression sum of squares) by 2 σ 4 , and dividing it by n nthe 2second sample 2 2 2 2 moment of the squared residuals, i.e., (e − σ ) /n, where σ = i=1 i i=1 ei /n. Waldman (1983) showed that if the Zi ’s in the BreuschPagan test are in fact the Xi ’s and their crossproducts, as in White’s test, then this modiﬁed BreuschPagan test is exactly the nR2 statistic proposed by White. White’s (1980) test for heteroskedasticity without specifying its form lead to further work on estimators that are more eﬃcient than OLS while recognizing that the eﬃciency of GLS may not be achievable, see Cragg (1992). Adaptive estimators have been developed by Carroll (1982) and Robinson (1987). These estimators assume no particular form of heteroskedasticity but nevertheless have the same asymptotic distribution as GLS based on the true σ 2i . Many Monte Carlo experiments were performed to study the performance of these and other tests of homoskedasticity. One such extensive study is that of Ali and Giaccotto (1984). Six types of heteroskedasticity speciﬁcations were considered; (i) σ 2i = σ 2 (iv) σ 2i = σ 2 Xi2
(ii) σ 2i = σ 2 Xi  (v) σ 2i = σ 2 [E(Yi )]2
(iii) σ 2i = σ 2 E(Yi ) (vi) σ 2i = σ 2 for i ≤ n/2 and σ 2i = 2σ 2 for i > n/2
Six data sets were considered, the ﬁrst three were stationary and the last three were nonstationary (Stationary versus nonstationary regressors, are discussed in Chapter 14). Five models were entertained, starting with a model with one regressor and no intercept and ﬁnishing with a model with an intercept and 5 variables. Four types of distributions were imposed on the disturbances. These were normal, t, Cauchy and log normal. The ﬁrst three are symmetric, but the last one is skewed. Three sample sizes were considered, n = 10, 25, 40. Various correlations between the disturbances were also entertained. Among the tests considered were tests 1, 2, 5 and 6 discussed in this section. The results are too numerous to summarize, but some of the major ﬁndings are the following: (1) The power of these tests increased with sample size and trendy nature or the variability of the regressors. It also decreased with more regressors and for deviations from the normal distribution. The results were mostly erratic when the errors were autocorrelated. (2) There were ten distributionally robust tests using OLS residuals named TROB which were variants of Glejser’s, White’s and Bickel’s tests. The last one being a nonparametric test not considered in this chapter. These tests were robust to both longtailed and skewed distributions. (3) None of these tests has any signiﬁcant power to detect heteroskedasticity which deviates substantially from the true underlying heteroskedasticity. For example, none of these tests was powerful in detecting heteroskedasticity of the sixth kind, i.e., σ 2i = σ 2 for i ≤ n/2 and σ 2i = 2σ 2 for i > n/2. In fact, the maximum power was 9%. (4) Ali and Giaccotto (1984) recommend any of the TROB tests for practical use. They note that the similarity among these tests is the use of squared residuals rather than the absolute value of the residuals. In fact, they argue that tests of the same form that use absolute value rather than squared residuals are likely to be nonrobust and lack power. Empirical Example: For the Cigarette Consumption Data given in Table 3.2, the OLS regression yields: logC = 4.30 − 1.34 logP + 0.17 logY (0.91) (0.32) (0.20)
¯ 2 = 0.27 R
5.5
Heteroskedasticity
107
0.4
Residuals
0.2
0.0
â€“0.2
â€“0.4 4.4
4.6
4.8
5.0
5.2
LNY Figure 5.1 Plots of Residuals versus Log Y
Suspecting heteroskedasticity, we plotted the residuals from this regression versus logY in Figure 5.1. This ďŹ gure shows the dispersion of the residuals to decrease with increasing logY . Next, we performed several tests for heteroskedasticity studied in this chapter. The ďŹ rst is Glejserâ€™s (1969) test. We ran the following regressions: ei  = 1.16 âˆ’ 0.22 logY (0.46) (0.10) ei  = âˆ’0.95 + 5.13 (logY )âˆ’1 (0.47) (2.23) ei  = âˆ’2.00 + 4.65 (logY )âˆ’0.5 (0.93) (2.04) ei  = 2.21 âˆ’ 0.96 (logY )0.5 (0.93) (0.42) The tstatistics on the slope coeďŹƒcient in these regressions are âˆ’2.24, 2.30, 2.29 and âˆ’2.26, respectively. All are signiďŹ cant with pvalues of 0.03, 0.026, 0.027 and 0.029, respectively, indicating the rejection of homoskedasticity. The second test is the Goldfeld and Quandt (1965) test. The observations are ordered according to logY and c = 12 central observations are omitted. Two regressions are run on the ďŹ rst and last 17 observations. The ďŹ rst regression yields s21 = 0.04881 and the second regression yields s22 = 0.01554. This is a test of equality of variances and it is based on the ratio of two Ď‡2 random variables with 17 âˆ’ 3 = 14 degrees of freedom. In fact, s21 /s22 = 0.04881/0.01554 = 3.141 âˆź F14,14 under H0 . This has a pvalue of 0.02 and rejects H0 at the 5% level. The third test is the Spearman rank correlation test. First one obtains the rank(logYi ) and rank(ei ) and 2 3 compute di = rankei âˆ’ ranklogYi . From these r = 1 âˆ’ 6 46 i=1 di /(n âˆ’ n) = âˆ’0.282 and
108
Chapter 5: Violations of the Classical Assumptions
Table 5.1 White Heteroskedasticity Test Fstatistic Obs*Rsquared Test Equation: Dependent Variable: Method: Sample: Included observations: Variable C LNP LNPˆ2 LNP*LNY LNY LNYˆ2 Rsquared Adjusted Rsquared S.E. of regression Sum squared resid Log likelihood DurbinWatson stat
4.127779 15.65644
Probability Probability
0.004073 0.007897
RESIDˆ2 Least Squares 1 46 46 Coeﬃcient 18.22199 9.506059 1.281141 –2.078635 –7.893179 0.855726 0.340357 0.257902 0.029778 0.035469 99.58660 1.853360
Std. Error 5.374060 3.302570 0.656208 0.727523 2.329386 0.253048
tStatistic 3.390730 2.878382 1.952340 –2.857139 –3.388523 3.381670
Mean dependent var S.D. dependent var Akaike info criterion Schwarz criterion Fstatistic Prob (Fstatistic)
Prob. 0.0016 0.0064 0.0579 0.0068 0.0016 0.0016 0.024968 0.034567 –4.068982 –3.830464 4.127779 0.004073
t = [r 2 (n − 2)/(1 − r 2 )]1/2 = 1.948. This is distributed as a t with 44 degrees of freedom. This tstatistic has a pvalue of 0.058. The fourth test is Harvey’s (1976) multiplicative heteroskedasticity test which is based upon regressing log e2i on log(log Yi ) log e2i = 24.85 − 19.08 log(log Y ) (17.25) (11.03) Harvey’s (1976) statistic divides the regression sum of squares which is 14.360 by 4.9348. This yields 2.91 which is asymptotically distributed as χ21 under the null. This has a pvalue of 0.088 and does not reject the null of homoskedasticity at the 5% signiﬁcance level. The ﬁfth test is the Breusch and Pagan (1979) test which is based on the regression of e2i / σ2 46 2 2 (where σ = i=1 ei /46 = 0.024968) on log Yi . The teststatistic is half the regression sum of squares = (10.971 ÷ 2) = 5.485. This is distributed as χ21 under the null hypothesis. This has a pvalue of 0.019 and rejects the null of homoskedasticity. Finally, White’s (1980) test for heteroskedasticity is performed which is based on the regression of e2i on logP , logY , (logP )2 , (logY )2 , (logP )(logY ) and a constant. This is shown in Table 5.1 using EViews. The teststatistic is nR2 = (46)(0.3404) = 15.66 which is distributed as χ25 . This has a pvalue of 0.008 and rejects the null of homoskedasticity. Except for Harvey’s test, all the tests performed indicate the presence of heteroskedasticity. This is true despite the fact that the data are in logs, and both consumption and income are expressed in per capita terms.
5.6
Autocorrelation
109
Table 5.2 White HeteroskedasticityConsistent Standard Errors Dependent Variable: Method: Sample: Included observations:
LNC Least Squares 1 46 46
White HeteroskedasticityConsistent Standard Errors & Covariance Variable C LNP LNY
Coeﬃcient 4.299662 –1.338335 0.172386
Rsquared Adjusted Rsquared S.E. of regression Sum squared resid Log likelihood DurbinWatson stat
0.303714 0.271328 0.163433 1.148545 19.60218 2.315716
Std. Error 1.095226 0.343368 0.236610
tStatistic 3.925821 –3.897671 0.728565
Mean dependent var S.D. dependent var Akaike info criterion Schwarz criterion Fstatistic Prob (Fstatistic)
Prob. 0.0003 0.0003 0.4702 4.847844 0.191458 –0.721834 –0.602575 9.378101 0.000417
White’s heteroskedasticityconsistent estimates of the variances are as follows: logC = 4.30 − 1.34 logP + 0.17 logY (1.10) (0.34) (0.24) These are given in Table 5.2 using EViews. Note that in this case all of the heteroskedasticityconsistent standard errors are larger than those reported using a standard OLS package, but this is not necessarily true for other data sets. In section 5.4, we described the Jarque and Bera (1987) test for normality of the disturbances. For this cigarette consumption regression, Figure 5.2 gives the histogram of the residuals along with descriptive statistics of these residuals including their mean, median, skewness and kurtosis. This is done using EViews. The measure of skewness S is estimated to be −0.184 and the measure of kurtosis κ is estimated to be 2.875 yielding a JarqueBera statistic of
(2.875 − 3)2 (−0.184)2 = 0.29. + JB = 46 6 24 This is distributed as χ22 under the null hypothesis of normality and has a pvalue of 0.865. Hence we do not reject that the distribution of the disturbances is symmetric and has a kurtosis of 3.
5.6
Autocorrelation
Violation of assumption 3 means that the disturbances are correlated, i.e., E(ui uj ) = σ ij = 0, for i = j, and i, j = 1, 2, . . . , n. Since ui has zero mean, E(ui uj ) = cov(ui , uj ) and this is denoted by σ ij . This correlation is more likely to occur in timeseries than in crosssection studies. Consider estimating the consumption function of a random sample of households. An unexpected event, like a visit of family members will increase the consumption of this household. However, this
110
Chapter 5: Violations of the Classical Assumptions
10
Series: Residuals Sample 1 46 Observations 46
8
Mean Median Maximum Minimum Std. Dev. Skewness Kurtosis
6
4
–9.90E16 0.007568 0.328677 –0.418675 0.159760 –0.183945 2.875020
2
JarqueBera Probability
0.289346 0.865305
0 –0.4
–0.3
–0.2
–0.1
0.0
0.1
0.2
0.3
Figure 5.2 Normality Test (JarqueBera)
positive disturbance need not be correlated to the disturbances affecting consumption of other randomly drawn households. However, if we were estimating this consumption function using aggregate timeseries data for the U.S., then it is very likely that a recession year affecting consumption negatively this year may have a carry over effect to the next few years. A shock to the economy like an oil embargo in 1973 is likely to affect the economy for several years. A labor strike this year may affect production for the next few years. Therefore, we will switch the i and j subscripts to t and s denoting timeseries observations t, s = 1, 2, . . . , T and the sample size will be denoted by T rather than n. This covariance term is symmetric, so that σ 12 = E(u1 u2 ) = E(u2 u1 ) = σ 21 . Hence, only T (T − 1)/2 distinct σ ts ’s have to be estimated. For example, if T = 3, then σ 12 , σ 13 and σ 23 are the distinct covariance terms. However, it is hopeless to estimate T (T − 1)/2 covariances (σ ts ) with only T observations. Therefore, more structure on these σ ts ’s need to be imposed. A popular assumption is that the ut ’s follow a ﬁrstorder autoregressive process denoted by AR(1): ut = ρut−1 + ǫt
t = 1, 2, . . . , T
(5.26)
where ǫt is IID(0, σ 2ǫ ). It is autoregressive because ut is related to its lagged value ut−1 . One can also write (5.26), for period t − 1, as ut−1 = ρut−2 + ǫt−1
(5.27)
and substitute (5.27) in (5.26) to get ut = ρ2 ut−2 + ρǫt−1 + ǫt
(5.28)
Note that the power of ρ and the subscript of u or ǫ always sum to t. By continuous substitution of this form, one ultimately gets ut = ρt u0 + ρt−1 ǫ1 + .. + ρǫt−1 + ǫt
(5.29)
5.6
Autocorrelation
111
This means that ut is a function of current and past values of ǫt and u0 where u0 is the initial value of ut . If u0 has zero mean, then ut has zero mean. This follows from (5.29) by taking expectations. Also, from (5.26) var(ut ) = ρ2 var(ut−1 ) + var(ǫt ) + 2ρcov(ut−1 , ǫt )
(5.30)
Using (5.29), ut−1 is a function of ǫt−1 , past values of ǫt−1 and u0 . Since u0 is independent of the ǫ’s, and the ǫ’s are themselves not serially correlated, then ut−1 is independent of ǫt . This means that cov(ut−1 , ǫt ) = 0. Furthermore, for ut to be homoskedastic, var(ut ) = var(ut−1 ) = σ 2u , and (5.30) reduces to σ 2u = ρ2 σ 2u + σ 2ǫ , which when solved for σ 2u gives: σ 2u = σ 2ǫ /(1 − ρ2 )
(5.31)
Hence, u0 ∼ (0, σ 2ǫ /(1 − ρ2 )) for the u’s to have zero mean and homoskedastic disturbances. Multiplying (5.26) by ut−1 and taking expected values, one gets E(ut ut−1 ) = ρE(u2t−1 ) + E(ut−1 ǫt ) = ρσ 2u
(5.32)
since E(u2t−1 ) = σ 2u and E(ut−1 ǫt ) = 0. Therefore, cov(ut , ut−1 ) = ρσ 2u , and the correlation coef ﬁcient between ut and ut−1 is correl(ut , ut−1 ) = cov(ut , ut−1 )/ var(ut )var(ut−1 ) = ρσ 2u /σ 2u = ρ. Since ρ is a correlation coeﬃcient, this means that −1 ≤ ρ ≤ 1. In general, one can show that cov(ut , us ) = ρt−s σ 2u
t, s = 1, 2, . . . , T
(5.33)
see problem 6. This means that the correlation between ut and ut−r is ρr , which is a fraction raised to an integer power, i.e., the correlation is decaying between the disturbances the further apart they are. This is reasonable in economics and may be the reason why this autoregressive form (5.26) is so popular. One should note that this is not the only form that would correlate the disturbances across time. In Chapter 14, we will consider other forms like the Moving Average (MA) process, and higher order Autoregressive Moving Average (ARMA) processes, but these are beyond the scope of this chapter. Consequences for OLS How is the OLS estimator affected by the violation of the no autocorrelation assumption among the disturbances? The OLS estimator is still unbiased and consistent since these properties rely on assumptions 1 and 4 and have nothing to do with assumption 3. For the simple linear regression, using (5.2), the variance of β OLS is now T T T (5.34) var(β OLS ) = var( t=1 wt ut ) = t=1 s=1 wt ws cov(ut , us ) 2 T 2 t−s 2 = σ u / t=1 xt + wt ws ρ σu t=s
where cov(ut , us ) = ρt−s σ 2u as explained in (5.33). Note that the ﬁrst term in (5.34) is the usual variance of β OLS under the classical case. The second term in (5.34) arises because of the correlation between the ut ’s. Hence, the variance of OLS computed from a regression package, i.e., s2 / Tt=1 x2t is a wrong estimate of the variance of β OLS for two reasons. First, it is using T 2 2 the wrong formula for the variance, i.e., σ u / t=1 xt rather than (5.34). The latter depends on ρ through the extra term in (5.34). Second, one can show, see problem 7, that E(s2 ) = σ 2u and will
112
Chapter 5: Violations of the Classical Assumptions
involve ρ as well as σ 2u . Hence, s2 is not unbiased for σ 2u and s2 / Tt=1 x2t is a biased estimate of var(β In fact, if ρ OLS ). The direction and magnitude of this bias depends on ρ and the regressor. is positive, and the xt ’s are themselves positively autocorrelated, then s2 / Tt=1 x2t understates the true variance of β OLS . This means that the conﬁdence interval for β is tighter than it should be and the tstatistic for H0 ; β = 0 is overblown, see problem 8. As in the heteroskedastic case, but for completely different reasons, any inference based on var(β OLS ) reported from the standard regression packages will be misleading if the ut ’s are serially correlated. Newey and West (1987) suggested a simple heteroskedasticity and autocorrelationconsistent covariance matrix for the OLS estimator without specifying the functional form of the serial correlation. The basic idea extends White’s (1980) replacement of heteroskedastic variances with squared OLS residuals e2t by additionally including products of least squares residuals et et−s for s = 0, ±1, . . . , ±p where p is the maximum order of serial correlation we are willing to assume. The consistency of this procedure relies on p being very small relative to the number of observations T . This is consistent with popular serial correlation speciﬁcations considered in this chapter where the autocorrelation dies out quickly as j increases. Newey and West (1987) allow the higher order covariance terms to receive diminishing weights. This NeweyWest option for the least squares estimator is available using EViews. Andrews (1991) warns about the unreliability of such standard error corrections in some circumstances. Wooldridge (1991) shows that it is possible to construct serially correlated robust F statistics for testing joint hypotheses as considered in Chapter 4. However, these are beyond the scope of this book. Is OLS still BLUE? In order to determine the BLU estimator in this case, we lag the regression equation once, multiply it by ρ, and subtract it from the original regression equation, we get Yt − ρYt−1 = α(1 − ρ) + β(Xt − ρXt−1 ) + ǫt
t = 2, 3, . . . , T
(5.35)
This transformation, known as the CochraneOrcutt (1949) transformation, reduces the disturbances to classical errors. Therefore, OLS on the resulting regression renders the estimates t = Xt − ρXt−1 , for t = 2, 3, . . . , T . Note BLU, i.e., run Yt = Yt − ρYt−1 on a constant and X that we have lost one observation by lagging, and the resulting estimators are BLUE only for linear combinations of (T − 1) observations in Y .1 Prais and Winsten (1954) derive the BLU estimators for linear combinations of T observations in Y . This entails recapturing the initial observation as follows: (i) Multiply the ﬁrst observation of the regression equation by 1 − ρ2 ; 1 − ρ2 Y1 = α 1 − ρ2 + β 1 − ρ2 X1 + 1 − ρ2 u1 (ii) add this transformed initial observation to the CochraneOrcutt transformed observations for t = 2, . . . , T and run the regression on the T observations rather than the (T −1) observations. See Chapter 9, for a formal proof of this result. Note that Y1 = 1 − ρ2 Y1
and
Yt = Yt − ρYt−1 for t = 2, . . . , T 1 = 1 − ρ2 X1 and X t = Xt − ρXt−1 for t = 2, . . . , T . The constant variable Ct = 1 Similarly, X t = (1 − ρ) t which takes the values C 1 = 1 − ρ2 and C for t = 1, . . . , T is now a new variable C t and X t for t = 2, . . . , T . Hence, the PraisWinsten procedure is the regression of Yt on C
5.6
Autocorrelation
113
without a constant. It is obvious that the resulting BLU estimators will involve ρ and are therefore, different from the usual OLS estimators except in the case where ρ = 0. Hence, OLS is no longer BLUE. Furthermore, we need to know ρ in order to obtain the BLU estimators. In applied work, ρ is not known and has to be estimated, in which case the PraisWinsten regression is no longer BLUE since it is based on an estimate of ρ rather than the true ρ itself. However, as long as ρ is a consistent estimate for ρ then this is a suﬃcient condition for the corresponding estimates of α and β in the next step to be asymptotically eﬃcient, see Chapter 9. We now turn to various methods of estimating ρ. (1) The CochraneOrcutt (1949) Method: This method starts with an initial estimate of ρ, the most convenient is 0, and minimizes the residual sum of squares in (5.35). This gives us the OLS estimates of α and β. Then we substitute α OLS and β OLS in (5.35) and we get et = ρet−1 + ǫt
t = 2, . . . , T
(5.36)
where et denotes the OLS residual. An estimate of ρ can be obtained by minimizing the residual sum of squares in (5.36) or running the regression of et on et−1 without a constant. The resulting estimate of ρ is ρco = Tt=2 et et−1 / Tt=2 e2t−1 where both summations run over t = 2, 3, . . . , T . The second step of the CochraneOrcutt procedure (2SCO) is to perform the regression in (5.35) with ρco instead of ρ. One can iterate this procedure (ITCO) by computing new residuals based on the new estimates of α and β and hence a new estimate of ρ from (5.36), and so on, until convergence. Both the 2SCO and the ITCO are asymptotically eﬃcient, the argument for iterating must be justiﬁed in terms of small sample gains. (2) The HilderthLu (1960) Search Procedure: ρ is between −1 and 1. Therefore, this procedure searches over this range, i.e., using values of ρ say between −0.9 and 0.9 in intervals of 0.1. For each ρ, one computes the regression in (5.35) and reports the residual sum of squares corresponding to that ρ. The minimum residual sum of squares gives us our choice of ρ and the corresponding regression gives us the estimates of α, β and σ 2 . One can reﬁne this procedure around the best ρ found in the ﬁrst stage of the search. For example, suppose that ρ = 0.6 gave the minimum residual sum of squares, one can search next between 0.51 and 0.69 in intervals of 0.01. This search procedure guards against a local minimum. Since the likelihood in this case contains ρ as well as σ 2 and α and β, this search procedure can be modiﬁed to maximize the likelihood rather than minimize the residual sum of squares, since the two criteria are no longer equivalent. The maximum value of the likelihood will give our choice of ρ and the corresponding estimates of α, β and σ 2 . (3) Durbin’s (1960) Method: One can rearrange (5.35) by moving Yt−1 to the right hand side, i.e., Yt = ρYt−1 + α(1 − ρ) + βXt − ρβXt−1 + ǫt
(5.37)
and running OLS on (5.37). The error in (5.37) is classical, and the presence of Yt−1 on the right hand side reminds us of the contemporaneously uncorrelated case discussed under the violation of assumption 4. For that violation, we have shown that unbiasedness is lost, but not consistency. Hence, the estimate of ρ as a coeﬃcient of Yt−1 is biased but consistent. This is the Durbin estimate of ρ, call it ρD . Next, the second step of the CochraneOrcutt procedure is performed using this estimate of ρ.
114
Chapter 5: Violations of the Classical Assumptions
(4) BeachMacKinnon (1978) Maximum Likelihood Procedure: Beach and MacKinnon (1978) derived a cubic equation in ρ which maximizes the likelihood function concentrated with respect to α, β, and σ 2 . With this estimate of ρ, denoted by ρBM , one performs the PraisWinsten procedure in the next step. Correcting for serial correlation is not without its critics. Mizon (1995) argues this point forcefully in his article entitled “A simple message for autocorrelation correctors: Don’t.” The main point being that serial correlation is a symptom of dynamic misspeciﬁcation which is better represented using a general unrestricted dynamic speciﬁcation. Monte Carlo Results Rao and Griliches (1969) performed a Monte Carlo study using an autoregressive Xt , and various values of ρ. They found that OLS is still a viable estimator as long as ρ < 0.3, but if ρ > 0.3, then it pays to perform procedures that correct for serial correlation based on an estimator of ρ. Their recommendation was to compute a Durbin’s estimate of ρ in the ﬁrst step and to do the PraisWinsten procedure in the second step. Maeshiro (1976, 1979) found that if the Xt series is trended, which is usual with economic data, then OLS outperforms 2SCO, but not the twostep PraisWinsten (2SPW) procedure that recaptures the initial observation. These results were conﬁrmed by Park and Mitchell (1980) who performed an extensive Monte Carlo using trended and untrended Xt ’s. Their basic ﬁndings include the following: (i) For trended Xt ’s, OLS beats 2SCO, ITCO and even a CochraneOrcutt procedure that is based on the true ρ. However, OLS was beaten by 2SPW, iterative PraisWinsten (ITPW), and BeachMacKinnon (BM). Their conclusion is that one should not use regressions based on (T − 1) observations as in Cochrane and Orcutt. (ii) Their results ﬁnd that the ITPW procedure is the recommended estimator beating 2SPW and BM for high values of true ρ, for both trended as well as nontrended Xt ’s. (iii) Test of hypotheses regarding the regression coeﬃcients performed miserably for all estimators based on an estimator of ρ. The results indicated less bias in standard error estimation for ITPW, BM and 2SPW than OLS. However, the tests based on these standard errors still led to a high probability of type I error for all estimation procedures. Testing for Autocorrelation So far, we have studied the properties of OLS under the violation of assumption 3. We have derived asymptotically eﬃcient estimators of the coeﬃcients based on consistent estimators of ρ and studied their small sample properties using Monte Carlo experiments. Next, we focus on the problem of detecting this autocorrelation between the disturbances. A popular diagnostic for detecting such autocorrelation is the Durbin and Watson (1951) statistic2 d=
T
t=2 (et
− et−1 )2 /
T
2 t=1 et
(5.38)
If this was based on the true ut ’s and T was very large then d can be shown to tend in the limit as T gets large to 2(1 − ρ), see problem 9. This means that if ρ → 0, then d → 2; if ρ → 1, then d → 0 and if ρ → −1, then d → 4. Therefore, a test for H0 ; ρ = 0, can be based on whether d is close to 2 or not. Unfortunately, the critical values of d depend upon the Xt ’s, and these vary from one data set to another. To get around this, Durbin and Watson established upper (dU ) and lower (dL ) bounds for this critical value. Figure 5.3 shows these bounds.
5.6
Autocorrelation
115
f(d)
Reject H 0
Inconclusive
Inconclusive
Do not reject H 0
Reject H 0
d 0
d1
du
2
4– d u
4– d1
4
Figure 5.3 DurbinWatson Critical Values
It is obvious that if the observed d is less than dL , or larger than 4 − dL , we reject H0 . If the observed d is between dU and 4 − dU , then we do not reject H0 . If d lies in any of the two indeterminant regions, then one should compute the exact critical values depending on Xt . Most regression packages report the DurbinWatson statistic. SHAZAM gives the exact pvalue for this dstatistic. If one is interested in a single sided test, say H0 ; ρ = 0 versus H1 ; ρ > 0 then one would reject H0 if d < dL , and not reject H0 if d > dU . If dL < d < dU , then the test is inconclusive. Similarly for testing H0 ; ρ = 0 versus H1 ; ρ < 0, one computes (4 − d) and follow the steps for testing against positive autocorrelation. Durbin and Watson tables for dL and dU covered samples sizes from 15 to 100 and a maximum of 5 regressors. Savin and White (1977) extended these tables for 6 ≤ T ≤ 200 and up to 10 regressors. The DurbinWatson statistic has several limitations. We discussed the inconclusive region and the computation of exact critical values. The DurbinWatson statistic is appropriate when there is a constant in the regression. In case there is no constant in the regression, see Farebrother (1980). Also, the DurbinWatson statistic is inappropriate when there are lagged values of the dependent variable among the regressors. We now turn to an alternative test for serial correlation that does not have these limitations and that is also easy to apply. This test was derived by Breusch (1978) and Godfrey (1978) and is known as the BreuschGodfrey test for zero ﬁrstorder serial correlation. This is a Lagrange Multiplier test that amounts to running the regression of the OLS residuals et on et−1 and the original regressors in the model. The test statistic is T R2 . Its distribution under the null is χ21 . In this case, the regressors are a constant and Xt , and the test checks whether the coeﬃcient of et−1 is signiﬁcant. The beauty of this test is that (i) it is the same test for ﬁrstorder serial correlation, whether the disturbances are Moving Average of order one MA(1) or AR(1). (ii) This test is easily generalizable to higher autoregressive or Moving Average schemes. For secondorder serial correlation, like MA(2) or AR(2) one includes two lags of the residuals on the right hand side; i.e., both et−1 and et−2 . (iii) This test is still valid even when lagged values of the dependent variable are present among the regressors, see Chapter 6. The Breusch and Godfrey test is standard using EViews and it prompts the user
116
Chapter 5: Violations of the Classical Assumptions
with a choice of the number of lags of the residuals to include among the regressors to test for serial correlation. You click on residuals, then tests and choose BreuschGodfrey. Next, you input the number of lagged residuals you want to include. In conclusion, we focus on ﬁrst differencing the data as a possible solution for getting rid of serial correlation in the errors. Some economic behavioral equations have variables in ﬁrst difference form, but other equations are ﬁrst differenced for estimation purposes. In the latter case, if the original disturbances were not autocorrelated, (or even correlated, with ρ = 1), then the transformed disturbances are serially correlated. After all, ﬁrst differencing the disturbances is equivalent to setting ρ = 1 in ut − ρut−1 , and this new disturbance u∗t = ut − ut−1 has ut−1 in common with u∗t−1 = ut−1 − ut−2 , making E(u∗t u∗t−1 ) = −E(u2t−1 ) = −σ 2u . However, one could argue that if ρ is large and positive, ﬁrst differencing the data may not be a bad solution. Rao and Miller (1971) calculated the variance of the BLU estimator correcting for serial correlation, for various guesses of ρ. They assume a true ρ of 0.2, and an autoregressive Xt Xt = λXt−1 + wt
with λ = 0, 0.4, 0.8.
(5.39)
They ﬁnd that OLS (or a guess of ρ = 0), performs better than ﬁrst differencing the data, and is pretty close in terms of eﬃciency to the true BLU estimator for trended Xt (λ = 0.8). However, the performance of OLS deteriorates as λ declines to 0.4 and 0, with respect to the true BLU estimator. This supports the Monte Carlo ﬁnding by Rao and Griliches that for ρ < 0.3, OLS performs reasonably well relative to estimators that correct for serial correlation. However, the ﬁrstdifference estimator, i.e., a guess of ρ = 1, performs badly for trended Xt (λ = 0.8) giving the worst eﬃciency when compared to any other guess of ρ. Only when the Xt ’s are less trended (λ = 0.4) or random (λ = 0), does the eﬃciency of the ﬁrstdifference estimator improve. However, even for those cases one can do better by guessing ρ. For example, for λ = 0, one can always do better than ﬁrst differencing by guessing any positive ρ less than 1. Similarly, for true ρ = 0.6, a higher degree of serial correlation, Rao and Miller (1971) show that the performance of OLS deteriorates, while that of the ﬁrst difference improves. However, one can still do better than ﬁrst differencing by guessing in the interval (0.4, 0.9). This gain in eﬃciency increases with trended Xt ’s. Empirical Example: Table 5.3 gives the U.S. Real Personal Consumption Expenditures (C) and Real Disposable Personal Income (Y ) from the Economic Report of the President over the period 19501993. This data set is available as CONSUMP.DAT on the Springer web site. The OLS regression yields: Ct = −65.80 + 0.916 Yt + residuals (90.99) (0.009) Figure 5.4 plots the actual, ﬁtted and residuals using EViews. This shows positive serial correlation with a string of positive residuals followed by a string of negative residuals followed by positive residuals. The DurbinWatson statistic is d = 0.461 which is much smaller than the lower bound dL = 1.468 for T = 44 and one regressor. Therefore, we reject the null hypothesis of H0 ; ρ = 0 at the 5% signiﬁcance level. Regressing OLS residuals on their lagged values yields et = 0.792 et−1 + residuals (0.106)
5.6
Autocorrelation
117
Table 5.3 U.S. Consumption Data, 1950–1993 C = Real Personal Consumption Expenditures (in 1987 dollars) Y = Real Disposable Personal Income (in 1987 dollars) YEAR
Y
C
YEAR
Y
C
1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971
6284 6390 6476 6640 6628 6879 7080 7114 7113 7256 7264 7382 7583 7718 8140 8508 8822 9114 9399 9606 9875 10111
5820 5843 5917 6054 6099 6325 6440 6465 6449 6658 6698 6740 6931 7089 7384 7703 8005 8163 8506 8737 8842 9022
1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993
10414 11013 10832 10906 11192 11406 11851 12039 12005 12156 12146 12349 13029 13258 13552 13545 13890 14005 14101 14003 14279 14341
9425 9752 9602 9711 10121 10425 10744 10876 10746 10770 10782 11179 11617 12015 12336 12568 12903 13029 13093 12899 13110 13391
Source: Economic Report of the President
The second step of the CochraneOrcutt (1949) procedure based on ρ = 0.792 yields the following regression: (Ct − 0.792Ct−1 ) = 168.49 + 0.926 (Yt − 0.792Yt−1 ) + residuals (301.7) (0.027)
The twostep PraisWinsten (1954) procedure yields ρP W = 0.707 with standard error (0.110) and the regression estimates are given by Ct = −52.63 + 0.917 Yt + residuals (181.9) (0.017)
Iterative PraisWinsten yields Ct = −48.40 + 0.916 Yt + residuals (191.1) (0.018) The MaximumLikelihood procedure yields ρM LE = 0.788 with standard error (0.11) and the resulting estimates are given by Ct = −24.82 + 0.915 Yt + residuals (233.5) (0.022)
118
Chapter 5: Violations of the Classical Assumptions
14000 12000 10000 400
8000
200
6000 4000
0 –200 –400 50
55
60
65
70
Residual
75
80
Actual
85
90
Fitted
Figure 5.4 Consumption and Disposable Income
Table 5.4 BreuschGodfrey LM Test Fstatistic Obs*Rsquared Test Equation: Dependent Variable: Method:
53.45697 24.90136
Probability Probability
0.00000 0.000001
RESID Least Squares
Presample missing value lagged residuals set to zero Variable
Coeﬃcient
Std. Error
tStatistic
Prob.
C Y RESID(–1)
–31.12035 0.003641 0.800205
60.82348 0.005788 0.109446
–0.511650 0.629022 7.311428
0.6116 0.5328 0.0000
Rsquared Adjusted Rsquared S.E. of regression Sum squared resid Log likelihood DurbinWatson stat
0.565940 0.544766 102.4243 430120.2 –264.5612 1.973552
Mean dependent var S.D. dependent var Akaike info criterion Schwarz criterion Fstatistic Prob (Fstatistic)
9.83E13 151.8049 12.16187 12.28352 26.72849 0.000000
Notes
119
Table 5.5 NeweyWest Standard Errors Dependent Variable: Method: Sample: Included observations:
CONSUM Least Squares 1950 1993 44
NeweyWest HAC Standard Errors & Covariance (lag truncation=3) Variable C Y
Coeﬃcient –65.79582 0.915623
Std. Error 133.3454 0.015458
tStatistic –0.493424 59.23190
Rsquared Adjusted Rsquared S.E. of regression Sum squared resid Log likelihood DurbinWatson stat
0.996267 0.996178 153.6015 990923.1 –282.9218 0.460778
Mean dependent var S.D. dependent var Akaike info criterion Schwarz criterion Fstatistic Prob (Fstatistic)
Prob. 0.6243 0.0000 9250.545 2484.624 12.95099 13.03209 11209.21 0.000000
Durbin’s (1960) Method yields the following regression Ct = 0.80 Ct−1 − 40.79 + 0.72 Yt − 0.53 Yt−1 + residuals (0.10) (59.8) (0.09) (0.13) Therefore, Durbin’s estimate of ρ is given by ρD = 0.80. The Breusch (1978) and Godfrey (1978) regression that tests for ﬁrstorder serial correlation is given in Table 5.4. This yields et = −31.12 + 0.004 Yt + 0.800 et−1 + residuals (60.82) (0.006) (0.109) The test statistic is T R2 which yields 43 × (0.565) = 24.9. This is distributed as χ21 under H0 ; ρ = 0. This rejects the null hypothesis of no serial correlation with a pvalue of 0.000001 shown in Table 5.4. The NeweyWest heteroskedasticity and autocorrelationconsistent standard errors for least squares with a threeyear lag truncation are given by Ct = −65.80 + 0.916 Yt + residuals (133.3) (0.015) This is given in Table 5.5 using EViews. Note that both standard errors are now larger than those reported by least squares. But once again, this is not necessarily the case for other data sets.
Notes 1. A computational warning is in order when one is applying the CochraneOrcutt transformation to crosssection data. Timeseries data has a natural ordering which is generally lacking in crosssection data. Therefore, one should be careful in applying the CochraneOrcutt transformation to crosssection data since it is not invariant to the ordering of the observations.
120
Chapter 5: Violations of the Classical Assumptions
2. Another test for serial correlation can be obtained as a byproduct of maximum likelihood estimation. The maximum likelihood estimator of ρ has a normal limiting distribution with mean ρ and variance (1 − ρ2 )/T . Hence, one can compute ρMLE /[(1 − ρ2MLE )/T ]1/2 and compare it to critical values from the normal distribution.
Problems 1. For the simple linear regression with heteroskedasticity, i.e., E(u2i ) = σ 2i , show that E(s2 ) is a function of the σ 2i ’s? 2. For the simple linear regression with heteroskedasticity of the form E(u2i ) = σ 2i = bx2i where b > 0, n show that E(s2 / i=1 x2i ) understates the variance of β OLS which is n n 2 2 2 2 i=1 xi ) . i=1 xi σ i /(
3. Weighted Least Squares. This is based on Kmenta (1986).
(a) Solve the two equations in (5.11) and show that the solution is given by (5.12). (b) Show that n 2 i=1 (1/σ i ) n n n 2 2 2 [ i=1 Xi /σ i ][ i=1 (1/σi )] − [ i=1 (Xi /σ 2i )]2 n ∗ w i=1 i = n ( i=1 wi∗ Xi2 )( ni=1 wi∗ ) − ( ni=1 wi∗ Xi )2 1 = n ∗ ¯∗ 2 i=1 wi (Xi − X ) ¯ ∗ = n w∗ Xi / n w∗ . where wi∗ = (1/σ 2i ) and X i=1 i i=1 i var(β)
=
4. Relative Efficiency of OLS Under Heteroskedasticity. Consider the simple linear regression with heteroskedasticity of the form σ 2i = σ 2 Xiδ where Xi = 1, 2, . . . , 10. (a) Compute var(β OLS ) for δ = 0.5, 1, 1.5 and 2. (b) Compute var(β BLUE ) for δ = 0.5, 1, 1.5 and 2. (c) Compute the eﬃciency of β OLS = var(β BLUE )/var(β OLS ) for δ = 0.5, 1, 1.5 and 2. What happens to this eﬃciency measure as δ increases?
5. Consider the simple regression with only a constant yi = α + ui for i = 1, 2, . . . , n; where the ui ’s are independent with mean zero and var(ui ) = σ 21 for i = 1, 2, . . . , n1 ; and var(ui ) = σ 22 for i = n1 + 1, . . . , n1 + n2 with n = n1 + n2 . (a) Derive the OLS estimator of α along with its mean and variance. (b) Derive the GLS estimator of α along with its mean and variance. (c) Obtain the relative eﬃciency of OLS with respect to GLS. Compute their relative eﬃciency for various values of σ 22 /σ 21 = 0.2, 0.4, 0.6, 0.8, 1, 1.25, 1.33, 2.5, 5; and n1 /n = 0.2, 0.3, 0.4, . . . , 0.8. Plot this relative eﬃciency. (d) Assume that ui is N (0, σ 21 ) for i = 1, 2, . . . , n1 ; and N (0, σ 21 ) for i = n1 + 1, . . . , n1 + n2 ; with ui ’s being independent. What is the maximum likelihood estimator of α, σ21 and σ 22 ? (e) Derive the LR test for testing H0 ; σ 21 = σ 22 in part (d).
Problems
121
6. Show that for an AR(1) model given in (5.26), E(ut us ) = ρt−s σ 2u for t, s = 1, 2, . . . , T . 7. Relative Efficiency of OLS Under the AR(1) Model. This problem is based on Johnston (1984, pp. 310312). For the simple regression without a constant yt = βxt + ut with ut = ρut−1 + ǫt and ǫt ∼ IID(0, σ 2ǫ ) (a) Show that var(β OLS )
σ 2u T
=
1 + 2ρ
2 t=1 xt
T −1
+ . . . + 2ρ
T −1
t=1 xt xt+1 T 2 t=1 xt
x1 xT T 2 t=1 xt
+ 2ρ2
!
T −2
t=1 xt xt+2 T 2 t=1 xt
and that the PraisWinsten estimator β P W has variance σ 2u 1 − ρ2 var(β P W ) = T T −1 T 2 2 1 + ρ2 − 2ρ t=1 xt xt+1 / t=1 xt t=1 xt
These expressions are easier to prove using matrix algebra, see Chapter 9.
(b) Let xt itself follow an AR(1) scheme with parameter λ, i.e., xt = λxt−1 + vt , and let T → ∞. Show that var(β 1 − ρ2 PW ) = lim asy eff(β OLS ) = 2 T →∞ var(β (1 + ρ − 2ρλ)(1 + 2ρλ + 2ρ2 λ2 + . . .) OLS ) =
(1 − ρ2 )(1 − ρλ) (1 + ρ2 − 2ρλ)(1 + ρλ)
(c) Tabulate this asy eff(β OLS ) for various values of ρ and λ where ρ varies between −0.9 to +0.9 in increments of 0.1, while λ varies between 0 and 0.9 in increments of 0.1. What do you conclude? How serious is the loss in eﬃciency in using OLS rather than the PW procedure? T (d) Ignoring this autocorrelation one would compute σ 2u / t=1 x2t as the var(β OLS ). The difference between this wrong formula and that derived in part (a) gives us the bias in estimating the variance of β OLS . Show that as T → ∞, this asymptotic proportionate bias is given by −2ρλ/(1 + ρλ). Tabulate this asymptotic bias for various values of ρ and λ as in part (c). What do you conclude? How serious is the asymptotic bias of using the wrong variances for β OLS when the disturbances are ﬁrstorder autocorrelated? (e) Show that
2
E(s ) =
σ 2u
T−
1 + 2ρ T −1
+ . . . + 2ρ
T −1
t=1 xt xt+1 T 2 t=1 xt
x1 xT T 2 t=1 xt
!
+ 2ρ2
/(T − 1)
T −2
t=1 xt xt+2 T 2 t=1 xt
Conclude that if ρ = 0, then E(s2 ) = σ 2u . If xt follows an AR(1) scheme with parameter λ, then for a large T , we get 1 + ρλ E(s2 ) = σ 2u T − /(T − 1) 1 − ρλ Compute this E(s2 ) for T = 101 and various values of ρ and λ as in part (c). What do you conclude? How serious is the bias in using s2 as an unbiased estimator for σ 2u ?
122
Chapter 5: Violations of the Classical Assumptions
8. For the AR(1) model given in (5.26), show that if ρ > 0 and the xt ’s are positively autocorrelated that E(s2 / x2t ) understates the var(β OLS ) given in (5.34). 9. Show that for the AR(1) model, the DurbinWatson statistic has plimd → 2(1 − ρ).
10. Regressions with NonZero Mean Disturbances. Consider the simple regression with a constant Yi = α + βXi + ui
i = 1, 2, . . . , n
where α and β are scalars and ui is independent of the Xi ’s. Show that: (a) If the ui ’s are independent and identically gamma distributed with f (ui ) = where ui ≥ 0 and θ > 0, then α OLS − s2 is unbiased for α.
θ−1 −ui 1 e Γ(θ) ui
(b) If the ui ’s are independent and identically χ2 distributed with ν degrees of freedom, then α OLS − s2 /2 is unbiased for α. (c) If the ui ’s are independent and identically exponentially distributed with f(ui ) = where ui ≥ 0 and θ > 0, then α OLS − s is consistent for α.
1 −ui /θ θe
11. The Heteroskedastic Consequences of an Arbitrary Variance for the Initial Disturbance of an AR(1) Model. This is based on Baltagi and Li (1990, 1992). Consider a simple AR(1) model ut = ρut−1 + ǫt
t = 1, 2, . . . , T
ρ < 1
with ǫt ∼ IID(0, σ 2ǫ ) independent of u0 ∼ (0, σ 2ǫ /τ ), and τ is an arbitrary positive parameter. (a) Show that this arbitrary variance on the initial disturbance u0 renders the disturbances, in general, heteroskedastic. (b) Show that var(ut ) = σ 2t is increasing if τ > (1 − ρ2 ) and decreasing if τ < (1 − ρ2 ). When is the process homoskedastic? (c) Show that cov(ut , ut−s ) = ρs σ 2t−s for t ≥ s. Hint: See the solution by Kim (1991).
(d) Consider the simple regression model yt = βxt + ut
t = 1, 2 . . . , T
with ut following the AR(1) process described above. Consider the common case where ρ > 0 and the xt ’s are positively autocorrelated. For this case, it is a standard result that 2 the var(β OLS ) is understated under the stationary case (i.e., (1 − ρ ) = τ ), see problem 8. This means that OLS rejects too often the hypothesis H0 ; β = 0. Show that OLS will reject more often than the stationary case if τ < 1 − ρ2 and less often than the stationary case if τ > (1 − ρ2 ). Hint: See the solution by Koning (1992). 12. ML Estimation of Linear Regression Model with AR(1) Errors and Two Observations. This is based on Magee (1993). Consider the regression model yi = xi β + ui , with only two observations x2  are scalars. Assume that ui ∼ N (0, σ 2 ) and u2 = ρu1 + ǫ i = 1, 2, and the nonstochastic x1  = 2 2 with ρ < 1. Also, ǫ ∼ N [0, (1 − ρ )σ ] where ǫ and u1 are independent. (a) Show that the OLS estimator of β is (x1 y1 + x2 y2 )/(x21 + x22 ). (b) Show that the ML estimator of β is (x1 y1 − x2 y2 )/(x21 − x22 ).
(c) Show that the ML estimator of ρ is 2x1 x2 /(x21 + x22 ) and thus is nonstochastic.
(d) How do the ML estimates of β and ρ behave as x1 → x2 and x1 → −x2 ? Assume x2 = 0. Hint: See the solution by Baltagi and Li (1995). 13. For the empirical example in section 5.5 based on the Cigarette Consumption Data in Table 3.2.
Problems
123
(a) Replicate the OLS regression of logC on logP , logY and a constant. Plot the residuals versus logY and verify Figure 3.2. (b) Run Glejser’s (1969) test by regressing ei  the absolute value of the residuals from part (a), on (logYi )δ for δ = 1, −1, −0.5 and 0.5. Verify the tstatistics reported in the text.
(c) Run Goldfeld and Quandt’s (1965) test by ordering the observations according to logYi and omitting 12 central observations. Report the two regressions based on the ﬁrst and last 17 observations and verify the F test reported in the text.
(d) Verify the Spearman rank correlation test based on the rank (logYi ) and rank ei .
(e) Verify Harvey’s (1976) multiplicative heteroskedasticity test based on regressing loge2i on log(logYi ). (f) Run the Breusch and Pagan (1979) test based on the regression of e2i / σ 2 on logYi , where 46 2 2 σ = i=1 ei /46.
(g) Run White’s (1980) test for heteroskedasticity.
(h) Run the Jarque and Bera (1987) test for normality of the disturbances.
(i) Compute White’s (1980) heteroskedasticity robust standard errors for the regression in part (a). 14. A Simple Linear Trend Model with AR(1) Disturbances. This is based on Kr¨ amer (1982). (a) Consider the following simple linear trend model Yt = α + β t + ut where ut = ρut−1 + ǫt with ρ < 1, ǫt ∼ IID(0, σ 2ǫ ) and var(ut ) = σ 2u = σ 2ǫ /(1 − ρ2 ). Our interest is focused on the estimates of the trend coeﬃcient, β, and the estimators to be considered are OLS, CO (assuming that the true value of ρ is known), the ﬁrstdifference estimator (FD), and the Generalized Least Squares (GLS), which is Best Linear Unbiased (BLUE) in this case. In the context of the simple linear trend model, the formulas for the variances of these estimators reduce to V (OLS) = 12σ2 {−6ρT +1 [(T − 1)ρ − (T + 1)]2 − (T 3 − T )ρ4 +2(T 2 − 1)(T − 3)ρ3 + 12(T 2 + 1)ρ2 − 2(T 2 − 1)(T + 3)ρ +(T 3 − T )}/(1 − ρ2 )(1 − ρ)4 (T 3 − T )2 V (CO) = 12σ2 (1 − ρ)2 (T 3 − 3T 2 + 2T ), V (F D) = 2σ 2 (1 − ρT −1 )/(1 − ρ2 )(T − 1)2 , V (GLS) = 12σ2 /(T − 1)[(T − 3)(T − 2)ρ2 − 2(T − 3)(T − 1)ρ + T (T + 1)]. (b) Compute these variances and their relative eﬃciency with respect to the GLS estimator for T = 10, 20, 30, 40 and ρ between −0.9 and 0.9 in 0.1 increments. (c) For a given T , show that the limit of var(OLS)/var(CO) is zero as ρ → 1. Prove that var(F D) and var(GLS) both tend in the limit to σ 2ǫ /(T − 1) < ∞ as ρ → 1. Conclude that var(GLS)/var(F D) tend to 1 as ρ → 1. Also, show that lim [var(GLS)/var(OLS)] = ρ→1
5(T 2 + T )/6(T 2 + 1) < 1 provided T > 3.
(d) For a given ρ, show that var(F D) = O(T −2 ) whereas the variance of the remaining estimators is O(T −3 ). Conclude that lim [var(F D)/var(CO)] = ∞ for any given ρ. T →∞
15. Consider the empirical example in section 5.6, based on the ConsumptionIncome data in Table 5.3. Obtain this data set from the CONSUMP.DAT ﬁle on the Springer web site.
124
Chapter 5: Violations of the Classical Assumptions (a) Replicate the OLS regression of Ct on Yt and a constant, and compute the DurbinWatson statistic. Test H0 ; ρ = 0 versus H1 ; ρ > 0 at the 5% signiﬁcance level. (b) Perform the CochraneOrcutt procedure and verify the regression results in the text. (c) Perform the twostep PraisWinsten procedure and verify the regression results in the text. Iterate on the PraisWinsten procedure. (d) Perform the maximum likelihood procedure and verify the results in the text. (e) Perform Durbin’s regression and verify the results in the text. (f) Test for ﬁrstorder serial correlation using the Breusch and Godfrey test. (g) Compute the NeweyWest heteroskedasticity and autocorrelationconsistent standard errors for the least squares estimates in part (a).
16. Benderly and Zwick (1985) considered the following equation RSt = α + βQt+1 + γPt + ut where RSt = the real return on stocks in year t, Qt+1 = the annual rate of growth of real GNP in year t + 1, and Pt = the rate of inﬂation in year t. The data is provided on the Springer web site and labeled BENDERLY.ASC. This data covers 31 annual observations for the U.S. over the period 19521982. This was obtained from Lott and Ray (1991). This equation is used to test the signiﬁcance of the inﬂation rate in explaining real stock returns. Use the sample period 19541976 to answer the following questions: (a) Run OLS to estimate the above equation. Remember to use Qt+1 . Is Pt signiﬁcant in this equation? Plot the residuals against time. Compute the NeweyWest heteroskedasticity and autocorrelationconsistent standard errors for these least squares estimates. (b) Test for serial correlation using the D.W. test. (c) Would your decision in (b) change if you used the BreuschGodfrey test for ﬁrstorder serial correlation? (d) Run the CochraneOrcutt procedure to correct for ﬁrstorder serial correlation. Report your estimate of ρ. (e) Run a PraisWinsten procedure accounting for the ﬁrst observation and report your estimate of ρ. Plot the residuals against time. 17. Using our crosssection Energy/GDP data set in Chapter 3, problem 3.16 consider the following two models: M odel 1: logEn = α + βlogRGDP + u M odel 2: En = α + βRGDP + v Make sure you have corrected the W. Germany observation on EN as described in problem 3.16 part (d). (a) Run OLS on both Models 1 and 2. Test for heteroskedasticity using the Goldfeldt/Quandt Test. Omit c = 6 central observations. Why is heteroskedasticity a problem in Model 2, but not Model 1? (b) For Model 2, test for heteroskedasticity using the Glejser Test. (c) Now use the BreuschPagan Test to test for heteroskedasticity on Model 2. (d) Apply White’s Test to Model 2. (e) Do all these tests give the same decision?
Problems
125
(f) Propose and estimate a simple transformation of Model 2, assuming heteroskedasticity of the form σ 2i = σ 2 RGDP 2 . (g) Propose and estimate a simple transformation of Model 2, assuming heteroskedasticity of the form σ 2i = σ 2 (a + bRGDP )2 . (h) Now suppose that heteroskedasticity is of the form σ 2i = σ 2 RGDP γ where γ is an unknown parameter. Propose and estimate a simple transformation for Model 2. Hint: You can write σ 2i as exp{α + γlogRGDP } where α = logσ 2 . (i) Compare the standard errors of the estimates for Model 2 from OLS, also obtain White’s heteroskedasticityconsistent standard errors. Compare them with the simple Weighted Least Squares estimates of the standard errors in parts (f), (g) and (h). What do you conclude?
18. You are given quarterly data from the ﬁrst quarter of 1965 (1965.1) to the fourth quarter of 1983 (1983.4) on employment in Orange County California (EMP) and real gross national product (RGNP). The data set is in a ﬁle called ORANGE.DAT on the Springer web site. (a) Generate the lagged variable of real GNP, call it RGN Pt−1 and estimate the following model by OLS: EM Pt = α + βRGN Pt−1 + ut . (b) What does inspection of the residuals and the DurbinWatson statistic suggest? (c) Assuming ut = ρut−1 + ǫt where ρ < 1 and ǫt ∼ IIN(0, σ 2ǫ ), use the CochraneOrcutt procedure to estimate ρ, α and β. Compare the latter estimates and their standard errors with those of OLS. (d) The CochraneOrcutt procedure omits the ﬁrst observation. Perform the PraisWinsten adjustment. Compare the resulting estimates and standard error with those in part (c). (e) Apply the BreuschGodfrey test for ﬁrst and second order autoregression. What do you conclude? (f) Compute the NeweyWest heteroskedasticity and autocorrelationconsistent covariance standard errors for the least squares estimates in part (a). 19. Consider the earning data underlying the regression in Table 4.1 and available on the Springer web site as EARN.ASC. (a) Apply White’s test for heteroskedasticity to the regression residuals. (b) Compute White’s heteroskedasticityconsistent standard errors. (c) Test the least squares residuals for normality using the JarqueBera test. 20. Harrison and Rubinﬁeld (1978) collected data on 506 census tracts in the Boston area in 1970 to study hedonic housing prices and the willingness to pay for clean air. This data is available on the Springer web site as HEDONIC.XLS. The dependent variable is the Median Value (MV) of owneroccupied homes. The regressors include two structural variables, RM the average number of rooms, and AGE representing the proportion of owner units built prior to 1940. In addition there are eight neighborhood variables: B, the proportion of blacks in the population; LSTAT, the proportion of population that is lower status; CRIM, the crime rate; ZN, the proportion of 25000 square feet residential lots; INDUS, the proportion of nonretail business acres; TAX, the full value property tax rate ($/$10000); PTRATIO, the pupilteacher ratio; and CHAS represents the dummy variable for Charles River: = 1 if a tract bounds the Charles. There are also two accessibility variables, DIS the weighted distances to ﬁve employment centers in the Boston region, and RAD the index of accessibility to radial highways. One more regressor is an air pollution variable NOX, the annual average nitrogen oxide concentration in parts per hundred million. (a) Run OLS of MV on the 13 independent variables and a constant. Plot the residuals.
126
Chapter 5: Violations of the Classical Assumptions (b) Apply White’s tests for heteroskedasticity. (c) Obtain the White heteroskedasticityconsistent standard errors. (d) Test the least squares residuals for normality using the JarqueBera test.
References For additional readings consult the econometrics books cited in the Preface. Recent chapters on heteroskedasticity and autocorrelation include Griﬃths (2001) and King (2001): Ali, M.M. and C. Giaccotto (1984), “A study of Several New and Existing Tests for Heteroskedasticity in the General Linear Model,” Journal of Econometrics, 26: 355373. Amemiya, T. (1973), “Regression Analysis When the Variance of the Dependent Variable is Proportional to the Square of its Expectation,” Journal of the American Statistical Association, 68: 928934. Amemiya, T. (1977), “A Note on a Heteroskedastic Model,” Journal of Econometrics, 6: 365370. Andrews, D.W.K. (1991), “Heteroskedasticity and Autocorrelation Consistent Covariance Matrix Estimation,” Econometrica, 59: 817858. Baltagi, B. and Q. Li (1990), “The Heteroskedastic Consequences of an Arbitrary Variance for the Initial Disturbance of an AR(1) Model,” Econometric Theory, Problem 90.3.1, 6: 405. Baltagi, B. and Q. Li (1992), “The Bias of the Standard Errors of OLS for an AR(1) process with an Arbitrary Variance on the Initial Observations,” Econometric Theory, Problem 92.1.4, 8: 146. Baltagi, B. and Q. Li (1995), “ML Estimation of Linear Regression Model with AR(1) Errors and Two Observations,” Econometric Theory, Solution 93.3.2, 11: 641642. Bartlett’s test, M.S. (1937), “Properties of Suﬃciency and Statistical Tests,” Proceedings of the Royal Statistical Society, A, 160: 268282. Beach, C.M. and J.G. MacKinnon (1978), “A Maximum Likelihood Procedure for Regression with Autocorrelated Errors,” Econometrica, 46: 5158. Benderly, J. and B. Zwick (1985), “Inﬂation, Real Balances, Output and Real Stock Returns,” American Economic Review, 75: 11151123. Breusch, T.S. (1978), “Testing for Autocorrelation in Dynamic Linear Models,” Australian Economic Papers, 17: 334355. Breusch, T.S. and A.R. Pagan (1979), “A Simple Test for Heteroskedasticity and Random Coeﬃcient Variation,” Econometrica, 47: 12871294. Buse, A. (1984), “Tests for Additive Heteroskedasticity: Goldfeld and Quandt Revisited,” Empirical Economics, 9: 199216. Carroll, R.H. (1982), “Adapting for Heteroskedasticity in Linear Models,” Annals of Statistics, 10: 12241233. Cochrane, D. and G. Orcutt (1949), “Application of Least Squares Regression to Relationships Containing Autocorrelated Error Terms,” Journal of the American Statistical Association, 44: 3261. Cragg, J.G. (1992), “QuasiAitken Estimation for Heteroskedasticity of Unknown Form,” Journal of Econometrics, 54: 197202.
References
127
Durbin, J. (1960), “Estimation of Parameters in TimeSeries Regression Model,” Journal of the Royal Statistical Society, Series B, 22: 139153. Durbin, J. and G. Watson (1950), “Testing for Serial Correlation in Least Squares RegressionI,” Biometrika, 37: 409428. Durbin, J. and G. Watson (1951), “Testing for Serial Correlation in Least Squares RegressionII,” Biometrika, 38: 159178. Evans, M.A., and M.L. King (1980) “A Further Class of Tests for Heteroskedasticity,” Journal of Econometrics, 37: 265276. Farebrother, R.W. (1980), “The DurbinWatson Test for Serial Correlation When There is No Intercept in the Regression,” Econometrica, 48: 15531563. Glejser, H. (1969), “A New Test for Heteroskedasticity,” Journal of the American Statistical Association, 64: 316323. Godfrey, L.G. (1978), “Testing Against General Autoregressive and Moving Average Error Models When the Regressors Include Lagged Dependent Variables,” Econometrica, 46: 12931302. Goldfeld, S.M. and R.E. Quandt (1965), “Some Tests for Homoscedasticity,” Journal of the American Statistical Association, 60: 539547. Goldfeld, S.M. and R.E. Quandt (1972), Nonlinear Methods in Econometrics (NorthHolland: Amsterdam). Griﬃths, W.E. (2001), “Heteroskedasticity,” Chapter 4 in B.H. Baltagi, (ed.), A Companion to Theoretical Econometrics (Blackwell: Massachusetts). Harrison, M. and B.P. McCabe (1979), “A Test for Heteroskedasticity Based On Ordinary Least Squares Residuals,” Journal of the American Statistical Association, 74: 494499. Harrison, D. and D.L. Rubinfeld (1978), “Hedonic Housing Prices and the Demand for Clean Air,” Journal of Environmental Economics and Management, 5: 81102. Harvey, A.C. (1976), “Estimating Regression Models With Multiplicative Heteroskedasticity,” Econometrica, 44: 461466. Hilderth, C. and J. Lu (1960), “Demand Relations with Autocorrelated Disturbances,” Technical Bulletin 276 (Michigan State University, Agriculture Experiment Station). Jarque, C.M. and A.K. Bera (1987), “A Test for Normality of Observations and Regression Residuals,” International Statistical Review, 55: 163177. Kim, J.H. (1991), “The Heteroskedastic Consequences of an Arbitrary Variance for the Initial Disturbance of an AR(1) Model,” Econometric Theory, Solution 90.3.1, 7: 544545. King, M. (2001), “Serial Correlation,” Chapter 2 in B.H. Baltagi, (ed.), A Companion to Theoretical Econometrics (Blackwell: Massachusetts). Koenker, R. (1981), “A Note on Studentizing a Test for Heteroskedasticity,” Journal of Econometrics, 17: 107112. Koenker, R. and G.W. Bassett, Jr. (1982), “Robust Tests for Heteroskedasticity Based on Regression Quantiles,” Econometrica, 50:4361. Koning, R.H. (1992), “The Bias of the Standard Errors of OLS for an AR(1) process with an Arbitrary Variance on the Initial Observations,” Econometric Theory, Solution 92.1.4, 9: 149150.
128
Chapter 5: Violations of the Classical Assumptions
Kr¨ amer, W. (1982), “Note on Estimating Linear Trend When Residuals are Autocorrelated,” Econometrica, 50: 10651067. Lott, W.F. and S.C. Ray (1992), Applied Econometrics: Problems With Data Sets (The Dryden Press: New York). Maddala, G.S. (1977), Econometrics (McGrawHill: New York). Maeshiro, A. (1976), “Autoregressive Transformation, Trended Independent Variables and Autocorrelated Disturbance Terms,” The Review of Economics and Statistics, 58: 497500. Maeshiro, A. (1979), “On the Retention of the First Observations in Serial Correlation Adjustment of Regression Models,” International Economic Review, 20: 259265. Magee L. (1993), “ML Estimation of Linear Regression Model with AR(1) Errors and Two Observations,” Econometric Theory, Problem 93.3.2, 9: 521522. Mizon, G.E. (1995), “A Simple Message for Autocorrelation Correctors: Don’t,” Journal of Econometrics 69: 267288. Newey, W.K. and K.D. West (1987), “A Simple, Positive Semideﬁnite, Heteroskedasticity and Autocorrelation Consistent Covariance Matrix,” Econometrica, 55: 703708. Oberhofer, W. and J. Kmenta (1974), “A General Procedure for Obtaining Maximum Likelihood Estimates in Generalized Regression Models,” Econometrica, 42: 579590. Park, R.E. and B.M. Mitchell (1980), “Estimating the Autocorrelated Error Model With Trended Data,” Journal of Econometrics, 13: 185201. Prais, S. and C. Winsten (1954), “Trend Estimation and Serial Correlation,” Discussion Paper 383 (Cowles Commission: Chicago). Rao, P. and Z. Griliches (1969), “Some Small Sample Properties of Several TwoStage Regression Methods in the Context of Autocorrelated Errors,” Journal of the American Statistical Association, 64: 253272. Rao, P. and R. L. Miller (1971), Applied Econometrics (Wadsworth: Belmont). Robinson, P.M. (1987), “Asymptotically Eﬃcient Estimation in the Presence of Heteroskedasticity of Unknown Form,” Econometrica, 55: 875891. Rutemiller, H.C. and D.A. Bowers (1968), “Estimation in a Heteroskedastic Regression Model,” Journal of the American Statistical Association, 63: 552557. Savin, N.E. and K.J. White (1977), “The DurbinWatson Test for Serial Correlation with Extreme Sample Sizes or Many Regressors,” Econometrica, 45: 19891996. Szroeter, J. (1978), “A Class of Parametric Tests for Heteroskedasticity in Linear Econometric Models,” Econometrica, 46: 13111327. Theil, H. (1978), Introduction to Econometrics (PrenticeHall: Englewood Cliffs, NJ). Waldman, D.M. (1983), “A Note on Algebraic Equivalence of White’s Test and a Variation of the Godfrey/BreuschPagan Test for Heteroskedasticity,” Economics Letters, 13: 197200. White, H. (1980), “A Heteroskedasticity Consistent Covariance Matrix Estimator and a Direct Test for Heteroskedasticity,” Econometrica, 48: 817838. Wooldridge, J.M. (1991), “On the Application of Robust, RegressionBased Diagnostics to Models of Conditional Means and Conditional Variances,” Journal of Econometrics, 47: 546.
CHAPTER 6
Distributed Lags and Dynamic Models 6.1
Introduction
Many economic models have lagged values of the regressors in the regression equation. For example, it takes time to build roads and highways. Therefore, the effect of this public investment on growth in GNP will show up with a lag, and this effect will probably linger on for several years. It takes time before investment in research and development pays off in new inventions which in turn take time to develop into commercial products. In studying consumption behavior, a change in income may affect consumption over several periods. This is true in the permanent income theory of consumption, where it may take the consumer several periods to determine whether the change in real disposable income was temporary or permanent. For example, is the extra consulting money earned this year going to continue next year? Also, lagged values of real disposable income appear in the regression equation because the consumer takes into account his life time earnings in trying to smooth out his consumption behavior. In turn, one’s life time income may be guessed by looking at past as well as current earnings. In other words, the regression relationship would look like Yt = α + β 0 Xt + β 1 Xt−1 + .. + β s Xt−s + ut
t = 1, 2, . . . , T
(6.1)
where Yt denotes the tth observation on the dependent variable Y and Xt−s denotes the (ts)th observation on the independent variable X. α is the intercept and β 0 , β 1 , . . . , β s are the current and lagged coeﬃcients of Xt . Equation (6.1) is known as a distributed lag since it distributes the effect of an increase in income on consumption over s periods. Note that the shortrun effect of a unit change in X on Y is given by β o , while the longrun effect of a unit change in X on Y is (β 0 + β 1 + .. + β s ). Suppose that we observe Xt from 1955 to 1995. Xt−1 is the same variable but for the previous period, i.e., 19541994. Since 1954 is not observed, we start from 1955 for Xt−1 , and end at 1994. This means that when we lag once, the current Xt series will have to start at 1956 and end at 1995. For practical purposes, this means that when we lag once we loose one observation from the sample. So if we lag s periods, we loose s observations. Furthermore, we are estimating one extra β with every lag. Therefore, there is double jeopardy with respect to loss of degrees of freedom. The number of observations fall (because we are lagging the same series), and the number of parameters to be estimated increase with every lagging variable introduced. Besides the loss of degrees of freedom, the regressors in (6.1) are likely to be highly correlated with each other. In fact most economic time series are usually trended and very highly correlated with their lagged values. This introduces the problem of multicollinearity among the regressors and as we saw in Chapter 4, the higher the multicollinearity among these regressors, the lower is the reliability of the regression estimates. In this model, OLS is still BLUE because the classical assumptions are still satisﬁed. All we have done in (6.1) is introduce the additional regressors (Xt−1 , . . . , Xt−s ). These regressors are uncorrelated with the disturbances since they are lagged values of Xt , which are by assumption not correlated with ut for every t.
130
Chapter 6: Distributed Lags and Dynamic Models
>E
( s + 1)>
> E 0
1
2
3
s
( s + 1)
Figure 6.1 Linear Distributed Lag
In order to reduce the degrees of freedom problem, one could impose more structure on the β’s. One of the simplest forms imposed on these coeﬃcients is the linear arithmetic lag, (see Figure 6.1), which can be written as β i = [(s + 1) − i]β
for i = 0, 1, . . . , s
(6.2)
The lagged coeﬃcients of X follow a linear distributed lag declining arithmetically from (s+1)β for Xt to β for Xt−s . Substituting (6.2) in (6.1) one gets Yt = α +
s
i=0 β i Xt−i
+ ut = α + β
s
i=0 [(s
+ 1) − i]Xt−i + ut
(6.3)
where the latter equation can be estimated by the regression of Yt on a constant and Zt , where Zt =
s
i=0 [(s
+ 1) − i]Xt−i
This Zt can be calculated given s and Xt . Hence, we have reduced the estimation of β 0 , β 1 , . . . , β s is obtained, β can be deduced from (6.2), for i = into the estimation of just one β. Once β i 0, 1, . . . , s. Despite its simplicity, this lag is too restrictive to impose on the regression and is not usually used in practice. Alternatively, one can think of β i = f (i) for i = 0, 1, . . . , s. If f (i) is a continuous function, over a closed interval, then it can be approximated by an rth degree polynomial, f (i) = a0 + a1 i + . . . + ar ir For example, if r = 2, then β i = a0 + a1 i + a2 i2
for i = 0, 1, 2, . . . , s
6.1
Introduction
131
so that β 0 = a0 β 1 = a0 + a1 + a2 β 2 = a0 + 2a1 + 4a2 .. .. .. . . . β s = a0 + sa1 + s2 a2 Once a0 , a1 , and a2 are estimated, β 0 , β 1 , . . . , β s can be deduced. In fact, substituting β i = a0 + a1 i + a2 i2 in (6.1) we get (6.4) Yt = α + si=0 (a0 + a1 i + a2 i2 )Xt−i + ut s 2 s s = α + a0 i=0 Xt−i + a1 i=0 iXt−i + a2 i=0 i Xt−i + ut
This last equation, shows that α, a0 , a1 and a2 can be estimated from the regression of Yt on a constant, Z0 = si=0 Xt−i , Z1 = si=0 iXt−i and Z2 = si=0 i2 Xt−i . This procedure was proposed by Almon (1965) and is known as the Almon lag. One of the problems with this procedure is the choice of s and r, the number of lags on Xt , and the degree of the polynomial, respectively. In practice, neither is known. Davidson and MacKinnon (1993) suggest starting with a maximum reasonable lag s∗ that is consistent with the theory and then based on the unrestricted regression, given in (6.1), checking whether the ﬁt of the model deteriorates as s∗ ¯ 2 ; (ii) minimizing is reduced. Some criteria suggested for this choice include: (i) maximizing R Akaike’s (1973) Information Criterion (AIC) with respect to s. This is given by AIC(s) = (RSS/T )e2s/T ; or (iii) minimizing Schwarz (1978) Bayesian Information Criterion (BIC) with respect to s. This is given by BIC(s) = (RSS/T )T s/T where RSS denotes the residual sum ¯ 2 , reward good ﬁt but penalize loss of squares. Note that the AIC and BIC criteria, like R of degrees of freedom associated with a high value of s. These criteria are printed by most regression software including SHAZAM, EViews and SAS. Once the lag length s is chosen it is straight forward to determine r, the degree of the polynomial. Start with a high value of r and construct the Z variables as described s 4 in (6.4). If r = 4 is the highest degree polynomial chosen and a4 , the coeﬃcient of Z4 = i=0 i Xt−4 is insigniﬁcant, drop Z4 and run the regression for r = 3. Stop, if the coeﬃcient of Z3 is signiﬁcant, otherwise drop Z3 and run the regression for r = 2. Applied researchers usually impose end point constraints on this Almon lag. A near end point constraint means that β −1 = 0 in equation (6.1). This means that for equation (6.4), this constraint yields the following restriction on the second degree polynomial in a’s: β −1 = f (−1) = a0 − a1 + a2 = 0. This restriction allows us to solve for a0 given a1 and a2 . In fact, substituting a0 = a1 − a2 into (6.4), the regression becomes Yt = α + a1 (Z1 + Z0 ) + a2 (Z2 − Z0 ) + ut
(6.5)
and once a1 and a2 are estimated, a0 is deduced, and hence the β i ’s. This restriction essentially states that Xt+1 has no effect on Yt . This may not be a plausible assumption, especially in our consumption example, where income next year enters the calculation of permanent income or life time earnings. A more plausible assumption is the far end point constraint, where β s+1 = 0. This means that Xt−(s+1) does not affect Yt . The further you go back in time, the less is the effect on the current period. All we have to be sure of is that we have gone far back enough
132
Chapter 6: Distributed Lags and Dynamic Models
>E
.
. ..
. –1 0 1 2 3
I (I + 1)
E
Figure 6.2 A Polynomial Lag with End Point Constraints
to reach an insigniﬁcant effect. This far end point constraint is imposed by removing Xt−(s+1) from the equation as we have done above. But, some researchers impose this restriction on β i = f (i), i.e., by restricting β s+1 = f (s + 1) = 0. This yields for r = 2 the following constraint: a0 +(s+1)a1 +(s+1)2 a2 = 0. Solving for a0 and substituting in (6.4), the constrained regression becomes Yt = α + a1 [Z1 − (s + 1)Z0 ] + a2 [Z2 − (s + 1)2 Z0 ] + ut
(6.6)
One can also impose both end point constraints and reduce the regression into the estimation of one a rather than three a’s. Note that β −1 = β s+1 = 0 can be imposed by not including Xt+1 and Xt−(s+1) in the regression relationship. However, these end point restrictions impose the additional restrictions that the polynomial on which the a’s lie should pass through zero at i = −1 and i = (s + 1), see Figure 6.2. These additional restrictions on the polynomial may not necessarily be true. In other words, the polynomial could intersect the Xaxis at points other than −1 or (s + 1). Imposing a restriction, whether true or not, reduces the variance of the estimates, and introduces bias if the restriction is untrue. This is intuitive, because this restriction gives additional information which should increase the reliability of the estimates. The reduction in variance and the introduction of bias naturally lead to Mean Square Error criteria that help determine whether these restrictions should be imposed, see Wallace (1972). These criteria are beyond the scope of this chapter. In general, one should be careful in the use of restrictions that may not be plausible or even valid. In fact, one should always test these restrictions before using them. See Schmidt and Waud (1975). Empirical Example: Using the ConsumptionIncome data from the Economic Report of the President over the period 19501993, given in Table 5.1, we estimate a consumptionincome regression imposing a ﬁve year lag on income. In this case, all variables are in logs and s = 5 in equation (6.1). Table 6.1 gives the SAS output imposing the linear arithmetic lag given in equation (6.2). = 0.047 as the coeﬃcient of Yt−5 which is denoted by Note that the SAS output reports β Y LAG5. This is statistically signiﬁcant with a tvalue of 83.1. Note that the coeﬃcient of Yt−4
6.1
Introduction
133
Table 6.1 Regression with Arithmetic Lag Restriction Dependent Variable: C Analysis of Variance Source Model Error C Total
DF 1 37 38
Root MSE Dep Mean C.V.
0.01828 9.14814 0.19978
Sum of Squares 2.30559 0.01236 2.31794
Mean Square 2.30559 0.00033
Rsquare Adj Rsq
0.9947 0.9945
F Value 6902.791
Prob>F 0.0001
T for H0: Parameter=0 0.873 83.083 83.083 83.083 83.083 83.083 83.083 3.906 0.631 –3.058 –2.857 –1.991
Prob > T 0.3880 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0004 0.5319 0.0041 0.0070 0.0539
Parameter Estimates
Variable INTERCEP Y YLAG1 YLAG2 YLAG3 YLAG4 YLAG5 RESTRICT RESTRICT RESTRICT RESTRICT RESTRICT
DF 1 1 1 1 1 1 1 –1 –1 –1 –1 –1
Parameter Estimate 0.095213 0.280795 0.233996 0.187197 0.140398 0.093598 0.046799 0.007218 0.000781 –0.003911 –0.005374 –0.005208
Standard Error 0.10900175 0.00337969 0.00281641 0.00225313 0.00168985 0.00112656 0.00056328 0.00184780 0.00123799 0.00127903 0.00188105 0.00261513
and so on. The coeﬃcient of Yt is given by 6β = 0.281. which is denoted by Y LAG4 is 2β, At the bottom of the regression output, SAS tests each one of these ﬁve coeﬃcient restrictions individually. We can see that three of these restrictions are rejected at the 5% level. One can test the arithmetic lag restrictions jointly using an F test. The Unrestricted Residual Sum of Squares (URSS) is obtained by regressing Ct on Yt , Yt−1 , . . . , Yt−5 and a constant. This yields URSS = 0.00667. The RRSS is given in Table 6.1 as 0.01236 and it involves imposing 5 restrictions given in (6.2). Therefore, F =
(0.01236 − 0.00667)/5 = 5.4597 0.00667/32
and this is distributed as F5,32 under the null hypothesis. The observed F statistic has a pvalue of 0.001 and we reject the linear arithmetic lag restrictions. Next we impose an Almon lag based on a second degree polynomial as described in equation (6.4). Table 6.2 reports the SAS output for s = 5 imposing the near end point constraint. In this = 0.193, β = 0.299, . . . , β = case, the estimated regression coeﬃcients rise and then fall: β 0 1 5 −0.159. Only β 5 is statistically insigniﬁcant. In addition, SAS reports a ttest for the near end point restriction which is rejected with a pvalue of 0.0001. The Almon lag restrictions can be
134
Chapter 6: Distributed Lags and Dynamic Models
Table 6.2 Almon Polynomial, r = 2, s = 5 and Near EndPoint Constraint PDLREG Procedure Dependent Variable = C Ordinary Least Squares Estimates SSE MSE SBC Reg Rsq DurbinWatson
0.014807 0.000411 –185.504 0.9936 0.6958
DFE Root MSE AIC Total Rsq
36 0.020281 –190.495 0.9936
Variable Intercept Y**0 Y**1 Y**2
DF 1 1 1 1
B Value 0.093289 0.400930 –0.294560 –0.268490
Std Error 0.1241 0.00543 0.0892 0.0492
t Ratio 0.752 73.838 –3.302 –5.459
Approx Prob 0.4571 0.0001 0.0022 0.0001
Restriction Y(–1)
DF –1
L Value 0.005691
Std Error 0.00135
t Ratio 4.203
Approx Prob 0.0001
Variable Y(0) Y(1) Y(2) Y(3) Y(4) Y(5)
Parameter Value 0.19324 0.29859 0.31606 0.24565 0.08735 –0.15883
Std Error 0.027 0.038 0.033 0.012 0.026 0.080
t Ratio 7.16 7.88 9.67 21.20 3.32 –1.99
Approx Prob 0.0001 0.0001 0.0001 0.0001 0.0020 0.0539
Estimate of Lag Distribution −0.159 0 0.3161                  
jointly tested using Chow’s F statistic. The URSS is obtained from the unrestricted regression of Ct on Yt , Yt−1 , . . . , Yt−5 and a constant. This was reported above as URSS = 0.00667. The RRSS, given in Table 6.2, is 0.014807 and involves four restrictions. Therefore,
F =
(0.014807 − 0.00667)/4 = 9.76 0.00667/32
and this is distributed as F4,32 under the null hypothesis. The observed F statistic has a pvalue of 0.00003 and we reject the second degree polynomial Almon lag speciﬁcation with a near end point constraint. Table 6.3 reports the SAS output for s = 5, imposing the far end point constraint. Note are decreasing, that this restriction is rejected with a pvalue of 0.008. In this case, the β’s = 0.309, . . . , β = −0.026 with β and β being statistically insigniﬁcant. Most = 0.502, β β 0 1 5 4 5 packages have polynomial distributed lags as part of their standard commands. For example, using EViews, replacing the regressor Y by PDL(Y, 5, 2, 1) indicates the request to ﬁt a ﬁve year Almon lag on Y that is of the secondorder degree, with a near end point constraint.
6.2
Inﬁnite Distributed Lag
135
Table 6.3 Almon Polynomial, r = 2, s = 5 and Far EndPoint Constraint PDLREG Procedure Dependent Variable = C Ordinary Least Squares Estimates SSE MSE SBC Reg Rsq DurbinWatson
0.009244 0.000257 –203.879 0.9960 0.6372
DFE Root MSE AIC Total Rsq
36 0.016024 –208.87 0.9960
Variable Intercept Y**0 Y**1 Y**2
DF 1 1 1 1
B Value –0.015868 0.405244 –0.441447 –0.133484
Std Error 0.1008 0.00439 0.0706 0.0383
t Ratio –0.157 92.331 –6.255 –3.483
Approx Prob 0.8757 0.0001 0.0001 0.0013
Restriction Y(–1)
DF –1
L Value 0.002758
Std Error 0.00107
t Ratio 2.575
Approx Prob 0.0080
Variable Y(0) Y(1) Y(2) Y(3) Y(4) Y(5)
6.2
Parameter Value 0.50208 0.30916 0.15995 0.05442 –0.00741 –0.02555
Std Error 0.064 0.022 0.008 0.025 0.029 0.021
t Ratio 7.89 14.32 19.82 2.20 –0.26 –1.23
Approx Prob 0.0001 0.0001 0.0001 0.0343 0.7998 0.2268
Estimate of Lag Distribution −0.026 0.5021                  
Inﬁnite Distributed Lag
So far we have been dealing with a ﬁnite number of lags imposed on Xt . Some lags may be inﬁnite. For example, the investment in building highways and roads several decades ago may still have an effect on today’s growth in GNP. In this case, we write equation (6.1) as Yt = α + ∞ (6.7) i=0 β i Xt−i + ut t = 1, 2, . . . , T.
There are an inﬁnite number of β i ’s to estimate with only T observations. This can only be feasible if more structure is imposed ∞ on the β i ’s. First, we normalize these β i ’s by their sum, i.e., let wi = β i /β where β = i=0 β i . If all the β i ’s have the same sign, then the β i ’s take the sign of β and 0 ≤ wi ≤ 1 for all i, with ∞ i=0 wi = 1. This means that the wi ’s can be interpreted as probabilities. In fact, Koyck (1954) imposed the geometric lag on the wi ’s, i.e., wi = (1 − λ)λi for i = 0, 1, . . . , ∞1 . Substituting β i = βwi = β(1 − λ)λi in (6.7) we get Yt = α + β(1 − λ)
∞
i i=0 λ Xt−i
+ ut
(6.8)
136
Chapter 6: Distributed Lags and Dynamic Models
Equation (6.8) is known as the inﬁnite distributed lag form of the Koyck lag. The shortrun effect of a unit change in Xt on Yt isgiven by β(1 − λ); whereas the longrun effect of a unit ∞ change in Xt on Yt is ∞ i=0 β i = β i=0 wi = β. Implicit in the Koyck lag structure is that the effect of a unit change in Xt on Yt declines the further back we go in time. For example, if λ = 1/2, then β 0 = β/2, β 1 = β/4, β 2 = β/8, etc. Deﬁning LXt = Xt−1 , as the lag operator, we have Li Xt = Xt−i , and (6.8) reduces to i (6.9) Yt = α + β(1 − λ) ∞ i=0 (λL) Xt + ut = α + β(1 − λ)Xt /(1 − λL) + ut ∞ i where we have used the fact that i=0 c = 1/(1 − c). Multiplying the last equation by (1 − λL) one gets Yt − λYt−1 = α(1 − λ) + β(1 − λ)Xt + ut − λut−1 or Yt = λYt−1 + α(1 − λ) + β(1 − λ)Xt + ut − λut−1
(6.10)
This is the autoregressive form of the inﬁnite distributed lag. It is autoregressive because it includes the lagged value of Yt as an explanatory variable. Note that we have reduced the problem of estimating an inﬁnite number of β i ’s into estimating λ and β from (6.10). However, OLS would lead to biased and inconsistent estimates, because (6.10) contains a lagged dependent variable as well as serially correlated errors. In fact the error in (6.10) is a Moving Average process of order one, i.e., MA(1), see Chapter 14. We digress at this stage to give two econometric models which would lead to equations resembling (6.10). 6.2.1
Adaptive Expectations Model (AEM)
Suppose that output Yt is a function of expected sales Xt∗ and that the latter is unobservable, i.e., Yt = α + βXt∗ + ut where expected sales are updated according to the following method ∗ ∗ Xt∗ − Xt−1 = δ(Xt − Xt−1 )
(6.11)
that is, expected sales at time t is a weighted combination of expected sales at time t − 1 and actual sales at time t. In fact, ∗ Xt∗ = δXt + (1 − δ)Xt−1
(6.12)
Equation (6.11) is also an error learning model, where one learns from past experience and adjust expectations after observing current sales. Using the lag operator L, (6.12) can be rewritten as Xt∗ = δXt /[1 − (1 − δ)L]. Substituting this last expression in the above relationship, we get Yt = α + βδXt /[1 − (1 − δ)L] + ut
(6.13)
Multiplying both sides of (6.13) by [1 − (1 − δ)L], we get Yt − (1 − δ)Yt−1 = α[(1 − (1 − δ)] + βδXt + ut − (1 − δ)ut−1 (6.14) looks exactly like (6.10) with λ = (1 − δ).
(6.14)
6.3
6.2.2
Estimation and Testing of Dynamic Models with Serial Correlation
137
Partial Adjustment Model (PAM)
Under this model there is a cost of being out of equilibrium and a cost of adjusting to that equilibrium, i.e., Cost = a(Yt − Yt∗ )2 + b(Yt − Yt−1 )2
(6.15)
where Yt∗ is the target or equilibrium level for Y , whereas Yt is the current level of Y . The ﬁrst term of (6.15) gives a quadratic loss function proportional to the distance of Yt from the equilibrium level Yt∗ . The second quadratic term represents the cost of adjustment. Minimizing this quadratic cost function with respect to Y , we get Yt = γYt∗ +(1−γ)Yt−1 , where γ = a/(a+b). Note that if the cost of adjustment was zero, then b = 0, γ = 1, and the target is reached immediately. However, there are costs of adjustment, especially in building the desired capital stock. Hence, Yt = γYt∗ + (1 − γ)Yt−1 + ut
(6.16)
where we made this relationship stochastic. If the true relationship is Yt∗ = α + βXt , then from (6.16) Yt = γα + γβXt + (1 − γ)Yt−1 + ut
(6.17)
and this looks like (6.10) with λ = (1 − γ), except for the error term, which is not necessarily MA(1) with the Moving Average parameter λ.
6.3
Estimation and Testing of Dynamic Models with Serial Correlation
Both the AEM and the PAM give equations resembling the autoregressive form of the inﬁnite distributed lag. In all cases, we end up with a lagged dependent variable and an error term that is either Moving Average of order one as in (6.10), or just classical or autoregressive as in (6.17). In this section we study the testing and estimation of such autoregressive or dynamic models. If there is a Yt−1 in the regression equation and the ut ’s are classical disturbances, as may be the case in equation (6.17), then Yt−1 is said to be contemporaneously uncorrelated with the disturbance term ut . In fact, the disturbances satisfy assumptions 14 of Chapter 3 and E(Yt−1 ut ) = 0 even though E(Yt−1 ut−1 ) = 0. In other words, Yt−1 is not correlated with the current disturbance ut but it is correlated with the lagged disturbance ut−1 . In this case, as long as the disturbances are not serially correlated, OLS will be biased, but remains consistent and asymptotically eﬃcient. This case is unlikely with economic data given that most macro timeseries variables are highly trended. More likely, the ut ’s are serially correlated. In this case, OLS is biased and inconsistent. Intuitively, Yt is related to ut , so Yt−1 is related to ut−1 . If ut and ut−1 are correlated, then Yt−1 and ut are correlated. This means that one of the regressors, lagged Y , is correlated with ut and we have the problem of endogeneity. Let us demonstrate what happens to OLS for the simple autoregressive model with no constant Yt = βYt−1 + ν t
β < 1
t = 1, 2, . . . , T
(6.18)
138
Chapter 6: Distributed Lags and Dynamic Models
with ν t = ρν t−1 + ǫt , ρ < 1 and ǫt ∼ IIN(0, σ 2ǫ ). One can show, see problem 3, that T T T T 2 2 β OLS = t=2 Yt Yt−1 / t=2 Yt−1 = β + t=2 Yt−1 ν t / t=2 Yt−1
2 with plim(β OLS −β) = asymp. bias(β OLS ) = ρ(1−β )/(1+ρβ). This asymptotic bias is positive if ρ > 0 and negative if ρ < 0. Also, this asymptotic bias can be large for small values of β and large values of ρ. For example, if ρ = 0.9 and β = 0.2, the asymptotic bias for β is 0.73. This is more than 3 times the value of β.2 Also, ρ = Tt=2 ν t ν t−1 / Tt=2 ν t = Yt − β ν t−1 where OLS Yt−1 has
plim( ρ − ρ) = −ρ(1 − β 2 )/(1 + ρβ) = − asymp.bias(β OLS )
This means that if ρ > 0, then ρ would be negatively biased. However, if ρ < 0, then ρ is positively biased. In both cases, ρ is biased towards zero. In fact, the asymptotic bias of the D.W. statistic is twice the asymptotic bias of β OLS , see problem 3. This means that the D.W. statistic is biased towards not rejecting the null hypothesis of zero serial correlation. Therefore, if the D.W. statistic rejects the null of ρ = 0, it is doing that when the odds are against it, and therefore conﬁrming our rejection of the null and the presence of serial correlation. If on the other hand it does not reject the null, then the D.W. statistic is uninformative and has to be replaced by another conclusive test for serial correlation. Such an alternative test in the presence of a lagged dependent variable has been developed by Durbin (1970), and the statistic computed is called Durbin’s h. Using (6.10) or (6.17), one computes OLS ignoring its possible bias and ρ from OLS residuals as shown above. Durbin’s h is given by h= ρ[n/(1 − n var(coeff. of Yt−1 ))]1/2 .
(6.19)
This is asymptotically distributed N (0, 1) under null hypothesis of ρ = 0. If n[var(coeff. of Yt−1 )] is greater than one, then h cannot be computed, and Durbin suggests running the OLS residuals et on et−1 and the regressors in the model (including the lagged dependent variable), and testing whether the coeﬃcient of et−1 in this regression is signiﬁcant. In fact, this test can be generalized to higher order autoregressive errors. Let ut follow an AR(p) process ut = ρ1 ut−1 + ρ2 ut−2 + .. + ρp ut−p + ǫt then this test involves running et on et−1 , et−2 , . . . , et−p and the regressors in the model including Yt−1 . The test statistic for H0 ; ρ1 = ρ2 = .. = ρp = 0; is TR2 which is distributed χ2p . This is the Lagrange multiplier test developed independently by Breusch (1978) and Godfrey (1978) and discussed in Chapter 5. In fact, this test has other useful properties. For example, this test is the same whether the null imposes an AR(p) model or an MA(p) model on the disturbances, see Chapter 14. Kiviet (1986) argues that even though these are large sample tests, the BreuschGodfrey test is preferable to Durbin’s h in small samples. 6.3.1
A Lagged Dependent Variable Model with AR(1) Disturbances
A model with a lagged dependent variable and an autoregressive error term is estimated using instrumental variables (IV). This method will be studied extensively in Chapter 11. In short, the IV method corrects for the correlation between Yt−1 and the error term by replacing Yt−1 with its predicted value Yt−1 . The latter is obtained by regressing Yt−1 on some exogenous variables,
6.3
Estimation and Testing of Dynamic Models with Serial Correlation
139
say a set of Z’s, which are called a set of instruments for Yt−1 . Since these variables are exogenous and uncorrelated with ut , Yt−1 will not be correlated with ut . Suppose the regression equation is Yt = α + βYt−1 + γXt + ut
t = 2, . . . , T
(6.20)
and that at least one exogenous variable Zt exists which will be our instrument for Yt−1 . Regressing Yt−1 on Xt , Zt and a constant, we get Yt−1 = Yt−1 + νt = a1 + a2 Zt + a3 Xt + νt.
(6.21)
ν t) Yt = α + β Yt−1 + γXt + (ut + β
(6.22)
Then Yt−1 = a1 + a2 Zt + a3 Xt and is independent of ut , because it is a linear combination of exogenous variables. But, Yt−1 is correlated with ut . This means that ν t is the part of Yt−1 that is correlated with ut . Substituting Yt−1 = Yt−1 + ν t in (6.20) we get Yt−1 is uncorrelated with the new error term (ut + β ν t ) because ΣYt−1 ν t = 0 from (6.21). Also, Xt is uncorrelated with ut by assumption. But, from (6.21), Xt also satisﬁes ΣXt ν t = 0. Hence, Xt is uncorrelated with the new error term (ut + β ν t ). This means that OLS applied to (6.22) will lead to consistent estimates of α, β and γ. The only remaining question is where do we ﬁnd instruments like Zt ? This Zt should be (i) uncorrelated with ut , (ii) preferably predicting Yt−1 fairly well, but, not predicting it perfectly, otherwise Yt−1 = Yt−1 . If this happens, we are back to OLS which we know is inconsistent, (iii) Σzt2 /T should be ﬁnite and different from zero. Recall ¯ In this case, Xt−1 seems like a natural instrumental variable candidate. It is that zt = Zt − Z. an exogenous variable which would very likely predict Yt−1 fairly well, and satisﬁes Σx2t−1 /T being ﬁnite and different from zero. In other words, (6.21) regresses Yt−1 on a constant, Xt−1 and Xt , and gets Yt−1 . Additional lags on Xt can be used as instruments to improve the small sample properties of this estimator. Substituting Yt−1 in equation (6.22) results in consistent estimates of the regression parameters. Wallis (1967) substituted these consistent estimates in the original equation (6.20) and obtained the residuals u t . Then he computed T T ρ = [ t=2 u t u t−1 /(T − 1)]/[ t=1 u 2t /T ] + (3/T )
where the last term corrects for the bias in ρ. At this stage, one can perform a PraisWinsten procedure on (6.20) using ρ instead of ρ, see Fomby and Guilkey (1983). An alternative twostep procedure has been proposed by Hatanaka (1974). After estimating (6.22) and obtaining the residuals u t from (6.20), Hatanaka (1974) suggests running Yt∗ = Yt − ∗ ∗ ρYt−2 , Xt = Xt − ρXt−1 and u t−1 . Note that this is the CochraneOrcutt ρYt−1 on Yt−1 = Yt−1 − 2t ignores the t u t−1 / Tt=3 u transformation which ignores the ﬁrst observation. Also, ρ = Tt=3 u small sample bias correction factor suggested by Wallis (1967). Let δ be the coeﬃcient of u t−1 , then the eﬃcient estimator of ρ is given by ρ = ρ + δ. Hatanaka shows that the resulting estimators are asymptotically equivalent to the MLE in the presence of Normality.
Empirical Example: Consider the ConsumptionIncome data from the Economic Report of the President over the period 19501993 given in Table 5.1. Problem 5 asks the reader to verify that Durbin’s h obtained from the lagged dependent variable model described in (6.20) yields a value of 3.367. This is asymptotically distributed as N (0, 1) under the null hypothesis of no
140
Chapter 6: Distributed Lags and Dynamic Models
serial correlation of the disturbances. This null is soundly rejected. The Bruesch and Godfrey test runs the regression of OLS residuals on their lagged values and the regressors in the model. This yields a T R2 = 7.972. This is distributed as χ21 under the null. Therefore, we reject the hypothesis of no ﬁrstorder serial correlation. Next, we estimate (6.20) using current and lagged values of income (Yt , Yt−1 and Yt−2 ) as a set of instruments for lagged consumption (Ct−1 ). The regression given by (6.22), yields: t−1 + 0.802 Yt + residuals Ct = −0.053 + 0.196 C (0.1386) (0.0799) (0.1380)
Substituting these estimates in (6.20), one gets the residuals u t . Based on these u t ’s, the Wallis (1967) estimate of ρ yields ρ = 0.681. Using this ρ, the PraisWinsten regression on (6.20) gives the following result: Ct = 0.0007 + 0.169 Ct−1 + 0.822 Yt + residuals (0.007) (0.088) (0.088)
ρ = 0.597 Alternatively, based on u t , one can compute Hatanaka’s (1974) estimate of ρ given by and run Hatanaka’s regression ∗ + 0.820 Y ∗ + 0.068 u Ct∗ = −0.036 + 0.182 Ct−1 t−1 + residuals t (0.054) (0.098) (0.099) (0.142)
where Ct∗ = Ct − ρCt−1 . The eﬃcient estimate of ρ is given by ρ= ρ + 0.068 = 0.665. 6.3.2
A Lagged Dependent Variable Model with MA(1) Disturbances
Zellner and Geisel (1970) estimated the Koyck autoregressive representation of the inﬁnite distributed lag, given in (6.10). In fact, we saw that this could also arise from the AEM, see (6.14). In particular, it is a regression with a lagged dependent variable and an MA(1) error term with the added restriction that the coeﬃcient of Yt−1 is the same as the MA(1) parameter. For simplicity, we write Yt = α + λYt−1 + βXt + (ut − λut−1 )
(6.23)
Let wt = Yt − ut , then (6.23) becomes wt = α + λwt−1 + βXt
(6.24)
By continuous substitution of lagged values of wt in (6.24) we get wt = α(1 + λ + λ2 + .. + λt−1 ) + λt w0 + β(Xt + λXt−1 + .. + λt−1 X1 ) and replacing wt by (Yt − ut ), we get Yt = α(1 + λ + λ2 + .. + λt−1 ) + λt w0 + β(Xt + λXt−1 + .. + λt−1 X1 ) + ut
(6.25)
knowing λ, this equation can be estimated via OLS assuming that the disturbances ut are not serially correlated. Since λ is not known, Zellner and Geisel (1970) suggest a search procedure over λ, where 0 < λ < 1. The regression with the minimum residual sums of squares gives the
6.4
Autoregressive Distributed Lag
141
optimal λ, and the corresponding regression gives the estimates of α, β and w0 . The last coeﬃcient wo = Yo −uo = E(Yo ) can be interpreted as the expected value of the initial observation on the dependent variable. Klein (1958) considered the direct estimation of the inﬁnite Koyck lag, given in (6.8) and arrived at (6.25). The search over λ results in MLEs of the coeﬃcients. Note, however, that the estimate of wo is not consistent. Intuitively, as t tends to inﬁnity, λt tends to zero implying no new information to estimate wo . In fact, some applied researchers ignore the variable λt in the regression given in (6.25). This practice, known as truncating the remainder, is not recommended since the Monte Carlo experiments of Maddala and Rao (1971) and Schmidt (1975) have shown that even for T = 60 or 100, it is not desirable to omit λt from (6.25). In summary, we have learned how to estimate a dynamic model with a lagged dependent variable and serially correlated errors. In case the error is autoregressive of order one, we have outlined the steps to implement the Wallis TwoStage estimator and Hatanaka’s twostep procedure. In case the error is Moving Average of order one, we have outlined the steps to implement the ZellnerGeisel procedure.
6.4
Autoregressive Distributed Lag
So far, section 6.1 considered ﬁnite distributed lags on the explanatory variables, whereas section 6.2 considered an autoregressive relation including the ﬁrst lag of the dependent variable and current values of the explanatory variables. In general, economic relationships may be generated by an Autoregressive Distributed Lag (ADL) scheme. The simplest form is the ADL (1,1) model which is given by Yt = α + λYt−1 + β 0 Xt + β 1 Xt−1 + ut
(6.26)
where both Yt and Xt are lagged once. By specifying higher order lags for Yt and Xt , say an ADL (p, q) with p lags on Yt and q lags on Xt , one can test whether the speciﬁcation now is general enough to ensure White noise disturbances. Next, one can test whether some restrictions can be imposed on this general model, like reducing the order of the lags to arrive at a simpler ADL model, or estimating the simpler static model with the CochraneOrcutt correction for serial correlation, see problem 20 in Chapter 7. This general to speciﬁc modelling strategy is prescribed by David Hendry and is utilized by the econometric software PCGive, see Gilbert (1986). Returning to the ADL (1, 1) model in (6.26) one can invert the autoregressive form as follows: Yt = α(1 + λ + λ2 + ..) + (1 + λL + λ2 L2 + ..)(β 0 Xt + β 1 Xt−1 + ut )
(6.27)
provided λ < 1. This equation gives the effect of a unit change in Xt on future values of Yt . In fact, ∂Yt /∂Xt = β 0 while ∂Yt+1 /∂Xt = β 1 + λβ 0 , etc. This gives the immediate shortrun responses with the longrun effect being the sum of all these partial derivatives yielding (β 0 +β 1 )/(1−λ). This can be alternatively derived from (6.26) at the longrun static equilibrium (Y ∗ , X ∗ ) where Yt = Yt−1 = Y ∗ , Xt = Xt−1 = X ∗ and the disturbance is set equal to zero, i.e., Y∗ =
β + β1 ∗ α + 0 X 1−λ 1−λ
Replacing Yt by Yt−1 + ΔYt and Xt by Xt−1 + ΔXt in (6.26) one gets ΔYt = α + β 0 ΔXt − (1 − λ)Yt−1 + (β 0 + β 1 )Xt−1 + ut
(6.28)
142
Chapter 6: Distributed Lags and Dynamic Models
This can be rewritten as
ΔYt = β 0 ΔXt − (1 − λ) Yt−1 −
β + β1 α − 0 Xt−1 + ut 1−λ 1−λ
(6.29)
Note that the term in brackets contains the longrun equilibrium parameters derived in (6.28). In fact, the term in brackets represents the deviation of Yt−1 from the longrun equilibrium term corresponding to Xt−1 . Equation (6.29) is known as the Error Correction Model (ECM), see Davidson, Hendry, Srba and Yeo (1978). Yt is obtained from Yt−1 by adding the shortrun effect of the change in Xt and a longrun equilibrium adjustment term. Since, the disturbances are White noise, this model is estimated by OLS.
Note 1. Other distributions besides the geometric distribution can be considered. In fact, a Pascal distribution was considered by Solow (1960), a rationallag distribution was considered by Jorgenson (1966), and a Gamma distribution was considered by Schmidt (1974, 1975). See Maddala (1977) for an excellent review.
Problems 1. Consider the ConsumptionIncome data given in Table 5.1 and provided on the Springer web site as CONSUMP.DAT. Estimate a ConsumptionIncome regression in logs that allows for a six year lag on income as follows: (a) Use the linear arithmetic lag given in equation (6.2). Show that this result can also be obtained as an Almon lag ﬁrstdegree polynomial with a far end point constraint. (b) Use an Almon lag seconddegree polynomial, described in equation (6.4), imposing the near end point constraint. (c) Use an Almon lag seconddegree polynomial imposing the far end point constraint. (d) Use an Almon lag seconddegree polynomial imposing both end point constraints. (e) Using Chow’s F statistic, test the arithmetic lag restrictions given in part (a). (f) Using Chow’s F statistic, test the Almon lag restrictions implied by the model in part (b). (g) Repeat part (f) for the restrictions imposed in parts (c) and (d). 2. Consider ﬁtting an Almon lag third degree polynomial β i = a0 +a1 i+a2 i2 +a3 i3 for i = 0, 1, . . . , 5, on the ConsumptionIncome relationship in logarithms. In this case, there are ﬁve lags on income, i.e., s = 5. (a) Set up the estimating equation for the ai ’s and report the estimates using OLS. ) to the (b) What is your estimate of β 3 ? What is the standard error? Can you relate the var(β 3 variances and covariances of the ai ’s?
(c) How would the OLS regression in part (a) change if we impose the near end point constraint β −1 = 0?
(d) Test the near end point constraint.
Problems
143
(e) Test the Almon lag speciﬁcation given in part (a) against an unrestricted ﬁve year lag speciﬁcation on income. 3. For the simple dynamic model with AR(1) disturbances given in (6.18), 2 (a) Verify that plim(β OLS −β) = ρ(1−β )/(1+ρβ). Hint: From (6.18), Yt−1 = βYt−2 +ν t−1 and ρYt−1 = ρβYt−2 + ρν t−1 . Subtracting this last equation from (6.18) and rearranging terms, T one gets Yt = (β + ρ)Yt−1 − ρβYt−2 + ǫt . Multiply both sides by Yt−1 and sum t=2 Yt Yt−1 = T 2 2 (β + ρ) t=2 Yt−1 − ρβ Tt=2 Yt−1 Yt−2 + Tt=2 Yt−1 ǫt . Now divide by Tt=2 Yt−1 and take probability limits. See Griliches (1961). (b) For various values of ρ < 1 and β < 1, tabulate the asymptotic bias computed in part (a). (c) Verify that plim( ρ − ρ) = −ρ(1 − β 2 )/(1 + ρβ) = −plim(β − β). OLS
(d) Using part (c), show that plim d = 2(1− plim ρ) = 2[1 − ν t−1 )2 / Tt=1 ν2t denotes the DurbinWatson statistic.
βρ(β + ρ) ] where d = Tt=2 ( νt − 1 + βρ
T (e) Knowing the true disturbances, the DurbinWatson statistic would be d∗ = t=2 (ν t − T ν t−1 )2 / t=1 ν 2t and its plim d∗ = 2(1 − ρ). Using part (d), show that plim (d − d∗ ) = 2ρ(1 − β 2 ) = 2plim(β OLS − β) obtained in part (a). See Nerlove and Wallis (1966). For 1 + βρ various values of ρ < 1 and β < 1, tabulate d∗ and d and the asymptotic bias in part (d).
4. For the simple dynamic model given in (6.18), let the disturbances follow an MA(1) process ν t = ǫt + θǫt−1 with ǫt ∼ IIN(0, σ 2ǫ ). δ(1 − β 2 ) where δ = θ/(1 + θ2 ). 1 + 2βδ (b) Tabulate this asymptotic bias for various values of β < 1 and 0 < θ < 1. 1 T ν2 ) = σ 2ǫ [1 + θ(θ − θ∗ )] where θ∗ = δ(1 − β 2 )/(1 + 2βδ) and νt = (c) Show that plim( T t=2 t Yt − β OLS Yt−1 . (a) Show that plim(β OLS − β) =
5. Consider the lagged dependent variable model given in (6.20). Using the ConsumptionIncome data from the Economic Report of the President over the period 19501993 which is given in Table 5.1. (a) Test for ﬁrstorder serial correlation in the disturbances using Durbin’s h given in (6.19). (b) Test for ﬁrstorder serial correlation in the disturbances using the Breusch (1978) and Godfrey (1978) test. (c) Test for secondorder serial correlation in the disturbances. 6. Using the U.S. gasoline data in Chapter 4, problem 15 given in Table 4.2 and obtained from the USGAS.ASC ﬁle, estimate the following two models: RGN P CAR QM G = γ 1 + γ 2 log + γ 3 log Static: log CAR t P OP t P OP t PMG + ǫt +γ 4 log P GN P t Dynamic: log
QM G CAR
RGN P CAR = γ 1 + γ 2 log + γ3 P OP t t P OP t QM G PMG + λlog + ǫt +γ 4 log P GN P t CAR t−1
144
Chapter 6: Distributed Lags and Dynamic Models (a) Compare the implied shortrun and longrun elasticities for price (P M G) and income (RGN P ). (b) Compute the elasticities after 3, 5 and 7 years. Do these lags seem plausible? (c) Can you apply the DurbinWatson test for serial correlation to the dynamic version of this model? Perform Durbin’s htest for the dynamic gasoline model. Also, the BreuschGodfrey test for ﬁrstorder serial correlation.
7. Using the U.S. gasoline data in Chapter 4, problem 15, given in Table 4.2 estimate the following model with a six year lag on prices: log
QM G CAR
RGN P CAR + γ 3 log P OP t P OP t t PMG +γ 4 6i=0 wi log P GN P t−i
= γ 1 + γ 2 log
(a) Report the unrestricted OLS estimates. (b) Now, estimate a second degree polynomial lag for the same model. Compare the results with part (a) and explain why you got such different results. (c) Reestimate part (b) comparing the six year lag to a four year, and eight year lag. Which one would you pick? (d) For the six year lag model, does a third degree polynomial give a better ﬁt? (e) For the model outlined in part (b), reestimate with a far end point constraint. Now, reestimate with only a near end point constraint. Are such restrictions justiﬁed in this case?
References This chapter is based on the material in Maddala (1977), Johnston (1984), Kelejian and Oates (1989) and Davidson and MacKinnon (1993). Additional references on the material in this chapter include: Akaike, H. (1973), “Information Theory and an Extension of the Maximum Likelihood Principle,” in B. Petrov and F. Csake, eds. 2nd. International Symposium on Information Theory, Budapest: Akademiai Kiado. Almon, S. (1965), “The Distributed Lag Between Capital Appropriations and Net Expenditures,” Econometrica, 30: 407423. Breusch, T.S. (1978), “Testing for Autocorrelation in Dynamic Linear Models,” Australian Economic Papers, 17: 334355. Davidson, J.E.H., D.F. Hendry, F. Srba and S. Yeo (1978), “Econometric Modelling of the Aggregate TimeSeries Relationship Between Consumers’ Expenditure and Income in the United Kingdom,” Economic Journal, 88: 661692. Dhrymes, P.J. (1971), Distributed Lags: Problems of Estimation and Formulation (HoldenDay: San Francisco). Durbin, J. (1970), “Testing for Serial Correlation in Least Squares Regression when Some of the Regressors are Lagged Dependent Variables,” Econometrica, 38: 410421. Fomby, T.B. and D.K. Guilkey (1983), “An Examination of TwoStep Estimators for Models with Lagged Dependent and Autocorrelated Errors,” Journal of Econometrics, 22: 291300.
References
145
Gilbert, C.L. (1986), “Professor Hendry’s Econometric Methodology,” Oxford Bulletin of Economics and Statistics, 48: 283307. Godfrey, L.G. (1978), “Testing Against General Autoregressive and Moving Average Error Models when the Regressors Include Lagged Dependent Variables,” Econometrica, 46: 12931302. Griliches, Z. (1961), “A Note on Serial Correlation Bias in Estimates of Distributed Lags,” Econometrica, 29: 6573. Hatanaka, M. (1974), “An Eﬃcient TwoStep Estimator for the Dynamic Adjustment Model with Autocorrelated Errors,” Journal of Econometrics, 2: 199220. Jorgenson, D.W. (1966), “Rational Distributed Lag Functions,” Econometrica, 34: 135149. Kiviet, J.F. (1986), “On The Vigor of Some Misspeciﬁcation Tests for Modelling Dynamic Relationships,” Review of Economic Studies, 53: 241262. Klein, L.R. (1958), “The Estimation of Distributed Lags,” Econometrica, 26: 553565. Koyck, L.M. (1954), Distributed Lags and Investment Analysis (NorthHolland: Amsterdam). Maddala, G.S. and A.S. Rao (1971), “Maximum Likelihood Estimation of Solow’s and Jorgenson’s Distributed Lag Models,” Review of Economics and Statistics, 53: 8088. Nerlove, M. and K.F. Wallis (1967), “Use of the DurbinWatson Statistic in Inappropriate Situations,” Econometrica, 34: 235238. Schwarz, G. (1978), “Estimating the Dimension of a Model,” Annals of Statistics, 6: 461464. Schmidt, P. (1974), “An Argument for the Usefulness of the Gamma Distributed Lag Model,” International Economic Review, 15: 246250. Schmidt, P. (1975), “The Small Sample Effects of Various Treatments of Truncation Remainders on the Estimation of Distributed Lag Models,” Review of Economics and Statistics, 57: 387389. Schmidt, P. and R. N. Waud (1973), “The Almon lag Technique and the Monetary versus Fiscal Policy Debate,” Journal of the American Statistical Association, 68: 1119. Solow, R.M. (1960), “On a Family of Lag Distributions,” Econometrica, 28: 393406. Wallace, T.D. (1972), “Weaker Criteria and Tests for Linear Restrictions in Regression,” Econometrica, 40: 689698. Wallis, K.F. (1967), “Lagged Dependent Variables and Serially Correlated Errors: A Reappraisal of ThreePass Least Squares, ” Review of Economics and Statistics, 49: 555567. Zellner, A. and M.Geisel (1970), “Analysis of Distributed Lag Models with Application to Consumption Function Estimation,” Econometrica, 38: 865888.
Part II
CHAPTER 7
The General Linear Model: The Basics 7.1
Introduction
Consider the following regression equation y = Xβ + u where
⎤ ⎡ Y1 X11 X12 . . . X1k ⎢ X21 X22 . . . X2k ⎢ Y2 ⎥ ⎥ ⎢ y=⎢ ⎣ : ⎦;X = ⎣ : : : : Yn Xn1 Xn2 . . . Xnk ⎡
(7.1) ⎤
⎡
⎤ ⎡ β1 ⎥ ⎢ ⎥ ⎢ ⎥ ; β = ⎢ β2 ⎥ ; u = ⎢ ⎦ ⎣ : ⎦ ⎣ βk
⎤ u1 u2 ⎥ ⎥ : ⎦ un
with n denoting the number of observations and k the number of variables in the regression, with n > k. In this case, y is a column vector of dimension (n×1) and X is a matrix of dimension (n × k). Each column of X denotes a variable and each row of X denotes an observation on these variables. If y is log(wage) as in the empirical example in Chapter 4, see Table 4.1 then the columns of X contain a column of ones for the constant (usually the ﬁrst column), weeks worked, years of full time experience, years of education, sex, race, marital status, etc.
7.2
Least Squares Estimation
Least squares minimizes the residual sum of squares where the residuals are given by e = y −X β denotes a guess on the regression parameters β. The residual sum of squares and β RSS = ni=1 e2i = e′ e = (y − Xβ)′ (y − Xβ) = y ′ y − y ′ Xβ − β ′ X ′ y + β ′ X ′ Xβ
The last four terms are scalars as can be veriﬁed by their dimensions. It is essential that the reader keep track of the dimensions of the matrices used. This will insure proper multiplication, addition, subtraction of matrices and help the reader obtain the right answers. In fact the middle two terms are the same because the transpose of a scalar is a scalar. For a quick review of some matrix properties, see the Appendix to this chapter. Differentiating the RSS with respect to β one gets ∂RSS/∂β = −2X ′ y + 2X ′ Xβ
(7.2)
where use is made of the following two rules of differentiating matrices. The ﬁrst is that ∂a′ b/∂b = a and the second is ∂(b′ Ab)/∂b = (A + A′ )b = 2Ab where the last equality holds if A is a symmetric matrix. In the RSS equation a is y ′ X and A is X ′X. The firstorder condition for minimization equates the expression in (7.2) to zero. This yields X ′ Xβ = X ′ y
(7.3)
150
Chapter 7: The General Linear Model: The Basics
which is known as the OLS normal equations. As long as X is of full column rank , i.e., of rank k, ′ −1 ′ then X ′ X is nonsingular and the solution to the above equations is β OLS = (X X) X y. Full column rank means that no column of X is a perfect linear combination of the other columns. In other words, no variable in the regression can be obtained from a linear combination of the other variables. Otherwise, at least one of the OLS normal equations becomes redundant. This means that we only have (k − 1) linearly independent equations to solve for k unknown β’s. ′ ′ This yields no solution for β OLS and we say that X X is singular. X X is the sum of squares cross product matrix (SSCP). If it has a column of ones then it will contain the sums, the sum of squares, and the crossproduct sum between any two variables ⎡ n ⎤ . . . ni=1 Xik n n ni=1 Xi2 n 2 ⎢ ⎥ ... i=1 Xi2 i=1 Xi2 i=1 Xi2 Xik ⎥ X ′X = ⎢ ⎣ ⎦ : : : n n n 2 i=1 Xik i=1 Xik Xi2 . . . i=1 Xik
′ Of course y could be added to this matrix as another variable which will generate X yn and ′ yy automatically pertaining to the variable y will generate i=1 yi , nfor us, i.e., thecolumn n n 2 . To see this, let y X y , and X y , . . . , i=1 i i=1 ik i i=1 i1 i
′ yy y′X ′ Z = [y, X] then Z Z = X ′y X ′X
This matrix summarizes the data and we can compute any regression of one variable in Z on any subset of the remaining variables in Z using only Z ′ Z. Denoting the least squares residuals by e = y − X β OLS , the OLS normal equations given in (7.3) can be written as ′ X ′ (y − X β OLS ) = X e = 0
(7.4)
Note that if the regression includes a constant, the ﬁrst column of X will be a vector of ones and the ﬁrst equation of (7.4) becomes ni=1 ei = 0. This proves the well known result that if there is a constant in the regression, the OLS residuals sum to zero. Equation (7.4) also indicates that the regressor matrix X is orthogonal to the residuals vector e. This will become clear when we deﬁne e in terms of the orthogonal projection matrix on X. This representation allows another interpretation of OLS as a method of moments estimator which was considered in Chapter 2. This follows from the classical assumptions where X satisﬁes E(X ′ u) = 0. The sample counterpart of this condition yields X ′ e/n = 0. These are the OLS normal equations and therefore, yield the OLS estimates without minimizing the residual sums of squares. See Kelejian and Oates (1989) and the discussion of instrumental variable estimation in Chapter 11. Since data in economics are not generated using experiments like the physical sciences, the X’s are stochastic and we only observe one realization of this data. Consider for example, annual observations for GNP, money supply, unemployment rate, etc. One cannot repeat draws for this data in the real world or ﬁx the X’s to generate new y’s (unless one is performing a Monte Carlo study). So we have to condition on the set of X’s observed, see Chapter 5. Classical Assumptions: u ∼ (0, σ 2 In ) which means that (i) each disturbance ui has zero mean, (ii) constant variance, and (iii) ui and uj for i = j are not correlated. The u’s are known as spherical disturbances. Also, (iv) the conditional expectation of u given X is zero, E(u/X) = 0. Note that the conditioning here is with respect to every regressor in X and for all observations
7.2
Least Squares Estimation
151
i = 1, 2, . . . n. In other words, it is conditional on all the elements of the matrix X. Using (7.1), this implies that E(y/X) = Xβ is linear in β, var(ui /X) = σ 2 and cov(ui , uj /X) = 0. Additionally, we assume that plim X ′ X/n is ﬁnite and positive deﬁnite and plim X ′ u/n = 0 as n → ∞. Given these classical assumptions, and conditioning on the X’s observed, it is easy to show that β OLS is unbiased for β. In fact using (7.1) one can write OLS = β + (X ′ X)−1 X ′ u β
(7.5)
Taking expectations, conditioning on the X’s, and using assumptions (i) and (iv), one attains the unbiasedness result. Furthermore, one can derive the variancecovariance matrix of β OLS from (7.5) since ′ ′ −1 ′ ′ ′ −1 var(β = σ 2 (X ′ X)−1 (7.6) OLS ) = E(β OLS − β)(β OLS − β) = E(X X) X uu X(X X)
this uses assumption (iv) along with the fact that E(uu′ ) = σ 2 In . This variancecovariance ’s across the diagonal and the pairwise matrix is (k × k) and gives the variances of the β i off the diagonal. The next theorem shows that among all linear and β covariances of say β i j unbiased estimators of c′ β, it is c′ β OLS which has the smallest variance. This is known as the GaussMarkov Theorem. Theorem 1: Consider the linear estimator a′ y for c′ β, where both a and c are arbitrary vectors of constants. If a′ y is unbiased for c′ β then var(a′ y) ≥ var(c′ β OLS ).
Proof: For a′ y to be unbiased for c′ β it must follow from (7.1) that E(a′ y) = a′ Xβ + E(a′ u) = a′ Xβ = c′ β which means that a′ X = c′ . Also, var(a′ y) = E(a′ y −c′ β)(a′ y −c′ β)′ = E(a′ uu′ a) = ′ ′ 2 ′ σ 2 a′ a. Comparing this variance with that of c′ β OLS , one gets var(a y)− var(c β OLS ) = σ a a − 2 ′ ′ −1 ′ ′ 2 ′ ′ 2 σ c (X X) c. But, c = a X, therefore this difference becomes σ [a a − a PX a] = σ a′ P¯X a where PX is a projection matrix on the Xplane deﬁned as X(X ′ X)−1 X ′ and P¯X is deﬁned as In − PX . In fact, PX y = X β and P¯X y = y − PX y = y − y = e. So that y projects OLS = y the vector y on the Xplane and e is the projection of y on the plane orthogonal to X or perpendicular to X, see Figure 7.1. Both PX and P¯X are idempotent which means that the above difference σ 2 aP¯X a is greater or equal to zero since P¯X is positive semideﬁnite. To see this, deﬁne z = P¯X a, then the above difference is equal to σ 2 z ′ z ≥ 0. The implications of the theorem are important. It means for example, that for the choice of c′ = (1, 0, . . . , 0) one can pick β 1 = c′ β for which the best linear unbiased estimator would 1,OLS = c′ β OLS . Similarly any β j can be chosen by using c′ = (0, . . . , 1, . . . , 0) which has be β ′ 1 in the jth position and zero elsewhere. Again, the BLUE of β j = c′ β is β j,OLS = c β OLS . k Furthermore, any linear combination of these β’s such as their sum j=1 β j which corresponds as its BLUE. to c′ = (1, 1, . . . , 1) has the sum k β j=1
j,OLS
The disturbance variance σ 2 is unknown and has to be estimated. Note that E(u′ u) = E(tr(uu′ )) = tr(E(uu′ )) = tr(σ 2 In ) = nσ 2 , so that u′ u/n seems like a natural unbiased estimator for σ 2 . However, u is not observed and is estimated by the OLS residuals e. It is therefore, natural to investigate E(e′ e). In what follows, we show that s2 = e′ e/(n − k) is an unbiased estimator for σ 2 . To prove this, we need the fact that ′ −1 ′ ¯ ¯ e = y − Xβ OLS = y − X(X X) X y = PX y = PX u
(7.7)
152
Chapter 7: The General Linear Model: The Basics
Pxy = e
y
yˆ = P x y
x
Figure 7.1 The Orthogonal Decomposition of y
where the last equality follows from the fact that P¯X X = 0. Hence, E(e′ e) = E(u′ P¯X u) = E(tr{u′ P¯X u}) = E(tr{uu′ P¯X }) = tr(σ 2 P¯X ) = σ 2 tr(P¯X ) = σ 2 (n − k)
where the second equality follows from the fact that the trace of a scalar is a scalar. The third equality from the fact that tr(ABC) = tr(CAB). The fourth equality from the fact that E(trace) = trace{E(.)}, and E(uu′ ) = σ 2 In . The last equality from the fact that tr(P¯X ) = tr(In ) − tr(PX ) = n − tr(X(X ′ X)−1 X ′ )
= n − tr(X ′ X(X ′ X)−1 ) = n − tr(Ik ) = n − k. 2 ′ −1 is given by s2 (X ′ X)−1 . Hence, an unbiased estimator of var(β OLS ) = σ (X X) So far we have shown that β OLS is BLUE. It can also be shown that it is consistent for β. In fact, taking probability limits of (7.5) as n → ∞, one gets ′ −1 ′ plim(β OLS ) = plim(β) + plim(X X/n) (X u/n) = β
The ﬁrst equality uses the fact that the plim of a sum is the sum of the plims. The second equality follows from assumption 1 and the fact that plim of a product is the product of plims.
7.3
Partitioned Regression and the FrischWaughLovell Theorem
In Chapter 4, we studied a useful property of least squares which allows us to interpret multiple regression coeﬃcients as simple regression coeﬃcients. This was called the residualing interpretation of multiple regression coeﬃcients. In general, this property applies whenever the k regressors given by X can be separated into two sets of variables X1 and X2 of dimension (n×k1 ) and (n × k2 ) respectively, with X = [X1 , X2 ] and k = k1 + k2 . The regression in equation (7.1) becomes a partitioned regression given by y = Xβ + u = X1 β 1 + X2 β 2 + u
(7.8)
7.3
Partitioned Regression and the FrischWaughLovell Theorem
153
One may be interested in the least squares estimates of β 2 corresponding to X2 , but one has to control for the presence of X1 which may include seasonal dummy variables or a time trend, see Frisch and Waugh (1933) and Lovell (1963)1 . The OLS normal equations from (7.8) are as follows:
′ β X1′ y X1 X1 X1′ X2 1,OLS = (7.9) X2′ X1 X2′ X2 X2′ y β 2,OLS
These can be solved by partitioned inversion of the matrix on the left, see the Appendix to this chapter, or by solving two equations in two unknowns. Problem 2 asks the reader to verify that −1 ′ ¯ ′ ¯ β 2,OLS = (X2 PX1 X2 ) X2 PX1 y
(7.10)
′ −1 ′ β 2,OLS = (X2 X2 ) X2 y
(7.11)
P¯X1 y = P¯X1 X2 β 2 + P¯X1 u
(7.12)
where P¯X1 = In − PX1 and PX1 = X1 (X1′ X1 )−1 X1′ . P¯X1 is the orthogonal projection matrix of X1 and P¯X1 X2 generates the least squares residuals of each column of X2 regressed on all the 2 = P¯X X2 and y = P¯X y, then (7.10) can be written variables in X1 . In fact, if we denote by X 1 1 as using the fact that P¯X1 is idempotent. This implies that β 2,OLS can be obtained from the regression of y on X2 . In words, the residuals from regressing y on X1 are in turn regressed upon the residuals from each column of X2 regressed on all the variables in X1 . This was illustrated in Chapter 4 with some examples. Following Davidson and MacKinnon (1993) we denote this result more formally as the FrischWaughLovell (FWL) Theorem. In fact, if we premultiply (7.8) by P¯X1 and use the fact that P¯X1 X1 = 0, one gets
The FWL Theorem states that: (1) The least squares estimates of β 2 from equations (7.8) and (7.12) are numerically identical and (2) The least squares residuals from equations (7.8) and (7.12) are identical. ¯ X is idempotent, it immediately follows that, OLS on (7.12) yields β Using the fact that P 2,OLS 1 as given by equation (7.10). Alternatively, one can start from equation (7.8) and use the result that ¯ ¯ y = PX y + P¯X y = X β OLS + PX y = X1 β 1,OLS + X2 β 2,OLS + PX y
(7.13)
′ ¯ ¯ X2′ P¯X1 y = X2′ P¯X1 X2 β 2,OLS + X2 PX1 PX y
(7.14)
where PX = X(X ′ X)−1 X ′ and P¯X = In − PX . Premultiplying equation (7.13) by X2′ P¯X1 and using the fact that P¯X1 X1 = 0, one gets
But, PX1 PX = PX1 . Hence, P¯X1 P¯X = P¯X . Using this fact along with P¯X X = P¯X [X1 , X2 ] = 0, the last term of equation (7.14) drops out yielding the result that β 2,OLS from (7.14) is identical to the expression in (7.10). Note that no partitioned inversion was used in this proof. This proves part (1) of the FWL Theorem.
154
Chapter 7: The General Linear Model: The Basics
Also, premultiplying equation (7.13) by P¯X1 and using the fact that P¯X1 P¯X = P¯X , one gets ¯ P¯X1 y = P¯X1 X2 β 2,OLS + PX y
(7.15)
Now β 2,OLS was shown to be numerically identical to the least squares estimate obtained from equation (7.12). Hence, the ﬁrst term on the right hand side of equation (7.15) must be the ﬁtted values from equation (7.12). Since the dependent variables are the same in equations (7.15) and (7.12), P¯X y in equation (7.15) must be the least squares residuals from regression (7.12). But, P¯X y is the least squares residuals from regression (7.8). Hence, the least squares residuals from regressions (7.8) and (7.12) are numerically identical. This proves part (2) of the FWL Theorem. Several applications of the FWL Theorem will be given in this book. Problem 2 shows that if X1 is the vector of ones indicating the presence of a constant in the regression, then regression (7.15) is equivalent to running (yi − y¯) on the set of variables in X2 expressed as deviations from their respective sample means. Problem 3 shows that the FWL Theorem can be used to prove that including a dummy variable for one of the observations in the regression is equivalent to omitting that observation from the regression.
7.4
Maximum Likelihood Estimation
In Chapter 2, we introduced the method of maximum likelihood estimation which is based on specifying the distribution we are sampling from and writing the joint density of our sample. This joint density is then referred to as the likelihood function because it gives for a given set of parameters specifying the distribution, the probability of obtaining the observed sample. See Chapter 2 for several examples. For the regression equation, specifying the distribution of the disturbances in turn speciﬁes the likelihood function. These disturbances could be Poisson, Exponential, Normal, etc. Once this distribution is chosen, the likelihood function is maximized and the MLE of the regression parameters are obtained. Maximum likelihood estimators are desirable because they are (1) consistent under fairly general conditions,2 (2) asymptotically normal, (3) asymptotically eﬃcient and (4) invariant to reparameterizations of the model3 . Some of the undesirable properties of MLE are that (1) it requires explicit distributional assumptions on the disturbances, and (2) their ﬁnite sample properties can be quite different from their asymptotic properties. For example, MLE can be biased even though they are consistent, and their covariance estimates can be misleading for small samples. In this section, we derive the MLE under normality of the disturbances. The Normality Assumption: u ∼ N (0, σ 2 In ). This additional assumption allows us to derive distributions of estimators and other random variables. This is important for constructing conﬁdence intervals and tests of hypotheses. In fact using (7.5) one can easily see that β OLS is a linear combination of the u’s. But, a linear combination of normal random variables is itself OLS is N (β, σ 2 (X ′ X)−1 ). Similarly y is N (Xβ, σ 2 In ) and a normal random variable. Hence, β 2 ¯ e is N (0, σ PX ). Moreover, we can write the joint probability density function of the u’s as f (u1 , u2 , . . . , un ; β, σ 2 ) = (1/2πσ 2 )n/2 exp(−u′ u/2σ 2 ). To get the likelihood function we make the transformation u = y − Xβ and note that the Jacobian of the transformation is one. Hence f (y1 , y2 , . . . , yn ; β, σ 2 ) = (1/2πσ 2 )n/2 exp{−(y − Xβ)′ (y − Xβ)/2σ 2 }
(7.16)
7.4
Maximum Likelihood Estimation
155
Taking the log of this likelihood, we get logL(β, σ 2 ) = −(n/2)log(2πσ 2 ) − (y − Xβ)′ (y − Xβ)/2σ 2
(7.17)
Maximizing this likelihood with respect to β and σ 2 one gets the maximum likelihood estimators (MLE). Let θ = σ 2 and Q = (y − Xβ)′ (y − Xβ), then 2X ′ y − 2X ′ Xβ ∂logL(β, θ) = ∂β 2θ Q ∂logL(β, θ) n = − 2 ∂θ 2θ 2θ Setting these ﬁrstorder conditions equal to zero, one gets β M LE = β OLS
and θ=σ 2M LE = Q/n = RSS/n = e′ e/n.
Intuitively, only the second term in the log likelihood contains β and that term (without the negative sign) has already been minimized with respect to β in (7.2) giving us the OLS estimator. M LE is unbiased Note that σ 2M LE differs from s2 only in the degrees of freedom. It is clear that β for β while σ 2M LE is not unbiased for σ 2 . Substituting these MLE’s into (7.17) one gets the maximum value of logL which is 2M LE ) = −(n/2)log(2π σ 2M LE ) − e′ e/2 σ 2M LE logL(β M LE , σ
= −(n/2)log(2π) − (n/2)log(e′ e/n) − n/2
= constant − (n/2)log(e′ e).
In order to get the Cram´erRao lower bound for the unbiased estimators of β and σ 2 one ﬁrst computes the information matrix
2 ∂ logL/∂β∂β ′ ∂ 2 logL/∂β∂σ 2 2 (7.18) I(β, σ ) = −E ∂ 2 logL/∂σ 2 ∂β ′ ∂ 2 logL/∂σ 2 ∂σ 2 Recall, that θ = σ 2 and Q = (y − Xβ)′ (y − Xβ). It is easy to show (see problem 4) that ∂ 2 logL(β, θ) 1 ∂Q = 2 ∂β∂θ 2θ ∂β
and
∂ 2 logL(β, θ) −X ′ (y − Xβ) = ∂θ∂β θ2
Therefore, 2 ∂ logL(β, θ) −E(X ′ u) =0 E = ∂θ∂β θ2 Also ∂ 2 logL(β, θ) −X ′ X = θ ∂β∂β ′
and
−4Q 2n −Q n ∂ 2 logL(β, θ) = 2 3 + 2 = 3 + ∂θ 4θ 4θ θ 2θ 2
so that E
∂ 2 logL(β, θ) ∂θ 2
=
−nθ n −2n + n −n = 2 3 + 2 = 2 θ 2θ 2θ 2θ
156
Chapter 7: The General Linear Model: The Basics
using the fact that E(Q) = nσ 2 = nθ. Hence, I(β, σ 2 ) =
X ′ X/σ 2 0 0 n/2σ 4
(7.19)
The information matrix is blockdiagonal between β and σ 2 . This is an important property for regression models with normal disturbances. It implies that the Cram´erRao lower bound is I −1 (β, σ 2 ) =
σ 2 (X ′ X)−1 0 0 2σ 4 /n
(7.20)
M LE = β OLS attains the Cram´erRao lower bound. Under normality, β OLS is Note that β MVU (minimum variance unbiased). This is best among all unbiased estimators not only linear unbiased estimators. By assuming more (in this case normality) we get more (MVU rather than BLUE)4 . Problem 5 derives the variance of s2 under normality of the disturbances. This is found to be 2σ 4 /(n − k). This means that s2 does not attain the Cram´erRao lower bound. However, 2 following the theory of complete suﬃcient statistics one can show that both β OLS and s are MVU for their respective parameters and therefore both are small sample eﬃcient. Note also that σ 2M LE is biased, therefore it is not meaningful to compare its variance to the Cram´erRao lower bound. There is a tradeoff between bias and variance in estimating σ 2 . Problem 6 looks at all estimators of σ 2 of the type e′ e/r and derives r such that the mean squared error (MSE) is minimized. The choice of r turns out to be (n − k + 2). OLS , now we derive the distribution of s2 . In order to do that We found the distribution of β we need a result from matrix algebra, which is stated without proof, see Graybill (1961). Lemma 1: For every symmetric idempotent matrix A of rank r, there exists an orthogonal matrix P such that P ′ AP = Jr where Jr is a diagonal matrix with the ﬁrst r elements equal to one and the rest equal to zero. We use this lemma to show that the RSS/σ 2 is a chisquared with (n − k) degrees of freedom. To see this note that e′ e/σ 2 = u′ P¯X u/σ 2 and that P¯X is symmetric and idempotent of rank (n − k). Using the lemma there exists a matrix P such that P ′ P¯X P = Jn−k is a diagonal matrix with the ﬁrst (n − k) elements on the diagonal equal to 1 and the last k elements equal to zero. Now make the change of variable v = P ′ u. This makes v ∼ N (0, σ 2 In ) since the v’s are linear combinations of the u’s and P ′ P = In . Replacing u by v in RSS/σ 2 we get 2 2 v ′ P P¯X P v/σ 2 = v ′ Jn−k v/σ 2 = n−k i=1 vi /σ
where the last sum is only over i = 1, 2, . . . , n − k. But, the v’s are independent identically distributed N (0, σ 2 ), hence vi2 /σ 2 is the square of a standardized N (0, 1) random variable which is distributed as a χ21 . Moreover, the sum of independent χ2 random variables is a χ2 random variable with degrees of freedom equal to the sum of the respective degrees of freedom. Hence, RSS/σ 2 is distributed as χ2n−k . The beauty of the above result is that it applies to all quadratic forms u′ Au where A is symmetric and idempotent. We will use this result again in the test of hypotheses section.
7.5
7.5
Prediction
157
Prediction
Let us now predict To periods ahead. Those new observations are assumed to satisfy (7.1). In other words yo = Xo β + uo
(7.21)
What is the Best Linear Unbiased Predictor (BLUP) of E(yo )? From (7.21), E(yo ) = Xo β which OLS is BLUE for is a linear combination of the β’s. Using the GaussMarkov result yo = Xo β ′ 2 Xo β and the variance of this predictor of E(yo ) is Xo var(β OLS )Xo = σ Xo (X ′ X)−1 Xo′ . But, what if we are interested in the predictor for yo ? The best predictor of uo is zero, so the predictor for yo is still yo but its MSE is ′ E( yo − yo )( yo − yo )′ = E{Xo (β OLS − β) − uo }{Xo (β OLS − β) − uo } ′ 2 = Xo var(β OLS )Xo + σ ITo − 2cov{Xo (β OLS − β), uo } = σ 2 Xo (X ′ X)−1 Xo′ + σ 2 ITo
(7.22)
′ −1 ′ the last equality follows from the fact that (β OLS − β) = (X X) X u and uo have zero covariance. The latter holds because uo and u have zero covariance. Intuitively this says that the future To disturbances are not correlated with the current sample disturbances. Therefore, the predictor of the average consumption of a $20, 000 income household is the same as the predictor of consumption of a speciﬁc household whose income is $20, 000. The difference is not in the predictor itself but in the MSE attached to it. The latter MSE being larger. Salkever (1976) suggested a simple way to compute these forecasts and their standard errors. The basic idea is to augment the usual regression in (7.1) with a matrix of observationspeciﬁc dummies, i.e., a dummy variable for each period where we want to forecast:
u β X 0 y (7.23) + = uo γ Xo ITo yo or
y ∗ = X ∗ δ + u∗ ′
′
, γ ′ ).
(7.24) X∗
where δ = (β has in its second part a matrix of dummy variables one for each of the To periods for which we are forecasting. Since these To observations do not serve in the ′ ′ , γ′ ) where estimation, problem 7 asks the reader to verify that OLS on (7.23) yields δ = (β In other words, OLS on (7.23) yields the = (X ′ X)−1 X ′ y, β γ = yo − yo , and yo = Xo β. OLS estimate of β without the To observations, and the coeﬃcients of the To dummies are the forecast errors. This also means that the ﬁrst n residuals are the usual OLS residuals based on the ﬁrst n observations, whereas the next To residuals are all zero. e = y − Xβ Therefore, s∗2 = s2 = e′ e/(n − k), and the variance covariance matrix of δ is given by
(X ′ X)−1 (7.25) s2 (X ∗′ X ∗ )−1 = s2 [ITo + Xo (X ′ X)−1 Xo′ ] and the offdiagonal elements are of no interest. This means that the regression package gives and the estimated variance of the forecast error in one stroke. Note the estimated variance of β that if the forecasts rather than the forecast errors are needed, one can replace yo by zero, and as required. The variance ITo by −ITo in (7.23). The resulting estimate of γ will be yo = Xo β, of this forecast will be the same as that given in (7.25), see problem 7.
158
Chapter 7: The General Linear Model: The Basics
7.6
Conﬁdence Intervals and Test of Hypotheses
We start by constructing a conﬁdence interval for any linear combination of β, say c′ β. We ′ 2 ′ ′ −1 know that c′ β OLS ∼ N (c β, σ c (X X) c) and it is a scalar. Hence, ′ ′ ′ −1 1/2 zobs = (c′ β OLS − c β)/σ(c (X X) c)
(7.26)
OLS − c′ β)/s(c′ (X ′ X)−1 c)1/2 tobs = (c′ β
(7.27)
′ ′ −1 1/2 c′ β OLS ± tα/2 s(c (X X) c)
(7.28)
is a standardized N (0, 1) random variable. Replacing σ by s is equivalent to dividing zobs by the square root of a χ2 random variable divided by its degrees of freedom. The latter random variable is (n − k)s2 /σ 2 = RSS/σ 2 which was shown to be a χ2n−k . Problem 8 shows that zobs and RSS/σ 2 are independent. This means that
is a N (0, 1) random variable divided by the square root of an independent χ2n−k /(n − k). This is a tstatistic with (n − k) degrees of freedom. Hence, a 100(1 − α)% conﬁdence interval for c′ β is
Example: Let us say we are predicting one year ahead so that To = 1 and xo is a (1 × k) vector of next year’s observations on the exogenous variables. The 100(1 − α) conﬁdence interval for next year’s forecast of yo will be yo ± tα/2 s(1 + x′o (X ′ X)−1 xo )1/2 . Similarly (7.28) allows us to construct conﬁdence intervals or test any single hypothesis on any single β j (again by picking c to have 1 in its jth position and zero elsewhere). In this case we get the usual tstatistic reported in any regression package. More importantly, this allows us to test any hypothesis concerning any linear combination of the β’s, e.g., testing that the sum of coeﬃcients of input variables in a CobbDouglas production function is equal to one. This is known as a test for constant returns to scale, see Chapter 4.
7.7
Joint Conﬁdence Intervals and Test of Hypotheses
We have learned how to test any single hypothesis involving any linear combination of the β’s. But what if we are interested in testing two or three or more hypotheses involving linear combinations of the β’s. For example, testing that β 2 = β 4 = 0, i.e., that variables X2 and X4 are not signiﬁcant in the model. This can be written as c′2 β = c′4 β = 0 where c′j is a row vector of zeros with a one in the jth position. In order to test these two hypotheses simultaneously, we rearrange these restrictions on the β’s in matrix form Rβ = 0 where R′ = [c2 , c4 ]. In a similar fashion, we can rearrange g restrictions on the β’s into this matrix R which will now be of dimension (g × k). Also these restrictions need not be of the form Rβ = 0 and can be of the more general form Rβ = r where r is a (g × 1) vector of constants. For example, β 1 + β 2 = 1 and 3β 3 + 2β 4 = 5 are two such restrictions. Since Rβ is a collection of linear combinations 2 ′ −1 ′ of the β’s, the BLUE of these is Rβ OLS and the latter is distributed N (Rβ, σ R(X X) R ). ′ Standardization of the form encountered with the scalar c β gives us the following: ′ ′ −1 ′ −1 2 (Rβ OLS − Rβ) [R(X X) R ] (Rβ OLS − Rβ)/σ
(7.29)
7.8
Restricted MLE and Restricted Least Squares
159
rather than divide by the variance we multiply by its inverse, and since we divided by the variance rather than the standard deviation we square the numerator which means in vector form premultiplying by its transpose. Problem 9 replaces the matrix R by the vector c′ and shows that (7.29) reduces to the square of the zstatistic observed in (7.26). This also proves that the resulting statistic is distributed as χ21 . But, what is the distribution of (7.29)? The trick is to write it in terms of the original disturbances, i.e., (7.30) u′ X(X ′ X)−1 R′ [R(X ′ X)−1 R]−1 R(X ′ X)−1 X ′ u/σ 2 ′ −1 ′ where (Rβ OLS − Rβ) is replaced by R(X X) X u. Note that (7.30) is quadratic in the disturbances u of the form u′ Au/σ 2 . Problem 10 shows that A is symmetric and idempotent and of rank g. Applying the same proof as given below lemma 1 we get the result that (7.30) is distributed as χ2g . Again σ 2 is unobserved, so we divide by (n − k)s2 /σ 2 which is χ2n−k . This becomes a ratio of two χ2 ’s random variables. If we divide the numerator and denominator χ2 ’s by their respective degrees of freedom and prove that they are independent (see problem 11) the resulting statistic − r)′ [R(X ′ X)−1 R′ ]−1 (Rβ − r)/gs2 (7.31) (Rβ OLS
OLS
is distributed under the null Rβ = r as an F (g, n − k).
7.8
Restricted MLE and Restricted Least Squares
Maximizing the likelihood function given in (7.16) subject to Rβ = r is equivalent to minimizing the residual sum of squares subject to Rβ = r. Forming the Lagrangian function Ψ(β, μ) = (y − Xβ)′ (y − Xβ) + 2μ′ (Rβ − r)
(7.32)
∂Ψ(β, μ)/∂β = −2X ′ y + 2X ′ Xβ + 2R′ μ = 0
(7.33)
and differentiating with respect to β and μ one gets ∂Ψ(β, μ)/∂μ = 2(Rβ − r) = 0
Solving for μ, we premultiply (7.33) by μ = [R(X ′ X)−1 R′ ]−1 (Rβ − r)
R(X ′ X)−1
(7.34) and use (7.34) (7.35)
OLS
Substituting (7.35) in (7.33) we get ′ −1 ′ ′ −1 ′ −1 β RLS = β OLS − (X X) R [R(X X) R ] (Rβ OLS − r)
(7.36)
The restricted least squares estimator of β differs from that of the unrestricted OLS estimator by the second term in (7.36) with the term in parentheses showing the extent to which the unrestricted OLS estimator satisﬁes the constraint. Problem 12 shows that β RLS is biased unless the restriction Rβ = r is satisﬁed. However, its variance is always less than that of β OLS . This brings in the tradeoff between bias and variance and the MSE criteria which was discussed in Chapter 2. The Lagrange Multiplier estimator μ is distributed N (0, σ 2 [R(X ′ X)−1 R′ ]−1 ) under the null hypothesis. Therefore, to test μ = 0, we use μ/σ 2 = (Rβ − r)′ [R(X ′ X)−1 R′ ]−1 (Rβ − r)/σ 2 (7.37) μ ′ [R(X ′ X)−1 R′ ] OLS
OLS
Since μ measures the cost of imposing the restriction Rβ = r, it is no surprise that the right hand side of (7.37) was already encountered in (7.29) and is distributed as χ2g .
160
Chapter 7: The General Linear Model: The Basics
7.9
Likelihood Ratio, Wald and Lagrange Multiplier Tests
Before we go into the derivations of these three classical tests for the null hypothesis H0 ; Rβ = r, it is important for the reader to review the intuitive graphical explanation of these tests given in Chapter 2. The Likelihood Ratio test of H0 ; Rβ = r is based upon the ratio λ = maxℓr /maxℓu , where maxℓu and maxℓr are the maximum values of the unrestricted and restricted likelihoods, respectively. Let us assume for simplicity that σ 2 is known, then ′ 2 maxℓu = (1/2πσ 2 )n/2 exp{−(y − X β M LE ) (y − X β M LE )/2σ }
where β M LE = β OLS . Denoting the unrestricted residual sum of squares by URSS, we have maxℓu = (1/2πσ 2 )n/2 exp{−U RSS/2σ 2 }
Similarly, maxℓr is given by ′ 2 maxℓr = (1/2πσ 2 )n/2 exp{−(y − X β RM LE ) (y − X β RM LE )/2σ }
where β RM LE = β RLS . Denoting the restricted residual sum of squares by RRSS, we have maxℓr = (1/2πσ 2 )n/2 exp{−RRSS /2σ 2 }
Therefore, −2logλ = (RRSS − URSS )/σ 2 . Let us ﬁnd the relationship between these residual sums of squares. er = y − X β RLS = y − X β OLS − X(β RLS − β OLS ) = e − X(β RLS − β OLS ) ′ ′ e′r er = e′ e + (β RLS − β OLS ) X X(β RLS − β OLS )
(7.38)
where er denotes the restricted residuals and e′r er the RRSS. The crossproduct terms drop out RLS − β OLS ) from (7.36) into (7.38), we get: because X ′ e = 0. Substituting the value of (β ′ ′ −1 ′ −1 RRSS − URSS = (Rβ OLS − r) [R(X X) R ] (Rβ OLS − r)
(7.39)
M LE )I(β M LE )−1 R(β M LE )′ ]−1 r(β M LE ) M LE )′ [R(β W = r(β
(7.40)
M LE − r)′ [R(X ′ X)−1 R′ ]−1 (Rβ M LE − r)/σ 2 W = (Rβ
(7.41)
It is now clear that −2logλ is the right hand side of (7.39) divided by σ 2 . In fact, this Likelihood Ratio (LR) statistic is the same as that given in (7.37) and (7.29). Under the null hypothesis Rβ = r , this was shown to be a χ2g . The Wald test of Rβ = r is based upon the unrestricted estimator and the extent of which it satisﬁes the restriction. More formally, if r(β) = 0 denote the vector of g restrictions on β and ′ R(β M LE ) denotes the (g × k) matrix of partial derivatives ∂r(β)/∂β evaluated at β M LE , then the Wald statistic is given by where I(β) = −E(∂ 2 logL/∂β∂β ′ ). In this case, r(β) = Rβ − r, R(β M LE ) = R and I(β M LE ) = (X ′ X)/σ 2 as seen in (7.19). Therefore, which is the same as the LR statistic5 .
7.9
Likelihood Ratio, Wald and Lagrange Multiplier Tests
161
The Lagrange Multiplier test is based upon the restricted estimator. In section 7.8, we derived the restricted estimator and the estimated Lagrange Multiplier μ . The Lagrange Multiplier μ is the cost or shadow price of imposing the restrictions Rβ = r. If these restrictions are true, one would expect the estimated Lagrange Multiplier μ to have mean zero. Therefore, a test for the null hypothesis that μ = 0, is called the LM test and the corresponding test statistic is given in equation (7.37). Alternatively, one can derive the LM test as a score’ test based on the score or the ﬁrst derivative of the loglikelihood function i.e., S(β) = ∂logL/∂β. The score is zero for the unrestricted MLE, and the score test is based upon the departure of S(β), evaluated at the restricted estimator β RM LE , from zero. In this case, the score form of the LM statistic is given by ′ −1 LM = S(β RM LE ) I(β RM LE ) S(β RM LE )
(7.42)
For our model, S(β) = (X ′ y − X ′ Xβ)/σ 2 and from equation (7.36) we have
′ 2 S(β RM LE ) = X (y − X β RM LE )/σ OLS + R′ [R(X ′ X)−1 R′ ]−1 (Rβ OLS − r)}/σ 2 = {X ′ y − X ′ X β 2 = R′ [R(X ′ X)−1 R′ ]−1 (Rβ OLS − r)/σ
2 ′ −1 Using (7.20), one gets I −1 (β RM LE ) = σ (X X) . Therefore, the score form of the LM test becomes
LM
′ ′ −1 ′ −1 ′ −1 ′ ′ −1 ′ −1 2 = (Rβ OLS − r) [R(X X) R ] R(X X) R [R(X X) R ] (Rβ OLS − r)/σ ′ ′ −1 ′ −1 2 = (Rβ (7.43) OLS − r) [R(X X) R ] (Rβ OLS − r)/σ
This is numerically identical to the LM test derived in equation (7.37) and to the W and LR ′ /σ 2 from (7.35), so it is clear why the Score statistics derived above. Note that S(β RM LE ) = R μ and the Lagrangian Multiplier tests are identical. The score form of the LM test can also be obtained as a byproduct of an artiﬁcial regression. In fact, S(β) evaluated as β RM LE is given by ′ 2 S(β RM LE ) = X (y − X β RM LE )/σ
where y − X β RM LE is the vector of restricted residuals. If H0 is true, then this converges asymptotically to u and the asymptotic variance of the vector of scores becomes (X ′ X)/σ 2 . The score test is then based upon ′ ′ −1 ′ 2 (y − X β RM LE ) X(X X) X (y − X β RM LE )/σ
(7.44)
This expression is the explained sum of squares from the artiﬁcial regression of (y −X β RM LE )/σ on X. To see that this is exactly identical to the LM test in equation (7.37), recall from equation ′ on the left hand (7.33) that R′ μ = X ′ (y − X β RM LE ) and substituting this expression for R μ 2 side of equation (7.37) we get equation (7.44). In practice, σ is estimated by s2 the Mean Square Error of the restricted regression. This is an example of the GaussNewton Regression which will be discussed in Chapter 8. An alternative approach to testing H0 , is to estimate the restricted and unrestricted models and compute the following F statistic Fobs =
(RRSS − URSS )/g URSS /(n − k)
(7.45)
162
Chapter 7: The General Linear Model: The Basics
This statistic is known in the econometric literature as the Chow (1960) test and was encountered in Chapter 4. Note that from equation (7.39), if we divide the numerator by σ 2 we get a χ2g statistic divided by its degrees of freedom. Also, using the fact that (n − k)s2 /σ 2 is χ2n−k , the denominator divided by σ 2 is a χ2n−k statistic divided by its degrees of freedom. Problem 11 shows independence of the numerator and denominator and completes the proof that Fobs is distributed F (g, n − k) under H0 . Chow’s (1960) Test for Regression Stability Chow (1960) considered the problem of testing the equality of two sets of regression coeﬃcients y1 = X1 β 1 + u1
and
y2 = X2 β 2 + u2
(7.46)
where X1 is n1 × k and X2 is n2 × k with n1 and n2 > k. In this case, the unrestricted regression can be written as
X1 0 β1 u1 y1 = + (7.47) y2 0 X2 β2 u2 under the null hypothesis H0 ; β 1 = β 2 = β, the restricted model becomes
u1 X1 y1 β+ = u2 X2 y2
(7.48)
The URSS and the RRSS are obtained from these two regressions by stacking the n1 + n2 observations. It is easy to show that the URSS = e′1 e1 + e′2 e2 where e1 is the OLS residuals from y1 on X1 and e2 is the OLS residuals from y2 on X2 . In other words, the URSS is the sum of two residual sums of squares from the separate regressions, see problem 13. The Chow F statistic given in equation (7.45) has k and (n1 + n2 − 2k) degrees of freedom, respectively. Equivalently, one can obtain this Chow F statistic from running
u1 0 X1 y1 (7.49) (β 2 − β 1 ) + β1 + = u2 X2 X2 y2 Note that the second set of explanatory variables whose coeﬃcients are (β 2 − β 1 ) are interaction variables obtained by multiplying each independent variable in equation (7.48) by a dummy variable, say D2 , that takes on the value 1 if the observation is from the second regression and 0 if it is from the ﬁrst regression. A test for H0 ; β 1 = β 2 becomes a joint test of signiﬁcance for the coeﬃcients of these interaction variables. Gujarati (1970) points out that this dummy variable approach has the additional advantage of giving the estimates of (β 2 − β 1 ) and their tstatistics. If the Chow F test rejects stability, these individual interaction dummy variable coeﬃcients may point to the source of instability. Of course, one has to be careful with the interpretation of these individual tstatistics, after all they can all be insigniﬁcant with the joint F statistic still being signiﬁcant, see Maddala (1992). In case one of the two regressions does not have suﬃcient observations to estimate a separate regression say n2 < k, then one can proceed by running the regression on the full data set to get the RRSS. This is the restricted model because the extra n2 observations are assumed to be generated by the same regression as the ﬁrst n1 observations. The URSS is the residual sums
7.9
Likelihood Ratio, Wald and Lagrange Multiplier Tests
163
of squares based only on the longer period (n1 observations). In this case, the Chow F statistic given in equation (7.45) has n2 and n1 − k degrees of freedom, respectively. This is known as Chow’s predictive test since it tests whether the shorter n2 observations are different from their predictions using the model with the longer n1 observations. This predictive test can be performed with dummy variables as follows: Introduce n2 observation speciﬁc dummies, one for each of the observations in the second regression. Test the joint signiﬁcance of these n2 dummy variables. Salkever’s (1976) result applies and each dummy variable will have as its estimated coeﬃcient the prediction error with its corresponding standard error and its tstatistic. Once, again, the individual dummies may point out possible outliers, but it is their joint signiﬁcance that is under question. The W, LR and LM Inequality We have shown that the LR = W = LM for linear restrictions if the loglikelihood is quadratic. However, this is not necessarily the case for more general situations. In fact, in the next chapter where we consider more general variance covariance structure on the disturbances, estimating this variancecovariance matrix destroys this equality and may lead to conﬂict in hypotheses testing as noted by Berndt and Savin (1977). In this case, W ≥ LR ≥ LM . See also the problems at the end of this chapter. The LR, W and LM tests are based on the efficient MLE. When consistent rather than efficient estimators are used, an alternative way of constructing the scoretype test is known as Neyman’s C(α). For details, see Bera and Permaratne (2001). Although, these three tests are asymptotically equivalent, one test may be more convenient than another for a particular problem. For example, when the model is linear but the restriction is nonlinear, the unrestricted model is easier to estimate than the restricted model. So the Wald test suggests itself in that it relies only on the unrestricted estimator. Unfortunately, the Wald test has a drawback that the LR and LM test do not have. In ﬁnite samples, the Wald test is not invariant to testing two algebraically equivalent formulations of the nonlinear restriction. This fact has been pointed out in the econometric literature by Gregory and Veall (1985, 1986) and Lafontaine and White (1986). In what follows, we review some of Gregory and Veall’s (1985) ﬁndings: Consider the linear regression with two regressors yt = β o + β 1 x1t + β 2 x2t + ut
(7.50)
where the ut ’s are IIN(0, σ 2 ), and the nonlinear restriction β 1 β 2 = 1. Two algebraically equivalent formulation of the null hypothesis are: H A ; r A (β) = β 1 − 1/β 2 = 0, and H B ; r B (β) = β 1 β 2 − 1 = 0. The unrestricted maximum likelihood estimator is β OLS and the Wald statistic given in (7.40) is ′ ′ −1 W = r(β OLS ) [R(β OLS )V (β OLS )R (β OLS )] r(β OLS )
(7.51)
where V (β OLS ) is the usual estimated variancecovariance matrix of β OLS . Problem 19 asks the reader to verify that the Wald statistics corresponding to HA and HB using (7.51) are and
2 2 β 2 W A = (β 1 2 − 1) /(β 2 v11 + 2v12 + v22 /β 2 )
2 2 β 2 W B = (β 1 2 − 1) /(β 2 v11 + 2β 1 β 2 v12 + β 1 v22 )
(7.52)
(7.53)
164
Chapter 7: The General Linear Model: The Basics
OLS ) for i, j = 0, 1, 2. These Wald statistics are clearly not where the vij ’s are the elements of V (β identical, and other algebraically equivalent formulations of the null hypothesis can be generated with correspondingly different Wald statistics. Monte Carlo experiments were performed with 1000 replications on the model given in (7.50) with various values for β 1 and β 2 , and for a sample size n = 20, 30, 50, 100, 500. The experiments were run when the null hypothesis is true and when it is false. For n = 20 and β 1 = 10, β 2 = 0.1, so that H0 is satisﬁed, W A rejects the null when it is true 293 times out of a 1000, while W B rejects the null 65 times out of a 1000. At the 5% level one would expect 50 rejections with a 95% conﬁdence interval [36, 64]. Both W A and W B reject too often but W A performs worse than W B . When n is increased to 500, W A rejects 78 times while W B rejects 39 times out of a 1000. W A still rejects too often although its performance is better than that for n = 20, while W B performs well and is within the 95% conﬁdence region. When n = 20, β 1 = 1 and β 2 = 0.5, so that H0 is not satisﬁed, W A rejects the null when it is false 65 times out of a 1000 whereas W B rejects it 584 times out of a 1000. For n = 500, both test statistics reject the null in 1000 out of 1000 times. Even in cases where the empirical sizes of the tests appear similar, see Table 1 of Gregory and Veall (1985), in particular the case where β 1 = β 2 = 1, Gregory and Veall ﬁnd that W A and W B are in conﬂict about 5% of the time for n = 20, and this conﬂict drops to 0.5% at n = 500. Problem 20 asks the reader to derive four Wald statistics corresponding to four algebraically equivalent formulations of the common factor restriction analyzed by Hendry and Mizon (1978). Gregory and Veall (1986) give Monte Carlo results on the performance of these Wald statistics for various sample sizes. Once again they ﬁnd conﬂict among these tests even when their empirical sizes appear to be similar. Also, the differences among the Wald statistics are much more substantial, and persist even when n is as large as 500. Lafontaine and White (1985) consider a simple regression y = α + βx + γz + u where y is log of per capita consumption of textiles, x is log of per capita real income and z is log of relative prices of textiles, with the data taken from Theil (1971, p.102). The estimated equation is: y = 1.37 + 1.14x − 0.83z (0.31) (0.16) (0.04)
with σ 2 = 0.0001833, and n = 17, with standard errors shown in parentheses. Consider the null hypothesis H0 ; β = 1. Algebraically equivalent formulations of H0 are Hk ; β k = 1 for any exponent k. Applying (7.40) with r(β) = β k − 1 and R(β) = kβ k−1 , one gets the Wald statistic k − 1)2 /[(kβ k−1 )2 V (β)] Wk = (β
(7.54)
is the OLS estimate of β and V (β)is its corresponding estimated variance. For every k, where β 2 .05 = 4.6. Wk has a limiting χ1 distribution under H0 . The critical values are χ21,.05 = 3.84 and F1,14 The latter is an exact distribution test for β = 1 under H0 . Lafontaine and White (1985) try = 1.14 and V (β) = different integer exponents (±k) where k = 1, 2, 3, 6, 10, 20, 40. Using β (0.16)2 one gets W−20 = 24.56, W1 = 0.84, and W20 = 0.12. The authors conclude that one could get any Wald statistic desired by choosing an appropriate exponent. Since β > 1, Wk is inversely related to k. So, we can ﬁnd a Wk that exceeds the critical values given by the χ2 and F distributions. In fact, W−20 leads to rejection whereas W1 and W20 do not reject H0 .
Notes
165
For testing nonlinear restrictions, the Wald test is easy to compute. However, it has a serious problem in that it is not invariant to the way the null hypothesis is formulated. In this case, the score test may be diﬃcult to compute, but Neyman’s C(α) test is convenient to use and provide the invariance that is needed, see Dagenais and Dufour (1991).
Notes 1. For example, in a timeseries setting, including the time trend in the multiple regression is equivalent to detrending each variable ﬁrst, by residualing out the effect of time, and then running the regression on these residuals. 2. Two exceptions noted in Davidson and MacKinnon (1993) are the following: One, if the model is not identiﬁed asymptotically. For example, yt = β(1/t) + ut for t = 1, 2, . . . , T , will have (1/t) tend to zero as T → ∞. This means that as the sample size increase, there is no information on β. Two, if the number of parameters in the model increase as the sample size increase. For example, the ﬁxed effects model in panel data discussed in Chapter 11. 3. If the MLE of β is β MLE , then the MLE of (1/β) is (1/β MLE ). Note that this invariance property implies that MLE cannot be in general unbiased. For example, even if β MLE is unbiased for β, by the above reparameterization, (1/β MLE ) is not unbiased for (1/β).
4. If the distribution of disturbances is not normal, then OLS is still BLUE as long as the assumptions underlying the GaussMarkov Theorem are satisﬁed. The MLE in this case will be in general more eﬃcient than OLS as long as the distribution of the errors is correctly speciﬁed. 5. Using the Taylor Series approximation of r(β MLE ) around the true parameter vector β, one gets r(β MLE ) ≃ r(β) + R(β)(β MLE − β). Under the null hypothesis, r(β) = 0 and the var[r(β MLE )] ≃ ′ R(β) var(β MLE )R (β).
Problems 1. Invariance of the Fitted Values and Residuals to Nonsingular Transformations of the Independent Variables. Postmultiply the independent variables in (7.1) by a nonsingular transformation C, so that X ∗ = XC. (a) Show that PX ∗ = PX and P¯X ∗ = P¯X . Conclude that the regression of y on X has the same ﬁtted values and the same residuals as the regression of y on X ∗ . (b) As an application of these results, suppose that every X was multiplied by a constant, say, a change in the units of measurement. Would the ﬁtted values or residuals change when we rerun this regression? (c) Suppose that X contains two regressors X1 and X2 each of dimension n × 1. If we run the regression of y on (X1 − X2 ) and (X1 + X2 ), will this yield the same ﬁtted values and the same residuals as the original regression of y on X1 and X2 ? 2. The FWL Theorem. (a) Using partitioned inverse results from the Appendix, show that the solution to (7.9) yields β 2,OLS given in (7.10).
166
Chapter 7: The General Linear Model: The Basics (b) Alternatively, write (7.9) as a system of two equations in two unknowns β 1,OLS and β 2,OLS . Solve, by eliminating β 1,OLS and show that the resulting solution is given by (7.10).
(c) Using the FWL Theorem, show that if X1 = ιn a vector of ones indicating the presence of the constant in the regression, and X2 is a set of economic variables, then (i) β 2,OLS can be obtained by running yi − y¯ on the set of variables in X2 expressed as deviations from their respective sample means. (ii) The least squares estimate of the constant β 1,OLS can ′ ′ ′ ¯ ¯ be retrieved as y¯ − X2 β 2,OLS where X2 = ιn X2 /n is the vector of sample means of the independent variables in X2 .
3. Let y = Xβ + Di γ + u where y is n × 1, X is n × k and Di is a dummy variable that takes the value 1 for the ith observation and 0 otherwise. Using the FWL Theorem, prove that the least squares ∗′ ∗ −1 ∗′ ∗ X y and γ OLS = yi − x′i β estimates of β and γ from this regression are β OLS = (X X ) OLS , ∗ ∗ where X denotes the X matrix without the ith observation and y is the y vector without the ith observation and (yi , x′i ) denotes the ith observation on the dependent and independent variables. This means that γ OLS is the forecasted OLS residual from the regression of y ∗ on X ∗ for the ith observation which was essentially excluded from the regression by the inclusion of the dummy variable Di . 4. Maximum Likelihood Estimation. Given the loglikelihood in (7.17), (a) Derive the ﬁrstorder conditions for maximization and show that β MLE = β OLS and that σ 2MLE = RSS/n.
(b) Calculate the second derivatives given in (7.18) and verify that the information matrix reduces to (7.19). 5. Given that u ∼ N (0, σ 2 In ), we showed that (n − k)s2 /σ 2 ∼ χ2n−k . Use this fact to prove that, (a) s2 is unbiased for σ 2 . (b) var(s2 ) = 2σ 4 /(n − k). Hint: E(χ2r ) = r and var(χ2r ) = 2r. 6. Consider all estimators of σ 2 of the type σ 2 = e′ e/r = u′ P¯X u/r with u ∼ N (0, σ 2 In ). (a) Find E( σ 2MLE ) and the bias( σ 2MLE ).
(b) Find var( σ 2MLE ) and the MSE( σ 2MLE ).
(c) Compute MSE( σ 2 ) and minimize it with respect to r. Compare with the MSE of s2 and σ 2MLE .
7. Computing Forecasts and Forecast Standard Errors Using a Regression Package. This is based on Salkever (1976). From equations (7.23) and (7.24), show that ′ ′ −1 ′ ′ ′ X y, and γ ′OLS = yo − Xo β ′OLS ) where β (a) δ OLS = (β OLS = (X X) OLS . Hint: Set up OLS , γ the OLS normal equations and solve two equations in two unknowns. Alternatively, one can use the FWL Theorem to residual out the additional To dummy variables.
(b) e∗OLS = (e′OLS , 0′ )′ and s∗2 = s2 . (c) s∗2 (X ∗′ X ∗ )−1 is given by the expression in (7.25). Hint: Use partitioned inverse.
8.
(d) Replace yo by 0 and ITo by −ITo in (7.23) and show that γ = yo = Xo β OLS whereas all the results in parts (a), (b) and (c) remain the same.
(a) Show that cov(β OLS , e) = 0. (Since both random variables are normally distributed, this proves their independence).
Problems
9.
167
2 ′ (b) Show that β OLS and s are independent. Hint: A linear (Bu) and quadratic (u Au) forms in normal random variables are independent if BA = 0. See Graybill (1961) Theorem 4.17.
(a) Show that if one replaces R by c′ in (7.29) one gets the square of the zstatistic given in (7.26).
(b) Show that when we replace σ 2 by s2 , the χ21 statistic given in part (a) becomes the square of a tstatistic which is distributed as F (1, n − K). Hint: The square of a N (0, 1) is χ21 . Also the ratio of two independent χ2 random variables divided by their degrees of freedom is an F statistic with these corresponding degrees of freedom, see Chapter 2. 10.
(a) Show that the matrix A deﬁned in (7.30) by u′ Au/σ 2 is symmetric, idempotent and of rank g. (b) Using the same proof given below lemma 1, show that (7.30) is χ2g .
11.
(a) Show that the two quadratic forms s2 = u′ P¯X u/(n − k) and that given in (7.30) are independent. Hint: Two positive semideﬁnite quadratic forms u′ Au and u′ Bu are independent if and only if AB = 0, see Graybill (1961) Theorem 4.10. (b) Conclude that (7.31) is distributed as an F (g, n − k).
12. Restricted Least Squares. (a) Show that β RLS given by (7.36) is biased unless Rβ = r. ′ −1 ′ (b) Show that the var(β X u) where RLS ) = var(A(X X) A = IK − (X ′ X)−1 R′ [R(X ′ X)−1 R′ ]−1 R.
Prove that A2 = A, but A′ = A. Conclude that var(β RLS ) =
σ 2 A(X ′ X)−1 A′ = σ 2 {(X ′ X)−1 −(X ′ X)−1 R′ [R(X ′ X)−1 R′ ]−1 R(X ′ X)−1 }.
(c) Show that var(β OLS )− var(β RLS ) is a positive semideﬁnite matrix.
13. The Chow Test.
(a) Show that OLS on (7.47) yields OLS on each equation separately in (7.46). In other words, ′ −1 ′ ′ −1 ′ X1 y1 and β X2 y2 . β 1,OLS = (X1 X1 ) 2,OLS = (X2 X2 )
(b) Show that the residual sum of squares for equation (7.47) is given by RSS1 + RSS2 , where RSSi is the residual sum of squares from running yi on Xi for i = 1, 2.
(c) Show that the Chow F statistic can be obtained from (7.49) by testing for the joint signiﬁcance of Ho ; β 2 − β 1 = 0. 14. Suppose we would like to test Ho ; β 2 = 0 in the following unrestricted model given also in (7.8) y = Xβ + u = X1 β 1 + X2 β 2 + u (a) Using the FWL Theorem, show that the URSS is identical to the residual sum of squares obtained from P¯X1 y = P¯X1 X2 β 2 + P¯X1 u. Conclude that URSS = y ′ P¯X y = y ′ P¯X1 y − y ′ P¯X1 X2 (X2′ P¯X1 X2 )−1 X2′ P¯X1 y. (b) Show that the numerator of the F statistic for testing Ho ; β 2 = 0 which is given in (7.45), is y ′ P¯X1 X2 (X2′ P¯X1 X2 )−1 X2′ P¯X1 y/k2 . Substituting y = X1 β 1 + u under the null hypothesis, show that the above expression reduces to u′ P¯X1 X2 (X2′ P¯X1 X2 )−1 X2′ P¯X1 u/k2 .
168
Chapter 7: The General Linear Model: The Basics (c) Let v = X2′ P¯X1 u, show that if u ∼ IIN(0, σ 2 ) then v ∼ N (0, σ 2 X2′ P¯X1 X2 ). Conclude that the numerator of the F statistic given in part (b) when divided by σ 2 can be written as v ′ [var(v)]−1 v/k2 where v ′ [var(v)]−1 v is distributed as χ2k2 under Ho . Hint: See the discussion below lemma 1. (d) Using the result that (n − k)s2 /σ 2 ∼ χ2n−k where s2 is the URSS /(n − k), show that the F statistic given by (7.45) is distributed as F (k2 , n − k) under Ho . Hint: You need to show that u′ P¯X u is independent of the quadratic term given in part (b), see problem 11. (e) Show that the Wald Test for Ho ; β 2 = 0, given in (7.41), reduces in this case to W = ′ [R(X ′ X)−1 R′ ]−1 β denotes the OLS or equivalently the MLE /s2 were R = [0, Ik ], β β 2 2 2 2 2 of β 2 from the unrestricted model and s is the corresponding estimate of σ 2 given by URSS /(n − k). Using partitioned inversion or the FWL Theorem, show that the numerator of W is k2 times the expression in part (b). (f) Show that the score form of the LM statistic, given in (7.42) and (7.44), can be obtained as the explained sum of squares from the artiﬁcial regression of the restricted residuals on the matrix of regressors X. In this case, s2 = RRSS /(n − k1) (y − X1 β 1,RLS ) deﬂated by s is the Mean Square Error of the restricted regression. In other words, obtain the explained sum of squares from regressing P¯X1 y/ s on X1 and X2 .
15. Iterative Estimation in Partitioned Regression Models. This is based on Fiebig (1995). Consider the partitioned regression model given in (7.8) and let X2 be a single regressor, call it x2 of dimension n × 1 so that β 2 is a scalar. Consider the following strategy for estimating β 2 : Estimate β 1 from (1) the shortened regression of y on X1 . Regress the residuals from this regression on x2 to yield b2 . (1)
(a) Prove that b2 is biased. Now consider the following iterative strategy for reestimating β 2 : (1)
Reestimate β 1 by regressing y − x2 b2 following scheme:
(1)
on X1 to yield b1 . Next iterate according to the
(j)
(j)
b1 = (X1′ X1 )−1 X1′ (y − x2 b2 ) (j+1)
b2
(j)
= (x′2 x2 )−1 x′2 (y − X1 b1 ), (j+1)
(b) Determine the behavior of the bias of b2 (c) Show that as j increases on (7.8).
(j+1) b2
j = 1, 2, ... as j increases.
converges to the estimator of β 2 obtained by running OLS
16. Maddala (1992, pp. 120127). Consider the simple linear regression yi = α + βXi + ui
i = 1, 2, . . . , n.
where α and β are scalars and ui ∼ IIN(0, σ 2 ). For Ho ; β = 0, (a) Derive the Likelihood Ratio (LR) statistic and show that it can be written as nlog[1/(1 − r2)] where r2 is the square of the correlation coeﬃcient between X and y. as nr2 /(1 − (b) Derive the Wald (W) statistic for testing Ho ; β = 0. Show that it can nbe written 2 2 2 r ). This is the square of the usual tstatistic on β with σ MLE = i=1 ei /n used instead of is the unrestricted MLE which is OLS in this case, and the ei ’s are s2 in estimating σ 2 . β the usual least squares residuals. (c) Derive the Lagrange Multiplier (LM) statistic for testing Ho ; β = 0. Show that it can be 2RMLE = ni=1 (yi − y¯)2 /n written as nr2 . This is the square of the usual tstatistic on with σ 2RMLE is restricted MLE of σ 2 (i.e., imposing Ho used instead of s2 in estimating σ 2 . The σ and maximizing the likelihood with respect to σ 2 ).
Problems
169
(d) Show that LM/n = (W/n)/[1 + (W/n)], and LR/n = log[1 + (W/n)]. Using the following inequality x ≥ log(1 + x) ≥ x/(1 + x), conclude that W ≥ LR ≥ LM . Hint: Use x = W/n.
(e) For the cigarette consumption data given in Table 3.2, compute the W, LR, LM for the simple regression of logC on logP and demonstrate the above inequality given in part (d) for testing that the price elasticity is zero?
17. Engle (1984, pp.785786). Consider a set of T independent observations on a Bernoulli random variable which takes on the values yt = 1 with probability θ, and yt = 0 with probability (1 − θ). (a) Derive the loglikelihood function, the MLE of θ, the score S(θ), and the information I(θ). (b) Compute the LR, W and LM test statistics for testing Ho ; θ = θ o , versus HA ; θ = θo for θǫ(0, 1). 18. Engle (1984, pp. 787788). Consider the linear regression model y = Xβ + u = X1 β 1 + X2 β 2 + u given in (7.8), where u ∼ N (0, σ2 IT ). (a) Write down the loglikelihood function, ﬁnd the MLE of β and σ 2 . (b) Write down the score S(β) and show that the information matrix is blockdiagonal between β and σ 2 . (c) Derive the W, LR and LM test statistics in order to test Ho ; β 1 = β o1 , versus HA ; β 1 = β o1 , where β 1 is say the ﬁrst k1 elements of β. Show that if X = [X1 , X2 ], then 2 )/ )′ [X ′ P¯X X1 ](β o − β W = (β o1 − β 1 1 σ 1 1 2 LM = u ′ X1 [X1′ P¯X2 X1 ]−1 X1′ u / σ2
LR = T log( u′ u / u′ u )
u and σ is the unrestricted MLE, ′ u /T , σ 2 = u ′ u /T . β where u = y − X β, = y − Xβ 2 = u whereas β is the restricted MLE.
(d) Using the above results, show that W = T ( u′ u −u ′ u )/ u′ u
LM = T ( u′ u −u ′ u )/ u′ u
Also, that LR = T log[1 + (W/T )]; LM = W/[1 + (W/T )]; and (T − k)W/T k1 ∼ Fk1 ,T −k under Ho . As in problem 16, we use the inequality x ≥ log(1 + x) ≥ x/(1 + x) to conclude that W ≥ LR ≥ LM. Hint: Use x = W/T . However, it is important to note that all the test statistics are monotonic functions of the F statistic and exact tests for each would produce identical critical regions. (e) For the cigarette consumption data given in Table 3.2, run the following regression: logC = α + βlogP + γlogY + u compute the W, LR, LM given in part (c) for the null hypothesis Ho ; β = −1.
(f) Compute the Wald statistics for HoA ; β = −1, HoB ; β 5 = −1 and HoC ; β −5 = −1. How do these statistics compare? 19. Gregory and Veall (1985). Using equation (7.51) and the two formulations of the null hypothesis H A and H B given below (7.50), verify that the Wald statistics corresponding to these two formulations are those given in (7.52) and (7.53), respectively.
170
Chapter 7: The General Linear Model: The Basics
20. Gregory and Veall (1986). Consider the dynamic equation yt = ρyt−1 + β 1 xt + β 2 xt−1 + ut where ρ < 1, and ut ∼ NID(0, σ 2 ). Note that for this equation to be the CochraneOrcutt transformation yt − ρyt−1 = β 1 (xt − ρxt−1 ) + ut the following nonlinear restriction must be satisﬁed −β 1 ρ = β 2 called the common factor restriction by Hendry and Mizon (1978). Now consider the following four formulations of this restriction H A ; β 1 ρ + β 2 = 0; H B ; β 1 + (β 2 /ρ) = 0; H C ; ρ + (β 2 /β 1 ) = 0 and H D ; (β 1 ρ/β 2 ) + 1 = 0. (a) Using equation (7.51) derive the four Wald statistics corresponding to the four formulations of the null hypothesis. (b) Apply these four Wald statistics to the equation relating real personal consumption expenditures to real disposable personal income in the U.S. over the post World War II period 19501993, see Table 5.1. 21. Eﬀect of Additional Regressors on R2 . This problem was considered in nonmatrix form in Chapter 4, problem 4. Regress y on X1 which is T × K1 and compute SSE1 . Add X2 which is T × K2 so that the number of regressors in now K = K1 + K2 . Regress y on X = [X1 , X2 ] and get SSE2 . Show that SSE2 ≤ SSE1 . Conclude that the corresponding Rsquares satisfy R22 ≥ R12 . Hint: Show that PX − PX1 is a positive semideﬁnite matrix.
References Additional readings for the material covered in this chapter can be found in Davidson and MacKinnon (1993), Kelejian and Oates (1989), Maddala (1992), Fomby, Hill and Johnson (1984), Greene (1993), Johnston (1984), Judge et al. (1985) and Theil (1971). These econometrics texts were cited earlier. Other references cited in this chapter are the following: Bera A.K. and G. Permaratne (2001), “General Hypothesis Testing,” Chapter 2 in Baltagi, B.H. (ed.), A Companion to Theoretical Econometrics (Blackwell: Massachusetts). Berndt, E.R. and N.E. Savin (1977), “Conﬂict Among Criteria for Testing Hypotheses in the Multivariate Linear Regression Model,” Econometrica, 45: 12631278. Buse, A.(1982), “The Likelihood Ratio, Wald, and Lagrange Multiplier Tests: An Expository Note,” The American Statistician, 36: 153157. Chow, G.C. (1960), “Tests of Equality Between Sets of Coeﬃcients in Two Linear Regressions,” Econometrica, 28: 591605. Dagenais, M.G. and J. M. Dufour (1991), “Invariance, Nonlinear Models, and Asymptotic Tests,” Econometrica, 59: 16011615. Engle, R.F. (1984), “Wald, Likelihood Ratio, and Lagrange Multiplier Tests in Econometrics,” In: Griliches, Z. and M.D. Intrilligator (eds) Handbook of Econometrics (NorthHolland: Amsterdam). Fiebig, D.G. (1995), “Iterative Estimation in Partitioned Regression Models,” Econometric Theory, Problem 95.5.1, 11:1177.
Appendix
171
Frisch, R., and F.V. Waugh (1933), “Partial Time Regression as Compared with Individual Trends,” Econometrica, 1: 387401. Graybill, F.A.(1961), An Introduction to Linear Statistical Models, Vol. 1 (McGrawHill: New York). Gregory, A.W. and M.R. Veall (1985), “Formulating Wald Tests of Nonlinear Restrictions,” Econometrica, 53: 14651468. Gregory, A.W. and M.R. Veall (1986), “Wald Tests of Common Factor Restrictions,” Economics Letters, 22: 203208. Gujarati, D. (1970), “Use of Dummy Variables in Testing for Equality Between Sets of Coeﬃcients in Two Linear Regressions: A Generalization,” The American Statistician, 24: 5052. Hendry, D.F. and G.E. Mizon (1978), “Serial Correlation as a Convenient Simpliﬁcation, Not as a Nuisance: A Comment on A Study of the Demand for Money by the Bank of England,” Economic Journal, 88: 549563. Lafontaine, F. and K.J. White (1986), “Obtaining Any Wald Statistic You Want,” Economics Letters, 21: 3540. Lovell, M.C. (1963), “Seasonal Adjustment of Economic Time Series,” Journal of the American Statistical Association, 58: 9931010. Salkever, D. (1976), “The Use of Dummy Variables to Compute Predictions, Prediction Errors, and Conﬁdence Intervals,” Journal of Econometrics, 4: 393397.
Appendix Some Useful Matrix Properties This book assumes that the reader has encountered matrices before, and knows how to add, subtract and multiply conformable matrices. In addition, that the reader is familiar with the transpose, trace, rank, determinant and inverse of a matrix. Unfamiliar readers should consult standard texts like Bellman (1970) or Searle (1982). The purpose of this Appendix is to review some useful matrix properties that are used in the text and provide easy access to these properties. Most of these properties are given without proof. Starting with Chapter 7, our data matrix X is organized such that it has n rows and k columns, so that each row denotes an observation on k variables and each column denotes n observations on one variable. This matrix is of dimension n × k. The rank of an n × k matrix is always less than or equal to its smaller dimension. Since n > k, the rank (X) ≤ k. When there is no perfect multicollinearity among the variables in X, this matrix is said to be of full column rank k. In this case, X ′ X, the matrix of crossproducts is of dimension k × k. It is square, symmetric and of full rank k. This uses the fact that the rank(X ′ X) = rank(X) = k. Therefore, (X ′ X) is nonsingular and the inverse (X ′ X)−1 exists. This is needed for the computation of Ordinary Least Squares. In fact, for least squares to be feasible, X should be of full column rank k and no variable in X should be a perfect linear combination of the other variables in X. If we write ⎡ ′ ⎤ x1 ⎥ ⎢ X = ⎣ ... ⎦ x′n
where x′i denotes the ith observation, in the data, then X ′ X = of dimension k × 1.
n
i=1
xi x′i where xi is a column vector
172
Chapter 7: The General Linear Model: The Basics
An important and widely encountered matrix is the Identity matrix which will be denoted by In and subscripted by its dimension n. This is a square n×n matrix whose diagonal elements are all equal to one and its off diagonal elements are all equal to zero. Also, σ 2 In will be a familiar scalar covariance matrix, with every diagonal element equal to σ 2 reﬂecting homoskedasticity or equal variances (see Chapter 5), and zero covariances or no serial correlation (see Chapter 5). Let ⎡ 2 ⎤ 0 σ1 ⎢ ⎥ .. Ω = diag[σ 2i ] = ⎣ ⎦ . 0
σ 2n
2, . . . , n. This matrix be an (n × n) diagonal matrix with the ith diagonal element equal to σ 2i for i = 1, n will be encountered under heteroskedasticity, see Chapter 9. Note that tr(Ω) = i=1 σ 2i is the sum of its diagonal elements. Also, tr(In ) = n and tr(σ 2 In ) = nσ 2 . Another useful matrix is the projection matrix PX = X(X ′ X)−1 X ′ which is of dimension n × n. This matrix is encountered in Chapter 7. If y denotes the n × 1 vector of observations on the dependent variable, then PX y generates the predicted values y from the least squares regression of y on X. This matrix PX is symmetric and idempotent. This means ′ 2 that PX = PX and PX = PX PX = PX as can be easily veriﬁed. Some of the properties of idempotent matrices is that their rank is equal to their trace. Hence, rank(PX ) = tr(PX ) = tr[X(X ′ X)−1 X ′ ] = tr[X ′ X(X ′ X −1 )] = tr(Ik ) = k. Here, we used the fact that tr(ABC) = tr(CAB) = tr(BCA). In other words, the trace is unaffected by the cyclical permutation of the product. Of course, these matrices should be conformable and the product should result in a square matrix. Note that P¯X = In − PX is also a symmetric and idempotent matrix. In this case, P¯X y = y − PX y = y − y = e where e denotes the least squares residuals, y − X β OLS where ′ −1 ′ β OLS = (X X) X y, see Chapter 7. Some properties of these projection matrices are the following: PX X = X, P¯X X = 0, P¯X e = e
and PX e = 0.
In fact, X ′ e = 0 means that the matrix X is orthogonal to the vector of least squares residuals e. Note ′ ′ that X ′ e = 0 means that X ′ (y − X β OLS ) = 0 or X y = X X β OLS . These k equations are known as the OLS normal equations and their solution yields the least squares estimates β OLS . By the deﬁnition of ¯ ¯ ¯ PX , we have (i) PX + PX = In . Also, (ii) PX and PX are idempotent and (iii) PX P¯X = 0. In fact, any two of these properties imply the third. The rank(P¯X ) = tr(P¯X ) = tr(In − PX ) = n − k. Note that PX and P¯X are of rank k and (n − k), respectively. Both matrices are not of full column rank. In fact, the only full rank, symmetric idempotent matrix is the identity matrix. Matrices not of full rank are singular, and their inverse do not exist. However, one can ﬁnd a generalized inverse of a matrix Ω which we will call Ω− which satisﬁes the following requirements: (i) ΩΩ− Ω =Ω (iii) Ω− Ω is symmetric
and
(ii) Ω− ΩΩ− = Ω− (iv) ΩΩ− is symmetric.
Even if Ω is not square, a unique Ω− can be found for Ω which satisﬁes the above four properties. This is called the MoorePenrose generalized inverse. Note that a symmetric idempotent matrix is its own MoorePenrose generalized inverse. For example, it is easy to verify that if Ω = PX , then Ω− = PX and that it satisﬁes the above four properties. Idempotent matrices have characteristic roots that are either zero or one. The number of nonzero characteristic roots is equal to the rank of this matrix. The characteristic roots of Ω−1 are the reciprocals of the characteristic roots of Ω, but the characteristic vectors of both matrices are the same. The determinant of a matrix is nonzero if and only if it has full rank. Therefore, if A is singular, then A = 0. Also, the determinant of a matrix is equal to the product of its characteristic roots. For two square matrices A and B, the determinant of the product is the product of the determinants AB = A·B. Therefore, the determinant of Ω−1 is the reciprocal of the determinant of Ω. This follows from the fact that ΩΩ−1  = ΩΩ−1  = I = 1. This property is used in writing the likelihood function for Generalized Least Squares (GLS) estimation, see Chapter 9. The determinant of a triangular matrix
Appendix
173
is equal to the product of its diagonal elements. Of course, it immediately follows that the determinant of a diagonal matrix is the product of its diagonal elements. The constant in the regression corresponds to a vector of ones in the matrix of regressors X. This vector of ones is denoted by ιn where n is the dimension of this column vector. Note that ι′n ιn = n and ιn ι′n = Jn where Jn is a matrix of ones of dimension n × n. Note also that Jn is not idempotent, but ¯ ¯ J¯n = Jn /n is idempotent as can be easily veriﬁed. The rank(J¯n ) = tr(Jn ) = 1. Note also that In − Jn is idempotent with rank (n − 1). J¯n y has a typical element y¯ = ni=1 yi /n whereas (In − J¯n )y has a typical element (yi − y¯). So that J¯n is the averaging matrix, whereas premultiplying by (In − J¯n ) results in deviations from the mean. For two nonsingular matrices A and B (AB)−1 = B −1 A−1 Also, the transpose of a product of two conformable matrices, (AB)′ = B ′ A′ . In fact, for the product of three conformable matrices this becomes (ABC)′ = C ′ B ′ A′ . The transpose of the inverse is the inverse of the transpose, i.e., (A−1 )′ = (A′ )−1 . The inverse of a partitioned matrix
A=
A11 A21
A12 A22
is A−1 =
E −A−1 22 A21 E
−EA12 A−1 22 −1 −1 A22 + A−1 22 A21 EA12 A22
−1 where E = (A11 − A12 A−1 . Alternatively, it can be expressed as 22 A21 )
−1 −1 A11 + A11 A12 F A21 A−1 −A−1 A12 F −1 11 11 A = −F A21 A−1 F 11 −1 where F = (A22 − A21 A−1 . These formulas are used in partitioned regression models, see for 11 A12 ) example the FrischWaugh Lovell Theorem and the computation of the variancecovariance matrix of forecasts from a multiple regression in Chapter 7. An n × n symmetric matrix Ω has n distinct characteristic vectors c1 , . . . , cn . The corresponding n characteristic roots λ1 , . . . , λn may not be distinct but they are all real numbers. The number of nonzero characteristic roots of Ω is equal to the rank of Ω. The characteristic roots of a positive deﬁnite matrix are positive. The characteristic vectors of the symmetric matrix Ω are orthogonal to each other, i.e., c′i cj = 0 for i = j and can be made orthonormal with c′i ci = 1 for i = 1, 2, . . . , n. Hence, the matrix of characteristic vectors C = [c1 , c2 , . . . , cn ] is an orthogonal matrix, such that CC ′ = C ′ C = In with C ′ = C −1 . By deﬁnition Ωci = λi ci or ΩC = CΛ where Λ = diag[λi ]. Premultiplying the last vectors C diagonalizes equation by C ′ we get C ′ ΩC = C ′ CΛ = Λ. Therefore, the matrix of characteristic n the symmetric matrix Ω. Alternatively, we can write Ω = CΛC ′ = i=1 λi ci c′i which is the spectral decomposition of Ω. A real symmetric n × n matrix Ω is positive semideﬁnite if for every n × 1 nonnegative vector y, we have y ′ Ωy ≥ 0. If y ′ Ωy is strictly positive for any nonzero y then Ω is said to be positive deﬁnite. A necessary and suﬃcient condition for Ω to be positive deﬁnite is that all the characteristic roots of Ω are positive. One important application is the comparison of eﬃciency of two unbiased estimators of a vector of parameters β. In this case, we subtract the variancecovariance matrix of the ineﬃcient estimator from the more eﬃcient one and show that the resulting difference yields a positive semideﬁnite matrix, see the GaussMarkov Theorem in Chapter 7. If Ω is a symmetric and positive deﬁnite matrix, there exists a nonsingular matrix P such that Ω = P P ′ . In fact, using the spectral decomposition of Ω given above, one choice for P = CΛ1/2 so
174
Chapter 7: The General Linear Model: The Basics
that Ω = CΛC ′ = P P ′ . This is a useful result which we use in Chapter 9 to obtain Generalized Least Squares (GLS) as a least squares regression after transforming the original regression model by P −1 = (CΛ1/2 )−1 = Λ−1/2 C ′ . In fact, if u ∼ (0, σ 2 Ω), then P −1 u has zero mean and var(P −1 u) = P −1′ var(u)P −1′ = σ 2 P −1 ΩP −1′ = σ 2 P −1 P P ′ P −1′ = σ 2 In . From Chapter have seen that if u ∼ N (0, σ 2 In ), then ui /σ ∼ N (0, 1), so that u2i /σ 2 ∼ χ21 n2, we 2 ′ 2 and u u/σ = i=1 ui /σ 2 ∼ χ2n . Therefore, u′ (σ 2 In )−1 u ∼ χ2n . If u ∼ N (0, σ 2 Ω) where Ω is positive deﬁnite, then u∗ = P −1 u ∼ N (0, σ 2 In ) and u∗′ u∗ /σ 2 ∼ χ2n . But u∗′ u∗ = u′ P −1′ P −1 u = u′ Ω−1 u. Hence, u′ Ω−1 u/σ2 ∼ χ2n . This is used in Chapter 9. Note that the OLS residuals are denoted by e = P¯X u. If u ∼ N (0, σ 2 In ), then e has mean zero and var(e) = σ 2 P¯X In P¯X = σ 2 P¯X so that e ∼ N (0, σ 2 P¯X ). Our estimator of σ 2 in Chapter 7 is s2 = e′ e/(n−k) so that (n − k)s2 /σ 2 = e′ e/σ2 . The last term can also be written as u′ P¯X u/σ 2 . In order to ﬁnd the distribution of this quadratic form in Normal variables, we use the following result stated as lemma 1 in Chapter 7. Lemma 1: For every symmetric idempotent matrix A of rank r, there exists an orthogonal matrix P such that P ′ AP = Jr where Jr is a diagonal matrix with the ﬁrst r elements equal to one and the rest equal to zero. We use this lemma to show that the e′ e/σ 2 is a chisquared with (n − k) degrees of freedom. To see this note that e′ e/σ2 = u′ P¯X u/σ2 and that P¯X is symmetric and idempotent of rank (n − k). Using the lemma there exists a matrix P such that P ′ P¯X P = Jn−k is a diagonal matrix with the ﬁrst (n − k) elements on the diagonal equal to 1 and the last k elements equal to zero. An orthogonal matrix P is by deﬁnition a matrix whose inverse, is its own transpose, i.e., P ′ P = In . Let v = P ′ u then v has mean zero and var(v) = σ 2 P ′ P = σ 2 In so that v is N (0, σ 2 In ) and u = P v. Therefore, n−k 2 2 e′ e/σ 2 = u′ P¯X u/σ2 = v ′ P ′ P¯X P v/σ 2 = v ′ Jn−k v/σ 2 = i=1 vi /σ
But, the v’s are independent identically distributed N (0, σ 2 ), hence vi2 /σ 2 is the square of a standardized N (0, 1) random variable which is distributed as a χ21 . Moreover, the sum of independent χ2 random variables is a χ2 random variable with degrees of freedom equal to the sum of the respective degrees of freedom, see Chapter 2. Hence, e′ e/σ2 is distributed as χ2n−k . The beauty of the above result is that it applies to all quadratic forms u′ Au where A is symmetric and idempotent. In general, for u ∼ N (0, σ 2 I), a necessary and suﬃcient condition for u′ Au/σ 2 to be distributed χ2k is that A is idempotent of rank k, see Theorem 4.6 of Graybill (1961). Another useful theorem on quadratic forms in normal random variables is the following: If u ∼ N (0, σ 2 Ω), then u′ Au/σ 2 is χ2k if and only if AΩ is an idempotent matrix of rank k, see Theorem 4.8 of Graybill (1961). If u ∼ N (0, σ 2 I), the two positive semideﬁnite quadratic forms in normal random variables say u′ Au and u′ Bu are independent if and only if AB = 0, see Theorem 4.10 of Graybill (1961). A suﬃcient condition is that tr(AB) = 0, see Theorem 4.15 of Graybill (1961). This is used in Chapter 7 to construct F statistics to test hypotheses, see for example problem 11. For u ∼ N (0, σ 2 I), the quadratic form u′ Au is independent of the linear form Bu if BA = 0, see Theorem 4.17 of Graybill (1961). This is used in , see problem 8. In general, if u ∼ N (0, Σ), then u′ Au Chapter 7 to prove the independence of s2 and β ols ′ and u Bu are independent if and only if AΣB = 0, see Theorem 4.21 of Graybill (1961). Many other useful matrix properties can be found. This is only a sample of them that will be implicitly or explicitly used in this book. The Kronecker product of two matrices say Σ⊗ In where Σ is m × m and In is the identity matrix of dimension n is deﬁned as follows: ⎤ ⎡ σ 11 In . . . σ1m In ⎦ : : Σ ⊗ In = ⎣ σ m1 In . . . σ mm In
In other words, we place an In next to every element of Σ = [σ ij ]. The dimension of the resulting matrix is mn × mn. This is useful when we have a system of equations like Seemingly Unrelated Regressions in Chapter 10. In general, if A is m × n and B is p × q then A ⊗ B is mp × nq. Some properties of Kronecker
References
175
products include (A ⊗ B)′ = A′ ⊗ B ′ . If both A and B are square matrices of order m × m and p × p then (A ⊗ B)−1 = A−1 ⊗ B −1 , A ⊗ B = Am Bp and tr(A ⊗ B) = tr(A)tr(B). Applying this result to Σ ⊗ In we get (Σ ⊗ In )−1 = Σ−1 ⊗ In
and
Σ ⊗ In  = Σm In n = Σm
and tr(Σ ⊗ In ) = tr(Σ)tr(In ) = n tr(Σ). Some useful properties of matrix differentiation are the following: ∂x′ b =x ∂b
where x′ is 1 × k and b is k × 1.
Also ∂b′ Ab = (A + A′ ) ∂b
where A is k × k.
If A is symmetric, then ∂b′ Ab/∂b = 2Ab. These two properties will be used in Chapter 7 in deriving the least squares estimator.
References Bellman, R. (1970), Introduction to Matrix Analysis (McGraw Hill: New York). Searle, S.R. (1982), Matrix Algebra Useful for Statistics (John Wiley and Sons: New York).
CHAPTER 8
Regression Diagnostics and Speciﬁcation Tests 8.1
Inﬂuential Observations1
Sources of inﬂuential observations include: (i) improperly recorded data, (ii) observational errors in the data, (iii) misspeciﬁcation and (iv) outlying data points that are legitimate and contain valuable information which improve the eﬃciency of the estimation. It is constructive to isolate extreme points and to determine the extent to which the parameter estimates depend upon these desirable data. One should always run descriptive statistics on the data, see Chapter 2. This will often reveal outliers, skewness or multimodal distributions. Scatter diagrams should also be examined, but these diagnostics are only the ﬁrst line of attack and are inadequate in detecting multivariate discrepant observations or the way each observation affects the estimated regression model. In regression analysis, we emphasize the importance of plotting the residuals against the explanatory variables or the predicted values y to identify patterns in these residuals that may indicate nonlinearity, heteroskedasticity, serial correlation, etc, see Chapter 3. In this section, we learn how to identify signiﬁcantly large residuals and compute regression diagnostics that may identify inﬂuential observations. We study the extent to which the deletion of any observation affects the estimated coeﬃcients, the standard errors, predicted values, residuals and test statistics. These represent the core of diagnostic tools in regression analysis. Accordingly, Belsley, Kuh and Welsch (1980, p.11) deﬁne an inﬂuential observation as “..one which, either individually or together with several other observations, has demonstrably larger impact on the calculated values of various estimates (coeﬃcients, standard errors, tvalues, etc.) than is the case for most of the other observations.” First, what is a signiﬁcantly large residual? We have seen that the least squares residuals of y on X are given by e = (In − PX )u, see equation (7.7). y is n × 1 and X is n × k. If u ∼ IID(0, σ 2 In ), then e has zero mean and variance σ 2 (In − PX ). Therefore, the OLS residuals are correlated and heteroskedastic with var(ei ) = σ 2 (1 − hii ) where hii is the ith diagonal element of the hat matrix H = PX , since y = Hy. The diagonal elements hii have the following properties: n hii = nj=1 h2ij ≥ h2ii ≥ 0. i=1 hii = tr(PX ) = k and
The last property follows from the fact that PX is symmetric and idempotent. Therefore, h2ii − hii ≤ 0 or hii (hii − 1) ≤ 0. Hence, 0 ≤ hii ≤ 1, (see problem 1). hii is called the leverage of the ith observation. For a simple regression with a constant, hii = (1/n) + (x2i / ni=1 x2i ) ¯ hii can be interpreted as a measure of the distance between X values of where xi = Xi − X; the ith observation and their mean over all n observations. A large hii indicates that the ith observation is distant from the center of the observations. This means that the ith observation with large hii (a function only of Xi values) exercises substantial leverage in determining the ﬁtted value yi . Also, the larger hii , the smaller the variance of the residual ei . Since observations
178
Chapter 8: Regression Diagnostics and Speciﬁcation Tests
with high leverage tend to have smaller residuals, it may not be possible to detect them by an examination of the residuals alone. But, what is a large leverage? hii is large if it is more ¯ = 2k/n. Hence, hii ≥ 2k/n are considered outlying than twice the mean leverage value 2h observations with regards to X values. An alternative representation of hii is simply hii = d′i PX di = PX di 2 = x′i (X ′ X)−1 xi where di denotes the ith observation’s dummy variable, i.e., a vector of dimension n with 1 in the ith position and 0 elsewhere. x′i is the ith row of X and . denotes the Euclidian length. Note that d′i X = x′i . Let us standardize the ith OLS residual by dividing it by an estimate of its variance. A standardized residual would then be: (8.1) ei = ei /s 1 − hii where σ 2 is estimated by s2 , the MSE of the regression. This is an internal studentization of the residuals, see Cook and Weisberg (1982). Alternatively, one could use an estimate of σ 2 that is independent of ei . Deﬁning s2(i) as the MSE from the regression computed without the ith observation, it can be shown, see equation (8.18) below, that n − k − e2i (n − k)s2 − e2i /(1 − hii ) (8.2) = s2 s2(i) = (n − k − 1) n−k−1
Under normality, s2(i) and ei are independent and the externally studentized residuals are deﬁned by e∗i = ei /s(i) 1 − hii ∼ tn−k−1 (8.3)
Thus, if the normality assumption holds, we can readily assess the signiﬁcance of any single studentized residual. Of course, the e∗i will not be independent. Since this is a tstatistic, it is natural to think of e∗i as large if its value exceeds 2 in absolute value. Substituting (8.2) into (8.3) and comparing the result with (8.1), it is easy to show that e∗i is a monotonic transformation of ei e∗i
= ei
n−k−1 n − k − e2i
1 2
(8.4)
Cook and Wiesberg (1982) show that e∗i can be obtained as a tstatistic from the following augmented regression: y = Xβ ∗ + di ϕ + u
(8.5)
= ei /(1 − hii ) and e∗i is the where di is the dummy variable for the ith observation. In fact, ϕ tstatistic for testing that ϕ = 0. (see problem 4 and the proof given below). Hence, whether the ith residual is large can be simply determined by the regression (8.5). A dummy variable for the ith observation is included in the original regression and the tstatistic on this dummy tests whether this ith residual is large. This is repeated for all observations i = 1, . . . , n. This can be generalized easily to testing for a group of signiﬁcantly large residuals: y = Xβ ∗ + Dp ϕ∗ + u
(8.6)
8.1
Inﬂuential Observations
179
where Dp is an n × p matrix of dummy variables for the psuspected observations. One can test ϕ∗ = 0 using the Chow test described in (4.17) as follows: F =
[Residual SS(no dummies) − Residual SS(Dp dummies used)]/p Residual SS(Dp dummies used)/(n − k − p)
(8.7)
This will be distributed as Fp,n−k−p under the null, see Gentleman and Wilk (1975). Let ep = Dp′ e,
then
E(ep ) = 0 and
var(ep ) = σ 2 Dp′ P¯X Dp
(8.8)
Then one can show, (see problem 5), that F =
[e′p (Dp′ P¯X Dp )−1 ep ]/p ∼ Fp ,n−k−p [(n − k)s2 − e′p (Dp′ P¯X Dp )−1 ep ]/(n − k − p)
(8.9)
Another reﬁnement comes from estimating the regression without the ith observation: = [X ′ X(i) ]−1 X ′ y(i) β (i) (i) (i)
(8.10)
(A − a′ b)−1 = A−1 + A−1 a′ (I − bA−1 a′ )−1 bA−1
(8.11)
where the (i) subscript notation indicates that the ith observation has been deleted. Using the updating formula
with A = (X ′ X) and a = b = x′i , one gets ′ [X(i) X(i) ]−1 = (X ′ X)−1 + (X ′ X)−1 xi x′i (X ′ X)−1 /(1 − hii )
(8.12)
Therefore −β = (X ′ X)−1 xi ei /(1 − hii ) β (i)
(8.13)
Since the estimated coeﬃcients are often of primary interest, (8.13) describes the change in the estimated regression coeﬃcients that would occur if the ith observation is deleted. Note that a high leverage observation with hii large will be inﬂuential in (8.13) only if the corresponding residual ei is not small. Therefore, high leverage implies a potentially inﬂuential observation, but whether this potential is actually realized depends on yi . Alternatively, one can obtain this result from the augmented regression given in (8.5). Note that Pdi = di (d′i di )−1 d′i = di d′i is an n × n matrix with 1 in the ith diagonal position and 0 elsewhere. P¯di = In − Pdi , has the effect when postmultiplied by a vector y of deleting the ith observation. Hence, premultiplying (8.5) by P¯di one gets y(i) X(i) u(i) = β∗ + (8.14) P¯di y = 0 0 0 where the ith observation is moved to the bottom of the data, without loss of generality. The last observation has no effect on the least squares estimate of β ∗ since both the dependent and ∗ = β , and the ith observation’s independent variables are zero. This regression will yield β (i) residual is clearly zero. By the FrischWaughLovell Theorem given in section 7.3, the least squares estimates and the residuals from (8.14) are numerically identical to those from (8.5).
180
Chapter 8: Regression Diagnostics and Speciﬁcation Tests
∗ = β (i) in (8.5) and the ith observation residual from (8.5) must be zero. This Therefore, β , and the ﬁtted values from this regression are given by y = X β +di ϕ implies that ϕ = yi −x′i β (i) (i) The difference in residuals is whereas those from the original regression (7.1) are given by X β. therefore + di ϕ − Xβ e − e(i) = X β (i)
(8.15)
ϕ = ei /(1 − hii )
(8.16)
premultiplying (8.15) by P¯X and using the fact that P¯X X = 0, one gets P¯X (e − e(i) ) = P¯X di ϕ . But, P¯X e = e and P¯X e(i) = e(i) , hence P¯X di ϕ = e−e(i) . Premultiplying both sides by d′i one gets = ei since the ith residual of e(i) from (8.5) is zero. By deﬁnition, d′i P¯X di = 1 − hii , d′i P¯X di ϕ therefore −β + (X ′ X)−1 X ′ di ϕ premultiplying (8.15) by (X ′ X)−1 X ′ one gets 0 = β . This uses the fact (i) that both residuals are orthogonal to X. Rearranging terms and substituting ϕ from (8.16), one gets −β = (X ′ X)−1 xi ϕ = (X ′ X)−1 xi ei /(1 − hii ) β (i)
as given in (8.13). : Note that s2(i) given in (8.2) can now be written in terms of β (i) s2(i) =
t=i (yt
(i) )2 /(n − k − 1) − x′t β
(8.17)
upon substituting (8.13) in (8.17) we get
e2i hit ei 2 − e + = (n − k − t t=1 1 − hii (1 − hii )2 n 2ei n e2i e2i 2 = (n − k)s2 + t=1 hit − t=1 et hit + 2 1 − hi (1 − hii ) (1 − hii )2 e2i = (n − k)s2 − (8.18) 1 − hii n which is (8.2). This uses the fact that He = 0 and H 2 = H. Hence, t=1 et hit = 0 and n 2 =h . h ii it t=1 (the jth component of β) that results from the deletion To assess whether the change in β j j , σ 2 (X ′ X)−1 . This is of the ith observation, is large or small, we scale by the variance of β jj denoted by j − β j(i) )/s(i) (X ′ X)−1 (8.19) DFBETAS ij = (β jj 1)s2(i)
n
Note that s(i) is used in order to make the denominator stochastically independent of the numerator in the Gaussian case. Absolute values of DFBETAS larger than 2 are considered √ inﬂuential. However, Belsley, Kuh, and Welsch (1980) suggest 2/ n as a sizeadjusted cutoff. In fact, it would be most unusual for the removal of a single observation from a sample of 100
8.1
Inﬂuential Observations
181
or more to result in a change in any estimate by two or more standard errors. The sizeadjusted cutoff tend to expose approximately the same proportion of potentially inﬂuential observations, regardless of sample size. The sizeadjusted cutoff is particularly important for large data sets. In case of Normality, it can also be useful to look at the change in the tstatistics, as a means of assessing the sensitivity of the regression output to the deletion of the ith observation: β β j(i) j − DFSTAT ij = −1 ′ X )−1 s(i) (X(i) s (X ′ X)jj (i) jj
(8.20)
Another way to summarize coeﬃcient changes and gain insight into forecasting effects when the ith observation is deleted is to look at the change in ﬁt, deﬁned as −β ] = hii ei /(1 − hii ) DFFIT i = yi − y(i) = x′i [β (i)
(8.21)
where the last equality is obtained from (8.13). √ We scale this measure by the variance of y(i) , i.e., σ hii , giving DFFITS i =
hii 1 − hii
1/2
e √i = s(i) 1 − hii
hii 1 − hii
1/2
e∗i
(8.22)
where σ has been estimated by s(i) and e∗i denotes the externally studentized residual given in (8.3). Values of DFFITS larger than 2 in absolute value are considered inﬂuential. A size adjusted cutoff for DFFITS suggested by Belsley, Kuh and Welsch (1980) is 2 k/n. In (8.3), the studentized residual e∗i was interpreted as a tstatistic that tests for the signiﬁcance of the coeﬃcient ϕ of di , the dummy variable which takes the value 1 for the ith observation and 0 otherwise, in the regression of y on X and di . This can now be easily proved as follows: Consider the Chow test for the signiﬁcance of ϕ. The RRSS = (n − k)s2 , the URSS = (n − k − 1)s2(i) and the Chow F test described in (4.17) becomes F1,n−k−1 =
[(n − k)s2 − (n − k − 1)s2(i) ]/1 (n − k − 1)s2(i) /(n − k − 1)
=
e2i s2(i) (1 − hii )
(8.23)
The square root of (8.23) is e∗i ∼ tn−k−1 . These studentized residuals provide a better way to examine the information in the residuals, but they do not tell the whole story, since some of the most inﬂuential data points can have small e∗i (and very small ei ). One overall measure of the impact of the ith observation on the estimated regression coeﬃcients is Cook’s (1977) distance measure Di2 . Recall, that the conﬁdence region for all k − β)′ X ′ X(β − β)/ks2 ∼ F (k, n − k). Cook’s (1977) distance mearegression coeﬃcients is (β 2 sure Di uses the same structure for measuring the combined impact of the differences in the estimated regression coeﬃcients when the ith observation is deleted: −β )′ X ′ X(β −β )/ks2 Di2 (s) = (β (i) (i)
(8.24)
Even though Di2 (s) does not follow the above F distribution, Cook suggests computing the percentile value from this F distribution and declaring an inﬂuential observation if this percentile and β will be large, implying that the ith value ≥ 50%. In this case, the distance between β (i)
182
Chapter 8: Regression Diagnostics and SpeciďŹ cation Tests
observation has a substantial inďŹ‚uence on the ďŹ t of the regression. Cookâ€™s distance measure can be equivalently computed as: hii e2i 2 (8.25) Di (s) = 2 ks (1 âˆ’ hii )2 Di2 (s) depends on ei and hii ; the larger ei or hii the larger is Di2 (s). Note the relationship between Cookâ€™s Di2 (s) and Belsley, Kuh, and Welsch (1980) DFFITS i (Ďƒ) in (8.22), i.e., âˆš )/(Ďƒ hii ) DFFITS i (Ďƒ) = kDi (Ďƒ) = ( yi âˆ’ xâ€˛i Î˛ (i)
Belsley, Kuh, and Welsch (1980) suggest nominating DFFITS based on s(i) exceeding 2 k/n âˆš for special attention. Cookâ€™s 50 percentile recommendation is equivalent to DFFITS > k, which is more conservative, see Velleman and Welsch (1981). Next, we study the inďŹ‚uence of the ith observation deletion on the covariance matrix of the regression coeďŹƒcients. One can compare the two covariance matrices using the ratio of their determinants: ! â€˛ X ]âˆ’1 â€˛ X ]âˆ’1 ) det[X(i) s2k det(s2(i) [X(i) (i) (i) (i) (8.26) = 2k COVRATIO i = det(s2 [X â€˛ X]âˆ’1 ) s det[X â€˛ X]âˆ’1 Using the fact that â€˛ X(i) ] = (1 âˆ’ hii )det[X â€˛ X] det[X(i)
(8.27)
see problem 8, one obtains COVRATIO i =
s2(i) s2
!k
Ă—
1 = 1 âˆ’ hii nâˆ’kâˆ’1 nâˆ’k
1 +
eâˆ—2 i nâˆ’k
k
(8.28) (1 âˆ’ hii )
where the last equality follows from (8.18) and the deďŹ nition of eâˆ—i in (8.3). Values of COVRATIO not near unity identify possible inďŹ‚uential observations and warrant further investigation. Belsley, Kuh and Welsch (1980) suggest investigating points with COVRATIO âˆ’1 near to or larger than 3k/n. The COVRATIO depends upon both hii and eâˆ—2 i . In fact, from (8.28), COVRATIO is large when hii is large and small when eâˆ—i is large. The two factors can offset each other, that is why it is important to look at hii and eâˆ—i separately as well as in combination as in COVRATIO. Finally, one can look at how the variance of yi changes when an observation is deleted. var( yi ) = s2 hii
and the ratio is
and
) = s2 (hii /(1 âˆ’ hii )) var( y(i) ) = var(xâ€˛i Î˛ (i) (i)
FVARATIO i = s2(i) /s2 (1 âˆ’ hii )
(8.29)
This expression is similar to COVRATIO except that [s2(i) /s2 ] is not raised to the kth power. As a diagnostic measure it will exhibit the same patterns of behavior with respect to different conďŹ gurations of hii and the studentized residual as described for COVRATIO.
8.1
Inﬂuential Observations
183
Table 8.1 Cigarette Regression Dependent Variable: LNC Analysis of Variance Source Model Error C Total Root MSE Dep Mean C.V.
DF 2 43 45 0.16343 4.84784 3.37125
Sum of Squares 0.50098 1.14854 1.64953
Mean Square 0.25049 0.02671
Rsquare Adj Rsq
0.3037 0.2713
F Value 9.378
Prob>F 0.0004
T for H0: Parameter=0 4.730 –4.123 0.876
Prob > T 0.0001 0.0002 0.3858
Parameter Estimates
Variable INTERCEP LNP LNY
DF 1 1 1
Parameter Estimate 4.299662 –1.338335 0.172386
Standard Error 0.90892571 0.32460147 0.19675440
Example 1: For the cigarette data given in Table 3.2, Table 8.1 gives the SAS least squares regression for logC on logP and logY . logC = 4.30 − 1.34 logP + 0.172 logY + residuals (0.909) (0.325) (0.197) ¯ 2 = 0.271. Table 8.2 gives the data The standard error of the regression is s = 0.16343 and R along with the predicted values of logC, the least squares residuals e, the internal studentized residuals e given in (8.1), the externally studentized residuals e∗ given in (8.3), the Cook statistic given in (8.25), the leverage of each observation h, the DFFITS given in (8.22) and the COVRATIO given in (8.28). Using the leverage column, one can identify four potential observations with high leverage, i.e., ¯ = 2k/n = 6/46 = 0.13043. These are the observations belonging to the following greater than 2h states: Connecticut (CT), Kentucky (KY), New Hampshire (NH) and New Jersey (NJ) with leverage 0.13535, 0.19775, 0.13081 and 0.13945, respectively. Note that the corresponding OLS residuals are −0.078, 0.234, 0.160 and −0.059, which are not necessarily large. The internally studentized residuals are computed using equation (8.1). For KY this gives 0.23428 eKY √ = 1.6005 = eKY = √ s 1 − hKY 0.16343 1 − 0.19775
From Table 8.2, two observations with a high internally studentized residuals are those belonging to Arkansas (AR) and Utah (UT) with values of 2.102 and −2.679 respectively, both larger than 2 in absolute value. The externally studentized residuals are computed from (8.3). For KY, we ﬁrst compute s2(KY ) , the MSE from the regression computed without the KY observation. From (8.2), this is
184
Chapter 8: Regression Diagnostics and Speciﬁcation Tests
given by s2(KY ) = =
(n − k)s2 − e2KY /(1 − hKY ) (n − k − 1) (46 − 3)(0.16343)2 − (0.23428)2 /(1 − 0.19775) = 0.025716 (46 − 3 − 1)
From (8.3) we get e∗(KY ) =
0.23428 e √ √KY = 1.6311 = s(KY ) 1 − hKY 0.16036 1 − 0.19775
This externally studentized residual is distributed as a tstatistic with 42 degrees of freedom. However, e∗KY does not exceed 2 in absolute value. Again, e∗AR and e∗U T are 2.193 and −2.901 both larger than 2 in absolute value. From (8.13), the change in the regression coeﬃcients due to the omission of the KY observation is given by ′ −1 −β β (KY ) = (X X) xKY eKY /(1 − hKY )
Using the fact that
(X ′ X)−1
⎡
⎤ 30.929816904 4.8110214655 −6.679318415 = ⎣ 4.81102114655 3.9447686638 −1.177208398 ⎦ −6.679318415 −1.177208398 1.4493372835
and x′KY = (1, −0.03260, 4.64937) with eKY = 0.23428 and hKY = 0.19775 one gets ′ −β (β (KY ) ) = (−0.082249, −0.230954, 0.028492)
In order to assess whether this change is large or small, we compute DFBETAS given in (8.19). For the KY observation, these are given by DFBETAS KY,1 =
−β β −0.082449 1 1(KY ) √ = = −0.09222 0.16036 30.9298169 s(KY ) (X ′ X)−1 11
Similarly, DFBETAS KY,2 = −0.7251 and DFBETAS KY,3 = 0.14758. These are not larger than √ √ 2 in absolute value. However, DFBETAS KY,2 is larger than 2/ n = 2/ 46 = 0.2949 in absolute value. This is the sizeadjusted cutoff recommended by Belsley, Kuh and Welsch (1980) for large n. The change in the ﬁt due to the omission of the KY observation is given by (8.21). In fact, −β DFFIT KY = yKY − y(KY ) = x′KY [β (KY ) ] ⎛ ⎞ = (1, −0.03260, 4.64937) ⎜ −0.082249 ⎟ = 0.05775 ⎝ −0.230954 ⎠ −0.028492
or simply
DFFIT KY =
(0.19775)(0.23428) hKY eKY = = 0.05775 (1 − hKY ) 1 − 0.19775
8.2
Scaling it by the variance of y(KY ) we get from (8.22) DFFITS KY =
hKY 1 − hKY
1/2
e∗KY =
0.19775 1 − 0.19775
1/2
Recursive Residuals
185
(1.6311) = 0.8098
This larger than 2 in absolute value, but it is larger than the sizeadjusted cutoff of is not 2 k/n = 2 3/46 = 0.511. Note also that both DFFITS AR = 0.667 and DFFITS U T = −0.888 are larger than 0.511 in absolute value. Cook’s distance measure is given in (8.25) and for KY can be computed as e2KY hKY (0.23428)2 0.19775 2 DKY (s) = = = 0.21046 ks2 (1 − hKY )2 3(0.16343)2 (1 − 0.19775)2 2 (s) = 0.13623 and D 2 (s) = 0.22399, The other two large Cook’s distance measures are DRAR UT respectively. COVRATIO omitting the KY observation can be computed from (8.28) as !k s2(KY ) 1 1 0.025716 3 COVRATIO KY = = 1.1125 = s2 1 − hKY (0.16343)2 (1 − 0.019775)
which means that COVRATIO KY − 1/ = 0.1125 is less than 3k/n = 9/46 = 0.1956. Finally, FVARATIO omitting the KY observation can be computed from (8.29) as FVARATIO KY =
s2(KY ) s2 (1 − hKY )
=
0.025716 = 1.2001 (0.16343)2 (1 − 0.19775)
By several diagnostic measures, AR, KY and UT are inﬂuential observations that deserve special attention. The ﬁrst two states are characterized with large sales of cigarettes. KY is a producer state with a very low price on cigarettes, while UT is a low consumption state due to its high percentage of Mormon population (a religion that forbids smoking). Table 8.3 gives the predicted consumption along with the 95% conﬁdence band, the OLS residuals, and the internalized student residuals, Cook’s Dstatistic and a plot of these residuals. This last plot highlights the fact that AR, UT and KY have large studentized residuals.
8.2
Recursive Residuals
In Section 8.1, we showed that the least squares residuals are heteroskedastic with nonzero covariances, even when the true disturbances have a scalar covariance matrix. This section studies recursive residuals which are a set of linear unbiased residuals with a scalar covariance matrix. They are independent and identically distributed when the true disturbances themselves are independent and identically distributed.2 These residuals are natural in timeseries regressions and can be constructed as follows: = (X ′ Xt )−1 X ′ Yt where Xt denotes 1. Choose the ﬁrst t ≥ k observations and compute β t t t the t × k matrix of t observations on k variables and Yt′ = (y1 , . . . , yt ). The recursive residuals are basically standardized onestep ahead forecast residuals: )/ 1 + x′ (X ′ Xt )−1 xt+1 wt+1 = (yt+1 − x′t+1 β (8.30) t t t+1
186
Chapter 8: Regression Diagnostics and Speciﬁcation Tests
Table 8.2 Diagnostic Statistics for the Cigarettes Example STATE
LNC
LNP
LNY
e
e˜
e⋆
1
AL
4.96213
0.20487
4.64039
4.8254
0.1367
0.857
0.8546
0.012
0.0480
0.1919
2
AZ
4.66312
0.16640
4.68389
4.8844
–0.2213
–1.376
–1.3906
0.021
0.0315
–0.2508
0.9681
3
AR
5.10709
0.23406
4.59435
4.7784
0.3287
2.102
2.1932
0.136
0.0847
0.6670
0.8469
4
CA
4.50449
0.36399
4.88147
4.6540
–0.1495
–0.963
–0.9623
0.033
0.0975
–0.3164
1.1138
5
CT
4.66983
0.32149
5.09472
4.7477
–0.0778
–0.512
–0.5077
0.014
0.1354
–0.2009
1.2186
6
DE
5.04705
0.21929
4.87087
4.8458
0.2012
1.252
1.2602
0.018
0.0326
0.2313
0.9924
7
DC
4.65637
0.28946
5.05960
4.7845
–0.1281
–0.831
–0.8280
0.029
0.1104
–0.2917
1.1491
8
FL
4.80081
0.28733
4.81155
4.7446
0.0562
0.352
0.3482
0.002
0.0431
0.0739
1.1118 1.1142
OBS
PREDICTED
Cook’s D
Leverage
DFFITS COVRATIO 1.0704
9
GA
4.97974
0.12826
4.73299
4.9439
0.0358
0.224
0.2213
0.001
0.0402
0.0453
10
ID
4.74902
0.17541
4.64307
4.8653
–0.1163
–0.727
–0.7226
0.008
0.0413
–0.1500
1.0787
11
IL
4.81445
0.24806
4.90387
4.8130
0.0014
0.009
0.0087
0.000
0.0399
0.0018
1.1178
12
IN
5.11129
0.08992
4.72916
4.9946
0.1167
0.739
0.7347
0.013
0.0650
0.1936
1.1046
13
IA
4.80857
0.24081
4.74211
4.7949
0.0137
0.085
0.0843
0.000
0.0310
0.0151
1.1070
14
KS
4.79263
0.21642
4.79613
4.8368
–0.0442
–0.273
–0.2704
0.001
0.0223
–0.0408
1.0919
15
KY
5.37906
–0.03260
4.64937
5.1448
0.2343
1.600
1.6311
0.210
0.1977
0.8098
1.1126
16
LA
4.98602
0.23856
4.61461
4.7759
0.2101
1.338
1.3504
0.049
0.0761
0.3875
1.0224
17
ME
4.98722
0.29106
4.75501
4.7298
0.2574
1.620
1.6527
0.051
0.0553
0.4000
0.9403
18
MD
4.77751
0.12575
4.94692
4.9841
–0.2066
–1.349
–1.3624
0.084
0.1216
–0.5070
1.0731
19
MA
4.73877
0.22613
4.99998
4.8590
–0.1202
–0.769
–0.7653
0.018
0.0856
–0.2341
1.1258
20
MI
4.94744
0.23067
4.80620
4.8195
0.1280
0.792
0.7890
0.005
0.0238
0.1232
1.0518
21
MN
4.69589
0.34297
4.81207
4.6702
0.0257
0.165
0.1627
0.001
0.0864
0.0500
1.1724
22
MS
4.93990
0.13638
4.52938
4.8979
0.0420
0.269
0.2660
0.002
0.0883
0.0828
1.1712
23
MO
5.06430
0.08731
4.78189
5.0071
0.0572
0.364
0.3607
0.004
0.0787
0.1054
1.1541
24
MT
4.73313
0.15303
4.70417
4.9058
–0.1727
–1.073
–1.0753
0.012
0.0312
–0.1928
1.0210
25
NE
4.77558
0.18907
4.79671
4.8735
–0.0979
–0.607
–0.6021
0.003
0.0243
–0.0950
1.0719
26
NV
4.96642
0.32304
4.83816
4.7014
0.2651
1.677
1.7143
0.065
0.0646
0.4504
0.9366
27
NH
5.10990
0.15852
5.00319
4.9500
0.1599
1.050
1.0508
0.055
0.1308
0.4076
1.1422
28
NJ
4.70633
0.30901
5.10268
4.7657
–0.0594
–0.392
–0.3879
0.008
0.1394
–0.1562
1.2337
29
NM
4.58107
0.16458
4.58202
4.8693
–0.2882
–1.823
–1.8752
0.076
0.0639
–0.4901
0.9007
30
NY
4.66496
0.34701
4.96075
4.6904
–0.0254
–0.163
–0.1613
0.001
0.0888
–0.0503
1.1755
31
ND
4.58237
0.18197
4.69163
4.8649
–0.2825
–1.755
–1.7999
0.031
0.0295
–0.3136
0.8848
32
OH
4.97952
0.12889
4.75875
4.9475
0.0320
0.200
0.1979
0.001
0.0423
0.0416
1.1174
33
OK
4.72720
0.19554
4.62730
4.8356
–0.1084
–0.681
–0.6766
0.008
0.0505
–0.1560
1.0940
34
PA
4.80363
0.22784
4.83516
4.8282
–0.0246
–0.153
–0.1509
0.000
0.0257
–0.0245
1.0997
35
RI
4.84693
0.30324
4.84670
4.7293
0.1176
0.738
0.7344
0.010
0.0504
0.1692
1.0876
36
SC
5.07801
0.07944
4.62549
4.9907
0.0873
0.555
0.5501
0.008
0.0725
0.1538
1.1324
37
SD
4.81545
0.13139
4.67747
4.9301
–0.1147
–0.716
–0.7122
0.007
0.0402
–0.1458
1.0786
38
TN
5.04939
0.15547
4.72525
4.9062
0.1432
0.890
0.8874
0.008
0.0294
0.1543
1.0457
39
TX
4.65398
0.28196
4.73437
4.7384
–0.0845
–0.532
–0.5271
0.005
0.0546
–0.1267
1.1129
40
UT
4.40859
0.19260
4.55586
4.8273
–0.4187
–2.679
–2.9008
0.224
0.0856
–0.8876
0.6786
41
VT
5.08799
0.18018
4.77578
4.8818
0.2062
1.277
1.2869
0.014
0.0243
0.2031
0.9794
42
VA
4.93065
0.11818
4.85490
4.9784
–0.0478
–0.304
–0.3010
0.003
0.0773
–0.0871
1.1556
43
WA
4.66134
0.35053
4.85645
4.6677
–0.0064
–0.041
–0.0404
0.000
0.0866
–0.0124
1.1747
44
WV
4.82454
0.12008
4.56859
4.9265
–0.1020
–0.647
–0.6429
0.011
0.0709
–0.1777
1.1216
45
WI
4.83026
0.22954
4.75826
4.8127
0.0175
0.109
0.1075
0.000
0.0254
0.0174
1.1002
46
WY
5.00087
0.10029
4.71169
4.9777
0.0232
0.146
0.1444
0.000
0.0555
0.0350
1.1345
Table 8.3 Regression of Real PerCapita Consumption of Cigarettes Dep Obs
4.8254 4.8844 4.7784 4.6540 4.7477 4.8458 4.7845 4.7446 4.9439 4.8653 4.8130 4.9946 4.7949 4.8368 5.1448 4.7759 4.7298 4.9841 4.8590 4.8195 4.6702 4.8979 5.0071 4.9058 4.8735 4.7014 4.9500 4.7657 4.8693 4.6904 4.8649 4.9475 4.8356 4.8282 4.7293 4.9907 4.9301 4.9062 4.7384 4.8273 4.8818 4.9784 4.6677 4.9265 4.8127 4.9777
0.036 0.029 0.048 0.051 0.060 0.030 0.054 0.034 0.033 0.033 0.033 0.042 0.029 0.024 0.073 0.045 0.038 0.057 0.048 0.025 0.048 0.049 0.046 0.029 0.025 0.042 0.059 0.061 0.041 0.049 0.028 0.034 0.037 0.026 0.037 0.044 0.033 0.028 0.038 0.048 0.025 0.045 0.048 0.044 0.026 0.039 0 1.1485 1.3406
Lower95% Mean 4.7532 4.8259 4.6825 4.5511 4.6264 4.7863 4.6750 4.6761 4.8778 4.7983 4.7472 4.9106 4.7368 4.7876 4.9982 4.6850 4.6523 4.8692 4.7625 4.7686 4.5733 4.8000 4.9147 4.8476 4.8221 4.6176 4.8308 4.6427 4.7859 4.5922 4.8083 4.8797 4.7616 4.7754 4.6553 4.9020 4.8640 4.8497 4.6614 4.7308 4.8304 4.8868 4.5708 4.8387 4.7602 4.9000
Upper95% Mean 4.8976 4.9429 4.8743 4.7570 4.8689 4.9053 4.8940 4.8130 5.0100 4.9323 4.8789 5.0786 4.8529 4.8860 5.2913 4.8668 4.8074 5.0991 4.9554 4.8703 4.7671 4.9959 5.0996 4.9640 4.9249 4.7851 5.0692 4.8888 4.9526 4.7886 4.9215 5.0153 4.9097 4.8811 4.8033 5.0795 4.9963 4.9626 4.8155 4.9237 4.9332 5.0701 4.7647 5.0143 4.8653 5.0553
Lower95% Predict
Upper95% Predict
Std Err Residual
Student Residual
Residual
4.4880 4.5497 4.4351 4.3087 4.3965 4.5109 4.4372 4.4079 4.6078 4.5290 4.4769 4.6544 4.4602 4.5036 4.7841 4.4340 4.3912 4.6351 4.5155 4.4860 4.3267 4.5541 4.6648 4.5711 4.5399 4.3613 4.5995 4.4139 4.5293 4.3465 4.5305 4.6110 4.4978 4.4944 4.3915 4.6494 4.5940 4.5718 4.4000 4.4839 4.5482 4.6363 4.3242 4.5854 4.4790 4.6391
5.1628 5.2191 5.1217 4.9993 5.0989 5.1808 5.1318 5.0812 5.2801 5.2016 5.1491 5.3347 5.1295 5.1701 5.5055 5.1178 5.0684 5.3332 5.2024 5.1530 5.0137 5.2418 5.3495 5.2405 5.2071 5.0414 5.3005 5.1176 5.2092 5.0343 5.1993 5.2840 5.1735 5.1621 5.0671 5.3320 5.2663 5.2406 5.0769 5.1707 5.2154 5.3205 5.0113 5.2676 5.1465 5.3163
0.1367 –0.2213 0.3287 –0.1495 –0.0778 0.2012 –0.1281 0.0562 0.0358 –0.1163 0.00142 0.1167 0.0137 –0.0442 0.2343 0.2101 0.2574 –0.2066 –0.1202 0.1280 0.0257 0.0420 0.0572 –0.1727 –0.0979 0.2651 0.1599 –0.0594 –0.2882 –0.0254 –0.2825 0.0320 –0.1084 –0.0246 0.1176 0.0873 –0.1147 0.1432 –0.0845 –0.4187 0.2062 –0.0478 –0.00638 –0.1020 0.0175 0.0232
0.159 0.161 0.156 0.155 0.152 0.161 0.154 0.160 0.160 0.160 0.160 0.158 0.161 0.162 0.146 0.157 0.159 0.153 0.156 0.161 0.156 0.156 0.157 0.161 0.161 0.158 0.152 0.152 0.158 0.156 0.161 0.160 0.159 0.161 0.159 0.157 0.160 0.161 0.159 0.156 0.161 0.157 0.156 0.158 0.161 0.159
0.857 –1.376 2.102 –0.963 –0.512 1.252 –0.831 0.352 0.224 –0.727 0.009 0.739 0.085 –0.273 1.600 1.338 1.620 –1.349 –0.769 0.792 0.165 0.269 0.364 –1.073 –0.607 1.677 1.050 –0.392 –1.823 –0.163 –1.755 0.200 –0.681 –0.153 0.738 0.555 –0.716 0.890 –0.532 –2.679 1.277 –0.304 –0.041 –0.647 0.109 0.146
–2 –1 0
1 2
Cook’s D 0.012 0.021 0.136 0.033 0.014 0.018 0.029 0.002 0.001 0.008 0.000 0.013 0.000 0.001 0.210 0.049 0.051 0.084 0.018 0.005 0.001 0.002 0.004 0.012 0.003 0.065 0.055 0.008 0.076 0.001 0.031 0.001 0.008 0.000 0.010 0.008 0.007 0.008 0.005 0.224 0.014 0.003 0.000 0.011 0.000 0.000
187
Sum of Residuals Sum of Squared Residuals Predicted Resid SS (Press)
Std Err Predict
Recursive Residuals
4.9621 4.6631 5.1071 4.5045 4.6698 5.0471 4.6564 4.8008 4.9797 4.7490 4.8145 5.1113 4.8086 4.7926 5.3791 4.9860 4.9872 4.7775 4.7388 4.9474 4.6959 4.9399 5.0643 4.7331 4.7756 4.9664 5.1099 4.7063 4.5811 4.6650 4.5824 4.9795 4.7272 4.8036 4.8469 5.0780 4.8155 5.0494 4.6540 4.4086 5.0880 4.9307 4.6613 4.8245 4.8303 5.0009
Predict Value
8.2
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46
Var LNC
188
Chapter 8: Regression Diagnostics and Speciﬁcation Tests
t+1 = (X ′ Xt+1 )−1 X ′ Yt+1 . 2. Add the (t + 1)th observation to the data and obtain β t+1 t+1 Compute wt+2 . 3. Repeat step 2, adding one observation at a time. In timeseries regressions, one usually starts with the ﬁrst kobservations and obtain (T − k) forward recursive residuals. These recursive residuals can be computed using the updating formula given in (8.11) with A = (Xt′ Xt ) and a = −b = x′t+1 . Therefore, ′ (Xt+1 Xt+1 )−1 = (Xt′ Xt )−1 −(Xt′ Xt )−1 xt+1 x′t+1 (Xt′ Xt )−1 /[1+x′t+1 (Xt′ Xt )−1 xt+1 ] (8.31)
and only (Xt′ Xt )−1 have to be computed. Also, ′ −1 ′ β t+1 = β t + (Xt Xt ) xt+1 (yt+1 − xt+1 β t )/ft+1
(8.32)
where ft+1 = 1 + x′t+1 (Xt′ Xt )−1 xt+1 , see problem 13. Alternatively, one can compute these residuals by regressing Yt+1 on Xt+1 and dt+1 where dt+1 = 1 for the (t + 1)th observation, and zero otherwise, see equation (8.5). The estimated coeﬃcient of dt+1 is the numerator of wt+1 . The standard error of this estimate is st+1 times the denominator of wt+1 , where st+1 is the standard error of this regression. Hence, wt+1 can be retrieved as st+1 multiplied by the tstatistic corresponding to dt+1 . This computation has to be performed sequentially, in each case generating the corresponding recursive residual. This may be computationally ineﬃcient, but it is simple to generate using regression packages. It is obvious from (8.30) that if ut ∼ IIN(0, σ 2 ), then wt+1 has zero mean and var(wt+1 ) = σ 2 . Furthermore, wt+1 is linear in the y’s. Therefore, it is normally distributed. It remains to show that the recursive residuals are independent. Given normality, it is suﬃcient to show that cov(wt+1 , ws+1 ) = 0
for
t = s; t, s = k, . . . , T − 1
(8.33)
This is left as an exercise for the reader, see problem 13. Alternatively, one can express the T − k vector of recursive residuals as w = Cy where C is of dimension (T − k) × T as follows: ⎤ ⎡ x′ (X ′ X )−1 X ′ k k √1 0....0 − k+1 √k fk+1 fk+1 ⎥ ⎢ ⎥ ⎢ .. .. ⎥ ⎢ . . ⎥ ⎢ ′ ′ ⎥ ⎢ x′t (Xt−1 Xt−1 )−1 Xt−1 (8.34) C=⎢ − √ √1 0....0 ⎥ ft ft ⎥ ⎢ ⎥ ⎢ .. .. ⎥ ⎢ . . ⎦ ⎣ x′T (XT′ −1 XT −1 )−1 XT′ −1 √ √1 − f f T
T
Problem 14 asks the reader to verify that w = Cy, using (8.30). Also, that the matrix C satisﬁes the following properties: (i) CX = 0
(ii) CC ′ = IT −k
(iii) C ′ C = P¯X
(8.35)
This means that the recursive residuals w are (LUS) linear in y, unbiased with mean zero and have a scalar variancecovariance matrix: var(w) = CE(uu′ )C ′ = σ 2 IT −k . Property (iii) also
8.2
Recursive Residuals
189
means that w′ w = y ′ C ′ Cy = y ′ P¯X y = e′ e. This means that the sum of squares of (T − k) recursive residuals is equal to the sum of squares of T least squares residuals. One can also show from (8.32) that 2 RSSt+1 = RSSt + wt+1
for
t = k, . . . , T − 1
(8.36)
t ), see problem 14. Note that for t = k; RSS = 0, since t )′ (Yt − Xt β where RSSt = (Yt − Xt β with k observations one gets a perfect ﬁt and zero residuals. Therefore RSST =
T
2 t=k+1 wt
=
T
2 t=1 et
(8.37)
Applications of Recursive Residuals Recursive residuals have been used in several important applications: (1) Harvey (1976) used these recursive residuals to give an alternative proof of the fact that Chow’s postsample predictive test has an F distribution. Recall, from Chapter 7, that when the second sample n2 had fewer than k observations, Chow’s test becomes F =
(e′ e − e′1 e1 )/n2 ∼ F (n2 , n1 − k) e′1 e1 /(n1 − k)
(8.38)
where e′ e = RSS from the total sample (n1 + n2 = T observations), and e′1 e1 = RSS from the ﬁrst n1 observations. Recursive residuals can be computed for t = k + 1, . . . , n1 , and continued on for the extra n2 observations. From (8.36) we have e′ e = Therefore,
n1 +n2
2 t=k+1 wt
and
n1 +n2 2 wt /n2 F = n1 t=n1 +1 2 t=k+1 wt /(n1 − k)
e′1 e1 =
n1
2 t=k+1 wt
(8.39)
(8.40)
But the wt ’s are ∼IIN(0, σ 2 ) under the null, therefore the F statistic in (8.38) is a ratio of two independent chisquared variables, each divided by the appropriate degrees of freedom. Hence, F ∼ F (n2 , n1 − k) under the null, see Chapter 2. (2) Harvey and Phillips (1974) used recursive residuals to test the null hypothesis of homoskedasticity. If the alternative hypothesis is that σ 2i varies with Xj , the proposed test is as follows: 1) Order the data according to Xj and choose a base of at least k observations from among the central observations. 2) From the ﬁrst m observations compute the vector of recursive residuals w1 using the base constructed in step 1. Also, compute the vector of recursive residuals w2 from the last m observations. The maximum m can be is (T − k)/2.
190
Chapter 8: Regression Diagnostics and Speciﬁcation Tests
3) Under the null hypothesis, it follows that F = w2′ w2 /w1′ w1 ∼ Fm,m
(8.41)
Harvey and Phillips suggest setting m at approximately (n/3) provided n > 3k. This test has the advantage over the GoldfeldQuandt test in that if one wanted to test whether σ 2i varies with some other variable Xs , one could simply regroup the existing recursive residuals according to low and high values of Xs and compute (8.41) afresh, whereas the GoldfeldQuandt test would require the computation of two new regressions. (3) Phillips and Harvey (1974) suggest using the recursive residuals to test the null hypothesis of no serial correlation using a modiﬁed von Neuman ratio: T (wt − wt−1 )2 /(T − k − 1) (8.42) M V N R = t=k+2T 2 t=k+1 wt /(T − k)
This is the ratio of the meansquare successive difference to the variance. It is arithmetically closely related to the DW statistic, but given that w ∼ N (0, σ 2 IT −k ) one has an exact test available and no inconclusive regions. Phillips and Harvey (1974) provide tabulations of the signiﬁcance points. If the sample size is large, a satisfactory approximation is obtained from a normal distribution with mean 2 and variance 4/(T − k).
(4) Harvey and Collier (1977) suggest a test for functional misspeciﬁcation based on recursive residuals. This is based on the fact that w ∼ N (0, σ 2 IT −k ). Therefore, √ w/(s ¯ w / T − k) ∼ tT −k−1 (8.43) where w ¯ = Tt=k+1 wt /(T − k) and s2w = Tt=k+1 (wt − w) ¯ 2 /(T − k − 1). Suppose that the true functional form relating y to a single explanatory variable X is concave (convex) and the data are ordered by X. A simple linear regression is estimated by regressing y on X. The recursive residuals would be expected to be mainly negative (positive) and the computed tstatistic will be large in absolute value. When there are multiple X’s, one could carry out this test based on any single explanatory variable. Since several speciﬁcation errors might have a selfcancelling effect on the recursive residuals, this test is not likely to be very effective in multivariate situations. Recently, Wu (1993) suggested performing this test using the following augmented regression: y = Xβ + zγ + v
(8.44)
where z = C ′ ιT −k is one additional regressor with C deﬁned in (8.34) and ιT −k denoting a vector of ones of dimension T − k. In fact, the F statistic for testing H0 ; γ = 0 turns out to be the square of the Harvey and Collier (1977) tstatistic given in (8.43), see problem 15. Alternatively, a Sign test may be used to test the null hypothesis of no functional misspeciﬁcation. Under the null hypothesis, the expected number of positive recursive residuals is equal to (T − k)/2. A critical region may therefore be constructed from the binomial distribution. However, Harvey and Collier (1977) suggest that the Sign test tends to lack power compared with the ttest described in (8.43). Nevertheless, it is very simple and it may be more robust to nonnormality.
8.2
Recursive Residuals
191
(5) Brown, Durbin and Evans (1975) used recursive residuals to test for structural change over time. The null hypothesis is β 1 = β 2 = .. = β T = β (8.45) H0 ; σ 21 = σ 22 = .. = σ 2T = σ 2 where β t is the vector of coeﬃcients in period t and σ 2t is the disturbance variance for that period. The authors suggest a pair of tests. The ﬁrst is the CUSUM test which computes Wr = rt=k+1 wt /sw for r = k + 1, . . . , T (8.46)
where s2w is an estimate of the variance of the wt ’s, given below (8.43). Wr is a cumulative sum and should be plotted against r. Under the null, E(Wr ) = 0. But, if there is a structural break, Wr will tend to diverge from the horizontal line. The authors suggest checking whether Wr √ T − k and cross a pair of straight lines (see Figure 8.1) which pass through the points k, ±a √ T, ±3a T − k where a depends upon the chosen signiﬁcance level α. For example, a = 0.850, 0.948, and 1.143 for α = 10%, 5%, and 1% levels, respectively. If the coeﬃcients are not constant, there may be a tendency for a disproportionate number of recursive residuals to have the same sign and to push Wr across the boundary. The second test is the cumulative sum of squares (CUSUMSQ) which is based on plotting Wr∗ = rt=k+1 wt2 / Tt=k+1 wt2 for t = k + 1, . . . , T (8.47)
against r. Under the null, E(Wr∗ ) = (r − k)/(T − k) which varies from 0 for r = k to 1 for r = T . The signiﬁcance of the departure of Wr∗ from its expected value is assessed by whether Wr∗ crosses a pair of lines parallel to E(Wr∗ ) at a distance co above and below this line. Brown, Durbin and Evans (1975) provide values of c0 for various sample sizes T and levels of signiﬁcance α. The CUSUM and CUSUMSQ should be regarded as data analytic techniques; i.e., the value of the plots lie in the information to be gained simply by inspecting them. The plots contain more information than can be summarized in a single test statistic. The signiﬁcance lines constructed are, to paraphrase the authors, best regarded as ‘yardsticks’ against which to assess the observed plots rather than as formal tests of signiﬁcance. See Brown et al. (1975) for various examples. Note that the CUSUM and CUSUMSQ are quite general tests for structural change in that they do not require a prior determination of where the structural break takes place. If this is known, the Chowtest will be more powerful. But, if this break is not known, the CUSUM and CUSUMSQ are more appropriate. Example 2: Table 8.4 reproduces the consumptionincome data, over the period 19501993, taken from the Economic Report of the President. In addition, the recursive residuals are computed as in (8.30) and exhibited in column 4, starting with 1952 and ending in 1993. Column 5 gives the CUSUM given by Wr in (8.46) and this is plotted against r in Figure 8.2. The CUSUM crosses the upper 5% line in 1993, showing structural instability in the latter years. This was done using EViews. The postsample predictive test for 1993, can be obtained from (8.38) by computing the RSS from 19501992 and comparing it with the RSS from 19501993. The observed F statistic is 5.3895 which is distributed as F (1, 41). The same F statistic is obtained from (8.40) using recursive residuals. In fact, 2 2 2 / 1992 F = w1993 t=1952 wt /41 = (339.300414) /875798.36 ÷ 41 = 5.3895
192
Chapter 8: Regression Diagnostics and Speciﬁcation Tests
wr
3a T – k 2a T – k
a T –k 0
n
t
k
–a T – k
– 3a T – k
Figure 8.1 CUSUM Critical Values
The Phillips and Harvey (1974) modiﬁed von Neuman ratio can be computed from (8.42) and yields an MVNR of 0.4263. The asymptotic distribution of this statistic is N (2, 4/(44 − 2)) which yields a standardized N (0, 1) statistic z = (0.4263 − 2)/0.3086 = −5.105. This is highly signiﬁcant and indicates the presence of positive serial correlation. The Harvey and Collier (1977) functional misspeciﬁcation test, given in (8.43), yields w¯ = 66.585 and sw = 140.097 and an observed tstatistic of 0.0733. This is distributed as t41 under the null hypothesis and is not signiﬁcant. 30 20 10 0 –10
–20 55
60
65 CUSUM
70
75
80
85
90
5% Significance
Figure 8.2 CUSUM Plot of ConsumptionIncome Data
8.2
Recursive Residuals
193
Table 8.4 ConsumptionIncome Example YEAR 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993
Consumption 5820 5843 5917 6054 6099 6365 6440 6465 6449 6658 6698 6740 6931 7089 7384 7703 8005 8163 8506 8737 8842 9022 9425 9752 9602 9711 10121 10425 10744 10876 10746 10770 10782 11179 11617 12015 12336 12568 12903 13029 13093 12899 13110 13391
Income
Recursive residuals
Wr of CUSUM
6284 6390 6476 6640 6628 6879 7080 7114 7113 7256 7264 7382 7583 7718 8140 8508 8822 9114 9399 9606 9875 10111 10414 11013 10832 10906 11192 11406 11851 12039 12005 12156 12146 12349 13029 13258 13552 13545 13890 14005 14101 14003 14279 14341
N/A N/A 24.90068062 30.35482733 50.89329055 63.26038894 –49.80590703 –28.40431109 –31.52055902 53.19425584 67.69611375 –2.64655577 9.67914674 39.65882720 –40.12655671 –30.26075567 2.60563312 –78.94146710 27.18506579 64.36319479 –64.90671688 –71.64101261 70.09586688 –113.47532273 –85.63317118 –29.42763002 128.32845919 220.69313327 126.59174925 78.39424741 –25.95557363 –124.17868561 –90.84519334 127.83058064 –30.79462855 159.78087196 201.70712713 405.31056105 390.95384121 373.37091949 316.43123530 188.10968277 134.46128461 339.30041400
N/A N/A 0.17773916 0.39440960 0.75768203 1.20922986 0.85371909 0.65097129 0.42597994 0.80567648 1.28888618 1.26999527 1.33908428 1.62216596 1.33574566 1.11974669 1.13834551 0.57486733 0.76891226 1.22833184 0.76503264 0.25366456 0.75400351 –0.05597469 –0.66721774 –0.87726992 0.03872884 1.61401959 2.51762185 3.07719401 2.89192510 2.00554711 1.35710105 2.26954599 2.04973628 3.16024996 4.63001005 7.52308592 10.31368463 12.97877777 15.23743980 16.58015237 17.53992676 19.96182724
194
Chapter 8: Regression Diagnostics and Speciﬁcation Tests
8.3
Speciﬁcation Tests
Speciﬁcation tests are an important part of model speciﬁcation in econometrics. In this section, we only study a few of these diagnostic tests. For an excellent summary on this topic, see Wooldridge (2001). (1) Ramsey’s (1969) RESET (Regression Speciﬁcation Error Test) Ramsey suggests testing the speciﬁcation of the linear regression model yt = Xt′ β + ut by augmenting it with a set of regressors Zt so that the augmented model is yt = Xt′ β + Zt′ γ + ut
(8.48)
If the Zt ’s are available then the speciﬁcation test would reduce to the F test for H0 ; γ = 0. The crucial issue is the choice of Zt variables. This depends upon the true functional form under the alternative, which is usually unknown. However, this can be often well approximated by higher powers of the initial regressors, as in the case where the true form is quadratic or cubic. OLS . The popular Alternatively, one might approximate it with higher moments of yˆt = Xt′ β Ramsey RESET test is carried out as follows: (1) Regress yt on Xt and get yˆt . (2) Regress yt on Xt , yˆt2 , yˆt3 and yˆt4 and test that the coeﬃcients of all the powers of yˆt are zero. This is an F3,T −k−3 under the null. Note that yˆt is not included among the regressors because it would be perfectly multicollinear with Xt .3 Different choices of Zt ’s may result in more powerful tests when H0 is not true. Thursby and Schmidt (1977) carried out an extensive Monte Carlo and concluded that the test based on Zt = [Xt2 , Xt3 , Xt4 ] seems to be generally the best choice. (2) Utts’ (1982) Rainbow Test The basic idea behind the Rainbow test is that even when the true relationship is nonlinear, a good linear ﬁt can still be obtained over subsets of the sample. The test therefore rejects the null hypothesis of linearity whenever the overall ﬁt is markedly inferior to the ﬁt over a properly selected subsample of the data, see Figure 8.3. e be Let e′ e be the OLS residuals sum of squares from all available n observations and let e′ the OLS residual sum of squares from the middle half of the observations (T /2). Then (e′ e − e′ e)/ T2 (8.49) is distributed as F T ,( T −k) under H0 F = 2 2 e′ e/ T − k 2
Under H0 ; E(e′ e/T − k) = σ 2 = E e′ e/ T2 − k , while in general under HA ; E(e′ e/T − k) >
′ T E e e/ 2 − k > σ 2 . The RRSS is e′ e because all the observations are forced to ﬁt the straight line, whereas the URSS is e′ e because only a part of the observations are forced to ﬁt a straight line. The crucial issue of the Rainbow test is the proper choice of the subsample (the middle T /2 observations in case of one regressor). This affects the power of the test and
8.3
Speciﬁcation Tests
195
y
. . . . . .. . . . . . . . . . . . . . . . . . . . . . .
X
Figure 8.3 The Rainbow Test
not the distribution of the test statistic under the null. Utts (1982) recommends points close ¯ since an incorrect linear ﬁt will in general not be as far off there as it is in the outer to X, ¯ is measured by the magnitude of the corresponding diagonal elements region. Closeness to X of PX . Close points are those with low leverage hii , see section 8.1. The optimal size of the subset depends upon the alternative. Utts recommends about 1/2 of the data points in order to obtain some robustness to outliers. The F test in (8.49) looks like a Chow test, but differs in the selection of the subsample. For example, using the postsample predictive Chow test, the data are arranged according to time and the ﬁrst T observations are selected. The Rainbow ¯ and selects the ﬁrst T /2 of them. test arranges the data according to their distance from X
(3) Plosser, Schwert and White (1982) (PSW) Diﬀerencing Test The differencing test is a general test for misspeciﬁcation (like Hausman’s (1978) test, which will be introduced in the simultaneous equation chapter) but for timeseries data only. This test compares OLS and First Difference (FD) estimates of β. Let the differenced model be ˙ + u˙ y˙ = Xβ
(8.50)
where y˙ = Dy, X˙ = DX and u˙ = Du where ⎡
⎢ ⎢ D=⎢ ⎢ ⎣
1 −1 0 0 ... 0 1 −1 0 . . . 0 0 1 −1 . . . . . . . ... 0 0 0 0 ...
0 0 0 0 0 0 . . 1 −1
⎤
⎥ ⎥ ⎥ is the familiar (T − 1) × T differencing matrix. ⎥ ⎦
Wherever there is a constant in the regression, the ﬁrst column of X becomes zero and is dropped. From (8.50), the FD estimator is given by ˙ ′ ˙ −1 ˙ ′ β F D = (X X) X y˙
(8.51)
196
Chapter 8: Regression Diagnostics and Speciﬁcation Tests
F D ) = σ 2 (X˙ ′ X) ˙ −1 since var(u) ˙ −1 X˙ ′ DD′ X( ˙ X˙ ′ X) with var(β ˙ = σ 2 (DD)′ and ⎡ ⎤ 2 −1 0 . . . 0 0 ⎢ −1 2 −1 . . . 0 0 ⎥ ⎢ ⎥ DD′ = ⎢ : : : : : : ⎥ ⎢ ⎥ ⎣ 0 0 0 . . . 2 −1 ⎦ 0 0 0 . . . −1 2
The differencing test is based on q = β F D − β OLS
) − V (β with V ( q ) = σ 2 [V (β FD OLS )]
A consistent estimate of V ( q ) is ⎡ !−1 ! ˙ ′ DD′ X˙ ˙ ′ X˙ X X 2 V ( q) = σ ⎣ T T
X˙ ′ X˙ T
!−1
−
X ′X T
(8.52)
−1
where σ 2 is a consistent estimate of σ 2 . Therefore,
⎤ ⎦
(8.53)
Δ = T q′ [V ( q )]−1 q ∼ χ2k under H0
(8.54)
yt = β 1 x1t + β 2 x2t + ut
(8.55)
where k is the number of slope parameters if V ( q ) is nonsingular. V ( q ) could be singular, in − which case we use a generalized inverse V ( q ) of V ( q ) and in this case is distributed as χ2 with degrees of freedom equal to the rank(V ( q )). This is a special case of the general Hausman (1978) test which will be studied extensively in Chapter 11. Davidson, Godfrey, and MacKinnon (1985) show that, like the Hausman test, the PSW test is equivalent to a much simpler omitted variables test, the omitted variables being the sum of the lagged and oneperiod ahead values of the regressors. Thus if the regression equation we are considering is
the PSW test involves estimating the expanded regression equation yt = β 1 x1t + β 2 x2t + γ 1 z1t + γ 2 z2t + ut
(8.56)
where z1t = x1,t+1 + x1,t−1 and z2t = x2,t+1 + x2,t−1 and testing the hypothesis γ 1 = γ 2 = 0 by the usual F test. If there are lagged dependent variables in the equation, the test needs a minor modiﬁcation. Suppose that the model is yt = β 1 yt−1 + β 2 xt + ut
(8.57)
Now the omitted variables would be deﬁned as z1t = yt + yt−2 and z2t = xt+1 + xt−1 . There is no problem with z2t but z1t would be correlated with the error term ut because of the presence of yt in it. The solution would be simply to transfer it to the left hand side and write the expanded regression equation in (8.56) as (1 − γ 1 )yt = β 1 yt−1 + β 2 xt + γ 1 yt−2 + γ 2 z2t + ut
(8.58)
8.3
Speciﬁcation Tests
197
This equation can be written as yt = β ∗1 yt−1 + β ∗2 xt + γ ∗1 yt−2 + γ ∗2 z2t + u∗t
(8.59)
where all the starred parameters are the corresponding unstarred ones divided by (1 − γ 1 ). The PSW now tests the hypothesis γ ∗1 = γ ∗2 = 0. Thus, in the case where the model involves the lagged dependent variable yt−1 as an explanatory variable, the only modiﬁcation needed is that we should use yt−2 as the omitted variable, not (yt + yt−2 ). Note that it is only yt−1 that creates a problem, not higherorder lags of yt , like yt−2 , yt−3 , and so on. For yt−2 , the corresponding zt will be obtained by adding yt−1 to yt−3 . This zt is not correlated with ut as long as the disturbances are not serially correlated. (4) Tests for Nonnested Hypothesis Consider the following two competing nonnested models: H1 ; y = X1 β 1 + ǫ1
(8.60)
H2 ; y = X2 β 2 + ǫ2
(8.61)
These are nonnested because the explanatory variables under one model are not a subset of the other model even though X1 and X2 may share some common variables. In order to test H1 versus H2 , Cox (1961) modiﬁed the LRtest to allow for the nonnested case. The idea behind Cox’s approach is to consider to what extent Model I under H1 , is capable of predicting the performance of Model II, under H2 . Alternatively, one can artiﬁcially nest the 2 models H3 ; y = X1 β 1 + X2∗ β ∗2 + ǫ3
(8.62)
where X2∗ excludes from X2 the common variables with X1 . A test for H1 is simply the F test for H0 ; β ∗2 = 0. Criticism: This tests H1 versus H3 which is a (Hybrid) of H1 and H2 and not H1 versus H2 . Davidson and MacKinnon (1981) proposed (testing α = 0) in the linear combination of H1 and H2 : y = (1 − α)X1 β 1 + αX2 β 2 + ǫ
(8.63)
′ −1 where α is an unknown scalar. Since α is not identiﬁed, we replace β 2 by β 2,OLS = (X2 X2 /T ) ′ (X2 y/T ) the regression coeﬃcient estimate obtained from running y on X2 under H2 , i.e., (1) 2,OLS ; (2) Run y on X1 and y2 and test that the coeﬃcient of y2 is Run y on X2 get yˆ2 = X2 β zero. This is known as the Jtest and this is asymptotically N (0, 1) under H1 . Fisher and McAleer (1981) suggested a modiﬁcation of the Jtest known as the JA test.
= plim(X ′ X2 /T )−1 plim(X ′ X1 /T )β + 0 Under H1 ; plimβ 2 1 2 2
(8.64)
′ −1 by β = (X ′ X2 )−1 (X ′ X1 )β Therefore, they propose replacing β 2 2 1,OLS where β 1,OLS = (X1 X1 ) 2 2 ′ X1 y. The steps for the JAtest are as follows:
198
Chapter 8: Regression Diagnostics and Speciﬁcation Tests
1,OLS . 1. Run y on X1 get y1 = X1 β
2. Run y1 on X2 get y2 = X2 (X2′ X2 )−1 X2′ y1 .
3. Run y on X1 and y2 and test that the coeﬃcient of y2 is zero. This is the simple tstatistic on the coeﬃcient of y2 . The J and JA tests are asymptotically equivalent.
Criticism: Note the asymmetry of H1 and H2 . Therefore one should reverse the role of these hypotheses and test again. In this case one can get the four scenarios depicted in Table 8.5. In case both hypotheses are not rejected, the data are not rich enough to discriminate between the two hypotheses. In case both hypotheses are rejected neither model is useful in explaining the variation in y. In case one hypothesis is rejected while the other is not, one should remember that the nonrejected hypothesis may still be brought down by another challenger hypothesis. Small Sample Properties: (i) The Jtest tends to reject the null more frequently than it should. Also, the JA test has relatively low power when K1 , the number of parameters in H1 is larger than K2 , the number of parameters in H2 . Therefore, one should use the JA test when K1 is about the same size as K2 , i.e., the same number of nonoverlapping variables. (ii) If both H1 and H2 are false, these tests are inferior to the standard diagnostic tests. In practice, use higher signiﬁcance levels for the Jtest, and supplement it with the artiﬁcially nested F test and standard diagnostic tests. Table 8.5 Nonnested Hypothesis Testing α=0
α=1
Not Rejected
Rejected
Not Rejected
Both H1 and H2 are not rejected
H1 rejected H2 not rejected
Rejected
H1 not rejected H2 rejected
Both H1 and H2 are rejected
Note: J and JA tests are one degree of freedom tests, whereas the artiﬁcially nested F test is not. For a recent summary of nonnested hypothesis testing, see Pesaran and Weeks (2001). Examples of nonnested hypothesis encountered in empirical economic research include linear versus loglinear models, see section 8.5. Also, logit versus probit models in discrete choice, see Chapter 13 and exponential versus Weibull distributions in the analysis of duration data. In the logit versus probit speciﬁcation, the set of regressors is most likely to be the same. It is only the form of the distribution functions that separate the two models. Pesaran and Weeks (2001, p. 287) emphasize the differences between hypothesis testing and model selection: The model selection process treats all models under consideration symmetrically, while hypothesis testing attributes a diﬀerent status to the null and to the alternative hypotheses and by design treats the models asymmetrically. Model selection always
8.3
Speciﬁcation Tests
199
ends in a deﬁnite outcome, namely one of the models under consideration is selected for use in decision making. Hypothesis testing on the other hand asks whether there is any statistically signiﬁcant evidence (in the NeymanPearson sense) of departure from the null hypothesis in the direction of one or more alternative hypotheses. Rejection of the null hypothesis does not necessarily imply acceptance of any one of the alternative hypotheses; it only warns the investigator of possible shortcomings of the null that is being advocated. Hypothesis testing does not seek a deﬁnite outcome and if carried out with due care need not lead to a favorite model. For example, in the case of nonnested hypothesis testing it is possible for all models under consideration to be rejected, or all models to be deemed as observationally equivalent. They conclude that the choice between hypothesis testing and model selection depends on the primary objective of one’s study. Model selection may be more appropriate when the objective is decision making, while hypothesis testing is better suited to inferential problems. A model may be empirically adequate for a particular purpose, but of little relevance for another use... In the real world where the truth is elusive and unknowable both approaches to model evaluation are worth pursuing.
(5) White’s (1982) InformationMatrix (IM) Test This is a general speciﬁcation test much like the Hausman (1978) speciﬁcation test which will be considered in details in Chapter 11. The latter is based on two different estimates of the regression coeﬃcients, while the former is based on two different estimates of the Information Matrix I(θ) where θ ′ = (β ′ , σ 2 ) in the case of the linear regression studied in Chapter 7. The ﬁrst estimate of I(θ) evaluates the expectation of the second derivatives of the loglikelihood at the MLE, i.e., −E(∂ 2 logL/∂θ∂θ′ ) at θmle while the second sum up the outer products of the score vectors ni=1 (∂logLi (θ)/∂θ)(∂logLi (θ)/∂θ)′ evaluated at θmle . This is based on the fundamental identity that I(θ) = −E(∂ 2 logL/∂θ∂θ′ ) = E(∂logL/∂θ)(∂logL/∂θ)′
If the model estimated by MLE is not correctly speciﬁed, this equality will not hold. From Chapter 7, equation (7.19), we know that for the linear regression model with normal disturbances, the ﬁrst estimate of I(θ) denoted by I1 ( θ mle ) is given by
′ X X/ σ2 0 (8.65) I1 ( θM LE ) = 0 n/2 σ4 where σ 2 = e′ e/n is the MLE of σ 2 and e denotes the OLS residuals. Similarly, one can show that the second estimate of I(θ) denoted by I2 (θ) is given by n (∂logLi (θ) ′ i=1 (∂logLi (θ) I2 (θ) = ∂θ ∂θ ⎤ ⎡ 2 ′ −ui xi u3i xi ui xi xi + n ⎢ σ4 3 ′ 2σ 4 2 2σ 6 4 ⎥ (8.66) = ′ i=1 ⎣ 1 u x ui ui ⎦ ui x − + − 4i + i 6i 2σ 2σ 4σ 4 2σ 6 4σ 8
200
Chapter 8: Regression Diagnostics and Speciﬁcation Tests
where xi is the ith row of X. Substituting the MLE we get n 3 ⎤ ⎡ n 2 ′ i=1 ei xi i=1 ei xi xi 6 ⎥ ⎢ 4 3 ′ 2 σ (8.67) θM LE ) = ⎣ n σ I2 ( n 4 ⎦ n i=1 ei i=1 ei xi − + 2 σ6 4 σ4 4 σ8 where we used the fact that ni=1 ei xi = 0. If the model is correctly speciﬁed and the disturbances are normal then plim I1 ( θM LE )/n = plim I2 ( θM LE )/n = I(θ)
Therefore, the Information Matrix (IM) test rejects the model when [I2 ( θM LE ) − I1 ( θM LE )]/n
(8.68)
is too large. These are two matrices with (k + 1) by (k + 1) elements since β is k × 1 and σ 2 is a scalar. However, due to symmetry, this reduces to (k + 2)(k + 1)/2 unique elements. Hall (1987) noted that the ﬁrst k(k elements obtained from the ﬁrst k × k block of + 1)/2 unique (8.68) have a typical element ni=1 (e2i − σ 2 )xir xis /n σ 4 where r and s denote the rth and sth explanatory variables with r, s = 1, 2, . . . , k. This term measures the discrepancy between the OLS estimates of the variancecovariance matrix of β OLS and its robust counterpart suggested by White (1980), see Chapter 5. The next k unique elements correspond to the offdiagonal block n 3 6 σ e x /2n σ and this measures the discrepancy between the estimates of the cov(β, 2 ). The i=1 i i last element correspond to the difference in the bottom right elements, i.e., the two estimates of σ 2 . This is given by
1 n 4 3 8 e /4 σ − 4+ n i=1 i 4 σ These (k +1)(k +2)/2 unique elements can be arranged in vector form D(θ) which has a limiting normal distribution with zero mean and some covariance matrix V (θ) under the null. One can show, see Hall (1987) or Kr¨ amer and Sonnberger (1986) that if V (θ) is estimated from the sample moments of these terms, that the IM test statistic is given by H
m = nD ′ (θ)[V (θ)]−1 D(θ) →0 χ2(k+1)(k+2)/2
(8.69)
In fact, Hall (1987) shows that this statistic is the sum of three asymptotically independent terms m = m1 + m2 + m3
(8.70)
where m1 = a particular version of White’s heteroskedasticity test; m2 = n times the explained σ 6 ; and sum of squares from the regression of e3i on xi divided by 6 m3 =
2 n n 4 σ4 i=1 ei /n − 3 8 24 σ
which is similar to the JarqueBera test for normality of the disturbances given in Chapter 5. It is clear that the IM test will have power whenever the disturbances are nonnormal or heteroskedastic. However, Davidson and MacKinnon (1992) demonstrated that the IM test
8.3
Speciﬁcation Tests
201
considered above will tend to reject the model when true, much too often, in ﬁnite samples. This problem gets worse as the number of degrees of freedom gets large. In Monte Carlo experiments, Davidson and MacKinnon (1992) showed that for a linear regression model with ten regressors, the IM test rejected the null at the 5% level, 99.9% of the time for n = 200. This problem did not disappear when n increased. In fact, for n = 1000, the IM test still rejected the null 92.7% of the time at the 5% level. These results suggest that it may be more useful to run individual tests for nonnormality, heteroskedasticity and other misspeciﬁcation tests considered above rather than run the IM test. These tests may be more powerful and more informative than the IM test. Alternative methods of calculating the IM test with better ﬁnitesample properties are suggested in Orme (1990), Chesher and Spady (1991) and Davidson and MacKinnon (1992). Example 3: For the consumptionincome data given in Table 5.1, we ﬁrst compute the RESET test from the consumptionincome regression given in Chapter 5. Using EViews, one clicks on stability tests and then selects RESET. You will be prompted with the option of the number of ﬁtted terms to include (i.e., powers of y). Table 8.6 shows the RESET test including y2 and y3 . The F statistic for their jointsigniﬁcance is equal to 40.98. This is signiﬁcant and indicates misspeciﬁcation. Next, we compute Utts (1982) Rainbow test. Table 8.7 gives the middle half of the sample, i.e., 19621983, and the SAS regression using this data. The RSS of these middle observations is given by e′ e = 178245.867, while the RSS for the entire sample is given by e′ e = 990923.131 so that the observed F statistic given in (8.49) can be computed as follows: F =
(990923.131 − 178245.867)/22 = 4.145 178245.867/20
Table 8.6 Ramsey RESET Test Fstatistic Log likelihood ratio Test Equation: Dependent Variable: Method: Sample: Included observations:
40.97877 49.05091
Probability Probability
0.00000 0.00000
CONSUM Least Squares 1950 1993 44
Variable C Y FITTEDˆ2 FITTEDˆ3
Coeﬃcient –3210.837 2.079184 –0.000165 6.66E–09
Std. Error 1299.234 0.395475 4.71E–05 1.66E–09
tStatistic –2.471330 5.257428 –3.494615 4.012651
Rsquared Adjusted Rsquared S.E. of regression Sum squared resid Log likelihood DurbinWatson stat
0.998776 0.998684 90.13960 325005.9 –258.3963 1.458024
Mean dependent var S.D. dependent var Akaike info criterion Schwarz criterion Fstatistic Prob (Fstatistic)
Prob. 0.0178 0.0000 0.0012 0.0003 9250.54 2484.62 11.9271 12.0893 10876.9 0.00000
202
Chapter 8: Regression Diagnostics and SpeciďŹ cation Tests
Table 8.7 Utts (1982) Rainbow Test OBS
YEAR
INCOME
CONSUMP
LEVERAGE
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983
7583 7718 8140 8508 8822 9114 9399 9606 9875 10111 10414 11013 10832 10906 11192 11406 11851 12039 12005 12156 12146 12349
6931 7089 7384 7703 8005 8163 8506 8737 8842 9022 9425 9752 9602 9711 10121 10425 10744 10876 10746 10770 10782 11179
0.0440 0.0419 0.0359 0.0315 0.0285 0.0263 0.0246 0.0238 0.0230 0.0227 0.0229 0.0250 0.0241 0.0244 0.0260 0.0275 0.0316 0.0337 0.0333 0.0352 0.0350 0.0377
Dependent Variable: CONSUMPTION Analysis of Variance Source
DF
Model Error C Total
1 20 21
Root MSE Dep Mean C.V.
94.40494 9296.13636 1.01553
Sum of Squares 37715232.724 178245.86703 37893478.591
Mean Square 37715232.724 8912.29335
Rsquare Adj Rsq
F Value
Prob>F
4231.821
0.0001
0.9953 0.9951
Parameter Estimates
Variable INTERCEP INCOME
DF
Parameter Estimate
1 1
290.338091 0.872098
Standard Error 139.89449574 0.01340607
T for H0: Parameter=0 2.075 65.052
Prob > T 0.0511 0.0001
8.3
Speciﬁcation Tests
203
Table 8.8 PSW Differencing Test Dependent Variable: Method: Sample (adjusted): Included observations:
CONSUM Least Squares 1951 1993 43 after adjusting endpoints
Variable C Y Z
Coeﬃcient –60.28744 0.501803 0.208394
Std. Error 94.40916 0.273777 0.137219
tStatistic –0.638576 1.832891 1.518701
Rsquared Adjusted Rsquared S.E. of regression Sum squared resid Log likelihood DurbinWatson stat
0.996373 0.996191 151.5889 919167.2 –275.3699 0.456301
Mean dependent var S.D. dependent var Akaike info criterion Schwarz criterion Fstatistic Prob (Fstatistic)
Prob. 0.5267 0.0743 0.1367 9330.326 2456.343 12.94744 13.07031 5493.949 0.00000
This is distributed as F22,20 under the null hypothesis and rejects the hypothesis of linearity. The PSW differencing test, given in (8.54), is based upon β OLS = 0.915623 with V (β OLS ) = 0.0000747926, and β = (0.843298 − F D = 0.843298 with V (β F D ) = 0.00732539. Therefore, q 0.915623) = −0.072325 with V ( q )/T = 0.00732539 − 0.0000747926 = 0.0072506 and Δ = T q′ [V ( q )]−1 q = 0.721 which is distributed as χ21 under the null. This is not signiﬁcant and does not reject the null of no misspeciﬁcation. The artiﬁcial regression given in (8.56) with Zt = Yt + Yt−1 is given in Table 8.8. The tstatistic for Zt is 1.52 and has a pvalue of 0.137 which is not signiﬁcant. Now consider the two competing nonnested models: H1 ; Ct = β 0 + β 1 Yt + β 2 Yt−1 + ut
H2 ; Ct = γ 0 + γ 1 Yt + γ 2 Ct−1 + vt
The two nonnested models share Yt as a common variable. The artiﬁcial model that nests these two models is given by: H3 ; Ct = δ0 + δ 1 Yt + δ 2 Yt−1 + δ 3 Ct−1 + ǫt 2 (C 2HAT ). RegresTable 8.9, runs regression (1) given by H2 and obtains the predicted values C sion (2) runs consumption on a constant, income, lagged income and C 2HAT. The coeﬃcient of this last variable is 1.64 and is statistically signiﬁcant with a tvalue of 7.80. This is the Davidson and MacKinnon (1981) Jtest. In this case, H1 is rejected but H2 is not rejected. The JAtest, given by Fisher and McAleer (1981) runs the regression in H1 and keeps the predicted 1 (C 1HAT ). This is done in regression (3). Then C 1HAT is run on a constant, income values C 2 (C 2TILDE ). This is done and lagged consumption and the predicted values are stored as C in regression (5). The last step runs consumption on a constant, income, lagged income and C 2TILDE, see regression (6). The coeﬃcient of this last variable is 6.53 and is statistically signiﬁcant with a tvalue of 7.80. Again H1 is rejected but H2 is not rejected. Reversing the roles of H1 and H2 , the J and JAtests are repeated. In fact, regression (4) 1 (which was obtained from runs consumption on a constant, income, lagged consumption and C
204
Chapter 8: Regression Diagnostics and Speciﬁcation Tests
1 is −2.54 and is statistically signiﬁcant with a tvalue of regression (3)). The coeﬃcient on C 2 on a constant, −4.13. This Jtest rejects H2 but does not reject H1 . Regression (7) runs C income and lagged income and the predicted values are stored as C1 (C1TILDE). The last step of 1 , see regression the JA test runs consumption on a constant, income, lagged consumption and C (8). The coeﬃcient of this last variable is −1.18 and is statistically signiﬁcant with a tvalue of −4.13. This JA test rejects H2 but not H1 . The artiﬁcial model, given in H3 , is also estimated, see regression (9). One can easily check that the corresponding F tests reject H1 against H3 and also H2 against H3 . In sum, all evidence indicates that both Ct−1 and Yt−1 are important to include along with Yt . Of course, the true model is not known and could include higher lags of both Yt and Ct .
8.4
Nonlinear Least Squares and the GaussNewton Regression4
So far we have been dealing with linear regressions. But, in reality, one might face a nonlinear regression of the form: yt = xt (β) + ut
for t = 1, 2, . . . , T
(8.71)
where ut ∼ IID(0, σ 2 ) and xt (β) is a scalar nonlinear regression function of k unknown parameters β. It can be interpreted as the expected value ofyt conditional on the values of the independent variables. Nonlinear least squares minimizes Tt=1 (yt − xt (β))2 = (y − x(β))′ (y − x(β)). The ﬁrstorder conditions for minimization yield =0 X ′ (β)(y − x(β))
(8.72)
where X(β) is a T × k matrix with typical element Xtj (β) = ∂xt (β)/∂β j for j = 1, . . . , k. The solution to these k equations yield the Nonlinear Least Squares (NLS) estimates of β denoted by β N LS . These normal equations given in (8.72) are similar to those in the linear case in that they to be orthogonal to the matrix of derivatives X(β). In the require the vector of residuals y −x(β) linear case, x(β) = X β OLS and X(β) = X where the latter is independent of β. Because of this as well as the matrix of derivatives X(β) on β, one in general dependence of the ﬁtted values x(β) cannot get explicit analytical solution to these NLS ﬁrstorder equations. Under fairly general conditions, see Davidson and MacKinnon (1993), one can show that the β N LS has asymptotically 2 ′ a normal distribution with mean β 0 and asymptotic variance σ 0 (X (β 0 )X(β 0 ))−1 , where β 0 and σ 0 are the true values of the parameters generating the data. Similarly, deﬁning ′ s2 = (y − x(β N LS )) (y − x(β N LS ))/(T − k)
−1 . If the disturbances we get a feasible estimate of this covariance matrix as s2 (X ′ (β)X( β)) are normally distributed then NLS is MLE and therefore asymptotically eﬃcient as long as the model is correctly speciﬁed, see Chapter 7. Taking the ﬁrstorder Taylor series approximation around some arbitrary parameter vector β ∗ , we get y = x(β ∗ ) + X(β ∗ )(β − β ∗ ) + higherorder terms + u
(8.73)
y − x(β ∗ ) = X(β ∗ )b + residuals
(8.74)
or
8.4
Nonlinear Least Squares and the GaussNewton Regression
205
Table 8.9 Nonnested J and JA Test Regression 1 Dependent Variable: C Analysis of Variance Source Model Error C Total Root MSE Dep Mean C.V.
DF 2 40 42 113.51027 9330.32558 1.21657
Sum of Squares 252896668.15 515383.28727 253412051.44 Rsquare Adj Rsq
Mean Square 126448334.08 12884.58218
F Value 9813.926
Prob>F 0.0001
Standard Error 70.43957345 0.07458032 0.08203649
T for H0: Parameter=0 –0.248 6.369 5.954
Prob > T 0.8051 0.0001 0.0001
Mean Square 84351096.151 9199.05101
F Value 9169.543
Prob>F 0.0001
T for H0: Parameter=0 –0.202 –0.446 –4.126 7.805
Prob > T 0.8409 0.6579 0.0002 0.0001
0.9980 0.9979
Parameter Estimates Variable INTERCEP Y CLAG
DF 1 1 1
Parameter Estimate –17.498706 0.475030 0.488458
DF 3 39 42
Sum of Squares 253053288.45 358762.98953 253412051.44
Regression 2 Dependent Variable: C Analysis of Variance Source Model Error C Total Root MSE Dep Mean C.V.
95.91168 9330.32558 1.02796
Rsquare Adj Rsq
0.9986 0.9985
Parameter Estimates
Variable INTERCEP Y YLAG C2HAT
DF 1 1 1 1
Parameter Estimate –12.132250 –0.058489 –0.529709 1.637806
Standard Error 60.05133697 0.13107211 0.12837635 0.20983763
206
Chapter 8: Regression Diagnostics and Speciﬁcation Tests
Table 8.9 (continued) Regression 3 Dependent Variable: C Analysis of Variance Source Model Error C Total Root MSE Dep Mean C.V.
DF 2 40 42 151.58885 9330.32558 1.62469
Sum of Squares 252492884.22 919167.21762 253412051.44 Rsquare Adj Rsq
Mean Square 126246442.11 22979.18044
F Value 5493.949
Prob>F 0.0001
Standard Error 94.40916038 0.13669864 0.13721871
T for H0: Parameter=0 –0.639 5.195 1.519
Prob > T 0.5267 0.0001 0.1367
Mean Square 84351096.151 9199.05101
F Value 9169.543
Prob>F 0.0001
T for H0: Parameter=0 –2.647 5.042 7.805 –4.126
Prob > T 0.0117 0.0001 0.0001 0.0002
0.9964 0.9962
Parameter Estimates Variable INTERCEP Y YLAG
DF 1 1 1
Parameter Estimate –60.287436 0.710197 0.208394
DF 3 39 42
Sum of Squares 253053288.45 358762.98953 253412051.44
Regression 4 Dependent Variable: C Analysis of Variance Source Model Error C Total Root MSE Dep Mean C.V.
95.91168 9330.32558 1.02796
Rsquare Adj Rsq
0.9986 0.9985
Parameter Estimates
Variable INTERCEP Y CLAG C1HAT
DF 1 1 1 1
Parameter Estimate –194.034085 2.524742 0.800000 –2.541862
Standard Error 73.30021926 0.50073386 0.10249693 0.61602660
8.4
Nonlinear Least Squares and the GaussNewton Regression
207
Table 8.9 (continued) Regression 5 Dependent Variable: C1HAT Analysis of Variance Source Model Error C Total Root MSE Dep Mean C.V.
DF 2 40 42 24.61739 9330.32558 0.26384
Sum of Squares 252468643.58 24240.64119 252492884.22 Rsquare Adj Rsq
Mean Square 126234321.79 606.01603
F Value 208301.952
Prob>F 0.0001
Standard Error 15.27649082 0.01617451 0.01779156
T for H0: Parameter=0 –4.546 49.855 6.889
Prob > T 0.0001 0.0001 0.0001
Mean Square 84351096.151 9199.05101
F Value 9169.543
Prob>F 0.0001
T for H0: Parameter=0 4.849 –6.695 –4.126 7.805
Prob > T 0.0001 0.0001 0.0002 0.0001
0.9999 0.9999
Parameter Estimates
Variable INTERCEP Y CLAG
DF 1 1 1
Parameter Estimate –69.451205 0.806382 0.122564
DF 3 39 42
Sum of Squares 253053288.45 358762.98953 253412051.44
Regression 6 Dependent Variable: C Analysis of Variance Source Model Error C Total Root MSE Dep Mean C.V.
95.91168 9330.32558 1.02796
Rsquare Adj Rsq
0.9986 0.9985
Parameter Estimates
Variable INTERCEP Y YLAG C2TILDE
DF 1 1 1 1
Parameter Estimate 412.528849 –4.543882 –0.529709 6.527181
Standard Error 85.07504557 0.67869220 0.12837635 0.83626992
208
Chapter 8: Regression Diagnostics and Speciﬁcation Tests
Table 8.9 (continued) Regression 7 Dependent Variable: C2HAT Analysis of Variance Source Model Error C Total Root MSE Dep Mean C.V.
DF 2 40 42 72.27001 9330.32558 0.77457
Sum of Squares 252687749.95 208918.20026 252896668.15 Rsquare Adj Rsq
Mean Square 126343874.98 5222.95501
F Value 24190.114
Prob>F 0.0001
Standard Error 45.00958513 0.06517110 0.06541905
T for H0: Parameter=0 –0.653 7.202 6.889
Prob > T 0.5173 0.0001 0.0001
Mean Square 84351096.151 9199.05101
F Value 9169.543
Prob>F 0.0001
T for H0: Parameter=0 –1.232 6.263 7.805 –4.126
Prob > T 0.2252 0.0001 0.0001 0.0002
0.9992 0.9991
Parameter Estimates
Variable INTERCEP Y YLAG
DF 1 1 1
Parameter Estimate –29.402245 0.469339 0.450666
DF 3 39 42
Sum of Squares 253053288.45 358762.98953 253412051.44
Regression 8 Dependent Variable: C Analysis of Variance Source Model Error C Total Root MSE Dep Mean C.V.
95.91168 9330.32558 1.02796
Rsquare Adj Rsq
0.9986 0.9985
Parameter Estimates
Variable INTERCEP Y CLAG C1TILDE
DF 1 1 1 1
Parameter Estimate –75.350919 1.271176 0.800000 –1.175392
Standard Error 61.14775168 0.20297803 0.10249693 0.28485929
8.4
Nonlinear Least Squares and the GaussNewton Regression
209
Table 8.9 (continued) Regression 9 Dependent Variable: C Analysis of Variance Source Model Error C Total Root MSE Dep Mean C.V.
DF 3 39 42 95.91168 9330.32558 1.02796
Sum of Squares 253053288.45 358762.98953 253412051.44 Rsquare Adj Rsq
Mean Square 84351096.151 9199.05101
F Value 9169.543
Prob>F 0.0001
T for H0: Parameter=0 –0.682 8.318 –4.126 7.805
Prob > T 0.4991 0.0001 0.0002 0.0001
0.9986 0.9985
Parameter Estimates Variable INTERCEP Y YLAG CLAG
DF 1 1 1 1
Parameter Estimate –40.791743 0.719519 –0.529709 0.800000
Standard Error 59.78575886 0.08649875 0.12837635 0.10249693
This is the simplest version of the GaussNewton Regression, see Davidson and MacKinnon (1993). In this case the higherorder terms and the error term are combined in the residuals and (β − β ∗ ) is replaced by b, a parameter vector that can be estimated. If the model is linear, N LS , the X(β ∗ ) is the matrix of regressors X and the GNR regresses a residual on X. If β ∗ =β unrestricted NLS estimator of β, then the GNR becomes + residuals y−x = Xb
(8.75)
where x ≡ x(β N LS ) and X ≡ X(β N LS ). From the ﬁrstorder conditions of NLS we get (y − = 0. In this case, OLS on this GNR yields bOLS = (X ′ X) −1 X ′ (y − x x )′ X ) = 0 and this GNR has no explanatory power. However, this regression can be used to (i) check that the ﬁrstorder conditions given in (8.72) are satisﬁed. For example, one could check that the tstatistics are of the 10−3 order, and that R2 is zero up to several decimal places; (ii) compute estimated ′ X) −1 , where s2 = (y − x covariance matrices. In fact, this GNR prints out s2 (X )′ (y − x )/(T − k) is the OLS estimate of the regression variance. This can be veriﬁed easily using the fact that this GNR has no explanatory power. This method of computing the estimated variancecovariance has been obtained by some method other than NLS. matrix is useful especially in cases where β For example, sometimes the model is nonlinear only in one or two parameters which are known to be in a ﬁnite range, say between zero and one. One can then search over this range, running OLS regressions and minimizing the residual sum of squares. This search procedure can be repeated over ﬁner grids to get more accuracy. Once the ﬁnal parameter estimate is found, one can run the GNR to get estimates of the variancecovariance matrix.
210
Chapter 8: Regression Diagnostics and Speciﬁcation Tests
Testing Restrictions (GNR Based on the Restricted NLS Estimates) The best known use for the GNR is to test restrictions. These are based on the LM principle which requires only the restricted estimator. In particular, consider the following competing hypotheses: H0 ; y = x(β 1 , 0) + u
H1 ; y = x(β 1 , β 2 ) + u
the restricted where u ∼ IID(0, σ 2 I) and β 1 and β 2 are k ×1 and r ×1, respectively. Denote by β ′ ′ , 0). = (β NLS estimator of β, in this case β 1 The GNR evaluated at this restricted NLS estimator of β is 1 b1 + X 2 b2 + residuals (y − x ) = X
(8.76)
2 b2 + residuals P¯X1 (y − x ) = P¯X 1 X
(8.77)
2 b2 + residuals (y − x ) = P¯X 1 X
(8.78)
−1 ′ ′ P¯ X −1 ′ ¯ (y − x ′ P¯ X ) ) = (X b2,OLS = (X 2 X1 2 ) X2 (y − x 2 X1 2 ) X2 PX 1
(8.79)
with Xi (β) = ∂x/∂β i for i = 1, 2. and X i = Xi (β) where x = x(β) By the FWL Theorem this yields the same estimate of b2 as
′ (y − x ) = (y − x ) since X ) = 0 from the ﬁrstorder ) = (y − x ) − PX 1 (y − x But P¯X 1 (y − x 1 conditions of restricted NLS. Hence, (8.77) reduces to Therefore,
2 (X ′ P¯ X −1 ′ and the residual sums of squares is (y − x )′ (y − x ) − (y − x )′ X ). 2 X1 2 ) X2 (y − x ′ If X2 was excluded from the regression in (8.76), (y − x ) (y − x ) would be the residual sum of squares. Therefore, the reduction in the residual sum of squares brought about by the inclusion 2 is of X −1 ′ 2 (X ′ P¯ X ) (y − x )′ X 2 X1 2 ) X2 (y − x
1 has no explanatory This is also equal to the explained sum of squares from (8.76) since X 2 power. This sum of squares divided by a consistent estimate of σ is asymptotically distributed as χ2r under the null. Different consistent estimates of σ 2 yield different test statistics. The two most common test statistics for H0 based on this regression are the following: (1) T Ru2 where Ru2 is the uncentered R2 of (8.76) and (2) the F statistic for b2 = 0. The ﬁrst statistic is given by T Ru2 = −1 ′ 2 (X ′ P¯ X )/(y − x )′ (y − x ) where the uncentered R2 was deﬁned in the T (y − x )′ X 2 X1 2 ) X2 (y − x Appendix to Chapter 3. This statistic implicitly divides the explained sum of squares term by σ 2 = (restricted residual sums of squares)/T . This is equivalent to the LMstatistic obtained and getting the explained sum of squares. by running the artiﬁcial regression (y − x )/ σ on X 2 Regression packages print the centered R . This is equal to the uncentered Ru2 as long as there is a constant in the restricted regression so that (y − x ) sum to zero.
8.4
Nonlinear Least Squares and the GaussNewton Regression
211
The F statistic for b2 = 0 from (8.76) is 2 (X ′ P¯ X −1 ′ (y − x )′ X )/r (RRSS − URSS )/r 2 X1 2 ) X2 (y − x = (8.80) ′ ′ ′ ′ −1 2 (X P¯ X URSS /(T − k) [(y − x ) (y − x ) − (y − x ) X )]/(T − k) 2 X1 2 ) X2 (y − x
The denominator is the OLS estimate of σ 2 from (8.76) which tends to σ 20 as T → ∞. Hence (rF statistic → χ2r ). In small samples, use the F statistic. Diagnostic Tests for Linear Regression Models Variable addition tests suggested by Pagan and Hall (1983) consider the additional variables Z of dimension (T × r) and test whether their coeﬃcients are zero using an F test from the regression y = Xβ + Zγ + u
(8.81)
If H0 ; γ = 0 is true, the model is y = Xβ + u and there is no misspeciﬁcation. The GNR for this restriction would run the following regression: P¯X y = Xb + Zc + residuals
(8.82)
and test that c is zero. By the FWL Theorem, (8.82) yields the same residual sum of squares as P¯X y = P¯X Zc + residuals
(8.83)
Applying the FWL Theorem to (8.81) we get the same residual sum of squares as the regression in (8.83). The F statistic for γ = 0 from (8.81) is therefore identical to the F statistic for c = 0 from the GNR given in (8.82). Hence, “Tests based on the GNR are equivalent to variable addition tests when the latter are applicable,” see Davidson and MacKinnon (1993, p. 194). Note also, that the nRu2 test statistic for H0 ; γ = 0 based on the GNR in (8.82) is exactly the LM statistic based on running the restricted least squares residuals of y on X on the unrestricted set of regressors X and Z in (8.81). If X has a constant, then the uncentered R2 is equal to the centered R2 printed by the regression. Computational Warning: It is tempting to base tests on the OLS residuals u = P¯X y by simply regressing them on the test regressors Z. This is equivalent to running the GNR without the X variables on the right hand side of (8.82) yielding teststatistics that are too small. Functional Form Davidson and MacKinnon (1993, p. 195) show that the RESET with yt = Xt β + yt2 c+ residual which is based on testing for c = 0 is equivalent to testing for θ = 0 using the nonlinear model yt = Xt β(1 + θXt β) + ut . In this case, it is easy to verify from (8.74) that the GNR is yt − Xt β(1 + θXt β) = (2θ(Xt β)Xt + Xt )b + (Xt β)2 c + residual
2 At θ = 0 and β = β OLS , the GNR becomes (yt − Xt β OLS ) = Xt b + (Xt β OLS ) c+ residual. The tstatistic on c = 0 is equivalent to that from the RESET regression given in section 8.3, see problem 25.
212
Chapter 8: Regression Diagnostics and Speciﬁcation Tests
Testing for Serial Correlation Suppose that the null hypothesis is the nonlinear regression model given in (8.71), and the alternative is the model yt = xt (β) + ν t with ν t = ρν t−1 + ut where ut ∼ IID(0, σ 2 ). Conditional on the ﬁrst observation, the alternative model can be written as yt = xt (β) + ρ(yt−1 − xt−1 (β)) + ut The GNR test for H0 ; ρ = 0, computes the derivatives of this regression function with respect to β and ρ evaluated at the restricted estimates under the null hypothesis, i.e., ρ = 0 and β=β N LS (the nonlinear least squares estimate of β assuming no serial correlation). Those yield N LS ) and (yt−1 − xt−1 (β N LS )) respectively. Therefore, the GNR runs u N LS ) = t = yt − xt (β Xt (β Xt (β N LS )b+ c ut−1 + residual, and tests that c = 0. If the regression model is linear, this reduces to running ordinary least squares residuals on their lagged values in addition to the regressors in the model. This is exactly the Breusch and Godfrey test for ﬁrstorder serial correlation considered in Chapter 5. For other applications as well as beneﬁts and limitations of the GNR, see Davidson and MacKinnon (1993).
8.5
Testing Linear versus LogLinear Functional Form5
In many economic applications where the explanatory variables take only positive values, econometricians must decide whether a linear or loglinear regression model is appropriate. In general, the linear model is given by yi =
k
j=1 β j Xij
+
ℓ
s=1 γ s Zis
+ ui
i = 1, 2, . . . , n
(8.84)
+ ui
(8.85)
and the loglinear model is logyi =
k
j=1 β j
logXij +
ℓ
s=1 γ s Zis
i = 1, 2, . . . , n
with ui ∼ NID(0, σ 2 ). Note that, the loglinear model is general in that only the dependent variable y and a subset of the regressors, i.e., the X variables are subject to the logarithmic transformation. Of course, one could estimate both models and compare their loglikelihood values. This would tell us which model ﬁts best, but not whether either is a valid speciﬁcation. Box and Cox (1964) suggested the following transformation ⎧ λ ⎪ ⎨ yi λ−1 when λ = 0 (8.86) B(yi , λ) = ⎪ ⎩ logy when λ = 0 i where yi > 0. Note that for λ = 1, as long as there is constant in the regression, subjecting the linear model to a BoxCox transformation is equivalent to not transformation yields the loglinear regression. Therefore, the following BoxCox model regression. Therefore, the following BoxCox model B(yi , λ) =
k
j=1 β j B(Xij , λ) +
ℓ
s=1 γ s Zis
+ ui
(8.87)
8.5
Testing Linear versus LogLinear Functional Form
213
encompasses as special cases the linear and loglinear models given in (8.84) and (8.85), respectively. Box and Cox (1964) suggested estimating these models by ML and using the LR test to test (8.84) and (8.85) against (8.87). However, estimation of (8.87) is computationally burdensome, see Davidson and MacKinnon (1993). Instead, we give an LM test involving a Double Length Regression (DLR) due to Davidson and MacKinnon (1985) that is easier to compute. In fact, Davidson and MacKinnon (1993, p. 510) point out that “everything that one can do with the GaussNewton Regression for nonlinear regression models can be done with the DLR for models involving transformations of the dependent variable.” The GNR is not applicable in cases where the dependent variable is subjected to a nonlinear transformation, so one should use a DLR in these cases. Conversely, in cases where the GNR is valid, there is no need to run the DLR, since in these cases the latter is equivalent to the GNR. For the linear model (8.84), the null hypothesis is that λ = 1. In this case, Davidson and MacKinnon suggest running a regression with 2n observations where the dependent variable σ , . . . , en / σ , 1, . . . , 1)′ , i.e., the ﬁrst n observations are the OLS residuals has observations (e1 / from (8.84) divided by the MLE of σ, where σ 2mle = e′ e/n. The second n observations are all equal to 1. The 2n observations for the regressors have typical elements: for for for for
β j : Xij − 1 for i = 1, . . . , n and 0 γ s : Zis for i = 1, . . . , n and 0 σ: ei / σ for i = 1, . . . , n and −1 k λ: j=1 β j (Xij logXij − Xij + 1) − (yi log yi − yi+1 ) and σ logyi
for for for for for
the second n the second n the second n i = 1, . . . , n the second n
elements elements elements
β j : logXij for i = 1, . . . , n for i = 1, . . . , n γ s : Zis σ: ei / σ for i = 1, . . . , n 1 k λ: 2 j=1 β j (logXij )2 − 12 (logyi )2
the second n elements the second n elements the second n elements
elements
The explained sum of squares for this DLR provides an asymptotically valid test for λ = 1. This will be distributed as χ21 under the null hypothesis. Similarly, when testing the loglinear model (8.85), the null hypothesis is that λ = 0. In this case, the dependent variable of the DLR has observations ( e1 / σ , e2 / σ , . . . , en / σ , 1, . . . , 1)′ , i.e., the ﬁrst n observations are the OLS residuals from (8.85) divided by the MLE for σ, i.e., σ where σ 2 = e′ e/n. The second n observations are all equal to 1. The 2n observations for the regressors have typical elements: for for for for
and 0 for and 0 for and −1 for for i = 1, . . . , n and σ logyi for
the second n elements
The explained sum of squares from this DLR provides an asymptotically valid test for λ = 0. This will be distributed as χ21 under the null hypothesis. For the cigarette data given in Table 3.2, the linear model is given by C = β 0 + β 1 P + β 2 Y + u whereas the loglinear model is given by logC = γ 0 + γ 1 logP + γ 2 logY + ǫ and the BoxCox model is given by B(C, λ) = δ0 + δ 1 B(P, λ) + δ 2 B(Y, λ) + ν, where B(C, λ) is deﬁned in (8.86). In this case, the DLR which tests the hypothesis that H0 ; λ = 1, i.e., the model is linear, gives an explained sum of squares equal to 15.55. This is greater than a χ21,0.05 = 3.84 and is therefore signiﬁcant at the 5% level. Similarly the DLR that tests the hypothesis that H0 ; λ = 0, i.e., the model is loglinear, gives an explained sum of squares equal to 8.86. This is also greater than χ21,0.05 = 3.84 and is therefore signiﬁcant at the 5% level. In this case, both the linear and loglinear models are rejected by the data.
214
Chapter 8: Regression Diagnostics and Speciﬁcation Tests
Finally, it is important to note that there are numerous other tests for testing linear and loglinear models and the interested reader should refer to Davidson and MacKinnon (1993).
Notes 1. This section is based on Belsley, Kuh and Welsch (1980). 2. Other residuals that are linear unbiased with a scalar covariance matrix (LUS) are the BLUS residuals suggested by Theil (1971). Since we are explicitly dealing with timeseries data, we use subscript t rather than i to index observations and T rather than n to denote the sample size.
3. Ramsey’s (1969) initial formulation was based on BLUS residuals, but Ramsey and Schmidt (1976) showed that this is equivalent to using OLS residuals.
4. This section is based on Davidson and MacKinnon (1993, 2001). 5. This section is based on Davidson and MacKinnon (1993, pp. 502510).
Problems 1. We know that H = PX is idempotent. Also, (In − PX ) is idempotent. Therefore, b′ Hb ≥ 0 for any arbitrary vector b. Using these facts, show for b′ = (1, 0, . . . , 0) that 0 ≤ h11 ≤ 1. Deduce that 0 ≤ hii ≤ 1 for i = 1, . . . , n. 2. For the simple regression with no constant yi = xi β + ui for i = 1, . . . , n n (a) What is hii ? Verify that i=1 hii = 1.
−β , see (8.13)? What is s2 in terms of s2 and e2 , see (8.18)? What is DFBE(b) What is β (i) i (i) TAS ij , see (8.19)? (c) What are DFFIT i and DFFITS i , see (8.21) and (8.22)?
(d) What is Cook’s distance measure Di2 (s) for this simple regression with no intercept, see (8.24)? (e) Verify that (8.27) holds for this simple regression with no intercept. What is COVRATIO i , see (8.26)? 3. From the deﬁnition of s2(i) in (8.17), substitute (8.13) in (8.17) and verify (8.18). 4. Consider the augmented regression given in (8.5) y = Xβ ∗ + di ϕ + u where ϕ is a scalar and di = 1 for the ith observation and 0 otherwise. Using the FrischWaugh Lovell Theorem given in section 7.3, verify that ∗ = (X ′ X(i) )−1 X ′ y(i) = β . (a) β (i) (i) (i)
(b) ϕ = (d′i P¯X di )−1 d′i P¯X y = ei /(1 − hii ) where P¯X = I − PX .
(c) Residual Sum of Squares from (8.5) = (Residual Sum of Squares with di deleted) − e2i /(1 − hii ).
(d) Assuming Normality of u, show that the tstatistic for testing ϕ = 0 is t = ϕ /s.e.( ϕ) = e∗i as given in (8.3).
Problems
215
5. Consider the augmented regression y = Xβ ∗ + P¯X Dp ϕ∗ + u, where Dp is an n × p matrix of dummy variables for the p suspected observations. Note that P¯X Dp rather than Dp appear in this equation. Compare with (8.6). Let ep = Dp′ e, then E(ep ) = 0, var(ep ) = σ 2 Dp′ P¯X Dp . Verify that ∗ = (X ′ X)−1 X ′ y = β (a) β OLS and ∗ ′ ¯ −1 ′ ¯ (b) ϕ = (D PX Dp ) D PX y = (D′ P¯X Dp )−1 D′ e = (D′ P¯X Dp )−1 ep . p
p
p
p
p
(c) Residual Sum of Squares = (Residual Sum of Squares with Dp deleted) − e′p (Dp′ P¯X )Dp−1 ep . Using the FrischWaugh Lovell Theorem show this residual sum of squares is the same as that for (8.6).
(d) Assuming normality of u, verify (8.7) and (8.9). (e) Repeat this exercise for problem 4 with P¯X di replacing di . What do you conclude? 6. Using the updating formula in (8.11), verify (8.12) and deduce (8.13). 7. Verify that Cook’s distance measure given in (8.25) is related to DFFITS i (σ) as follows: DF√ FITS i (σ) = kDi (σ). 8. Using the matrix identity det(Ik − ab′ ) = 1 − b′ a, where a and b are column vectors of dimension ′ k, prove (8.27). Hint: Use a = xi and b′ = x′i (X ′ X)−1 and the fact that det[X(i) X(i) ] =det[{Ik − ′ ′ −1 ′ xi xi (X X) }X X]. 9. For the cigarette data given in Table 3.2 (a) Replicate the results in Table 8.2. β (b) For the New Hampshire observation (NH), compute eN H , e∗N H , β− (N H) , DFBETAS N H , 2 DFFIT N H , DFFITS N H , DN H (s), COVRATIO N H , and FVARATIO N H . (c) Repeat the calculations in part (b) for the following states: AR, CT, NJ and UT.
(d) What about the observations for NV, ME, NM and ND? Are they inﬂuential? 10. For the ConsumptionIncome data given in Table 5.1, compute (a) The internal studentized residuals e given in (8.1).
(b) The externally studentized residuals e∗ given in (8.3). (c) Cook’s statistic given in (8.25).
(d) The leverage of each observation h. (e) The DFFITS given in (8.22). (f) The COVRATIO given in (8.28). (g) Based on the results in parts (a) to (f), identify the observations that are inﬂuential. 11. Repeat problem 10 for the 1982 data on earnings used in Chapter 4. This data is provided on the Springer web site as EARN.ASC. 12. Repeat problem 10 for the Gasoline data provided on the Springer web site as GASOLINE.DAT. Use the gasoline demand model given in Chapter 10, section 5. Do this for Austria and Belgium separately. 13. Independence of Recursive Residuals. (a) Using the updating formula given in (8.11) with A = (Xt′ Xt ) and a = −b = x′t+1 , verify (8.31).
216
Chapter 8: Regression Diagnostics and Speciﬁcation Tests (b) Using (8.31), verify (8.32).
(c) For ut ∼ IIN(0, σ 2 ) and wt+1 deﬁned in (8.30) verify (8.33). Hint: deﬁne vt+1 = ft+1 wt+1 . From (8.30), we have = x′ (β − β ) + ut+1 for t = k, . . . , T − 1 vt+1 = ft+1 wt+1 = yt+1 − x′t+1 β t t+1 t Since ft+1 is ﬁxed, it suﬃces to show that cov(vt+1 , vs+1 ) = 0 for t = s.
14. Recursive Residuals are Linear Unbiased With Scalar Covariance Matrix (LUS). (a) Verify that the (T − k) recursive residuals deﬁned in (8.30) can be written in vector form as w = Cy where C is deﬁned in (8.34). This shows that the recursive residuals are linear in y. (b) Show that C satisﬁes the three properties given in (8.35) i.e., CX = 0, CC ′ = IT −k , and C ′ C = P¯X . Prove that CX = 0 means that the recursive residuals are unbiased with zero mean. Prove that the CC ′ = IT −k means that the recursive residuals have a scalar covariance matrix. Prove that C ′ C = P¯X means that the sum of squares of (T − k) recursive residuals is equal to the sum of squares of T least squares residuals. (c) If the true disturbances u ∼ N (0, σ 2 IT ), prove that the recursive residuals w ∼ N (0, σ 2 IT −k ) using parts (a) and (b). 2 (d) Verify (8.36), i.e., show that RSSt+1 = RSSt + wt+1 for t = k, . . . , T − 1 where RSSt = ′ (Yt − Xt β t ) (Yt − Xt β t ).
15. The Harvey and Collier (1977) Misspeciﬁcation ttest as a Variable Additions Test. This is based on Wu (1993). (a) Show that the F statistic for testing H0 ; γ = 0 versus γ = 0 in (8.44) is given by F =
y ′ P¯X y − y ′ P¯[X,z] y ′ y P¯[X,z] y/(T − k − 1)
=
y ′ (P¯X
y ′ Pz y − Pz )y/(T − k − 1)
and is distributed as F (1, T − k − 1) under the null hypothesis.
(b) Using the properties of C given in (8.35), show that the F statistic given in part (a) is the square of the Harvey and Collier (1977) tstatistic given in (8.43). 16. For the Gasoline data for Austria given on the Springer web site as GASOLINE.DAT and the model given in Chapter 10, section 5, compute: (a) The recursive residuals given in (8.30). (b) The CUSUM given in (8.46) and plot it against r. (c) Draw the 5% upper and lower lines given below (8.46) and see whether the CUSUM crosses these boundaries. (d) The postsample predictive test for 1978. Verify that computing it from (8.38) or (8.40) yields the same answer. (e) The modiﬁed von Neuman ratio given in (8.42). (f) The Harvey and Collier (1977) functional misspeciﬁcation test given in (8.43). 17. The Diﬀerencing Test in a Regression with Equicorrelated Disturbances. This is based on Baltagi (1990). Consider the timeseries regression Y = ιT α + Xβ + u
(1)
Problems
217
where ιT is a vector of ones of dimension T . X is T ×K and [ιT , X] is of full column rank. u ∼ (0, Ω) where Ω is positive deﬁnite. Differencing this model, we get DY = DXβ + Du
(2)
where D is a (T − 1) × T matrix given below (8.50). Maeshiro and Wichers (1989) show that GLS on (1) yields through partitioned inverse: = (X ′ LX)−1 X ′ LY β
(3)
= (X ′ M X)−1 X ′ M Y β
(4)
where L = Ω−1 − Ω−1 ιT (ι′T Ω−1 ιT )−1 ι′T Ω−1 . Also, GLS on (2) yields
where M = D′ (DΩD′ )−1 D. Finally, they show that M = L, and GLS on (2) is equivalent to GLS on (1) as long as there is an intercept in (1). Consider the special case of equicorrelated disturbances Ω = σ 2 [(1 − ρ)IT + ρJT ]
(5)
where IT is an identity matrix of dimension T and JT is a matrix of ones of dimension T . (a) Derive the L and M matrices for the equicorrelated case, and verify the Maeshiro and Wichers result for this special case. (b) Show that for the equicorrelated case, the differencing test given by Plosser, Schwert, and White (1982) can be obtained as the difference between the OLS and GLS estimators of the differenced equation (2). Hint: See the solution by Koning (1992). 18. For the 1982 data on earnings used in Chapter 4, provided as EARN.ASC on the Springer web site compute Ramsey’s (1969) RESET and the Thursby and Schmidt (1977) variant of this test. 19. Repeat problem 18 for the Hedonic housing data given on the Springer web site as HEDONIC.XLS. 20. Repeat problem 18 for the cigarette data given in Table 3.2. 21. Repeat problem 18 for the Gasoline data for Austria given on the Springer web site as GASOLINE.DAT. Use the model given in Chapter 10, section 5. Also compute the PSW differencing test given in (8.54). 22. Use the 1982 data on earnings used in Chapter 4, and provided on the Springer web site as EARN.ASC. Consider the two competing nonnested models H0 ; log(wage) = β 0 + β 1 ED + β 2 EXP + β 3 EXP 2 + β 4 W KS +β 5 M S + β 6 F EM + β 7 BLK + β 8 U N ION + u H1 ; log(wage) = γ 0 + γ 1 ED + γ 2 EXP + γ 3 EXP 2 + γ 4 W KS +γ 5 OCC + γ 6 SOU T H + γ 7 SM SA + γ 8 IN D + ǫ Compute: (a) The Davidson and MacKinnon (1981) Jtest for H0 versus H1 . (b) The Fisher and McAleer (1981) JAtest for H0 versus H1 . (c) Reverse the roles of H0 and H1 and repeat parts (a) and (b).
218
Chapter 8: Regression Diagnostics and Speciﬁcation Tests (d) Both H0 and H1 can be artiﬁcially nested in the model used in Chapter 4. Using the F test given in (8.62), test for H0 versus this augmented model. Repeat for H1 versus this augmented model. What do you conclude?
23. For the ConsumptionIncome data given in Table 5.1, (a) Test the hypothesis that the Consumption model is linear BoxCox alternative. (b) Test the hypothesis that the Consumption model is loglinear general BoxCox alternative. 24. Repeat problem 23 for the Cigarette data given in Table 3.2. 25. RESET as a GaussNewton Regression. This is based on Baltagi (1998). Davidson and MacKinnon (1993) showed that Ramsey’s (1969) regression error speciﬁcation test (RESET) can be derived as a GaussNewton Regression. This problem is a simple extension of their results. Suppose that the linear regression model under test is given by: yt = Xt′ β + ut
t = 1, 2, . . . , T
(1)
where β is a k × 1 vector of unknown parameters. Suppose that the alternative is the nonlinear regression model between yt and Xt : yt = Xt′ β[1 + θ(Xt′ β) + γ(Xt′ β)2 + λ(Xt′ β)3 ] + ut ,
(2)
where θ, γ, and λ are unknown scalar parameters. It is well known that Ramsey’s (1969) RESET is obtained by regressing yt on Xt , yt2 , yt3 and yt4 and by testing that the coeﬃcients of all powers of yt are jointly zero. Show that this RESET can be derived from a GaussNewton Regression on (2), which tests θ = γ = λ = 0.
References This chapter is based on Belsley, Kuh and Welsch (1980), Johnston (1984), Maddala (1992) and Davidson and MacKinnon (1993). Additional references are the following: Baltagi, B.H. (1990), “The Differencing Test in a Regression with Equicorrelated Disturbances,” Econometric Theory, Problem 90.4.5, 6: 488. Baltagi, B.H. (1998), “Regression Speciﬁcation Error Test as A GaussNewton Regression,” Econometric Theory, Problem 98.4.3, 14: 526. Belsley, D.A., E. Kuh and R.E. Welsch (1980), Regression Diagnostics (Wiley: New York). Box, G.E.P. and D.R. Cox (1964), “An Analysis of Transformations,” Journal of the Royal Statistical Society, Series B, 26: 211252. Brown, R.L., J. Durbin, and J.M. Evans (1975), “Techniques for Testing the Constancy of Regression Relationships Over Time,” Journal of the Royal Statistical Society 37:149192. Chesher, A. and R. Spady (1991), “Asymptotic Expansions of the Information Matrix Test Statistic,” Econometrica 59: 787815. Cook, R.D. (1977), “Detection of Inﬂuential Observations in Linear Regression,” Technometrics 19:1518. Cook, R.D. and S. Weisberg (1982), Residuals and Inﬂuences in Regression (Chapman and Hall: New York).
References
219
Cox, D.R. (1961), “Tests of Separate Families of Hypotheses,” Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, 1: 105123. Davidson, R., L.G. Godfrey and J.G. MacKinnon (1985), “A Simpliﬁed Version of the Differencing Test,” International Economic Review, 26: 63947. Davidson, R. and J.G. MacKinnon (1981), “Several Tests for Model Speciﬁcation in the Presence of Alternative Hypotheses,” Econometrica, 49: 781793. Davidson, R. and J.G. MacKinnon (1985), “Testing Linear and Loglinear Regressions Against BoxCox Alternatives,” Canadian Journal of Economics, 18: 499517. Davidson, R. and J.G. MacKinnon (1992), “A New Form of the Information Matrix Test,” Econometrica, 60: 145157. Davidson, R. and J.G. MacKinnon (2001), “Artiﬁcial Regressions,” Chapter 1 in Baltagi, B.H. (ed.) A Companion to Theoretical Econometrics (Blackwell: Massachusetts). Fisher, G.R. and M. McAleer (1981), “Alternative Procedures and Associated Tests of Signiﬁcance for NonNested Hypotheses,” Journal of Econometrics, 16: 103119. Gentleman, J.F. and M.B. Wilk (1975), “Detecting Outliers II: Supplementing the Direct Analysis of Residuals,” Biometrics, 31: 387410. Godfrey, L.G. (1988), Misspeciﬁcation Tests in Econometrics: The Lagrange Multiplier Principle and Other Approaches (Cambridge University Press: Cambridge). Hall, A. (1987), “The Information Matrix Test for the Linear Model,” Review of Economic Studies, 54: 257263. Harvey, A.C. (1976), “An Alternative Proof and Generalization of a Test for Structural Change,” The American Statistician, 30: 122123. Harvey, A.C. (1990), The Econometric Analysis of Time Series (MIT Press: Cambridge). Harvey, A.C. and P. Collier (1977), “Testing for Functional Misspeciﬁcation in Regression Analysis,” Journal of Econometrics, 6: 103119. Harvey, A.C. and G.D.A. Phillips (1974), “A Comparison of the Power of Some Tests for Heteroskedasticity in the General Linear Model,” Journal of Econometrics, 2: 307316. Hausman, J. (1978), “Speciﬁcation Tests in Econometrics,” Econometrica, 46: 12511271. Koning, R.H. (1992), “The Differencing Test in a Regression with Equicorrelated Disturbances,” Econometric Theory, Solution 90.4.5, 8: 155156. Kr¨ amer, W. and H. Sonnberger (1986), The Linear Regression Model Under Test (PhysicaVerlag: Heidelberg). Krasker, W.S., E. Kuh and R.E. Welsch (1983), “Estimation for Dirty Data and Flawed Models,” Chapter 11 in Handbook of Econometrics, Vol. I, eds. Z. Griliches and M.D. Intrilligator, Amsterdam, NorthHolland. Maeshiro, A. and R. Wichers (1989), “On the Relationship Between the Estimates of Level Models and Difference Models,” American Journal of Agricultural Economics, 71: 432434. Orme, C. (1990), “The Small Sample Performance of the Information Matrix Test,” Journal of Econometrics, 46: 309331. Pagan, A.R. and A.D. Hall (1983), “Diagnostic Tests as Residual Analysis,” Econometric Reviews, 2: 159254.
220
Chapter 8: Regression Diagnostics and Speciﬁcation Tests
Pesaran, M.H. and M. Weeks (2001), “Nonnested Hypothesis Testing: A Overview,” Chapter 13 in Baltagi, B.H. (ed.) A Companion to Theoretical Econometrics (Blackwell: Massachusetts). Phillips, G.D.A. and A.C. Harvey (1974), “A Simple Test for Serial Correlation in Regression Analysis,” Journal of the American Statistical Association, 69: 935939. Plosser, C.I., G.W. Schwert, and H. White (1982), “Differencing as a Test of Speciﬁcation,” International Economic Review, 23: 535552. Ramsey, J.B. (1969), “Tests for Speciﬁcation Errors in Classical Linear LeastSquares Regression Analysis,” Journal of the Royal Statistics Society, Series B, 31: 350371. Ramsey, J.B. and P. Schmidt (1976), “Some Further Results in the Use of OLS and BLUS Residuals in Error Speciﬁcation Tests,” Journal of the American Statistical Association, 71: 389390. Schmidt, P. (1976), Econometrics (Marcel Dekker: New York). Theil, H. (1971), Principles of Econometrics (Wiley: New York). Thursby, J. and P. Schmidt (1977), “Some Properties of Tests for Speciﬁcation Error in a Linear Regression Model,” Journal of the American Statistical Association, 72: 635641. Utts, J.M. (1982), “The Rainbow Test for Lack of Fit in Regression,” Communications in Statistics, 11: 28012815. Velleman, P. and R. Welsch (1981), “Eﬃcient Computing of Regression Diagnostics,” The American Statistician, 35: 234242. White, H. (1980), “A Heteroskedasticity–Consistent Covariance Matrix Estimator and a Direct Test for Heteroskedasticity,” Econometrica, 48: 817838. White, H. (1982), “Maximum Likelihood Estimation of Misspeciﬁed Models,” Econometrica, 50: 125. Wooldridge, J.M. (1995), “Diagnostic Testing,” Chapter 9 in B.H. Baltagi (ed.) A Companion to Theoretical Econometrics (Blackwell: Massachusetts). Wu, P. (1993), “Variable Addition Test,” Econometric Theory, Problem 93.1.2, 9: 145146.
CHAPTER 9
Generalized Least Squares 9.1
Introduction
This chapter considers a more general variance covariance matrix for the disturbances. In other words, u ∼ (0, σ 2 In ) is relaxed so that u ∼ (0, σ 2 Ω) where Ω is a positive deﬁnite matrix of dimension (n×n). First Ω is assumed known and the BLUE for β is derived. This estimator turns out to be different from β OLS , and is denoted by β GLS , the Generalized Least Squares estimator of β. Next, we study the properties of β OLS under this nonspherical form of the disturbances. It turns out that the OLS estimates are still unbiased and consistent, but their standard errors as computed by standard regression packages are biased and inconsistent and lead to misleading inference. Section 9.3 studies some special forms of Ω and derive the corresponding BLUE for β. It turns out that heteroskedasticity and serial correlation studied in Chapter 5 are special cases of Ω. Section 9.4 introduces normality and derives the maximum likelihood estimator. Sections 9.5 and 9.6 study the way in which test of hypotheses and prediction get affected by this general variancecovariance assumption on the disturbances. Section 9.7 studies the properties of this BLUE for β when Ω is unknown, and is replaced by a consistent estimator. Section 9.8 studies what happens to the W, LR and LM statistics when u ∼ N (0, σ 2 Ω). Section 9.9 gives another application of GLS to spatial autocorrelation.
9.2
Generalized Least Squares
The regression equation did not change, only the variancecovariance matrix of the disturbances. It is now σ 2 Ω rather than σ 2 In . However, we can rely once again on a result from matrix algebra to transform our nonspherical disturbances back to spherical form, see the Appendix to Chapter 7. This result states that for every positive deﬁnite matrix Ω, there exists a nonsingular matrix P such that P P ′ = Ω. In order to use this result, we transform the original model y = Xβ + u
(9.1)
by premultiplying it by P −1 . We get P −1 y = P −1 Xβ + P −1 u
(9.2)
Deﬁning y ∗ as P −1 y and X ∗ and u∗ similarly, we have y ∗ = X ∗ β + u∗
(9.3)
with u∗ having 0 mean and var(u∗ ) = P −1 var(u)P −1′ = σ 2 P −1 ΩP ′−1 = σ 2 P −1 P P ′ P ′−1 = σ 2 In . Hence, the variancecovariance of the disturbances in (9.3) is a scalar times an identity matrix. Therefore, using the results of Chapter 7, the BLUE for β in (9.1) is OLS on the transformed model in (9.3) ∗′ ∗ −1 ∗′ ∗ ′ −1′ −1 β P X)−1 X ′ P −1′ P −1 y = (X ′ Ω−1 X)−1 X ′ Ω−1 y (9.4) BLU E = (X X ) X y = (X P
222
Chapter 9: Generalized Least Squares
BLU E ) = σ 2 (X ∗′ X ∗ )−1 = σ 2 (X ′ Ω−1 X)−1 . This β BLU E is known as β GLS . Deﬁne with var(β ′ 2 2 Σ = E(uu ) = σ Ω, then Σ differs from Ω only by the positive scalar σ . One can easily verify ′ −1 −1 ′ −1 that β GLS can be alternatively written as β GLS = (X Σ X) X Σ y and that var(β GLS ) = ′ −1 −1 −1 −1 2 (X Σ X) . Just substitute Σ = Ω /σ in this last expression for β GLS and verify that this yields (9.4). It is clear that β GLS differs from β OLS . In fact, since β OLS is still a linear unbiased estimator of β, the GaussMarkov Theorem states that it must have a variance larger than that of β GLS . ′ −1 ′ Using equation (7.5) from Chapter 7, i.e., β OLS = β + (X X) X u it is easy to show that 2 ′ −1 ′ ′ −1 var(β OLS ) = σ (X X) (X ΩX)(X X)
(9.5)
GLS )′ Ω−1 (y − X β GLS )/(n − K) s∗2 = e∗′ e∗ /(n − K) = (y − X β
(9.6)
Problem 1 shows that var(β OLS )− var(β GLS ) is a positive semideﬁnite matrix. Note that 2 (X ′ X)−1 , and hence a regression package that is programmed to var(β ) is no longer σ OLS compute s2 (X ′ X)−1 as an estimate of the variance of β OLS is using the wrong formula. Fur2 2 thermore, problem 2 shows that E(s ) is not in general σ . Hence, the regression package is also wrongly estimating σ 2 by s2 . Two wrongs do not make a right, and the estimate of var(β OLS ) is biased. The direction of this bias depends upon the form of Ω and the X matrix. (We saw some examples of this bias under heteroskedasticity and serial correlation in Chapter 5). Hence, the standard errors and tstatistics computed using this OLS regression are biased. Under heteroskedasticity, one can use the White (1980) robust standard errors for OLS. In this case, = diag[e2 ] where ei denotes the least squares residuals. Σ = σ 2 Ω in (9.5) is estimated by Σ i The resulting tstatistics are robust to heteroskedasticity. Similarly Wald type statistics for 2 ′ −1 in (7.41) by (9.5) with H0 ; Rβ = r can be obtained based on β OLS by replacing σ (X X) 2 Σ = diag[ei ]. In the presence of both serial correlation and heteroskedasticity, one can use the consistent covariance matrix estimate suggested by Newey and West (1987). This was discussed in Chapter 5. To summarize, β OLS is no longer BLUE whenever Ω = In . However, it is still unbiased and consistent. The last two properties do not rely upon the form of the variancecovariance matrix of the disturbances but rather on E(u/X) = 0 and plim X ′ u/n = 0. The standard errors of β OLS as computed by the regression package are biased and any test of hypothesis based on this OLS regression may be misleading. So far we have not derived an estimator for σ 2 . We know however, from the results in Chapter 7, that the transformed regression (9.3) yields a mean squared error that is an unbiased estimator for σ 2 . Denote this by s∗2 which is equal to the transformed OLS residual sum of squares divided by (n − K). Let e∗ denote the vector of OLS residuals from (9.3), this means that −1 (y − X β −1 e e∗ = y ∗ − X ∗ β GLS and GLS = P GLS ) = P = e′GLS Ω−1 eGLS /(n − K)
Note that s∗2 now depends upon Ω−1 .
Necessary and Sufficient Conditions for OLS to be Equivalent to GLS There are several necessary and suﬃcient conditions for OLS to be equivalent to GLS, see Puntanen and Styan (1989) for a historical survey. For pedagogical reasons, we focus on the derivation of Milliken and Albohali (1984). Note that y = PX y + P¯X y. Therefore, replacing y
9.3
Special Forms of Ω
223
GLS by this expression we get in β
′ −1 −1 ′ −1 ′ −1 −1 ′ −1 ¯ ¯ β GLS = (X Ω X) X Ω [PX y + PX y] = β OLS + (X Ω X) X Ω PX y
The last term is zero for every y if and only if X ′ Ω−1 P¯X = 0
(9.7)
Therefore, β GLS = β OLS if and only if (9.7) is true. Another easy necessary and suﬃcient condition to check in practice is the following: PX Ω = ΩPX
(9.8)
see Zyskind (1967). This involves Ω rather than Ω−1 . There are several applications in economics where these conditions are satisﬁed and can be easily veriﬁed, see Balestra (1970) and Baltagi (1989). We will apply these conditions in Chapter 10 on Seemingly Unrelated Regressions, Chapter 11 on simultaneous equations and Chapter 12 on panel data. See also problem 9.
9.3
Special Forms of Ω
If the disturbances are heteroskedastic but not serially correlated, then Ω = diag[σ 2i ]. In this case, P = diag[σ i ], P −1 = Ω−1/2 = diag[1/σ i ] and Ω−1 = diag[1/σ 2i ]. Premultiplying the regression equation by Ω−1/2 is equivalent to dividing the ith observation of this model by σ i . This makes the new disturbance ui /σ i have 0 mean and homoskedastic variance σ 2 , leaving properties ∗ = X /σ for like no serial correlation intact. The new regression runs yi∗ = yi /σ i on Xik i ik i = 1, 2, . . . , n, and k = 1, 2, . . . , K. Speciﬁc assumptions on the form of these σ i ’s were studied in the heteroskedasticity chapter. If the disturbances follow an AR(1) process ut = ρut−1 + ǫt for t = 1, 2, . . . , T ; with ρ ≤ 1 and ǫt ∼IID(0, σ 2ǫ ), then cov(ut , ut−s ) = ρs σ 2u with σ 2u = σ 2ǫ /(1 − ρ2 ). This means that ⎡ ⎤ 1 ρ ρ2 . . . ρT −1 ⎢ ρ 1 ρ . . . ρT −2 ⎥ ⎥ Ω=⎢ (9.9) ⎣ : : : : ⎦ T −1 T −2 T −3 ρ ρ ρ ... 1 and
Ω−1 =
Then
P −1
1 1 − ρ2
⎡
⎢ ⎢ ⎢ =⎢ ⎢ ⎢ ⎣
⎡
⎢ ⎢ ⎢ ⎢ ⎣
1 −ρ 0 ... 0 0 0 −ρ 1 + ρ2 −ρ . . . 0 0 0 : : : : : : 0 0 0 . . . −ρ 1 + ρ2 −ρ 0 0 0 ... 0 −ρ 1
1 − ρ2 0 0 . . . 0 0 −ρ 1 0 ... 0 0 0 −ρ 1 . . . 0 0 : : : : : 0 0 0 . . . −ρ 1 0 0 0 . . . 0 −ρ
0 0 0 : 0 1
⎤ ⎥ ⎥ ⎥ ⎥ ⎦
(9.10)
⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦
(9.11)
224
Chapter 9: Generalized Least Squares
is the matrix that satisﬁes the following condition P −1′ P −1 = (1 − ρ2 )Ω−1 . Premultiplying the regression model by P −1 is equivalent to performing the PraisWinsten transformation. In particular the ﬁrst observation on y becomes y1∗ = 1 − ρ2 y1 and the remaining observations are given by yt∗ = (yt −ρyt−1 ) for t = 2, 3, . . . , T , with similar terms for the X’s and the disturbances. Problem 3 shows that the variance covariance matrix of the transformed disturbances u∗ = P −1 u is σ 2ǫ IT . Other examples where an explicit form for P −1 has been derived include, (i) the MA(1) model, see Balestra (1980); (ii) the AR(2) model, see Lempers and Kloek (1973); (iii) the specialized AR(4) model for quarterly data, see Thomas and Wallis (1971); and (iv) the error components model, see Fuller and Battese (1974) and Chapter 12.
9.4
Maximum Likelihood Estimation
Assuming that u ∼ N (0, σ 2 Ω), the new likelihood function can be derived keeping in mind that u∗ = P −1 u = Ω−1/2 u and u∗ ∼ N (0, σ 2 In ). In this case f (u∗1 , . . . , u∗n ; β, σ 2 ) = (1/2πσ 2 )n/2 exp{−u∗′ u∗ /2σ 2 }
(9.12)
Making the transformation u = P u∗ = Ω1/2 u∗ , we get f (u1 , . . . , un ; β, σ 2 ) = (1/2πσ 2 )n/2 Ω−1/2  exp{−u′ Ω−1 u/2σ 2 }