datamining methods and models

Page 217

EXERCISES

199

Zero cells play havoc with the logistic regression solution, causing instability in the analysis and leading to possibly unreliable results. Rather than omitting the categories with zero cells, we may try to collapse the categories or redefine them somehow, in order to find some records for the zero cells. The logistic regression results should always be validated using either the model diagnostics and goodnessof-fit statistics shown in Hosmer and Lemeshow [1], or the traditional data mining cross-validation methods.

REFERENCES 1. D. W. Hosmer and S. Lemeshow, Applied Logistic Regression, 2nd ed., Wiley, New York, 2000. 2. P. McCullagh and J. A. Nelder, Generalized Linear Models, 2nd ed., Chapman & Hall, London, 1989. 3. C. R. Rao, Linear Statistical Inference and Its Application, 2nd ed., Wiley, New York, 1973. 4. Churn data set, in C. L. Blake and C. J. Merz, UCI Repository of Machine Learning Databases, http://www.ics.uci.edu/∼mlearn/MLRepository.html, University of California, Department of Information and Computer Science, Irvine, CA, 1998. Also available at the book series Web site. 5. Y. M. M. Bishop, S. E. Feinberg, and P. Holland, Discrete Multivariate Analysis: Theory and Practice, MIT Press, Cambridge, MA, 1975. 6. Adult data set, in C. L. Blake and C. J. Merz, UCI Repository of Machine Learning Databases, http://www.ics.uci.edu/∼mlearn/MLRepository.html. University of California, Department of Information and Computer Science, Irvine, CA, 1998. Adult data set compiled by Ron Kohavi. Also available at the book series Web site. 7. Cereals data set, in Data and Story Library, http://lib.stat.cmu.edu/DASL/. Also available at the book series Web site. 8. Breast cancer data set, compiled by Dr. William H. Wohlberg, University of Wisconsin Hospitals, Madison, WI; cited in O. L. Mangasarian and W. H. Wohlberg, Cancer diagnosis via linear programming, SIAM News, Vol. 23, No. 5, September 1990. 9. German data set, compiled by Professor Dr. Hans Hofmann, University of Hamburg, Germany. Available from the “Datasets from UCI” site, hosted by Silicon Graphics, Inc. at http://www.sgi.com/tech/mlc/db/. Also available at the book series Web site.

EXERCISES Clarifying the Concepts 4.1. Determine whether the following statements are true or false. If a statement is false, explain why and suggest how one might alter the statement to make it true. (a) Logistic regression refers to methods for describing the relationship between a categorical response variable and a set of categorical predictor variables. (b) Logistic regression assumes that the relationship between the predictor and the response is nonlinear. (c) π (x)may be interpreted as a probability.


Issuu converts static files into: digital portfolios, online yearbooks, online catalogs, digital photo albums and more. Sign up and create your flipbook.