Handbook of statistical analysis and data mining applications second edition elder - Download the co

Page 1


https://ebookmass.com/product/handbook-of-statisticalanalysis-and-data-mining-applications-second-edition-elder/

Instant digital products (PDF, ePub, MOBI) ready for you

Download now and discover formats that fit your needs...

Predictive Modeling in Biomedical Data Mining and Analysis

Sudipta Roy

https://ebookmass.com/product/predictive-modeling-in-biomedical-datamining-and-analysis-sudipta-roy/

ebookmass.com

Handbook of Regression Analysis With Applications in R, Second Edition Samprit Chatterjee

https://ebookmass.com/product/handbook-of-regression-analysis-withapplications-in-r-second-edition-samprit-chatterjee/

ebookmass.com

Statistical Methods for Survival Data Analysis 3rd Edition

Lee

https://ebookmass.com/product/statistical-methods-for-survival-dataanalysis-3rd-edition-lee/

ebookmass.com

Financial Algebra: Advanced Algebra with Financial Applications 2nd Edition, (Ebook PDF)

https://ebookmass.com/product/financial-algebra-advanced-algebra-withfinancial-applications-2nd-edition-ebook-pdf/

ebookmass.com

HANDBOOK OF STATISTICAL ANALYSIS AND DATA MINING APPLICATIONS

List of Tutorials on the Elsevier Companion Web Page

Note: This list includes all the extra tutorials published with the 1st edition of this handbook (2009). These can be considered “enrichment” tutorials for readers of this 2nd edition. Since the 1st edition of the handbook will not be available after the release of the 2nd edition, these extra tutorials are carried over in their original format/versions of software, as they are still very useful in learning and understanding data mining and predictive analytics, and many readers will want to take advantage of them.

List of Extra Enrichment Tutorials that are only on the ELSEVIER COMPANION web page, with data sets as appropriate, for downloading and use by readers of this 2nd edition of handbook:

1. TUTORIAL “O”—Boston Housing Using Regression Trees [Field: Demographics]

2. TUTORIAL “P”—Cancer Gene [Field: Medical Informatics & Bioinformatics]

3. TUTORIAL “Q”—Clustering of Shoppers [Field: CRM—Clustering Techniques]

4. TUTORIAL “R”—Credit Risk using Discriminant Analysis [Field: Financial—Banking]

5. TUTORIAL “S”—Data Preparation and Transformation [Field: Data Analysis]

6. TUTORIAL “T”—Model Deployment on New Data [Field: Deployment of Predictive Models]

7. TUTORIAL “V”—Heart Disease Visual Data Mining Methods [Field: Medical Informatics]

8. TUTORIAL “W”—Diabetes Control in Patients [Field: Medical Informatics]

9. TUTORIAL “X”—Independent Component Analysis [Field: Separating Competing Signals]

10. TUTORIAL “Y”—NTSB Aircraft Accidents Reports [Field: Engineering— Air Travel—Text Mining]

11. TUTORIAL “Z”—Obesity Control in Children [Field: Preventive Health Care]

12. TUTORIAL “AA”—Random Forests Example [Field: Statistics—Data Mining]

13. TUTORIAL “BB”—Response Optimization [Field: Data Mining— Response Optimization]

14. TUTORIAL “CC”—Diagnostic Tooling and Data Mining: Semiconductor Industry [Field: Industry—Quality Control]

15. TUTORIAL “DD”—Titanic—Survivors of Ship Sinking [Field: Sociology]

16. TUTORIAL “EE”—Census Data Analysis [Field: Demography—Census]

17. TUTORIAL “FF”—Linear & Logistic Regression—Ozone Data [Field: Environment]

18. TUTORIAL “GG”—R-Language Integration—DISEASE SURVIVAL ANALYSIS Case Study [Field: Survival Analysis—Medical Informatics]

19. TUTORIAL “HH”—Social Networks Among Community Organizations [Field: Social Networks—Sociology & Medical Informatics]

20. TUTORIAL “II”—Nairobi, Kenya Baboon Project: Social Networking

Among Baboon Populations in Kenya on the Laikipia Plateau [Field: Social Networks]

21. TUTORIAL “JJ”—Jackknife and Bootstrap Data Miner Workspace and MACRO [Field: Statistics Resampling Methods]

22. TUTORIAL “KK”—Dahlia Mosaic Virus: A DNA Microarray Analysis of 10 Cultivars from a Single Source: Dahlia Garden in Prague, Czech Republic [Field: Bioinformatics]

The final companion site URL will be https://www.elsevier.com/books-and-journals/ book-companion/9780124166325.

Foreword 2 for 1st Edition

A November 2008 search on https:// www.amazon.com/ for “data mining” books yielded over 15,000 hits—including 72 to be published in 2009. Most of these books either describe data mining in very technical and mathematical terms, beyond the reach of most individuals, or approach data mining at an introductory level without sufficient detail to be useful to the practitioner. The Handbook of Statistical Analysis and Data Mining Applications is the book that strikes the right balance between these two treatments of data mining.

This volume is not a theoretical treatment of the subject—the authors themselves recommend other books for this—but rather contains a description of data mining principles and techniques in a series of “knowledge-transfer” sessions, where examples from real data mining projects illustrate the main ideas. This aspect of the book makes it most valuable for practitioners, whether novice or more experienced. While it would be easier for everyone if data mining were merely a matter of finding and applying the correct mathematical equation or approach for any given problem, the reality is that both “art” and “science” are necessary. The “art” in data mining requires experience: when one has seen and overcome the difficulties in finding solutions from among the many possible approaches, one can apply newfound wisdom to the next project. However, this process takes considerable time, and particularly for data mining novices, the iterative process inevitable in data mining can lead to discouragement when a “textbook” approach doesn't yield a good solution.

This book is different; it is organized with the practitioner in mind. The volume is divided into four parts. Part I provides an overview of analytics from a historical perspective and frameworks from which to approach data mining, including CRISP-DM and SEMMA. These chapters will provide a novice analyst an excellent overview by defining terms and methods to use and will provide program managers a framework from which to approach a wide variety of data mining problems. Part II describes algorithms, though without extensive mathematics. These will appeal to practitioners who are or will be involved with day-to-day analytics and need to understand the qualitative aspects of the algorithms. The inclusion of a chapter on text mining is particularly timely, as text mining has shown tremendous growth in recent years.

Part III provides a series of tutorials that are both domain-specific and softwarespecific. Any instructor knows that examples make the abstract concept more concrete, and these tutorials accomplish exactly that. In addition, each tutorial shows how the solutions were developed using popular data mining software tools, such as Clementine, Enterprise Miner, Weka, and STATISTICA. The step-by-step specifics will assist practitioners in learning not only how to approach a wide variety of problems but also how to use these software products effectively. Part IV presents a look at the future of data mining, including a treatment of model ensembles and “The Top 10 Data Mining Mistakes,” from the popular presentation by Dr. Elder.

However, the book is best read a few chapters at a time while actively doing the data mining rather than read cover to cover (a daunting task for a book this size). Practitioners will appreciate tutorials that match their business objectives and choose to ignore other tutorials. They may choose to read sections on a particular algorithm to increase insight into that algorithm and then decide to add a second algorithm after the first is mastered. For those new to a particular software tool highlighted in the tutorials section, the step-by-step approach will operate much like a user's manual. Many chapters stand well on their own, such as

the excellent “History of Statistics and Data Mining” chapter and chapters 16, 17, and 18. These are broadly applicable and should be read by even the most experienced data miners.

The Handbook of Statistical Analysis and Data Mining Applications is an exceptional book that should be on every data miner's bookshelf or, better yet, found lying open next to their computer.

Abbott Abbott Analytics, San Diego, CA, United States

Dean

Preface

Much has happened in the professional discipline known previously as data mining since the first edition of this book was written in 2008. This discipline has broadened and deepened to a very large extent, requiring a major reorganization of its elements. A new parent discipline was formed, data science, which includes previous subjects and activities in data mining and many new elements of the scientific study of data, including storage structures optimized for analytic use, data ethics, and performance of many activities in business, industry, and education. Analytic aspects that used to be included in data mining have broadened considerably to include image analysis, facial recognition, industrial performance and control, threat detection, fraud detection, astronomy, national security, weather forecasting, and financial forensics. Consequently, several subdisciplines have been erected to contain various specialized data analytic applications. These subdisciplines of data science include the following:

• Machine learning—analytic algorithm design and optimization

• Data mining—generally restricted in scope now to pattern recognition apart from causes and interpretation

• Predictive analytics—using algorithms to predict things, rather than describe them or manage them

• Statistical analysis—use of parametric statistical algorithms for analysis and prediction

• Industrial statistical analysis—analytic techniques to control and direct industrial operations

• Operations research—decision science and optimization of business processes

• Stock market quants—focused on stock market trading and portfolio optimization.

• Data engineering—focused on optimizing data flow through memories and storage structures

• Business intelligence—focused primarily on descriptive aspects of data but predictive aspects are coming

• Business analytics—focused primarily on the predictive aspects of data but is merging with descriptives (based on an article by Vincent Granville published in http://www. datasciencecentral.com/profiles/ blogs/17-analytic-disciplines-compared.)

In this book, we will use the terms “data mining” and “predictive analytics” synonymously, even though data mining includes many descriptive operations also. Modern data mining tools, like the ones featured in this book, permit ordinary business analysts to follow a path through the data mining process to create models that are “good enough.” These less-than-optimal models are far better in their ability to leverage faint patterns in databases to solve problems than the ways it used to be done. These tools provide default configurations and automatic operations, which shield the user from the technical complexity underneath. They provide one part in the crude analogy to the automobile interface. You don't have to be a chemical engineer or physicist who understands moments of force to be able to operate a car. All you have to do is learn to

turn the key in the ignition, step on the gas and the brake at the right times, and turn the wheel to change direction in a safe manner, and voilà, you are an expert user of the very complex technology under the hood. The other half of the story is the instruction manual and the driver's education course that help you to learn how to drive.

This book provides the instruction manual and a series of tutorials to train you how to do data mining in many subject areas. We provide both the right tools and the right intuitive explanations (rather than formal mathematical definitions) of the data mining process and algorithms, which will enable even beginner data miners to understand the basic concepts necessary to understand what they are doing. In addition, we provide many tutorials in many different industries and businesses (using many of the most common data mining tools) to show how to do it.

OVERALL ORGANIZATION OF THIS BOOK

We have divided the chapters in this book into four parts to guide you through the aspects of predictive analytics. Part I covers the history and process of predictive analytics. Part II discusses the algorithms and methods used. Part III is a group of tutorials, which serve in principle as Rome served—as the central governing influence. Part IV presents some advanced topics. The central theme of this book is the education and training of beginning data mining practitioners, not the rigorous academic preparation of algorithm scientists. Hence, we located the tutorials in the middle of the book in Part III, flanked by topical chapters in Parts I, II, and IV.

This approach is “a mile wide and an inch deep” by design, but there is a lot packed into that inch. There is enough here to stimulate you to take deeper dives into theory, and there

is enough here to permit you to construct “smart enough” business operations with a relatively small amount of the right information. James Taylor developed this concept for automating operational decision-making in the area of enterprise decision management (Raden and Taylor, 2007). Taylor recognized that companies need decisionmaking systems that are automated enough to keep up with the volume and time-critical nature of modern business operations. These decisions should be deliberate, precise, and consistent across the enterprise; smart enough to serve immediate needs appropriately; and agile enough to adapt to new opportunities and challenges in the company. The same concept can be applied to nonoperational systems for customer relationship management (CRM) and marketing support. Even though a CRM model for cross sell may not be optimal, it may enable several times the response rate in product sales following a marketing campaign. Models like this are “smart enough” to drive companies to the next level of sales. When models like this are proliferated throughout the enterprise to lift all sales to the next level, more refined models can be developed to do even better. This enterprise-wide “lift” in intelligent operations can drive a company through evolutionary rather than revolutionary changes to reach long-term goals. Companies can leverage “smart enough” decision systems to do likewise in their pursuit of optimal profitability in their business. Clearly, the use of this book and these tools will not make you experts in data mining. Nor will the explanations in the book permit you to understand the complexity of the theory behind the algorithms and methodologies so necessary for the academic student. But we will conduct you through a relatively thin slice across the wide practice of data mining in many industries and disciplines. We can show you how to create powerful

• Determine the best ways to do each activity.

• Practice the best ways for each activity over and over again.

This practice trains the sensory and motor neurons to function together most efficiently. Individual neurons in the brain are “trained” in practice by adjusting signal strengths and firing thresholds of the motor nerve cells. The performance of a successful hurdler follows the “model” of these activities and the process of coordinating them to run the race. Creation of an analytic “model” of a business process to predict a desired outcome follows a very similar path to the training regimen of a hurdler. We will explore this subject further in Chapter 3 and apply it to develop a data mining process that expresses the basic activities and tasks performed in creating an analytic model.

HUMAN INTUITION

In humans, the right side of the brain is the center for visual and esthetic sensibilities. The left side of the brain is the center for quantitative and time-regulated sensibilities. Human intuition is a blend of both sensibilities. This blend is facilitated by the neural connections between the right side of the brain and the left side. In women, the number of neural connections between the right and left sides of the brain is 20% greater (on average) than in men. This higher connectivity of women's brains enables them to exercise intuitive thinking to a greater extent than men. Intuition “builds” a model of reality from both quantitative building blocks and visual sensibilities (and memories).

PUTTING IT ALL TOGETHER

Biological taxonomy students claim (in jest) that there are two kinds of people in taxonomy—those who divide things up into

two classes (for dichotomous keys) and those who don't. Along with this joke is a similar recognition from the outside that taxonomists are divided also into two classes: the “lumpers” (who combine several species into one) and the “splitters” (who divide one species into many). These distinctions point to a larger dichotomy in the way people think.

In ecology, there used to be two schools of thought: autoecologists (chemistry, physics, and mathematics explain all) and the synecologists (organism relationships in their environment explain all). It wasn't until the 1970s that these two schools of thought learned that both perspectives were needed to understand the complexities in ecosystems (but more about that later). In business, there are the “big picture” people versus “detail” people. Some people learn by following an intuitive pathway from general to specific (deduction). Often, we call them “big picture” people. Other people learn by following an intuitive pathway from specific to general (inductive). Often, we call them “detail” people. Similar distinctions are reflected in many aspects of our society. In Chapter 1, we will explore this distinction to a greater depth in regards to the development of statistical and data mining theory through time.

Many of our human activities involve finding patterns in the data input to our sensory systems. An example is the mental pattern that we develop by sitting in a chair in the middle of a shopping mall and making some judgment about patterns among its clientele. In one mall, people of many ages and races may intermingle. You might conclude from this pattern that this mall is located in an ethnically diverse area. In another mall, you might see a very different pattern. In one mall in Toronto, a great many of the stores had Chinese titles and script on the windows. One observer noticed that he was the only non-Asian seen for a half hour. This led to the conclusion that the mall catered to the Chinese community and was owned

(probably) by a Chinese company or person. Statistical methods employed in testing this “hypothesis” would include the following:

• Performing a survey of customers to gain empirical data on race, age, length of time in the United States, etc.

• Calculating means (averages) and standard deviations (an expression of the average variability of all the customers around the mean).

• Using the mean and standard deviation for all observations to calculate a metric (e.g., student's t-value) to compare with standard tables.

• If the metric exceeds the standard table value, this attribute (e.g., race) is present in the data at a higher rate than expected at random.

More advanced statistical techniques can accept data from multiple attributes and process them in combination to produce a metric (e.g., average squared error), which reflects how well a subset of attributes (selected by the processing method) predict desired outcome. This process “builds” an analytic equation, using standard statistical methods. This analytic “model” is based on averages across the range of variation of the input attribute data. This approach to finding the pattern in the data is basically a deductive, top-down process (general to specific).

The general part is the statistical model employed for the analysis (i.e., normal parametric model). This approach to model building is very “Platonic.” In Chapter 1, we will explore the distinctions between Aristotelian and Platonic approaches for understanding truth in the world around us.

Part I—Introduction and overview of data mining processes.

Both statistical analysis and data mining algorithms operate on patterns: statistical analysis uses a predefined pattern (i.e., the parametric model) and compares some measure of the observations to standard metrics

of the model. We will discuss this approach in more detail in Chapter 1. Data mining doesn't start with a model; it builds a model with the data. Thus, statistical analysis uses a model to characterize a pattern in the data; data mining uses the pattern in the data to build a model. This approach uses deductive reasoning, following an Aristotelian approach to truth. From the “model” accepted in the beginning (based on the mathematical distributions assumed), outcomes are deduced. On the other hand, data mining methods discover patterns in data inductively, rather than deductively, following a more Platonic approach to truth. We will unpack this distinction to a much greater extent in Chapter 1.

Which is the best way to do it? The answer is it depends. It depends on the data. Some data sets can be analyzed better with statistical analysis techniques, and other data sets can be analyzed better with data mining techniques. How do you know which approach to use for a given data set? Much ink has been devoted to paper to try to answer that question. We will not add to that effort. Rather, we will provide a guide to general analytic theory (Chapter 2) and broad analytic procedures (Chapter 3) that can be used with techniques for either approach. For the sake of simplicity, we will refer to the joint body of techniques as analytics. In Chapter 4, we introduce some of the many data preparation procedures for analytics.

Chapter 5 presents various methods for selecting candidate predictor variables to be used in a statistical or machine-learning model (differences between statistical and machine-learning methods of model building are discussed in Chapter 1). Chapter 6 introduces accessory tools and some advanced features of many data mining tools.

Part II—Basic and advanced algoithms, and their application to common problems. Chapters 7 and 8 discuss various basic and advanced algorithms used in data mining modeling applications. Chapters 9 and 10

that combines art and science–intuition and expertise in collecting and understanding data in order to make accurate models that realistically predict the future that lead to informed strategic decisions thus allowing correct actions ensuring success, before it is too late...today, numeracy is as essential as literacy. As John

Elder likes to say: ‘Go data mining!’ It really does save enormous time and money. For those with the patience and faith to get through the early stages of business understanding and data transformation, the cascade of results can be extremely rewarding.”

Gary Miner, September, 2017

Biographies of the Primary Authors of This Book

BOB NISBET, PHD

Bob was trained initially in ecology and ecosystems analysis. He has over 30 years of experience in complex system analysis and modeling, most recently as a researcher (University of California, Santa Barbara). In business, he pioneered the design and development of configurable data mining applications for retail sales forecasting and churn, propensity to buy, and customer acquisition in telecommunications insurance, banking, and credit industries. In addition to data mining, he has expertise in data warehousing technology for extract, transform, and load (ETL) operations, business intelligence reporting, and data quality analyses. He is the lead author of the Handbook of Statistical Analysis and Data Mining Applications (Academic Press, 2009) and a coauthor of Practical Text Mining (Academic Press, 2012) and coauthor of Practical Predictive Analytics and Decisioning Systems for Medicine (Academic Press, 2015). Currently, he serves as an instructor in the University of California, Predictive Analytics Certificate Program, teaching online courses in effective data preparation and coteaching introduction to predictive analytics. Additionally, Bob is in the last stages of writing another book on data preparation for predictive analytic modeling.

GARY D. MINER, PHD

Dr. Gary Miner received a BS from Hamline University, St. Paul, MN, with biology, chemistry, and education majors; an MS in zoology and population genetics from the University of Wyoming; and a PhD in biochemical genetics from the University of Kansas as the recipient of a NASA predoctoral fellowship. He pursued additional National Institutes of Health postdoctoral studies at the University of Minnesota and University of Iowa eventually becoming immersed in the study of affective disorders and Alzheimer’s disease.

In 1985, he and his wife, Dr. Linda Winters-Miner, founded the Familial Alzheimer’s Disease Research Foundation, which became a leading force in organizing both local and international scientific meetings, bringing together all the leaders in the field of genetics of Alzheimer’s from several countries,

resulting in the first major book on the genetics of Alzheimer’s disease. In the mid-1990s, Dr. Miner turned his data analysis interests to the business world, joining the team at StatSoft and deciding to specialize in data mining. He started developing what eventually became the Handbook of Statistical Analysis and Data Mining Applications (coauthored with Dr. Robert A. Nisbet and Dr. John Elder), which received the 2009 American Publishers Award for Professional and Scholarly Excellence (PROSE). Their follow-up collaboration, Practical Text Mining and Statistical Analysis for Non-structured Text Data Applications, also received a PROSE award in February 2013. Gary was also the coauthor of Practical Predictive Analytics and Decisioning Systems for Medicine (Academic Press, 2015). Overall, Dr. Miner’s career has focused on medicine and health issues and the use of data analytics (statistics and predictive analytics) in analyzing medical data to decipher fact from fiction.

Gary has also served as a merit reviewer for Patient-Centered Outcomes Research Institute (PCORI) that awards grants for predictive analytics research into the comparative effectiveness and heterogeneous treatment effects of medical interventions including drugs among different genetic groups of patients; additionally, he teaches online classes in “introduction to predictive analytics,” “text analytics,” “risk analytics,” and “healthcare predictive analytics” for the University of California, Irvine. Recently, until his “official retirement” 18 months ago, he spent most of his time in his primary role as senior analyst/health-care applications specialist for Dell | Information Management Group, Dell Software (through Dell’s acquisition of StatSoft (www.StatSoft.com) in April 2014). Currently, Gary is working on two new short popular books on “health-care solutions for the United States” and “patient-doctor genomics stories.”

KENNETH P. YALE, DDS, JD

Dr. Yale has a track record of business development, product innovation, and strategy in both entrepreneurial and large companies across health-care industry verticals, including health payers, life sciences, and government programs. He is an agile executive who identifies future industry trends and seizes opportunities to build sustainable businesses. His accomplishments include innovations in health insurance, care management, data science, big data health-care analytics, clinical decision support, and precision medicine. His prior experience includes medical director and vice president of clinical solutions at ActiveHealth Management/ Aetna, chief executive of innovation incubator business unit at UnitedHealth Group Community & State, strategic counsel for Johnson & Johnson, corporate vice president of CorSolutions and Matria Healthcare, senior vice president and general counsel at EduNeering, and founder and CEO of Advanced Health Solutions. Dr. Yale previously worked in the federal government as a commissioned officer in the US Public Health Service, legislative counsel in the US Senate, special assistant to the president and executive director of the White House Domestic Policy Council, and chief of staff of the White House Office of Science and Technology.

Dr. Yale provides leadership and actively participates with industry organizations, including the American Medical Informatics Association, workgroup on genomics and translational bioinformatics, Bloomberg/BNA Health Insurance Advisory Board, Healthcare Information and Management Systems Society, and URAC accreditation organization. He is a frequent speaker and author on health and technology topics, including the books Managed Care and Clinical Integration: Population Health and Accountable Care, and tutorial author in Practical Predictive Analytics and Decisioning Systems for Medicine and editor with Statistical Analysis and Data Mining Applications, Second Edition.

PAR T I

HIS TORY OF PHASES OF DATA ANALYSIS, BASIC THEORY, AND THE DATA

MINING PROCESS

1

The Background for Data Mining Practice

PREAMBLE

You must be interested in learning how to practice data mining; otherwise, you would not be reading this book. We know that there are many books available that will give a good introduction to the process of data mining. Most books on data mining focus on the features and functions of various data mining tools or algorithms. Some books do focus on the challenges of performing data mining tasks. This book is designed to give you an introduction to the practice of data mining in the real world of business.

One of the first things considered in building a business data mining capability in a company is the selection of the data mining tool. It is difficult to penetrate the hype erected around the description of these tools by the vendors. The fact is that even the most mediocre of data mining tools can create models that are at least 90% as good as the best tools. A 90% solution performed with a relatively cheap tool might be more cost-effective in your organization than a more expensive tool. How do you choose your data mining tool? Few reviews are available. The best listing of tools by popularity is maintained and updated yearly by http://KDNuggets.com. Some detailed reviews available in the literature go beyond just a discussion of the features and functions of the tools (see Nisbet, 2006, Parts 1–3). The interest in an unbiased and detailed comparison is great. We are told that the “most downloaded document in data mining” is the comprehensive but decade-old tool review by Elder and Abbott (1998)

The other considerations in building a business's data mining capability are forming the data mining team, building the data mining platform, and forming a foundation of good data mining practice. This book will not discuss the building of the data mining platform. This subject is discussed in many other books, some in great detail. A good overview of how to build a data mining platform is presented in Data Mining: Concepts and Techniques (Han and Kamber, 2006). The primary focus of this book is to present a practical approach to building cost-effective data mining models aimed at increasing company profitability, using tutorials and demo versions of common data mining tools.

Just as important as these considerations in practice is the background against which they must be performed. We must not imagine that the background doesn't matter… it does matter,

whether or not we recognize it initially. The reason it matters is that the capabilities of statistical and data mining methodology were not developed in a vacuum. Analytic methodology was developed in the context of prevailing statistical and analytic theory. But the major driver in this development was a very pressing need to provide a simple and repeatable analysis methodology in medical science. From this beginning developed modern statistical analysis and data mining. To understand the strengths and limitations of this body of methodology and use it effectively, we must understand the strengths and limitations of the statistical theory from which they developed. This theory was developed by scientists and mathematicians who brought together previous thinking and combined it with original thinking to bring structure to it. But this thinking was not one-sided or unidirectional; there arose several views on how to solve analytic problems. To understand how to approach the solving of an analytic problem, we must understand the different ways different people tend to think. This history of statistical theory behind the development of various statistical techniques bears strongly on the ability of the technique to serve the tasks of a data mining project.

DATA MINING OR PREDICTIVE ANALYTICS?

This discipline used to be known as “data mining” or “knowledge discovery in databases (KDD)” and still is to some extent. The distinction between these classical terms was proposed by Fayaad et al. (1996) for data mining as the application of mathematical algorithms for extracting patterns from data, while KDD includes many other process steps before and after the pattern recognition operations. Currently, however, the vast majority of practitioners have traded those terms for the common term “predictive analytics” (PA). The new term was popularized initially by the fine conferences started by Eric Segal, Predictive Analytics World (PAW). In this book, we will use the terms PA and data mining synonymously, recognizing that the latter term broadened in the meantime to include many elements of KDD. In addition to data mining and predictive analytics, a third term “data science” has arisen, initially in academic circles, but it has spread recently to many other disciplines also. How can we keep these terms straight?

According to Smith (2016), the current working definitions of predictive analytics (he calls it “data analytics”) and data science are inadequate for most purposes; at best, the distinctions are murky. One way to dispel this murk is to consider what goals they are designed to seek. Predictive analytics seeks to provide operational insights into issues that we either know we know or know we don't know. Predictive analytics focuses on correlative analysis and predicts relationships between known random variables or sets of data in order to identify how an event will occur in the future. For example, identifying where to sell personal power generators and store locations as a function of future weather conditions (e.g., storms). While the weather may not have caused the buying behavior, it often strongly correlates to future sales.

On the other hand, the goal of data science is to provide strategic actionable insights into the world where we don't know what we don't know. For example, it might focus on trying to identify a future technology that doesn't exist today, but that might have a big impact on an organization in the future.

Smith (2016) claims that assigning the central focus of predictive analytics to operations and assigning the central focus of data science to strategy helps to understand how both activities are integrated into the enterprise information management (EIM). The EIM consists of those capabilities necessary for managing today's large-scale data assets. In addition to

relational data bases, data warehouses, and data marts, we see the emergence of big data solutions (Hadoop). Both data science and predictive analytics leverage data assets to provided day-to-day operational insights into very different functions such as counting assets and predicting inventory.

There are many books available that will give a good introduction to the process of predictive analytics, such as Larose and Larose (2015), which focuses on the mathematical algorithms and model evaluation methods. Other books focus on the features and functions of data mining tools or algorithms. These books leave many users with the question, “How can I apply these techniques in my company?” This book is designed to give you an introduction to the practice of predictive analytics in the real world of business.

One of the first things considered in building a predictive analytics capability in an organization is selecting the data mining tool. It is difficult to penetrate the hype erected around the description of these tools by the vendors. The fact is that even the most mediocre of data mining tools can create models that are at least 90% as good as the best tools. A 90% solution performed with a relatively cheap tool might be more cost-effective in your organization than a more expensive tool. How do you choose your data mining tool? There are few deep reviews available. Some detailed reviews are available that go beyond just a discussion of their features and functions (see Nisbet, 2004, 2006). The best listing of tools by popularity is maintained and updated yearly by KDNuggets.com. In addition, there are many other blogs and newsletters available, which provide much useful information. Several of the important blogs include the following:

• Data Science Central (available through LinkedIn) (https://www.linkedin.com/ groups/35222?midToken=AQGpwYhuvOEU4g&trk=eml-b2_anet_digest_weekly-hero-8groupPost~group&trkEmail=eml-b2_anet_digest_weekly-hero-8-groupPost~group-nullq3766~ipwtz308~5b)

• Machine Learning and Data Mining (available through LinkedIn) (https://www. linkedin.com/groups/4298680)

• Predictive Analytics World blog by Eric Siegel (http://www.predictiveanalyticsworld. com/blog/)

A much more complete list is available on KDnuggets.com.

The other considerations in building a business predictive analytics capability are the formation of an analytics team, building the predictive analytics platform, and forming a foundation of good analytic practice. This book will not discuss the building of the predictive analytics platform that is discussed well elsewhere, such as in “Data Mining: Concepts and Techniques” (Han and Kamber, 2006). The primary focus of this book is to present a practical approach to building cost-effective data mining models aimed at increasing company profitability, using tutorials and demo versions of common predictive analytics tools, both commercial and open source.

Just as important as these considerations in practice is the background against which they must be performed. We must not imagine that the background doesn't matter… it does, whether we recognize it initially or not. The reason is that the capabilities of statistical analysis and predictive analytics methodology were not developed in a vacuum. Analytic methodology was developed in the context of prevailing statistical and analytic theory. But the major driver in this development was a very pressing need to provide a simple and repeatable analysis methodology in medical science beginning in the 20th century. From this beginning developed modern statistical analysis and predictive analytics. In order to understand the strengths

Turn static files into dynamic content formats.

Create a flipbook
Issuu converts static files into: digital portfolios, online yearbooks, online catalogs, digital photo albums and more. Sign up and create your flipbook.