Applied Analytics Using S A S 速 Enterprise Miner
Course Notes
Applied Analytics Using SAlf Enterprise Miner™Course Notes was developed by Jim Georges, Jeff Thompson, and Chip Wells. Additional contributions were made by Tom Bohannon, Mike Hardin, Dan Kelly, Bob Lucas, and Sue Walsh. Editing and production support was provided by the Curriculum Development and Support Department. SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product names are trademarks of their respective companies. Applied Analytics Using SAS®Enterprise Miner™Course Notes Copyright © 2010 SAS Institute Inc. Cary,NC, USA. All rights reserved. Printed in the United States of America. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, or otherwise, without the prior written permission of the publisher, SAS Institute Inc. Book code E1755, course code LWAAEM61/AAEM61, prepared date 01 Jul2010.
LWAAEM61_002
ISBN 9781607645931
For Your Information
Table of Contents Course Description Prerequisites Chapter 1
1.1
x Introduction
Introduction to SAS Enterprise Miner
Chapter 2
ix
A c c e s s i n g and A s s a y i n g Prepared Data
11
13 21
2.1
Introduction
23
2.2
Creating a SAS Enterprise Miner Project, Library, and Diagram
25
Demonstration: Creating a SAS Enterprise Miner Project
2.3
2.4
2.5
3.1
Demonstration: Creating a SAS Library
210
Demonstration: Creating a SAS Enterprise Miner Diagram
213
Exercises
214
Defining a Data Source
215
Demonstration: Defining a Data Source
218
Exercises
235
Exploring a Data Source
236
Demonstration: Exploring Source Data
237
Demonstration: Changing the Explore Window Sampling Defaults
264
Exercises
266
Demonstration: Modifying and Correcting Source Data
267
Chapter Summary
Chapter 3
26
Introduction to Predictive Modeling: D e c i s i o n T r e e s
Introduction Demonstration: Creating Training and Validation Data
280 31
33 323
iii
iv
F o r Your Information
3.2
Cultivating Decision Trees Demonstration: Constructing a Decision Tree Predictive Model
3.3
Optimizing the Complexity of Decision Trees Demonstration: Assessing a Decision Tree
3.4
Understanding Additional Diagnostic Tools (SelfStudy) Demonstration: Understanding Additional Plots and Tables (Optional)
3.5
Autonomous Tree Growth Options (SelfStudy)
328 343 363 379 398 399 3106
Exercises
3114
3.6
Chapter Summary
3116
3.7
Solutions
3117
Chapter 4
4.1
4.2
Solutions to Exercises
3117
Solutions to Student Activities (Polls/Quizzes)
3133
Introduction to Predictive Modeling: R e g r e s s i o n s
Introduction
418
Demonstration: Running the Regression Node
425
Selecting Regression Inputs
Optimizing Regression Complexity Demonstration: Optimizing Complexity
4.4
Interpreting Regression Models Demonstration: Interpreting a Regression Model
4.5
Transforming Inputs Demonstration: Transforming Inputs
4.6
43
Demonstration: Managing Missing Values
Demonstration: Selecting Inputs 4.3
41
Categorical Inputs Demonstration: Recoding Categorical Inputs
429 433 438 440 449 451 452 456 466 469
For Your Information
4.7
Polynomial Regressions (SelfStudy)
476
Demonstration: Adding Polynomial Regression Terms Selectively
478
Demonstration: Adding Polynomial Regression Terms Autonomously (SelfStudy)
485
Exercises
488
4.8
Chapter Summary
489
4.9
Solutions
491
Solutions to Exercises Solutions to Student Activities (Polls/Quizzes) Chapter 5
5.1
Introduction to Predictive Modeling: Neural Networks a n d Other Modeling Tools
Introduction Demonstration: Training a Neural Network
5.2
Input Selection Demonstration: Selecting Neural Network Inputs
5.3
5.4
Stopped Training
491 4102
51
53 512 520 521 524
Demonstration: Increasing Network Flexibility
535
Demonstration: Using the AutoNeural Tool (SelfStudy)
539
Other Modeling Tools (SelfStudy)
546
Exercises
558
5.5
Chapter Summary
559
5.6
Solutions
560
Solutions to Exercises Chapter 6
6.1
Model A s s e s s m e n t
Model Fit Statistics Demonstration: Comparing Models with Summary Statistics
6.2
Statistical Graphics
560 61
63 66 69
v
vi
F o r Your Information
6.3
6.4
Demonstration: Comparing Models with ROC Charts
613
Demonstration: Comparing Models with Score Rankings Plots
618
Demonstration: Adjusting for Separate Sampling
621
Adjusting for Separate Sampling
626
Demonstration: Adjusting for Separate Sampling (continued)
629
Demonstration: Creating a Profit Matrix
632
Profit Matrices
639
Demonstration: Evaluating Model Profit
642
Demonstration: Viewing Additional Assessments
644
Demonstration: Optimizing with Profit (SelfStudy)
646
Exercises
648
6.5
Chapter Summary
649
6.6
Solutions
650
Solutions to Exercises Chapter 7
Model Implementation
650 71
7.1
Introduction
73
7.2
Internally Scored Data Sets
74
7.3
Demonstration: Creating a Score Data Source
75
Demonstration: Scoring with the Score Tool
76
Demonstration: Exporting a Scored Table (SelfStudy)
79
Score Code Modules
715
Demonstration: Creating a SAS Score Code Module
716
Demonstration: Creating Other Score Code Modules
722
Exercises
724
7.4
Chapter Summary
725
7.5
Solutions to Exercises
726
For Your Information
Chapter 8
Introduction to Pattern Discovery
81
8.1
Introduction
83
8.2
Cluster Analysis
88
8.3
Demonstration: Segmenting Census Data
815
Demonstration: Exploring and Filtering Analysis Data
824
Demonstration: Setting Cluster Tool Options
837
Demonstration: Creating Clusters with the Cluster Tool
841
Demonstration: Specifying the Segment Count
844
Demonstration: Exploring Segments
846
Demonstration: Profiling Segments
856
Exercises
861
Market Basket Analysis (SelfStudy)
863
Demonstration: Market Basket Analysis
867
Demonstration: Sequence Analysis
880
Exercises
883
8.4
Chapter Summary
885
8.5
Solutions
886
Solutions to Exercises Chapter 9
Special Topics
886 91
9.1
Introduction
93
9.2
Ensemble Models
94
Demonstration: Creating Ensemble Models 9.3
9.4
Variable Selection
95 99
Demonstration: Using the Variable Selection Node
910
Demonstration: Using Partial Least Squares for Input Selection
914
Demonstration: Using the Decision Tree Node for Input Selection
919
Categorical Input Consolidation
922
vii
viii
F o r Your Information
Demonstration: Consolidating Categorical Inputs 9.5
Surrogate Models Demonstration: Describing Decision Segments with Surrogate Models
Appendix A
C a s e Studies
A. 1 Banking Segmentation Case Study A.2
Web Site Usage Associations Case Study
923 931 932 A1
A3 A16
A.3 Credit Risk Case Study
A19
A. 4 Enrollment Management Case Study
A36
Appendix B
B. l
Bibliography
References
Appendix C
Index
B1
B3 C1
F o r Your Information
ix
Course Description This course covers the skills required to assemble analysis flow diagrams using the rich tool set of SAS Enterprise Miner for both pattern discovery (segmentation, association, and sequence analyses) and predictive modeling (decision tree, regression, and neural network models).
To learn more...
ยงsas Education
ยงsas Publishing
For information on other courses in the curriculum, contact the SAS Education Division at 18003337660, or send email to training@sas.com. You can also find this information on the Web at support.sas.com/training/ as well as in the Training Course Catalog.
For a list of other SAS books that relate to the topics covered in this Course Notes, USA customers can contact our SAS Publishing Department at 18007273228 or send email to sasbook@sas.com. Customers outside the USA, please contact your local SAS office. Also, see the Publications Catalog on the Web at support.sas.com/pubs for a complete list of books and a convenient order form.
x
F o r Your Information
Prerequisites Before attending this course, you should be acquainted with Microsoft Windows and Windowsbased software. In addition, you should have at least an introductorylevel familiarity with basic statistics and regression modeling. Previous SAS software experience is helpful but not required.
Chapter 1 Introduction 1.1
Introduction to S A S E n t e r p r i s e Miner
1 3
12
C h a p t e r 1 Introduction
1.1 Introduction to SAS Enterprise Miner
1.1
Introduction to SAS Enterprise
13
Miner
SAS Enterprise Miner
z The SAS Enterprise Miner 6.1 interface simplifies many common tasks associated with applied analysis. It offers secure analysis management and provides a wide variety of tools with a consistent graphical interface. You can customize it by incorporating your choice of analysis methods and tools.
SAS Enterprise Miner  Interface Tour 'â€˘'" ,T
s
^
77. "o *  j 9
Menu bar and shortcut buttons
3 The interface window is divided into several functional components. The menu bar and corresponding shortcut buttons perform the usual windows tasks, in addition to starting, stopping, and reviewing analyses.
14
Chapter 1 Introduction
SAS Enterprise Miner  Interface Tour •
—
—
*• m Una
Project panel
I >*inw*« •
• —  * — 
1
s
— 
cap

£U
_
U — 1
•
The Project panel manages and views data sources, diagrams, results, and project users.
SAS Enterprise Miner  Interface Tour pCO
V(r.«m

* a *
n...  riii L/^ 13" Properties panel 1
:
•
r , i  • Lr
01
•H w i B m h y
•a—r at..  U — 1 ii—
5 The Properties panel enables you to view and edit the settings o f data sources, diagrams, nodes, results, and users.
1.1 Introduction to SAS Enterprise Miner
SAS Enterprise Miner  Interface Tour
C — —1
MM "—••HI.
J
• —


B
i$r~
" k— 1
tit—
£ = —
Help panel
s The Help panel displays a short description o f the property that you select in the Properties panel. Extended help can be found in the Help Topics selection from the Help main menu.
SAS Enterprise Miner  Interface Tour
Diagram workspace
Q...„ :•— 1
1
.

g
fjif
 y^—J
M
Ck—
In the Diagram workspace, process flow diagrams are built, edited, and run. The workspace is where you graphically sequence the tools that you use to analyze your data and generate reports.
15
16
Chapter 1 Introduction
SAS Enterprise Miner  Interface Tour y
E J — I Hto——,>  g j —  f
—
1la.——]
Process flow fir:J
8 The Diagram workspace contains one or more process flows. A process flow starts with a data source and sequentially applies SAS Enterprise Miner tools to complete your analytic objective.
SAS Enterprise Miner  Interface Tour
et=a .a — •n
Node
D~
 *
IB
1 Si   I * — 1 •fir— i
.w*jrw>
A process flow contains several nodes. Nodes are SAS Enterprise Miner tools connected by arrows to show the direction o f information flow in an analysis.
1.1
Introduction to S A S Enterprise Miner
17
SAS Enterprise Miner  Interface Tour
•
—

a*

CKJ—
•J

il , ;
(E
 J 
s
10
The SAS Enterprise Miner tools available to your analysis are contained in the tools palette. The tools palette is arranged according to a process for data mining, SEMMA. SEMMA is an acronym for the following: Sample
You sample the data by creating one or more data tables. The samples should be large enough to contain the significant information, but small enough to process.
Explore You explore the data by searching for anticipated relationships, unanticipated trends, and anomalies in order to gain understanding and ideas. Modify
You modify the data by creating, selecting, and transforming the variables to focus the model selection process.
Model
You model the data by using the analytical tools to search for a combination of the data that reliably predicts a desired outcome.
Assess
You assess competing predictive models (build charts to evaluate the usefulness and reliability of the findings from the data mining process).
Additional tools are available under the Utility group and, i f licensed, the Credit Scoring group.
18
C h a p t e r 1 Introduction
SEMMASample Tab
•Append > Data Partition
5~
rati—iilr»
• File Import •Filter • Input Data •Merge ^ • Sample • Time Series
— '  Sir >  k  — 1
I Ji ,.) — J 3
f
11
The tools in each Tool tab are arranged alphabetically. The Append tool is used to append data sets that are exported by two different paths in a single process flow diagram. The Append node can also append train, validation, and test data sets into a new training data set. The Data Partition tool enables you to partition data sets into training, test, and validation data sets. The training data set is used for preliminary model fitting. The validation data set is used to monitor and tune the model during estimation and is also used for model assessment. The test data set is an additional holdout data set that you can use for model assessment. This tool uses simple random sampling, stratified random sampling, or cluster sampling to create partitioned data sets. The File Import tool enables you to convert selected external flat files, spreadsheets, and database tables into a format that SAS Enterprise Miner recognizes as a data source. The Filter tool creates and applies filters to your training data set, and optionally, to the validation and test data sets. You can use filters to exclude certain observations, such as extreme outliers and errant data that you do not want to include in your mining analysis. The Input Data tool represents the data source that you choose for your mining analysis and provides details (metadata) about the variables in the data source that you want to use. The Merge tool enables you to merge observations from two or more data sets into a single observation in a new data set. The Merge tool supports both onetoone and match merging. The Sample tool enables you to take simple random samples, observation samples, stratified random samples, firstw samples, and cluster samples of data sets. For any type of sampling, you can specify either a number of observations or a percentage of the population to select for the sample. If you are working with rare events, the Sample tool can be configured for oversampling or stratified sampling. Sampling is recommended for extremely large databases because it can significantly decrease model training time. If the sample is sufficiently representative, relationships found in the sample can be expected to generalize to the complete data set. The Sample tool writes the sampled observations to an output data set and saves the seed values that are used to generate the random numbers for the samples so that you can replicate the samples.
1.1
Introduction to S A S Enterprise Miner
19
The Time Series tool converts transactional data to time series data. Transactional data is timestamped data that is collected over time at no particular frequency. By contrast, time series data is timestamped data that is summarized over time at a specific frequency. You might have many suppliers and many customers, as well as transaction data that is associated with both. The size of each set of transactions can be very large, which makes many traditional data mining tasks difficult. By condensing the information into a time series, you can discover trends and seasonal variations in customer and supplier habits that might not be visible in transactional data.
110
Chapter 1 Introduction
SEMMAExplore Tab laaHai^jJaiajfl^Ju I •Association • Cluster • DMDB • Graph Explore • Market Basket  • Multiplot • Path Analysis • SOM/Kohonen • StatExplore •Variable Clustering •Variable Selection
at
 U — 1
fit
12 The Association tool enables you to perform association discovery to identify items that tend to occur together within the data. For example, i f a customer buys a loaf o f bread, how likely is the customer to also buy a gallon o f milk? This type o f discovery is also known as market basket analysis. The tool also enables you to perform sequence discovery i f a timestamp variable (a sequence variable) is present in the data set. This enables you to take into account the ordering of the relationships among items. The Cluster tool enables you to segment your data; that is, it enables you to identify data observations that are similar in some way. Observations that are similar tend to be in the same cluster, and observations that are different tend to be in different clusters. The cluster identifier for each observation can be passed to subsequent tools in the diagram. The D M D B tool creates a data mining database that provides summary statistics and factorlevel information for class and interval variables in the imported data set. The Graph Explore tool is an advanced visualization tool that enables you to explore large volumes of data graphically to uncover patterns and trends and to reveal extreme values in the database. The tool creates a runtime sample of the input data source. You use the Graph Explore node to interactively explore and analyze your data using graphs. Your exploratory' graphs are persisted when the Graph Explore Results window is closed. When you reopen the Graph Explore Results window, the persisted graphs are recreated. The experimental M a r k e t Basket tool performs association rule mining over transaction data in conjunction with item taxonomy. Transaction data contains sales transaction records with details about items bought by customers. Market basket analysis uses the information from the transaction data to give you insight about which products tend to be purchased together. The M u l t i P l o t tool is a visualization tool that enables you to explore large volumes o f data graphically. The MultiPlot tool automatically creates bar charts and scatter plots for the input and target. The code created by this tool can be used to create graphs in a batch environment. The Path Analysis tool enables you to analyze Web log data to determine the paths that visitors take as they navigate through a Web site. You can also use the tool to perform sequence analysis.
1.1
Introduction to S A S Enterprise Miner
111
The SOM/Kohonen tool performs unsupervised learning by using Kohonen vector quantization (VQ), Kohonen selforganizing maps (SOMs), or batch SOMs with NadarayaWatson or locallinear smoothing. Kohonen VQ is a clustering method, whereas SOMs are primarily dimensionreduction methods. For cluster analysis, the Clustering tool is recommended instead of Kohonen VQ or SOMs. The StatExplore tool is a multipurpose tool used to examine variable distributions and statistics in your data sets. The tool generates summarization statistics. You can use the StatExplore tool to do the following: • select variables for analysis, for profiling clusters, and for predictive models • compute standard univariate distribution statistics • compute standard bivariate statistics by class target and class segment • compute correlation statistics for interval variables by interval input and target The Variable Clustering tool is useful for data reduction, such as choosing the best variables or cluster components for analysis. Variable clustering removes collinearity, decreases variable redundancy, and helps to reveal the underlying structure of the input variables in a data set. The Variable Selection tool enables you to evaluate the importance of input variables in predicting or classifying the target variable. To select the important inputs, the tool uses either an Rsquared or a Chisquared selection criterion. The Rsquared criterion enables you to remove variables in hierarchies, remove variables that have large percentages of missing values, and remove class variables that are based on the number of unique values. The variables that are not related to the target are set to a status of rejected. Although rejected variables are passed to subsequent tools in the process flow diagram, these variables are not used as model inputs by more detailed modeling tools, such as the Neural Network and Decision Tree tools. You can reassign the input model status to rejected variables.
112
Chapter 1
Introduction
SEMMAModify Tab
"I w> ! 1 l»* . . .,*o rr^ipw : .... .. 2 J. rwxn wSjii'' ""' ''' ^ ' ' ' ' " ' " J 1
1
1
1 1 1 1
1
•Drop • Impute • Interactive Binning • Principal Components • Replacement
Er'»Rule§BuHder

C?>r   k — )
•Transform Variables '
OM»nt
'• • • . .*'t
'•• ,\
1
iiJ
13
The Drop tool is used to remove variables from scored data sets. You can remove all variables with the role type that you specify, or you can manually specify individual variables to drop. For example, you could remove all hidden, rejected, and residual variables from your exported data set, or you could remove only a few variables that you identify yourself. The Impute tool enables you to replace values for observations that have missing values. You can replace missing values for interval variables with the mean, median, midrange, midminimum spacing, or distributionbased replacement, or you can use a replacement Mestimator such as Tukey's biweight, Huber's, or Andrew's Wave. You can also estimate the replacement values for each interval input by using a treebased imputation method. Missing values for class variables can be replaced with the most frequently occurring value, distributionbased replacement, treebased imputation, or a constant. The Interactive Binning tool is an interactive grouping tool that you use to model nonlinear functions of multiple modes of continuous distributions. The Interactive tool computes initial bins by quantiles. Then you can interactively split and combine the initial bins. You use the Interactive Binning node to create bins or buckets or classes of all input variables, which include both class and interval input variables. You can create bins in order to reduce the number of unique levels as well as attempt to improve the predictive power of each input. The Principal Components tool calculates eigenvalues and eigenvectors from the uncorrected covariance matrix, corrected covariance matrix, or the correlation matrix of input variables. Principal components are calculated from the eigenvectors and are usually treated as the new set of input variables for successor modeling tools. A principal components analysis is useful for data interpretation and data dimension reduction. The Replacement tool enables you to reassign and consolidate levels of categorical inputs. This can improve the performance of predictive models. The Rules Builder tool opens the Rules Builder window so that you can create ad hoc sets of rules with userdefinable outcomes. You can interactively define the values of the outcome variable and the paths to the outcome. This is useful in ad hoc rule creation such as applying logic for posterior probabilities and scorecard values.
1.1
Introduction to S A S Enterprise Miner
113
The Transform Variables tool enables you to create new variables that are transformations of existing variables in your data. Transformations are useful when you want to improve the fit of a model to the data. For example, transformations can be used to stabilize variances, remove nonlinearity, improve additivity, and correct nonnormality in variables. The Transform Variables tool supports various transformation methods. The available methods depend on the type and the role of a variable.
114
C h a p t e r 1 Introduction
SEMMA  Model Tab 4mmM^£&2$M\ OSS
• AutoNeural
f — — r itms .
• Decision Tree
"=r—y
. . .Fa
• Dmine Regression
j
• DMNeural • Ensemble
... .'owt^v
• —
' '. . r*.
;f ..uuj.'iin.
t
j *
^
« G r a d i e n t Boosing 
cpj— 
 k — 1
• Least Angle Regression
.».•• *TJ^.—
1
''^S^jfcj^'" '
,_ . •
•MBR • Model Import
" i
fir
• Neural Network • Partial Least Squares a 1
•••Regression
» R u l e induction • Two Stage
I ^ .  .  J  . .  ? '
•
'
The AutoNeural tool can be used to automatically configure a neural network. It conducts limited searches for a better network configuration. The Decision Tree tool enables you to perform multiway splitting of your database based on nominal, ordinal, and continuous variables. The tool supports both automatic and interactive training. When you run the Decision Tree tool in automatic mode, it automatically ranks the input variables based on the strength of their contributions to the tree. This ranking can be used to select variables for use in subsequent modeling. In addition, dummy variables can be generated for use in subsequent modeling. You can override any automatic step with the option to define a splitting rule and prune explicit tools or subtrees. Interactive training enables you to explore and evaluate a large set of trees as you develop them. The Dmine Regression tool performs a regression analysis on data sets that have a binary or interval level target variable. The Dmine Regression tool computes a forward stepwise least squares regression. In each step, an independent variable is selected that contributes maximally to the model Rsquare value. The tool can compute all twoway interactions of classification variables, and it can also use AOV16 variables to identify nonlinear relationships between interval variables and the target variable. In addition, the tool can use group variables to reduce the number of levels of classification variables. f
If you want to create a regression model on data that contains a nominal or ordinal target, then you would use the Regression tool.
The DMNeural tool is another modeling tool that you can use tofita nonlinear model. The nonlinear model uses transformed principal components as inputs to predict a binary or an interval target variable.
1.1
Introduction to S A S Enterprise Miner
115
The Ensemble tool creates new models by combining the posterior probabilities (for class targets) or the predicted values (for interval targets) from multiple predecessor models. The new model is then used to score new data. One common ensemble approach is to use multiple modeling methods, such as a neural network and a decision tree, to obtain separate models from the same training data set. The component models from the two complementary modeling methods are integrated by the Ensemble tool to form the final model solution. It is important to note that the ensemble model can only be more accurate than the individual models if the individual models disagree with one another. You should always compare the model performance of the ensemble model with the individual models. You can compare models using the Model Comparison tool. The Gradient Boosting tool uses a partitioning algorithm described in "A Gradient Boosting Machine," and "Stochastic Gradient Boosting" by Jerome Friedman. A partitioning algorithm searches for an optimal partition of the data defined in terms of the values of a single variable. The optimality criterion depends on how another variable, the target, is distributed into the partition segments. When the target values are more similar within the segments, the worth of the partition is greater. Most partitioning algorithms further partition each segment in a process called recursive partitioning. The partitions are then combined to create a predictive model. The model is evaluated by goodnessoffit statistics defined in terms of the target variable. These statistics are different than the measure of worth of an individual partition. A good model might result from many mediocre partitions. The Least Angle Regressions (LARS) tool can be used for both input variable selection and model fitting. When used for variable selection, the LAR algorithm chooses input variables in a continuous fashion that is similar to Forward selection. The basis of variable selection is the magnitude of the candidate inputs' estimated coefficients as they grow from zero to the least squares' estimate. Either a LARS or LASSO algorithm can be used for model fitting. See Efron et al (2004) and Hastie, Tibshirani, and Friedman (2001) for further details. The MemoryBased Reasoning (MBR) tool is a modeling tool that uses a ^nearest neighbor algorithm to categorize or predict observations. The ^nearest neighbor algorithm takes a data set and a probe, where each observation in the data set consists of a set of variables and the probe has one value for each variable. The distance between an observation and the probe is calculated. The k observations that have the smallest distances to the probe are the ^nearest neighbors to that probe. In SAS Enterprise Miner, the ^nearest neighbors are determined by the Euclidean distance between an observation and the probe. Based on the target values of the ^nearest neighbors, each of thefcnearestneighbors votes on the target value for a probe. The votes are the posterior probabilities for the class target variable. The Model Import tool imports and assesses a model that was not created by one of the SAS Enterprise Miner modeling nodes. You can then use the Assessment node to compare the userdefined model(s) with a model(s) that you developed with a SAS Enterprise Miner modeling node. This process is called integrated assessment. The Neural Network tool enables you to construct, train, and validate multilayer feedforward neural networks. In general, each input is fully connected to the first hidden layer, each hidden layer is fully connected to the next hidden layer, and the last hidden layer is fully connected to the output. The Neural Network tool supports many variations of this general form. The Partial Least Squares tool models continuous and binary targets based on the SAS/STAT PLS procedure. The Partial Least Squares node produces DATA step score code and standard predictive model assessment results.
116
Chapter 1 Introduction
The Regression tool enables you to fit both linear and logistic regression models to your data. You can use continuous, ordinal, and binary target variables. You can use both continuous and discrete variables as inputs. The tool supports the stepwise, forward, and backward selection methods. The interface enables you to create higherorder modeling terms such as polynomial terms and interactions. The Rule Induction tool enables you to improve the classification of rare events in your modeling data. The Rule Induction tool creates a Rule Induction model that uses split techniques to remove the largest pure split tool from the data. Rule induction also creates binary models for each level of a target variable and ranks the levels from the rarest event to the most common. The TwoStage tool enables you to model a class target and an interval target. The interval target variable is usually the value that is associated with a level of the class target. For example, the binary variable PURCHASE is a class target that has two levels, Yes and No, and the interval variable AMOUNT can be the value target that represents the amount of money that a customer spends on the purchase.
1.1
Introduction to S A S Enterprise Miner
117
SEMMA  Assess Tab 3..
, , r
"™ ' ,r
n , r n
OQSE3 •Cutoff 
;
~
<•»»»MWCMM H « I « .

<n J
.
• Decisions
.
• Model Comparison
J
• Segment Profile
,CSi, 8.! ,J:,J.^I^».SIJ.«M /^ ;
I
J
• Score 
•HH^.^, . H"i,^. «l,0^lH J 1
lt

§
i^f
 ja
j
l
•jCtr— n HI
15
The Cutoff tool provides tabular and graphical information to assist users in determining appropriate probability cutoff point(s) for decision making with binary target models. The establishment of a cutoff decision point entails the risk of generating false positives and false negatives, but an appropriate use of the Cutoff node can help minimize those risks. The Decisions tool enables you to define target profiles for a target that produces optimal decisions. The decisions are made using a userspecified decision matrix and output from a subsequent modeling procedure. The Model Comparison tool provides a common framework for comparing models and predictions from any of the modeling tools. The comparison is based on the expected and actual profits or losses that would result from implementing the model. The tool produces several charts that help to describe the usefulness of the model, such as lift charts and profit/loss charts. The Segment Profile tool enables you to examine segmented or clustered data and identify factors that differentiate data segments from the population. The tool generates various reports that aid in exploring and comparing the distribution of these factors within the segments and population. The Score tool enables you to manage, edit, export, and execute scoring code that is generated from a trained model. Scoring is the generation of predicted values for a data set that might not contain a target variable. The Score tool generates and manages scoring formulas in the form of a single SAS DATA step, which can be used in most SAS environments even without the presence of SAS Enterprise Miner. The Score tool can also generate C score code and Java score code.
118
Chapter 1 Introduction
Beyond SEMMA  Utility Tab
D â€” 
Control Point End Groups Metadata Reporter SAS Code Start Groups

a â€”
16 The Control Point tool enables you to establish a control point to reduce the number o f connections that are made in process flow diagrams. For example, suppose that three Input Data Source tools are to be connected to three modeling tools. If no Control Point tool is used, then nine connections are required to connect all o f the Input Data Source tools to all o f the modeling tools. However, i f a Control Point tool is used, only six connections are required. The E n d Groups tool is used only in conjunction with the Start Groups tool. The End Groups node acts as a boundary marker that defines the endofgroup processing operations in a process flow diagram. Group processing operations are performed on the portion o f the process flow diagram that exists between the Start Groups node and the End Groups node. The Metadata tool enables you to modify the columns metadata information at some point in your process flow diagram. You can modify attributes such as roles, measurement levels, and order. The Reporter tool uses SAS Output Delivery System (ODS) capability to create a single PDF or RTF file that contains information about the open process flow diagram. The PDF or RTF documents can be viewed and saved directly and are included in SAS Enterprise Miner report package files. The SAS Code tool enables you to incorporate new or existing SAS code into process flow diagrams. The ability to write SAS code enables you to include additional SAS procedures into your data mining analysis. You can also use a SAS DATA step to create customized scoring code, to conditionally process data, and to concatenate or merge existing data sets. The tool provides a macro facility to dynamically reference data sets used for training, validation, testing, or scoring variables, such as input, target, and predict variables. After you run the SAS Code tool, the results and the data sets can then be exported for use by subsequent tools in the diagram. The Start Groups tool is useful when your data can be segmented or grouped, and you want to process the grouped data in different ways. The Start Groups node uses BYgroup processing as a method to process observations from one or more data sources that are grouped or ordered by values o f one or more common variables. BY variables identify the variable or variables by which the data source is indexed, and BY statements process data and order output according to the BY group values.
1.1
Introduction to S A S Enterprise Miner
119
Credit Scoring Tab (Optional)
*
• Credit Exchange ,
f™=:—r =. " r y,t=_j vUT.:i'.^:...^...i.j. ST"* 3
jrrrk.
kllH
• Interactive Grouping • Reject Inference • Scorecard
^^ *  ^ t rrj—
J"*,ST*"S I . , • —. — i riter*™
 ep,—

a—>
.351— 
)
—
rn n
17
The optional Credit Scoring tab provides functionality related to credit scoring. f
The Credit Scoring for the SAS Enterprise Miner solution is not included with the base version of SAS Enterprise Miner. If your site did not license Credit Scoring for SAS Enterprise Miner, the Credit Scoring tab and its associated tools do not appear in your SAS Enterprise Miner software.
The Credit Exchange tool enables you to exchange the data that is created in SAS Enterprise Miner with the SAS Credit Risk Management solution. The Interactive Grouping tool creates groupings, or classes, of all input variables. (This includes both class and interval input variables.) You can create groupings in order to reduce the number of unique levels as well as attempt to improve the predictive power of each input. Along with creating group levels for each input, the Interactive Grouping tool creates Weight of Evidence (WOE) values. The Reject Inference tool uses the model that was built using the accepted applications to score the rejected applications in the retained data. The observations in the rejected data set are classified as inferred "goods" and inferred "bads." The inferred observations are added to the A c c e p t s data set that contains the actual "good" and "bad" records, forming an augmented data set. This augmented data set then serves as the input data set of a second creditscoring modeling run. During the second modeling run, attribute classification is readjusted and the regression coefficients are recalculated to compensate for the data set augmentation. The Scorecard tool enables you to rescale the logit scores of binary prediction models to fall within a specified range.
c CD u> n Q . 5"
<; 3 C 5
O
—1
*—• CD CD
a —
CD
ffl f? O. — —'
o
o o3
3
a
X
o
CD
hei
—
(SQ i o
3
2'
5 3
M
r
10 1a. to cr c   f & — i <* S£ S
ucti iter
fS
S cr 3 fj 1
CD
•<
CD
CD (A
UIOJ
tenc esul en
O
to'
5' B, oi
CD W
, 
3
S3 r*
5
 CD
Repair input data Transform input data
«
(B ^
AS
ta
Q
3
R
Gather results Assess observed results
O
n T 35 O CO
o „.
*< 3 n 3
3* CD O
GS a
IT 3 CD
3
CD
d'O O
"»)
ft
W
 M 3.
CD
P n

£L o
oCD
7T
S— 3
[B

2 3
=2 EL
a. si. g Q y w CD £L * o' 5 2 CD cr O
3
O
CD
O i Di CD &) _ w B 3 tS
—
a. O
3*
TJ CD
03
*<
o"
o Q.
o —** o
Apply analysis
«< B .
CD
>
a to
O
5' . " n CD p o o .
3

Generate deployment methods o £ Integrate deployment 3 *0 M 3*
CD
1 Z3
rj
_ H e g » CD
•*)
6
a.
k3
C
Select cases "S Extract input data o O Validate input data
r+
i3
Define analytic objective i»
Q
S s  a.
o o ££, 3 p. — ~n • <
P> 3
w
1 "8 tsi
g>
I
CD"
n
o r. a 3 CD •<

CD
H o o o' 3
to o
CO
CD
o cr 3
i,«
H
—
CD
O
5T
Refine analytic objective
1.1 Introduction to SAS Enterprise Miner
121
SAS Enterprise Miner Analytic Strengths
Pattern Discovery
Predictive Modeling
19
The analytic strengths of SAS Enterprise Miner lie in the realm traditionally known as data mining. There are a great number of analytic methods identified with data mining, but they usually fall into two broad categories: pattern discovery and predictive modeling (Hand 2005). SAS Enterprise Miner provides a variety o f pattern discovery and predictive modeling tools. Chapters 2 through 7 demonstrate SAS Enterprise Miner's predictive modeling capabilities. In those chapters, you use a rich collection of modeling tools to create, evaluate, and improve predictions from prepared data. Special topics in predictive modeling are featured in Chapter 9. Chapter 8 introduces some of SAS Enterprise Miner's pattern discovery tools. In this chapter, you see how to use SAS Enterprise Miner to graphically evaluate and transform prepared data. You use SAS Enterprise Miner's tools to cluster and segment an analysis population. You can also choose to try SAS Enterprise Miner's market basket analysis and sequence analysis capabilities.
122
Chapter 1 Introduction
Applied Analytics Case Studies ^ —i
.....
fH
W O
Jfj Bank usage segmentation V Web services associations
:&52J
Ti—1
» iinmm
J
&
Credit risk scoring
*»
?
li& University enrollment prediction
^ i — — ^
e — . w .
.
... .
20
Appendix A further illustrates SAS Enterprise Miner's analytic capabilities with case studies drawn from realworld business applications. • A consumer bank sought to segment its customers based on historic usage patterns. Segmentation was to be used for improving contact strategies in the Marketing Department. • A radio station developed a Web site to provide such services to its audience as podcasts, news streams, music streams, archives, and live Web music performances. The station tracked usage o f these services by U R L . Analysts at the station wanted to see whether any unusual patterns existed in the combinations of services selected. • A bank sought to use performance on an inhouse subprime credit product to create an updated risk model. The risk model was to be combined with other factors to make future credit decisions. • The administration o f a large private university requested that the Office o f Enrollment Management and the Office o f Institutional Research work together to identify prospective students who would most likely enroll as freshmen.
Chapter 2 Accessing and Assaying Prepared Data 2.1
Introduction
2.2
C r e a t i n g a S A S E n t e r p r i s e M i n e r Project, Library, a n d D i a g r a m
Demonstration: Creating a SAS Enterprise Miner Project
2.3
2.4
2.5
2

3
25
26
Demonstration: Creating a SAS Library
210
Demonstration: Creating a SAS Enterprise Miner Diagram
213
Exercises
214
D e f i n i n g a Data S o u r c e
215
Demonstration: Defining a Data Source
218
Exercises
235
E x p l o r i n g a Data S o u r c e
236
Demonstration: Exploring Source Data
237
Demonstration: Changing the Explore Window Sampling Defaults
264
Exercises
266
Demonstration: Modifying and Correcting Source Data
267
Chapter Summary
280
22
Chapter 2 Accessing and Assaying Prepared Data
2.1 Introduction
23
2.1 Introduction Analysis Element Organization
Projects
Libraries and Diagrams
Process Flows
Nodes
3
Analyses in SAS Enterprise Miner start by defining a project. A project is a container for a set of related analyses. After a project is defined, you define libraries and diagrams. A library is a collection of data sources accessible by the SAS Foundation Server. A diagram holds analyses for one or more data sources, Processflowsare the specific steps you use in your analysis of a data source. Each step in a process flow is indicated by a node. The nodes represent the SAS Enterprise Miner tools that are available to the user. f
A core set of tools is available to all SAS Enterprise Miner users. Additional tools can be added by licensing additional SAS products or by creating SAS Enterprise Miner extensions.
Before you can create a process flow, you need to define the following items: • a SAS Enterprise Miner project to contain the data sources, diagrams, and model packages • one or more SAS libraries inside your project, linking SAS Enterprise Miner to analysis data • a diagram workspace to display the steps of your analysis A SAS administrator can define these for you in advance of using SAS Enterprise Miner or you can define them yourself. f
The demonstrations show you how to create the items in the list above.
24
Chapter 2 Accessing and Assaying Prepared Data
Analysis Element Organization
Projects
Libraries and Diagrams
Process Flows
Nodes
4 Behind the scenes (on the SAS Foundation Server where the project resides), the organization is more complicated. Fortunately, the SAS Enterprise Miner client shields you from this complexity. At the top of the hierarchy is the SAS Enterprise Miner project, corresponding to a directory on (or accessible by) the SAS Foundation. When a project is defined, four subdirectories are automatically created within the project directory: Datasources, Reports, System, and Workspaces. The SAS Enterprise Miner client via the SAS Foundation handles the writing, reading, and maintenance o f these directories. f
In both the personal workstation and SAS Enterprise Miner client configurations, all references to computing resources such as L I B N A M E statements and access to SAS Foundation technologies must be made from the selected SAS Foundation Server's perspective. This is important because data and directories that are visible to the client machine might not be visible to the server.
Projects are defined to contain diagrams, the next level of the SAS Enterprise Miner organizational hierarchy. Diagrams usually pertain to a single analysis theme or project. When a diagram is defined, a new subdirectory is created in the Workspaces directory of the corresponding project. Each diagram is independent, and no information can be passed from one diagram to another. Libraries are defined using a L I B N A M E statement. You can create a library using the SAS Enterprise Miner Library Wizard, using SAS Management Console, or by using the StartUp Code window when you define a project. Any data source compatible with a L I B N A M E statement can provide data for SAS Enterprise Miner. Specific analyses in SAS Enterprise Miner occur in process flows. A process flow is a sequence o f tasks or nodes connected by arrows in the user interface, and it defines the order of analysis. The organization of a process flow is contained in a file, EM DGRAPH, which is stored in the diagram directory. Each node in the diagram corresponds to a separate subdirectory in the diagram directory. Information in one process flow can be sent to another by connecting the two process flows.
2.2 Creating a SAS Enterprise Miner Project, Library, and Diagram
25
2.2 Creating a SAS Enterprise Miner Project, Library, and Diagram Your first task when you start an analysis is creating a SAS Enterprise Miner project, data library, and diagram. Often these are set up by a SAS administrator (or reused from a previous analysis), but knowing how to set up these items yourself is fundamental to learning about SAS Enterprise Miner. Before attempting the processes described in the following demonstrations, you need to log on to your computer, and start and log on to the SAS Enterprise Miner client program.
26
Chapter 2 Accessing and Assaying Prepared Data
Creating a SAS Enterprise Miner Project A SAS Enterprise Miner project contains materials related to a particular analysis task. These materials include analysis process flows, intermediate analysis data sets, and analysis results. To define a project, you must specify a project name and the location o f the project on the SAS Foundation Server. Follow the steps below to create a new SAS Enterprise Miner project. • • ^ • I ^ H  i r  x 
3tEnterprise Miner File
Edit
01
View
g.ctions
I I I
i
Options
Window
Help
f I 1 U
. .: 1 M
1 .1 1^1
Welcome to Enterprise Miner O Ac o A o
Hefp Topics
E n te
Mi ne
US'
New Project...
OpenProject...
Si?
Recent Projects.,,
^?
Exit
0 student as s t u d e n t ] * ^ No project open
22 Creating a SAS Enterprise Miner Project, Library, and Diagram 1.
27
Select File => New ^ Project from the main menu. The Create New Project wizard opens at Step 1 £ F Create New Project — Step 1 of 4 Select SAS Server
Select a SAS Server for this project. All processing will take place on this server,
SAS Server SASApp  Logical Workspace Server < Back
!Next>
Cancel
In this configuration o f SAS Enterprise Miner, the only server available for processing is the host server listed above. 2.
Select Next >.
3.
Name the project. 2 £ Create New Project — Step 2 of 4 Specify Project Name and Server Directory
Specify a project name and directory on the SAS Server for this project. All SAS data sets andfileswill be written to this location.
Project Name SAS Server Directory D:\SA5\Config\Lev 1 \AppData\EnterpriseMiner\6.1 < Back
Next >
Browse Cancel
Step 2 of the Create New Project wizard is used to specify the following information: •
the name o f the project you are creating
•
the location of the project
28 4.
Chapter 2 Accessing and Assaying Prepared Data Type a project name, for example, My
P r o j e c t , in the Name field.
2 £ Create New Project — Step 2 of 4 Specify Project Name and Server Directory
SAS'
Specify a project name and directory on the SAS Server for this project. SAS data sets and files will be written to this location.
Project Name My Project SAS Server Directory D:\SAS\Config\Levl\AppData\EnterpriseMiner\6.t <Back The path specified by the SAS project folder w i l l be created. 5.
Cancel
D i r e c t o r y field is the physical location where the
Select N e x t > . I f you have an existing project directory with the same name and location as specified, this project w i l l be added to the list of available projects in SAS Enterprise Miner. This technique can be used to import a project created by another installation of SAS Enterprise Miner.
f
6.
Server
Next >
Browse
Select a location for the project's metadata.
EG
j5£ Create New Project — S t e p3 of 4 Register the Project
SA s
Select the SAS Folders location for this project. Use these folders to organize your projects and control user access.
En t e r p r i s e j ne r a e f f l l
m\
SAS Folder Location — /Users/student/My Folder ! < Back /f
Browse Next >
Cancel
The SAS folder, My Folder, is in a WebDAV directory. This is where the metadata associated with the project is stored. This folder can be accessed and modified using SAS Management Console.
2.2 Creating a SAS Enterprise Miner Project, Library, and Diagram 7.
29
Select Next >. Information about your project is summarized in Step 4.
8.
To finish defining the project, select Finish. 2 f Create New Project — Step 4 of 4 New Project Information
New Project Information Name SAS Metadata location SAS Application server Server Directory
My Project /Users/student/My Folder SASApp  Logical Workspace Server D:\SAS\Config\Lev1\AppData\Enterpri:
<Back
! Finish j
Cancel
The SAS Enterprise Miner client application opens the project that you created. HHE3
? . Enterprise Miner  My Project File
Edit
03 3 i
View
fictions
Option?
Window
Help
I I iSNMJQlQjsjIjijI
j ]
• I Data Sources m Diagrams 5 I Model Packages Bl
Users
Value
Property Name
My Project
Project Start C o d e C re sled
6/17/09 2 40 PM
Server
SASApp • L o g i c a l W o r k s p a c
Grid Available
No
Path
D ASASIC onfiglLevt tApp Data
Max. Concurrent T a s k s
Default
 f t student as student [i3^ Connected to SASApp  Logical Workspace Server
210
Chapter 2 Accessing and Assaying Prepared Data
Creating a SAS Library A SAS library connects SAS Enterprise Miner with the raw data sources, which are the basis of your analysis. A library can link to a directory on the SAS Foundation server, a relational database, or even an Excel workbook. To define a library, you need to know the name and location of the data structure that you want to link with SAS Enterprise Miner, in addition to any associated options, such as user names and passwords. Follow the steps below to create a new SAS library. I.
Select File â€˘=> New O L i b r a r y . . . from the main menu. The Library Wizard  Step l o f 3 Select Action window opens.
Select an action (*â€˘ Create New Lihrary
Next>
Cancel
2.2 Creating a SAS Enterprise Miner Project, Library, and Diagram 2.
211
Select Next >. The Library Wizard is updated to show Step 2 of 3 Create or Modify.
Engine
Name
BASE
f Library infor mation Path Browse.,
Options
<Back
Next>
3.
Type AAEM61 in the Name field.
4.
Type the path C: \ w o r k s h o p \ w i n s a s \ a a e m 6 1 in the P a t h field. £
Cancel
_ .Library Wizard —Step 2 of 3 Create or Engine
Name
3
BASE
AAEM61
Library information 
r
Path C: \Workshop\Winsas\aaem61
Browse.
Options
< Back
Next >
Cancel
I f your data is installed in a different directory, specify this directory' in the P a t h field.
212 5.
Chapter 2 Accessing and Assaying Prepared Data Select Next >. The Library Wizard window is updated to show Step 3 of 3 Confirm Action. .Library Wizard ~ S t e p 3 o f 3 Confirm Action Property Action IName [Engine 'Path 'Options
Value Create New AAEMS1 BASE C:\WorkslioplWirisas\aaem61
rStatus Action Succeeded! The Library "AAEM61" was created.
<Back
Finish H
The Confirm Action window shows the name, type, and path of the created SAS library. 6.
Select Finish. A l l data available in the SAS library can now be used by SAS Enterprise Miner.
2.2 Creating a SAS Enterprise Miner Project, Library, and Diagram
213
Creating a SAS Enterprise Miner Diagram
A SAS Enterprise Miner diagram workspace contains and displays the steps involved in your analysis. To define a diagram, you need only specify its name. Follow the steps below to create a new SAS Enterprise Miner diagram workspace. 1.
Select File & New
D i a g r a m . . . from the main menu.
2.
Type the name P r e d i c t i v e A n a l y s i s in the D i a g r a m Name field and select OK. SAS Enterprise Miner creates an analysis workspace window labeled Predictive Analysis. ^ E n t e r p r i s e Miner  M y Project Fiie
E*
View
3
Actions
Options
Window
hjelp
:HNBD1D3JUJI
••!
:
I 1 i
I 1 I *\&\
My Project
>. OJ Data Sources 
Sarrple f i i r p b r e  Modify  Model  Assess j Utility j Credit Scoring 
JtJ Diagrams Predictive Analysis
Predictive Analysis
BB _ £ ]Model Packages •. n I Users
Property
Value
ID
EMWS
Name
Predictive Anatysis
Slr.lus
Open
Moles History
Diagram Precfctive Anatyss opened
.
...
<? student as student [\)^ Connected to SASApp  Logical Workspace Server
You use the Predictive Analysis window to create process flow diagrams.
214
Chapter 2 Accessing and Assaying Prepared Data
1. Creating a Project, Defining a Library, and Creating an Analysis Diagram a. Use the steps demonstrated on pages 26 to 28 to create a project for your SAS Enterprise Miner analyses on your computer. b. Use the steps demonstrated on pages 210 to 212 to define a SAS library to access the course datafromyour project. f
Your instructor will provide the path to the data if it is different than that specified in these notes.
c. Use the steps demonstrated on page 213 to create an analysis diagram in your project.
2.3 Defining a Data Source
2.3
215
Defining a Data S o u r c e
Defining a Data Source
Select table. Analysis Data
Define variable roles. Define measurement levels. Define table role.
SAS Foundation Server Libraries 10
After you define a new project and diagram, your next analysis task in SAS Enterprise Miner is usually to create an analysis data source. A data source is a link between an existing SAS table and SAS Enterprise Miner. To define a data source, you need to select the analysis table and define metadata appropriate to your analysis task. The selected table must be visible to the SAS Foundation Server via a predefined SAS library. Any table found in a SAS library can be used, including those stored in formats external to SAS. such as tables from a relational database. (To define a SAS library to such tables might require SAS/ACCESS product licenses.) The metadata definition serves three primary purposes for a selected data set. It informs SAS Enterprise Miner of the following: â€˘ the analysis role of each variable â€˘ the measurement level o f each variable â€˘ the analysis role of the data set The analysis role of each variable tells SAS Enterprise Miner the purpose of the variable in the current analysis. The measurement level o f each variable distinguishes continuous numeric variables from categorical variables. The analysis role of the data set tells SAS Enterprise Miner how to use the selected data set in the analysis. A l l of this information must be defined in the context o f the analysis at hand. Other (optional) metadata definitions are specific to certain types o f analyses. These are discussed in the context o f these analyses. f
Any data source that is defined for use in SAS Enterprise Miner should be largely ready for analysis. Usually, most of the data preparation work is completed before a data source is defined. It is possible (and useful for documentation purposes) to write the completed data preparation process in a SAS Code node embedded in SAS Enterprise Miner. Those details are outside the scope of this discussion.
216
Chapter 2 Accessing and Assaying Prepared Data
Charity Direct Mail Demonstration ' Analysis goal: j A veterans' organization seeks continued  contributions from lapsing donors. Use lapsingdonor responses from an earlier campaign to predict future lapsingdonor responses.
11 To demonstrate defining a data source, and later, using SAS Enterprise Miner's predictive modeling tools, consider the following specific analysis example: A national veterans' organization seeks to better target its solicitations for donation. By only soliciting the most likely donors, less money will be spent on solicitation efforts and more money will be available for charitable concerns. Solicitations involve sending a small gift to an individual and include a request for a donation. Gifts to donors include mailing labels and greeting cards. The organization has more than 3.5 million individuals in its mailing database. These individuals are classified by their response behaviors to previous solicitation efforts. Of particular interest is the class of individuals identified as lapsing donors. These individuals made their most recent donation between 12 and 24 months ago. The organization seeks to rank its lapsing donors based on their responses to a greeting card mailing sent in June of 1997. (The charity calls this the 97NK Campaign.) With this ranking, a decision can be made to either solicit or ignore a lapsing individual in the June 1998 campaign.
2.3 Defining a Data Source
217
Charity Direct Mail Demonstration Analysis goal: A veterans' organization seeks continued contributions from lapsing donors. Use lapsingdonor responses from an earlier campaign to predict future lapsingdonor responses. Analysis data: •
E x t r a c t e d f r o m p r e v i o u s year's c a m p a i g n
•
S a m p l e b a l a n c e s r e s p o n s e / n o n  r e s p o n s e rate
• A c t u a l r e s p o n s e rate a p p r o x i m a t e l y 5%
12 The source of this data is the Association for Computing Machinery's (ACM) 1998 KDDCup competition. The data set and other details of the competition are publicly available at the UCI KDD Archive at kdd.ics.uci.edu. For model development, the data were sampled to balance the response and nonresponse rates. (The reason and consequences of this action are discussed in later chapters.) In the original campaign, the response rate was approximately 5%. The following demonstrations show how to access the 97NK campaign data in SAS Enterprise Miner. The process is divided into three parts: • specifying the source data • setting the columns metadata • finalizing the data source specification
218
Chapter 2 Accessing and Assaying Prepared Data
Defining a Data Source
Specifying Source Data A data source links SAS Enterprise Miner to an existing analysis table. To specify a data source, you need to define a SAS library and know the name of the table that you will link to SAS Enterprise Miner. f o l l o w these steps to specify a data source. I.
Select File â€˘=> New â€˘=> Data Source... from the main menu. The Data Source Wizard  Step l o f 7 Metadata Source opens.
Select a metadata source
/
Source:
Next
>
Cancel
The Data Source Wizard guides you through a sevenstep process to create a SAS Enterprise Miner data source. Step I tells SAS Enterprise Miner where to look for initial metadata values. The default and typical choice is the SAS Table that you will link to in the next step.
2.3 Defining a Data Source 2.
Select Next > to use a SAS
219
table (the common choice) as the source for the metadata.
The Data Source Wizard continues to Step 2 of 7 Select a SAS Table. S f Data Source Wizard —Step 2 of 7 Select a SAS Table
Select a SAS table
Browse.
Table:
•
< Back
3.
Cancel
In this step, select the SAS table that you want to make available to SAS Enterprise Miner. You can either type the library name and SAS table name as lihname.
taJblename or select the SAS table
from a list. 4.
Select Browse... to choose a SAS table from the libraries that are visible to the SAS Foundation Server. The Select a SAS
Table window opens.
g T Select a SAS Table \%4 5 Libraries S A
0 2J) 2P ~j[p
Aaern61 Apfrntlib Maps Sampsio
) )
_J Sasdata jjjj Sashelp • '• ^ Sasuser • • • • ^ Stpsarnp J ) " 0 Work jjj ) Wrsdist ) Wrstemp \
I Name Aaern61 Apfrntlib Maps Sampsio Sasdata Sashelp Sasuser Stpsarnp Work Wrsdist Wrstemp
BASE V9 V9 V9 BASE V9 V9 BASE V9 BASE BASE
Refresh
Path
Engine
C:\Workshop\Winsas\aaem61 D:\SA5\Config\Levl\5A5Ap,,. D:\Program Files\SA5\SASF... D:\Program Files\5AS\5A5F... D:\SAS\Config\Levl\5A5Ap... D:\Program Files\SAS\SASF... C:\Documents and 5ettings\... D:\Prograrn Files\5AS\SASF... C: \DOCUME~1 \Student\LO... D: \S AS\Conf ig\Lev 1 \SASAp.., D:\SAS\Config\Levl\5ASAp...
Properties.
Cancel
220
Chapter 2 Accessing and Assaying Prepared Data
One of the libraries listed is named A A E M 6 1 , which is the library name defined in the Library Wizard. 5.
Doubleclick the Aaem61 library. The panel on the right is updated to show the contents of the A A E M 6 1 library. Select a SAS Table SAS Libraries j f j Aaem61 j$ Apfrntlib JP Maps Sampsio Sasdata ) Sashelp j j l Sasuser ) Stpsarnp •Jp Work Wrsdist ^ Wrstemp r
El Name
Type
[=jBank DATA [  ] Census2000 DATA C r e d i t DATA DATA f_Q Dottrain DATA [HQ Dotvalid : jH Dungaree DATA [ J Inq2005 DATA DATA i  j Insurance 1 ]Q Organics DATA DATA g§ Profile B  Pva97nk DATA ~2 Scoreorga... DATA ; ^]Scorepva9... DATA
Created Date 20040119 20060705 20060926 20060725 20060725 20040119 20060919 20060919 20080410 20060923 20080303 20080528 20080303
H
—
Refresh
6.
Modified Date 20040119 20060705 20060926 20060725 20060725 20040119 20060919 20060919 20080410 20060923 20080303 20080528 20080303
Properties
7 S0KB 1MB 372KB 32KB 32KB 40KB 9MB 5MB 2MB 5MB 1MB 408KB 18MB OK
Size
Cancel
Select the Pva97nk SAS table. f g S e l e c t a S A S Table z j SAS Libraries J Aaem61 ffr Apfrntlib 2$ Maps Sampsio Sasdata ^ Sashelp
Jp Sasuser jjj Stpsarnp •1j Work j i  Wrsdist
Name
DATA L^Bank fj~] Census2000 DATA DATA HO Credit DATA [Hj] Dottrain f"J Dotvalid DATA DATA ~J Dungaree DATA  j  Inq2005 DATA : ^2 Insurance DATA : Q Organics DATA [t] Profile DATA Pva97nk Scoreorga... DATA [HQ Scorepva9,,, DATA
^J Wrstemp I P :
Type
Created Date
Modified Date
20040119 20060705 20060926 20060725 20060725 20040119 20060919 20060919 20080410 20060923
20040119 20060705 20060926 20060725 20060725 20040119 20060919 20060919 20080410 20060923
780KB 1MB 372KB 32KB 32KB 40KB 9MB 5MB 2MB 5MB
20080303 20080528 20080303
20080303 20080528 20030303
1MB 408KB 18MB
Refresh
Properties...
OK
Size A.
Cancel
2.3 Defining a Data Source
7.
221
Select OK. The Select a SAS Table window closes and the selected table appears in the T a b l e field. Sf Data Source Wizard  Step 2 of 7 Select a SAS Table
J Select a SAS table
Table:
Browse
M61 .PVA97NK
< Back
Next >
Cancel
Select Next >. The Data Source Wizard proceeds to Step 3 of 7 Table Information. £f Data Source Wizard — Step 3 of 7 Table Information
Table Properties Property Table Name
Value AAEM6I.PVA97NK
Description Member Type
DATA
Data Set Type
DATA
En;:ne
BASE
Number of vanables
28
Number of Observations
96S6
C r e a t e d Date
M a r c h 3 , 2 0 0 8 1 0 : 0 6 : 0 0 AM E5T
Modified Date
March 3, 2008 1 0 : 0 6 : 0 0 AM EST
< Back
i' Next >
"}]
Cam s\
This step of the Data Source Wizard provides basic information about the selected table. The SAS table P V A 9 7 N K is used in this chapter and subsequent chapters to demonstrate the predictive modeling tools of SAS Enterprise Miner. As seen in the Data Source Wizard  Step 3 o f 7 'fable Information window, the table contains 9.686 cases and 28 variables.
222
Chapter 2 Accessing and Assaying Prepared Data
Defining Column Metadata With a data set specified, your next task is to set the column metadata. To do this, you need to know the modeling role and proper measurement level of each variable in the source data set. Follow these steps to define the column metadata: 1.
Select Next >. The Data Source Wizard proceeds to Step 4 of 7 Metadata Advisor Options. ÂŁ f Data Source Wizard â€” Step 4 of? Metadata Advisor Options
Metadata Advisor Options Use the basic setting to set the initial measurement levels and roles based on the variable attributes. Use the advanced setting to set the initial measurement levels and roles based on both the variable attributes and distributions.
F Basic
<~ Advanced
Customize
< Back
11
ftext>"""
Cancel
This step of the Data Source Wizard starts the metadata definition process. SAS Enterprise Miner assigns initial values to the metadata based on characteristics of the selected SAS table. The Basic setting assigns initial values to the metadata based on variable attributes such as the variable name, data type, and assigned SAS format. The Advanced setting assigns initial values to the metadata in the same way as the Basic setting, but it also assesses the distribution o f each variable to better determine the appropriate measurement level. 2.
Select Next > to use the Basic setting.
2.3 Defining a Data Source
223
The Data Source Wizard proceeds to Step 5 of 7, Column Metadata. ST Data Source Wizard —Step 5 of 7 Column Metadata I
lolumns;
I
nnf
f Label
Explore
Compute Summary
f~ Basic
Role
Level
Input Input Input Input Input Input Input Input Input Input Input Input Input Input Input Input Input ID Input Input Input Input Input Input Input Input Tarqet
Interval Nominal Nominal Nominal Interval Interval Interval Interval Interval Interval Interval Interval Interval Interval Interval Interval Interval Nominal Interval Interval Interval Interval Interval Interval Nominal Interval Interval
i Show code
Apply
tn
\~ Mining
Name DemAge 0 em Cluster DemGender DemHomeOwner DemMedHomeValue DemMedlncome iDernPctVeterans GiftAvg36 GiftAvqAII GiftAvqCard36 GiflAvgLast GiftCnt36 GittCntAII GmCntCard36 GiftCntCardAN GifiTimeFirst GiftTimeLast ID PromCntl 2 PromCnt36 PromCntAII PromCntCard12 PromCntCard36 PromCntCardAII StatusCat96NK StatusCatStarAll TaraetB
I F i l l IA!
Reset
f Statistics
Order
Report
Dr No No ^0 No No No No
Mo No Mo No MO No No No No No NO No No NO No No No No No No No No No No No NO No
ho No No No No No No No No
No No No No No
No No No No No No
„„
I n t n n n l
< Back
Next> ;
^
1 :
1
Cancel
The Data Source Wizard displays its best guess for the metadata assignments. This guess is based on the name and data type of each variable. The correct values for model role and measurement level are found in the PVA97NK metadata table on the next page. A comparison of the currently assigned metadata to that in the PVA97NK metadata table shows several discrepancies. While the assigned modeling roles are mostly correct, the assigned measurement levels for several variables are in error. It is possible to improve the default metadata assignments by using the Advanced option in the Metadata Advisor. 3.
Select < Back in the Data Source Wizard. This returns you to Step 4 o f 6 Metadata Advisor Options.
4.
Select the Advanced option.
224
Chapter 2 Accessing and Assaying Prepared Data
PVA97NK Metadata Table Model Role
Name
Measurement Level
Description
DemAge
Input
Interval
Age
DemCluster
Input
Nominal
Demographic Cluster
DeraGender
Input
Nominal
Gender
DemHomeOwner
Input
Binary
Home Owner
DemMedHomeValue
Input
Interval
Median Home Value Region
DemMedl n come
Input
Interval
Median Income Region
DemPctVeterans
Input
Interval
Percent Veterans Region
GiftAvg36
Input
Interval
Gift Amount Average 36 Months
GiftAvgAll
Input
Interval
Gift Amount Average A l l Months
GiftAvgCard36
Input
Interval
Gift Amount Average Card 36 Months
GiftAvgLast
Input
Interval
Gift Amount Last
Giftent36
Input
Interval
Gift Count 36 Months
GiftCntAll
Input
Interval
Gift Count A l l Months
GiftCntCard36
Input
Interval
Gift Count Card 36 Months
GiftCntCardAll
Input
Interval
Gift Count Card A l l Months
GiftTimeFirst
Input
Interval
Time Since First Gift
GiftTimeLast
Input
Interval
Time Since Last Gift
ID
Nominal
Control Number
PromCntl2
Input
Interval
Promotion Count 12 Months
PromCnt3 6
Input
Interval
Promotion Count 36 Months
PromCntAll
Input
Interval
Promotion Count A l l Months
PromCntCardl2
Input
Interval
Promotion Count Card 12 Months
PromCntCard3 6
Input
Interval
Promotion Count Card 36 Months
PromCntCardAII
Input
Interval
Promotion Count Card A l l Months
StatusCat96NK
Input
Nominal
Status Category 96NK
StatusCatStarAll
Input
Binary
Status Category Star A l l Months
Targets
Target
Binary
Target Gift Flag
TargetD
Rejected
1 nterva!
Target Gift Amount
2 3 Defining a Data Source
5.
225
Select Next > to use the Advanced setting. The Data Source Wizard again proceeds to Step 5 of 7 Column Metadata.
m
ST Data Source Wizard — Step 5 of 7 Column Metadata
1 I n n t IFIII isl to lolumns:
f
P
Label Role
Name
DernArje Input Rejected DemCluster DemGender Input DetnHorneOwner Input DemMedHomeValue Input DemMedlncome Input DernPctVeterans Input GiftAvg36 Input GlftAvgAll Input GiflAvgCard36 Input GiftAvgLast Input Input GirtCnt36 GlftCntAll Input GifiCntCard36 Input GiflCntCardAII Input Input GiftTimeFirst Input GiftTimeLast ID ID PromCntl 2 Input PromCnt36 Input PromCntAII Input PromCntCard12 Input PromCntCard36 Input Input PromCntCardAII Input StatusCat96NK Input StatusCatStarAII Target Targets
Apply
Mining
C
Basic
r
Report
Order
Level
Interval Nominal Jominal linary Interval Interval Interval Interval Interval Interval Interval Nominal Interval Nominal
Reset Statistics
No
No No
Dr
No No No
interval Interval
Interval Nominal Interval Interval Interval Nominal Interval Interval Nominal Binary Binary
No
No NO
No
No
o No 0
No
«
Show code
Explore
Refresh Summary
<Back
! Next>
:
Cancel
While many o f the default metadata settings are correct, there are several items that need to be changed. For example, the D e m C l u s t e r variable is rejected (for having too many distinct values), and several numeric inputs have their measurement level set to N o m i n a l instead o f I n t e r v a l (for having too few distinct values). To avoid the timeconsuming task of making metadata adjustments, go back to the previous Data Source Wizard step and customize the Metadata Advisor. 6. Select < Back. You return to the Metadata Advisor Options window.
226 7.
Chapter 2 Accessing and Assaying Prepared Data
Select Customize.... The Advanced Advisor Options dialog box opens. B £ Advanced Advisor Options
Property
Value
[Missing Percentage Threshold 50 Reject Vars with Excessive Missing Value Yes Class Levels Count Threshold 20 Detect Class Levels Yes Reject Levels Count Threshold 20 Reject Vars with Excessive Class Values Yes
Missing Percentage Tlireshold Specify a maximum percentage of missing values for variables to be rejected. The default value is 50.
OK
Cancel
Using the default Advanced options, the Metadata Advisor can do the following: •
reject variables with an excessive number of missing values (default=50%)
•
detect the number class levels of numeric variables and assign a role of N o m i n a l to those with class counts below the selected threshold (default=20)
•
detect the number class levels of character variables and assign a role of Re j e c t e d to those with class counts above the selected threshold (default=20)
f
In the PVA97NK table, there are several numeric variables with fewer than 20 distinct values that should not be treated as nominal. Similarly, there is one class variable with more than 20 levels that should not be rejected. To avoid changing many metadata values in the next step of the Data Source Wizard, you should alter these defaults.
2.3 Defining a Data Source
227
Type 2 as the Class Levels Count Threshold value so that only binary numeric variables are treated as categorical variables. Advanced Advisor Options
Property
Missing Percentage Threshold 50 Reject Vars with Excessive Missing ValueYes Class Levels Count Threshold 2 Detect Class Levels Yes Reject Levels Count Threshold 20 Reject Vars with Excessive Class Values Yes
Value
Class Levels Count Threshold I f "Detect class levels"=Yes, interval variables with less than the number
specified for this property will be marked as NOMINAL. The default value is 20.
228
9.
Chapter 2 Accessing and Assaying Prepared Data
Type 100 as the Reject Levels Count Threshold value, so that only character variables with more than 100 distinct values are rejected. f
Be sure to press ENTER after you type the number 100. Otherwise, the value might not be registered in the field. Advanced Advisor Options
Property
Missing Percentage Threshold 50 Reject Vars with Excessive Missing Value Yes Class Levels Count Threshold 2 Detect Class Levels Yes Reject Levels Count Threshold 100
Value
Reject Vars with Excessive Class Values Yes
Reject Levels Count Threshold Specify a rnaximum number of levels for a class variable before being marked REJECTED. The default value is 20.
OK
10. Select OK to close the Advanced Advisor Options dialog box.
Cancel
2.3 Defining a Data Source
229
11. Select Next > to proceed to Step 5 o f the Data Source Wizard. g £ Data Source Wizard — Step 5 o f ? Column Metadata 1 I
Irnnnf "! 1
Columns:
nnt
["" label
I F i n is!
Apply
to
r~ Mining
r
Role
Level
input Input DemCluster DemGender Input DemHomeQwner Input DemMedHomeValue Input Input DemMedlncorne DemPctVeterans Input GiftAvg36 Input Input jGiftAvgAII GiftAvgCard36 Input Input GiftAvgLast ln)irt GiftCni36 GiftCntAII Input GiftCntCard36 Input Input GiftCntCardAII Input GiftTimeFirst Input GiftTimeLast ID ID Input PromCnt12 Input PromCnt36_ Input PromCntAII PromCntCardT2 Input PromCntCard36 Input PromCntCardAII Input StatusCat98NK Input StatusCatStarAII Input TargetB Target TarqetD Taruet
Interval Nominal Nominal Binary Interval Interval Interval Interval Interval Interval Interval Interval (interval interval
Name
DemAge
Interval Interval Interval Jo mi rial Interval Interval Interval Interval Interval interval Nominal Binary Binary Interval
Basic Report
r
Reset Statistics
Order No
No
Dr
*
No
No No NO NO
No NO No NO
NO
"So" No No
No No
i l Show code
Explore
Refresh Summary
<Back
Next >
Cancel
A comparison o f the Column Metadata table to the table at the beginning o f the demonstration shows that most o f the metadata is correctly defined. SAS Enterprise Miner correctly inferred the model roles for the noninput variables by their names. The measurement levels are correctly defined by using the Advanced Metadata Advisor.
The analysis o f the PVA97NK data in this course focuses on the T a r g e t B variable, so the T a r g e t D variable should be rejected.
230
Chapter 2 Accessing and Assaying Prepared Data
12. Select Role •=> Rejected for T a r c r e t D . £ f Data Source Wizard —Step 5 of 7 Column Metadata I
Columns:
f
Label
Name
T U I l i O I I I O M I U i z
PromCntCard36 PromCntCardAII StatusCat_96NK Statu sCatStarAN TargetB
TargetD
Show code
Explore
nnt
Refresh Summary
If™ml
tn
2J
P Mining Role
Tripntr
Input Input Input Input Target Rejected
f" Basic
Level
Interval Interval luminal imary binary Interval
Report
Apply r
Reset Statistics
Order
Dr
Np_ No
2l <Back
Next>
Cancel
In summary, Step 5 of 7 Column Metadata is usually the most timeconsuming of the Data Source Wizard steps. You can use the following tips to reduce the amount o f time required to define metadata for SAS Enterprise Miner predictive modeling data sets: • Only include variables that you intend to use in the modeling process in your raw data source. • For variables that are not inputs, use variable names that start with the intended role. For example, an ID variable should start with I D and a target variable should start with T a r g e t . • Inputs that are to have a nominal measurement level should have a character data type. • Inputs that are to be interval must have a numeric data type. • Customize the Metadata Advisor to have a Class Level Count set equal to 2 and a Reject Levels Count set equal to a number greater than the maximum cardinality (level count) of your nominal inputs.
2.3 Defining a Data Source
231
Finalizing the Data Source Specification Follow these steps to complete the data source specification process: I.
Select Next > to proceed to Decision Configuration. f
The Data Source Wizard gained an extra step due to the presence o f a categorical (binary, ordinal, or nominal) target variable. D a t a Source W i z a r d â€” Step 6 of 9 Decision Configuration
Decision Processing Do you want to build models based on the values of the decisions ? If you answer yes, you may enter information on the cost or profit of each possible decision, prior probability and cost function. The data will be scanned for the distributions of the target variables.l : mm mm mm
a
no
<Back
C
Yes
Next >
Cancel
When you define a predictive modeling data set, it is important to properly configure decision processing. In fact, obtaining meaningful models often requires using these options. The PVA97NK table was structured so that reasonable models are produced without specifying decision processing. However, this might not be the case for data sources that you will encounter outside this course. Because you need to understand how to set these options, a detailed discussion of decision processing is provided in Chapter 6, "Model Assessment." f
Do n o t select Yes here because that changes the default settings for subsequent analysis steps and yields results that diverge from those in the course notes.
232
2.
Chapter 2 Accessing and Assaying Prepared Data
Select Next >. By skipping the decision processing step, you reach the next step of the Data Source Wizard. ] Â§ [ D a t a Source W i z a r d â€” Step 7 of 8 D a t a Source A t t r i b u t e s
You may change the name and the role, and can specify a population segment identifier for the data source to be created.
Name:
PVA97NK
Role:
Raw
Segment:
[
<Back !
Next >
Cancel
This penultimate step enables you to set a role for the data source and add descriptive comments about the data source definition. For the upcoming analysis, a table role o f Raw is acceptable.
2.3 Defining a Data Source
3.
The final step in the Data Source Wizard provides summary details about the data table that you created. Select Finish. j£TData Source Wizard — Step 8 of 8 Summary Metadata Completed. Library. AAEM61 Table: PVA97NK Role: Raw Role ID Input Input Input Rejected Target
Count 1 2 20 3 1 1
Level Nominal Binary Interval Nominal Interval Binary
< Back
! Finish I
Cancel
233
234
Chapter 2 Accessing and Assaying Prepared Data
The PVA97NK data source is added to the Data Sources entry in the Project panel. 15f Enterprise Miner  My Project File
Ecfit
View
Actions
Options
IDI x Window
Help
••1 I  X  H m i 0  D  Q  a j  : i i 
1 '1 [ I MI^I
I
My Project .: O Data Sources
Sample f&Tplore  Modify j Model  Assess  Util.ty [ Credit Scoring ] Te>t Mining]"
!
Diagrams Predictive Analysis m L2J Model Packages v ji] Users
Pr opertY Name Variables Decisions Role Notes Library [Table Type No. Obs Mo Cols No. Bytes Segment Created By Create Date Modifier! By Modify Date Scope
Vabe PVANK PVA97NK 1 Raw AAEM61 PVA97NK DATA 96B6 be 20J9D24
...
H7898\sasdemo 2/5/101:25 PW sasdemo 2/5/10 1:25 PM Local
_il
ID Di3gra.ii Predictive Analysis opened
4.
^ 
 J 
©i
[ 0 ll7898\sasdemo as sasdemo [ s ^ Connected to SASApo • Logical Workspace Server
Select the PVA97NK data source to obtain table properties in the SAS Enterprise Miner Properties panel.
2.3 Defining a Data Source
235
Exercises
2. Creating a SAS Enterprise Miner Data Source Use the steps demonstrated on pages 218 to 234 to create a SAS Enterprise Miner data source from the PVA97NK data.
236
Chapter 2 Accessing and Assaying Prepared Data
2.4 Exploring a Data Source As stated in Chapter 1 and noted in Section 2.3, the task of data assembly largely occurs outside of SAS Enterprise Miner. However, it is quite worthwhile to explore and validate your data's content. By assaying the prepared data, you substantially reduce the chances of erroneous results in your analysis, and you can gain insights graphically into associations between variables. In this exploration, you should look for sampling errors, unexpected or unusual data values, and interesting variable associations. The next demonstrations illustrate SAS Enterprise Miner tools that are useful for data validation.
2.4 Exploring a Data Source
237
Exploring Source Data SAS Enterprise Miner can construct interactive plots to help you explore your data. This demonstration shows the basic features o f the Explore window. These include the following: • opening the Explore window • changing the Explore window sample size • creating a histogram for a single variable • changing graph properties for a histogram • changing chart axes • adding a missing bin to a histogram • adding plots to the Explore window • exploring variable associations
Opening the Explore Window There are several ways to access the Explore window. Use these steps to open the Explore window through the Project panel. 1.
Open the Data Sources folder in the Project panel and rightclick the data source o f interest. The Data Source Option menu appears.
rill
ST Enterprise Miner  My Project File Edit View Actions Options Window Help Dl
iXiSlHffialDlaji^l
;
I
_J_J_.I.
3 My Project  O Data Sources V
IN] i t f V
Sample ["Explore  Modify  Model  Assess j Utility  Credit Scoring
J
I Diaqrar Rename 3£>Pre Duplicate Delete v fijjModell • JTj Users Edit Variables... Refresh Metadata Edit Decisions.,. 
ID.
Explore...
Name Variables Decisions Role Notes Library Table No_._p_bs No. Cols No. Bytes Segment Created By Create Date MnHifiari Du
M]^J
i;:; Predictive Analysis
Value JK
Raw M61 'A97NK DATA 9686 28 2049024 student 6/17/09 3:40 Ptvl"
Diagram Predictive Analysis opened
Id
J [ft student as student
SJi^OQ  J 
j)ioo%:oj"
Connected to SASApp  Logical Workspace Server
238
2.
Chapter 2 Accessing and Assaying Prepared Data
Select E x p l o r e . . . from the Data Source Option menu. I f you are using an installation of SAS Enterprise Miner with factory default settings, the Large Data Constraint alert window opens.
Row limit of 2,000 observations reached.
5
OK
By default, a maximum of 2000 observations are transferred from the SAS Foundation Server to the SAS Enterprise Miner client. This represents about a fifth of the PVA97NK table. 3.
Select OK to close the Large Data Constraint alert window. The Explore  A A E M 6 1 .PVA97NK window opens. Explore  AAEM61.PVA97INK File
View
 nx
Actions Window
j • Sample Properties Value
Property
Rows Columns Library Member
9636
28 AAEM61 PVA97NK DATA Top Default 2000 12345
Type
Sample Method Fetch Size Fetched Rows Random Seed
Apply AAEM61.PVA97NK
Target G... Control 000014974 000006294 1 00046110 1 00185937 000029637 1 00112632 000123712 000045409 1 00019094 000178271 1 00102224
Plot...
m Target G... GiftCou... $4.00 $10.00 $11.00 $40.00 $6.00
Gift Cou... Gift Cou... 4 8 41 12 1 11 4 4 3 S 16
Gift Cou... Gift Ami $17*1 3 $2£ 3 $E 20 $1t 8 $2C 1 $11 g $ie 3 $1; 3 $se 1 $E 2 $E_L; 13 =l
•
The Explore window features a 2000observation sample from the PVA97NK data source. Sample properties are shown in the top half of the window and a data table is shown in the bottom half.
2.4 Exploring a Data Source
239
Changing the Explore Sample Size The Sample Method property indicates that the sample is drawn from the top (first 2000 rows) of the data set. Use these steps to change the sampling properties in the Explore window. Although selecting a sample through this method is quick to execute, fetching the top rows o f a table might not produce a representative sample of the table.
f 1.
Leftclick the Sample Method value field. The Option menu lists two choices: Top (the current setting) and Random.
2.
Select Random from the Option menu.  • Explore  AAEM61 .PVA97NK File
View
lnx
Actions Window
O St  i Sample Properties
Value
Property 96S6
ROWS
28 AAEM61 PVA97NK DATA Random Default 2000 12345
Columns Library Member Type Sample Method Fetch Size Fetched Rows iRandom Seed
Apply
Obs#
Target G. . Control ... Target G... Gift Cou... Gift Cou... Gift Cou... Gift Cou... Gift 000014974 2 4 1 3 1 000006294 1 8 0 3 1 00046110 $4.00 6 41 3 20 4 100185937 $10.00 3 8 M 12 5 000029637 1 1 1 6 100112632 $11.00 3 11 2 9 7 000123712 2 4 2 3 8 000045409 4 0 3 1 1 00019094 $40.00 3 0 Hi 10 000178271 S 1 2 H $6.00 3 16 2 13 11 1 00102224
•
H:
4 1
3.
Plot...
I
•
Amt 51 i d
$20—' SE $1t S2C (11 $1£
$1! $3<
$E
$H •j
Select Actions •=> A p p l y Sample Properties from the Explore window menu. A new, random sample o f 2000 observations is made. This 2000row sample now has distributional properties that are similar to the original 9686 observation table. This gives you an idea about the general characteristics of the variables. I f your goal is to examine the data for potential problems, it is wise to examine the entire data set. SAS Enterprise Miner enables you to increase the sample transferred to the client (up to a maximum of 30.000 observations). See the SAS Enterprise Miner Help file to learn how to increase this maximum value.
240
Chapter 2 Accessing and Assaying Prepared Data
4.
Select the Fetch Size property and select Max from the Option menu.
5.
Select Actions >=> A p p l y Sample Properties. Because there are fewer than 30,000 observations, the entire PVA97NK table is transferred to the SAS Enterprise Miner client machine, as indicated by the Fetched
Rows field.
' Explore  AAEM61.PVA97NK
Dx
File View Actions Window 1 1 Sample Properties
ROWS Columns Library Member Type .__ Sample Method iFetch Size Fetched Rows Random Seed
Property
_
9686 28 AAEM61 PVA97NK DATA Random Max 9686 12345
Value
Apply
j
Plot...
AAEM61.PVA97N
Target G... Control... Target G... Gift Cou... Gift Cou... Gift Cou... Gift Cou... Gift 3 000014974 3 000006294 100046110 $4.00 41 20 12 8 100185937 $10.00 1 000029637 1 11 9 1 00112632 $11.00 4 000123712 3 4 000045409 3 3 1 00019094 $40.00 1 5 2 000178271 16 13 1 00102224 $6.00
Ami $17*1 $2C $£ $1t $2C $11 $1J $1f $3J $E $fjj :=1
±l
2.4 Exploring a Data Source
241
Creating a Histogram for a Single Variable While you can use the Explore window to browse a data set, its primary purpose is to create statistical analysis plots. Use these steps to create a histogram in the Explore window. I.
Select Actions •=> Plot from the Explore window menu. The Chart wizard opens to the Select a Chart Type step. 'i
bli Scatter £S Line alb Histogram
Plots values of two variables against each other.
Efc Density
Si!
Box Tables
—
M Matrix rjfj Lattice IS] Parallel Axis ©
Constellation
Cancel
Next>
n«'.ti
The Chart wizard enables the construction of a multitude of analysis charts. This demonstration focuses on histograms.
242 2.
Chapter 2 Accessing and Assaying Prepared Data
Select Histogram.
1* Scatter 1^ Line ilfri Histogram v
Density
71 fltt Histogram provides statistical relations between X values which are grouped as bins or discrete.
&Box £ 3 Tables \M Matrix Lattice t§] Parallel Axis © Constellation Cancel
Histograms are useful for exploring the distribution o f values in a variable.
Next>
2.4 Exploring a Data Source
3.
243
Select Next >. The Chart wizard proceeds to the next step, Select Chart Roles. t
ij 1;.
3 Missing required roles: X. A Variable Role DEMAGE DEMCLUSTER DEMGENDER DEMHOMEOWNER DEMMEDHOMEVALUE DEMMEDINCOME DEMPCTVETERANS GIFTAVG36 G1FTAVGALL GFTAVGCARD36 GIFTAVGLAST rciFTTNTSfi Response statistic:
Use default assignments Type Numeric Character Character Character Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numprir
Description Age Demographic Cluster Gender Home Owner Median Home Value ... Median income Region Percent Veteran? Re... Gift Amount Average... Gift Amount Average... Gift Amount Average... Gift Amount Last
Frequency
Cancel
<Back
To draw a histogram, one variable must be selected to have the role X.
Format
DOLLAR11 DOLLAR! I DOLLARS .2 DOLLAR9 2 DOLLARS .2 DOLLARS .2

244
4.
Chapter 2 Accessing and Assaying Prepared Data
Select Role O X for the DEMAGE variable.
Use default assignments A Variable Role DEMAGE X DEMCLUSTER DEMGENDER DEMHOMEOVVWER DEMMEDHOMEVALUE DEMMEDINCOME DEMPCT VETERANS GIFTAVG36 GIFTAVGALL GIFTAVGCARD3fi GFTAVGLAST niFTCNnR Response statistic: Frequency
Type Numeric Character Character Character Numeric Numeric Numeric Numeric Numeric Numeric Numeric ^umerir.
Description 'Age Demographic Cluster bender jHome Owner Median Home Value... Median Income Region Percent Veterans ReGift Amount Average... Gift Amount Average... [Gift Amount Average... !Gift Amount Last Gift Gnunl 3R MnntlT?
Format
DOLLAR11 DOLLAR11 DOLLAR9.2 DOLLARS .2 DOLLARS .2 DOLLARS .2
•
• Cancel
<Back
The Chart wizard is ready to make a histogram of the DEMAGE variable.
Next>
Finish
2.4 Exploring a Data Source
5.
245
Select Finish. The Explore window is filled with a histogram o f the DEMAGE variable. Variable descriptions, rather than variable names, are used to label the axes of plots in the Explore window.
f
Explore  AAEM61 .PVA97NK File
Edit
0
View
O
Actions Window
4
,1, Graph 1
1500 H
1000
u
c 3
500
0.0
17.4
26/
34.8
43.5
52.2
60.9
69.6
78.3
87.0
Age Axes in Explore window plots are chosen to range from the minimum to the maximum values o f the plotted variable. Here you can see that Age has a minimum value of 0 and a maximum value o f 87. The mode occurs in the ninth bin, which ranges between about 70 and 78. F r e q u e n c y tells you that there are about 1400 observations in this range.
246
Chapter 2 Accessing and Assaying Prepared Data
Changing the Graph Properties for a Histogram By default, a histogram in SAS Enterprise Miner has ]0 bins and is scaled to show the entire range of data. Use these steps to change the number of bins in a histogram and change the range of the axes. While the default bin size is sufficient to show the general shape of a variable's distribution, it is sometimes useful to increase the number of bins to improve the histogram's resolution. 1.
Rightclick in the data area of the Age histogram and select Graph Properties... from the Option menu. The PropertiesHistogram window opens.
Graph Histogram Axes Horizontal Vertical Title/Footnote
Bins LJ Show Missing Bin [E Bin Axis [â€˘] End Labels LJ Bin to Data Range Boundary:
Upper
Number of X Bins:
10
Number of Y Bins:
10
V
Color 115 Show Outline I2j Gradient Gradient ThteeColorRamp
OK
Apply
Cancel
This window enables you to change the appearance of your charts. For histograms, the most important appearance property (at least in a statistical sense) is the number o f bins.
2.4 Exploring a Data Source 2.
Type 87 into the N u m b e r
Graph Histogram Axes Horizontal Vertical Title/Footnote
of
X
247
B i n s field.
Bins LJ Show Missing Bin 12i Bin Axis E
End Labels
Q Bin to Data Range Upper
Boundary: Number of X Bins:
37
Number of Y Bins:
10
â€˘
Color IE Show Outline IE Gradient Gradient ThreeCotorRanip
OK
Apply
Cancel
Because Age is integervalued and the original distribution plot had a maximum o f 87. there w i l l be one bin per possible Age value.
248
3.
Chapter 2 Accessing and Assaying Prepared Data
Select OK. The Explore window reopens and shows many more bins in the Age histogram. [
Explore  AAEM61.PVA97NK File
Edit
View
Infxl
Actions Window
#aofr Ji, Graph1 200
150& C
01
S" too
50
'i' 'T' ' 
0
3
i''
1
i
1
1
i
1
1
i
1
1
i
1
1
i
1
1
i
1
1
i
1
1
i "
t ' 1
i "
i''
i "
r
1
''"p
1 1
i
1
i i
1 1
i
r
1
i
1
1
1
1
' i'
1
1
1
1
"t
1
1
1
1
1
i
1
1
1
1
1
i'
1
i
6 9 12 15 18 21 24 27 30 33 36 39 42 45 48 51 54 57 60 63 66 69 72 75 78 81 84 87
Age With the increase in resolution, unusual features become apparent in the Age variable. For example, there are unexpected spikes in the histogram at 10year intervals, starting at Age=7. Also, you must question the veracity of ages below 18 for donors to the charity.
2.4 Exploring a Data Source
249
Changing Chart Axes A very useful feature of the Explore window is the ability to zoom in on data of interest. Use the following steps to change chart axes in the Explore window: 1.
Position your cursor under the histogram until the cursor appears as a magnifying glass.
2.
Clicking with your mouse and dragging the cursor to the right magnifies the horizontal axis o f the histogram. '."Explore  AAEM61.PVA97NK File
Edit
View
 l O l X
Actions Window
Graph1
^JDJXJ
i
i
r
32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Age
:
250
3.
Chapter 2 Accessing and Assaying Prepared Data
Position your cursor below the histogram but above the Age axis label. A horizontal scroll bar appears. Inlxf
'/.Explore  AAEM61.PVA97NK File
Edit
View
Actions Window
y o#
Graph 1
33 34 35 36 37 38 33 40 41 42 43 44 45 46 47 48 49 50 51 52 53 5 ^ 5 56 57 58 59 60 61 62 63 64
Age
2.4 Exploring a Data Source
4.
251
Click and drag the scroll bar to the left to translate the horizontal axis to the left. Explore  AAEM61 .P VA97NK File
Edit
H
View
O
DX
Actions Window
«
.tic Graph 1
^jnjxj
200
150
> o c
a> •
g" 100 k.
u_
50
0
2
4
6
8
1 l i ^ 12
14
16
18
Age
20
22 ~24
26
28
30
32
34
36
252
5.
Chapter 2 Accessing and Assaying Prepared Data
Position your cursor to the left of the histogram to make a similar adjustment to the vertical axis o f the histogram. • • Explore  AAEM61.PVA97NK Fi!e
Edit
View
Actions Window
__jnj_xj
triii Graph 1
40
u
30
c 3
W i l 20
10
i — — i — ' — i——r 1
1
0
2
4
6
1
10
— i
12
1
1
14
1
I
16
i

18
i
1
20
1
[
22
.
[
24
.
1
26
1
[
28
1
1
30
,
1
32
,
34
1 —
Age At this resolution, you can see that there are approximately 40 observations with Age less than 8. A review o f the data preparation process might help you determine whether these values are valid.
2.4 Exploring a Data Source
6.
253
Select ! (the yellow triangle) in the lower right corner of the Explore window to reset the axes to their original ranges. â€˘ Explore  AAEMB1.PVA97NK File
Edit
View
Actions Window
m i >it, Graph 1
200
150
> o c
S" 100
50
T
0
' I ' ' 1 '
3
1
I ' ' I '
r
I '
1
I ' ' I '
1
I ' ' I ' ' I "
I '
1
i "
I ' ' 1 '
1
I '
1
I "
I
1
' I
1
' I ' ' I ' ' I ' ' I ' ' I ' ' I ' ' I
1
' I ' ' I ' ' I ' ' 1
6 9 12 15 18 21 24 27 30 33 36 39 42 45 48 51 54 57 60 63 66 69 72 75 78 81 84 87
Age
254
Chapter 2 Accessing and Assaying Prepared Data
Adding a "Missing" Bin to a Histogram Not all observations appear in the histogram for Age. There are many observations with missing values for this variable. Follow these steps to add a missing value bin to the Age histogram: 1.
Rightclick on the graph and select Graph Properties... from the Option menu.
2.
Check the Show Missing Bin option.
Graph
Histogram Axes Horizontal Vertical Title/Footnote
Bins E
Show Missing Bin
0
Bin Axis \Z\ End Labels
[ J Bin to Data Range Boundary:
Upper
Number of X Bins:
87
Number of Y Bins:
10
â€˘w
Color E
Show Outline
E
Gradient
Gradient
OK
Apply
Cancel
2.4 Exploring a Data Source 3.
255
Select OK. The Age histogram is modified to show a missing value bin. • Explore  AAEM61.PVA97NK File
Edit
View
 •
Actions Window
X
,iu Graph 1
2500 H
2000
> 1500 o C 3
cr o>
^ 1000
500
I ' ' I "
I ' ' I ' ' I ' ' I ' ' I "
I i
1
I
1
' I
1
' I ' ' ! "
I ' ' I I ' I "
I
1
' I ' • I ' ' 1
1
1
I i
1
I
1
1
I ' i I "
I '
1
I "
I "
IffiM
I ' •1 "
I '
1
I
0 3 6 9 12 15 18 21 24 27 30 33 36 39 42 45 48 51 54 57 60 63 66 69 72 75 78 81 84 87
Age With the missing value bin added, it is easy to see that nearly a quarter o f the observations are missing an Age value.
256
Chapter 2 Accessing and Assaying Prepared Data
Adding Plots to the Explore Window You can add other plots to the Explore window. Follow these steps to add a pie chart of the target variable. 1.
Select Actions Type step.
Plot from the Explore window menu. The Chart wizard opens to the Select a Chart
2.
Scroll down in the chart list and select a Pie chart.
pjfj Lattice Parallel Axis ® Constellation
Shows each categories contribution to a sum.
jfL 3D Charts Q j Contour t i l Bar € • Pie L i Needle H
Vector
Cancel
Next>
2.4 Exploring a Data Source 3.
257
Select Next >. The Chart wizard continues to the Select Chart Roles step.
^ Missing required roles: Category. A Variable DEMAGE DEMCLUSTER DEMGENDER DEMHOMEOVVNER DEMMEDHOMEVALUE DEMMEDINCOME DEMPCTVETERANS G1FTAVG36 GIF T AVGALL GFTAVGCARD36 GIFTAVGLAST GIFTCNT36 GiFTCNTALL G1FTCNTCARD36
Role
Use default assignments Description Format Age Demographic Cluster Gender Home Owner Median Home Value ... DOLLAR11 Median Income Region DOLLAR"! 1 Percent Veterans Re... Gift Amount Average... DOLLAR9.2 Gift Amount Average... DOLLARS .2 Gift Amount Average... DOLLAR9.2 Gift Amount Last DOLLARS .2 Gift Count 38 Months Gift Count All Months Gift Count Card 36 M. .
Type Numeric Character Character iCharacter Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric
Cancel
<Back
The message at the top o f the Select Chart Roles window states that a variable must be assigned the Category role.
258
4.
Chapter 2 Accessing and Assaying Prepared Data
Scroll the variable list and select Role •=> Category for the TARGETB variable.
Use default assignments A Variable Role GIFTCNTCARDALL GIFTTIMEFIRST GIFTTIMELAST ID PROMCNT12 PROMCNT36 PROMCNTALL PROMCNTCARD12 PROMCNTCARD36 PROMCNTCARDALL STATUSCAT96NK STATUSCATSTARALL TARGETB Category TARGETD
Type INumeric [Numeric jNumeric Character Numeric Numeric Numeric Numeric Numeric Numeric Character Numeric Numeric Numeric
Description Format Girt Count Card All M... I Times Since First Gift Time Since Last Gift Control Number Promotion Count 12 M... Promotion Count 36 M... Promotion Count All M... Promotion Count Card... Promotion Count Card... Promotion Count Card... Status Category 96NK Status Category Star... Target Gift Flag Target Gift Amount DOLLAR9.2
Cancel
<Back
Next>
—
Finish
2.4 Exploring a Data Source
The chart shows an equal number of cases for TARGETB=0 (top) and TARGETB=1 (bottom).
259
260
6.
Chapter 2 Accessing and Assaying Prepared Data
Select Window >=> Tile to simultaneously view all subwindows of the Explore window. Explore  AAEM61 .PVA97NK
File Edit View Actions Window 'jr Graph2
.Jnx
^JDJXJ Property Rows Columns Library Member Type Sample Method iFetch Size Fetched Rows Random Seed
9686 28 AAEM61 PVA97NK DATA Random Max 9686 12345
Value
Apply I,Graphl
^inJiii
2500 2000¬ 1500 £ 1000 LL
500¬ 0
Plot...
r""r'"i 0
i
i
i
i
i
i
i
i i"
6 12 18 24 30 35 42 48 54 60 66 72 78 84
Age
Target G... Control 000014974 000006294 100046110 1 00185937 000029637 1 00112632 000123712 000045409 1 00019094 000178271 100102224
Target G... G
$4.00 $10.00 $11.00 $40.00 $6.00
'A
2.4 Exploring a Data Source
261
Exploring Variable Associations A l l elements of the Explore window are connected. By selecting a bar in one histogram, for example, corresponding observations in the data table and other plots are also selected. Follow these steps to use this feature to explore variable associations: 1.
Doubleclick the Age histogram title bar so that it fills the Explore window.
2.
Click and drag a rectangle in the Age histogram to select cases with Age in excess o f 70 years. Explore  AAEM61.PVA97NK File
Edit
0
View
Actions Window
*i
i k Graph1
400
0 3 6 9 12 15 18 21 24 27 30 33 36 39 42 45 48 51 54 57 60 63 66 69 72 75 78 81 84 87
Age The selected cases are crosshatched. (The vertical axis is rescaled to show the selection better.)
*
262
3.
Chapter 2 Accessing and Assaying Prepared Data
Doubleclick the Age histogram title bar. The tile display is restored. Explore AAEM61.PVA97NK File
Edit
View
Actions Window
\ \ Sample
^jnjxj Property
Rows Columns Li bran/ Member Type iSample Method Fetch Size Fetched Rows Random Seed
Value
9686 23 AAEM61 PVA97NK DATA Random Max 96S6 12345
Apply
^jD[x]
â€˘di. Graph 1
^jDjxJ Obs#
l ""!" "!'""!" "!" "! 1
0
1
1
1
r ' 'P""l " l f
l l
,
T ,
I
I'
6 12 18 24 30 36 42 48 54 60 66 72 78 84
Aoe
'
Plot...
Target G... Control... Target G... G 000014974 000006294 1 00046110 1 00185937 000029637 1 00112632 000123712 000045409 1 00019094 000178271 1 00102224
$4.00 $10.00 $11.00 $40.00 $6.00
Notice that part o f the TargetB pie chart is selected. This selection shows the relative proportion of observations with Age greater than 70 that do and do not donate. Because the arc on the TARGETB=] segment is slightly thicker, it appears that there is a slightly higher number of donors than nondonors in this Age selection.
2.4 Exploring a Data Source
4.
Doubleclick the Tar get B pie chart title bar to confirm this observation.
5.
Close the Explore window to return to the SAS Enterprise Miner client interface screen.
263
264
Chapter 2 Accessing and Assaying Prepared Data
Changing the Explore Window Sampling Defaults
In the previous demonstration, you were alerted to the fact that only a sample of data was initially selected for use in the Explore window. 
»2f Row limit of 2,000 observations reached. OK
Follow these steps to change the preference settings of SAS Enterprise Miner to use a random sample or all o f the data source data in the Explore window: 1.
Select Options •=> Preferences... from the main menu. The Preferences window opens.
!
Property
v Look and feel • ' •Property Sheet Tooltips '••Tools Palette Tooltips Sinteractive Sampling [•Sample Method h Fetch Size Random Seed SModel Package Options [Generate C Score Code [•Generate Java Score Code
Value Cross Platform Off Display tool name and description Top_ Default 1 2345
; : :;
•
No NO
—
OK
Cancel
2.4 Exploring a Data Source
2.
Select Sample Method O Random.
Muser Interface
Value
Property
[•Look and feel '••Property Sheet Tooltips Tools Palette Tooltips ;
[•Sample Method [•Fetch Size Random Seed ©ModelPackage Options [Generate C Score Code [•Generate Java Score Code l
A.
Cross Platform Off Display tool name and description Random Default 12345 No No
OK
Cancel
The random sampling method improves on the default method (at the top o f the data set) by guaranteeing that the Explore window data is representative of the original data source. The only negative aspect is an increase in processing time for extremely large data sources. 3.
Select Fetch Size
Max.
Value
Property [•Look and feel [Property Sheet Tooltips 'Tools Palette Tooltips Sjlnteractive Sampling [Sample Method [ Fetch Size Random Seed :
BlAMIHI^WWBBilBBgi [Generate C Score Code [Generate Java Score Code tJ i
r
n^i^*.^
A. %
Cross Platform Off Display tool name and description Random Max 12345
•
% i
_
No No
OK
Cancel
The Max fetch size enables a larger sample of data to be extracted for use in the Explore window. If you use these settings, the Explore window uses the entire data set or a random sample o f up to 30,000 observations (whichever is smaller).
265
266
Chapter 2 Accessing and Assaying Prepared Data
Exercises
3. Using the Explore Window to Study Variable Distribution Use the Explore window to study the distribution of the variable Medium Income Region (DemMedlncome). Answer the following questions: a. What is unusual about the distribution of this variable?
b. What could cause this anomaly to occur?
c. What do you think should be done to rectify the situation?
2.4 Exploring a Data Source
267
Modifying and Correcting Source Data
In the previous exercise, the DemMedlncome variable was seen to have an unusual spike at 0. This phenomenon often occurs in data extracted from a relational database table where 0 or another number is used as a substitute for the value m i s s i n g or u n k n o w n . Clearly, having zero income is considerably different from having an unknown income. I f you properly use the i n c o m e variable in a predictive model, this discrepancy can be addressed. This demonstration shows you how to replace a placeholder value for missing with a true missing value indicator. In this way, SAS Enterprise Miner tools can correctly respond to the true, but unknown, value. SAS Enterprise Miner includes several tools that you can use to modify the source data for your analysis. The following demonstrations show how to use the Replacement node to modify incorrect or improper values for a variable:
Process Flow Setup Use the following steps to set up the process flow that will modify the DemMedlncome variable: 1.
Drag the PVA97NK data source to the Predictive Analysis workspace window. ^ E n t e r p r i s e Mtner  My Project File Edit
View
Actions Options Window Help
ol \n\ iigi^iigiiaiBlajl^iUWlo 3 My Project  • ! Data Sources PVA97NK  ^il Diagrams Analysis j M &Predictive r
I'M
en ^ j ^ i o i m i M J d
Sample  Explore  Modify  Model j Assess j Utility  Credit Scoring 4« Predictive Analysis
EM
* Jj Model Packages v ffi Users
Property ID Name Status Notes History
Value
EMWS
Predictive Analysi
•VA97NK
±1
fD Diagram Predictive Analysis opened
ft
VI
3
y student as student
">
©
ioo%
S
n
 _J 
Connected to SASApp  Logical Workspace Server
268
2.
Chapter 2 Accessing and Assaying Prepared Data
Select the M o d i f y tab to access the Modify tool group. 2 f Enterprise Miner  My Project File
Edit
View
EMEI
Actions Options Window Help M
3 My Project B B l Data Sources ^PVA97NK E 'JsJ Diagrams Predictive Analysis ± '^ I Model Packages E [Qj Users
Gb
©
&
Sample ) Explore  Modify) Model) Assess  Utility] Credit Scoring" Predictive
Analysis
Property
P Name Status Notes History
EMWS Predictive Analysis Open
JLi I jJ
ID (diagram Predictive Analysis opened
rJ :
©
l00%
•J
O student a; student ~\ Connected to SASApp  Logical Workspace Server
2.4 Exploring a Data Source
3.
269
Drag the Replacement tool (third from the right) from the fools Palette into the Predictive Analysis workspace window. gf Enterprise Miner  My Project File
Edit
View
Actions Options Window Help
J 0L3H J ^  M
23 My Project H ffl Data Sources 
fflPVA97NK
H Diagrams Predictive Analysis i+j i^J Model Packages E [ f l j Users
[Node ID Im porterjData
* Predictive Analysis
E M
Value
Property
General
Sample  Explore  Modify  Model  Assess  Utility  Credit Scoring
Repl
>VA97NK
Exported Data Notes
Train
Replacement Ed Default Limits MeStandard Deviati J Cutoff Values Replacement Ed General Diagram Predictive Analysis opened
1\
9
I
fO student as student
—
©
100%
Connected to SASApp • Logical Workspace Server
270
4.
Chapter 2 Accessing and Assaying Prepared Data
Connect the P V A 9 7 N K data to the Replacement node by clicking near the right side o f the P V A 9 7 N K node and dragging an arrow to the left side of the Replacement node. ST Enterprise Miner  My Project File Edit View Actions Options Window Help 23 My Project  O Sources [^PVA97NK _HJ Diagrams D a t a
Sample  Explore  Modify  Model  Assess  Utility  Credit Scoring 
Predictive Analysis >. ^) Model Packages •+] ifjj Users
Property
EM
[•tj Predictive Analysis
Value
General Node ID 3epl Imported Data Exported Data Notes ... Train ariables Replacement Ed Default Limits MeStandard Deviati J CutoffValues E Class Variables r Replacement Ed General piagram Predictive Analysis opened
J^.?gpUcemeiit
: : : 'VA97NK
J j  J JJ
<7
 — J — +j
\t) student as student
ioo%
Connected to SASApp  Logical Workspace Server
You created a process flow, which is the method that SAS Enterprise Miner uses to carry out analyses. The process flow, at this point, reads the raw P V A 9 7 N K data and replaces the unwanted values of the observations. You must, however, specify which variables have unwanted values and what the correct values are. To do this, you must change the settings of the Replacement node.
2.4 Exploring a Data Source
271
Changing the Replacement Node Properties Use the following steps to modify the default settings of the Replacement node: 1.
Select the Replacement node and examine the Properties panel. Property
1
Value
I
General
Node ID Imported Data Exported Data iNotes
1
Repl
... ... ...
•Replacement Editoi Default Limits MethiStandard Deviation: .Cutoff Values I H Replacement Editoi  Unknown Levels [ignore
Score
Replacement ValueComputed Hide No

Ri
Replacement RepoYes
The Properties panel displays the analysis methods used by the node when it is run. By default, the node replaces all interval variables whose values are more than three standard deviations from the variable mean. f
You can control the number of standard deviations by selecting the C u t o f f Values property.
In this demonstration, you only want to replace the value for D e m M e d l n c o m e when it equals zero. Thus, you need to change the default setting. 2.
Select the Default Limits Method property and select None from the Options menu. Value
Property
General
Node ID Imported Data Exported Data Notes
Repl
Train
(sllnterval Variables Replacement Editoi •Default Limits MethiNone Cutoff Values Replacement Editoi Ignore Unknown Levels
Score
H H
Replacement ValueComputed No Hide
Reoort
Replacement RepoYes
... ... ... •
Q
Chapter 2 Accessing and Assaying Prepared Data
272
You want to replace improper values with missing values. To do this, you need to change the Replacement Value property. 3.
Select the Replacement Value property and select Missing from the Options menu. Pro pert/
Value
General
Node ID Imported Data Exported Data Notes
Train • ill Wc m
Repl ... ...
...!
\mm
Replacement Editoi •Default Limits MethiNone •CutoffValues E Iciass Variables  Replacement Edito •Unknown Levels Iqnore
...
• •
Score
Replacement ValueMissino. Hide No
HnnBHBHMHHHHH ReplacementRepoYes You are now ready to specify the variables that you want to replace. 4.
Select 7j (Interval Variables: Replacement Editor ellipsis) from the Replacement node properties panel. Pro perry Node ID Imported Data Exported Data iNotes
Value Repl
Train EjimBffiirafflTOM
Replacement Editoi Default Limits MethiNone CutoffValues E ]ciass Variables Replacement Edito Unknown Levels Ignore
• • U U
m
Score
Replacement Value Mis sing No Hide
Report
Replacement RepoYes f
Be careful to open the Replacement Editor for Interval Variables, not for Class Variables.
2.4 Exploring a Data Source
273
The Interactive Replacement Interval Filter window opens. Interactive Replacement Interval Filter Columns:
f " Mining
~~ Label Name
DemAge DemMedHomeValue DemMedlncome DemPcTVeterans GiftAvg36 GifiAvqAII Gift.a.vciCarrJ36 GirtAvgLast GiftCnt36 GiftCntAII GiftCntCard36 GiftCntCardAII GiftTirneFirst GlftTimeLast PromCntl 2 PrornCnt36 PromCntAll PromCntCardl 2 PromCntCard36 PromCntCardAII farqetD
Use Default Default Default Default Default Default Default Default Default Default Default Default Default Default Default Default Default Default Default Default DefauH
<l Generate Summary
Report
to
Mo 4o 4o . MO Mo Mo Mo Mo Mo Mo No No No Mo No No No NO MO Mo
F Statistics
P Basic Limit Method
Lower Limit
Upper Limit 
default Default Default Default Default Default Default Default Default Default Default Default Default Default Default Default Default Default Default Default Default
•
• • • •
D
b
£
b P P P P
b • ' • • • •
b P b P
•
D 0 D D
•
to
 b
_J OK
Cancel
274
5.
Chapter 2 Accessing and Assaying Prepared Data
Select User Specified as the Limit Method value for DemMedlncome. BE Interactive Replacement Interval Filter :olumns:
f~ Mining
P Label Name
Use
DemAge Default DemMedHomeValue Default DemMedlncome Default DemPctVeterans Default Default GiftAvq36 Default GiftAvgAll GiftAvgCard36 Default GfflAvqLast Default (Default GlftCnt36 Default GiftCntAII Default GiftCntCard36 GiftCntCardAII Default Default GiftTimeFirst Default GiftTimeLast PromCntl2 Default PromCnt36 Default PromCntAll Default PromCntCard12 Default jjDefault PromCntCard36 PromCntCardAII Default TargetD Default <l Generate Summary
W Basic
Report IMo No Mo Mo Mo Mo Mo Mo Mo Mo IMo No No No Mo Mo Mo Mo Mo Mo No
Limit Method
r Lower Limit
Statistics Upper Limit
Default Default User Specified Default Default Default Default Default Default Default Default Default Default Default Default Default Default Default Default Default Default
D D D D D
D D
J OK
Cancel
2.4 Exploring a Data Source 6.
275
Type 1 as the Lower Limit value for DemMedlncome. ST Interactive Replacement Interval Filter Columns:
f Label Name
•I
Report
Use
DemAqe DemMedHomeValue DemMedlncome DemPcfVeterans GiftAvq36 GiftAvqAll GiftAvqCard36 GiftAvqLast GlftCnt36 GiftCntAII GiftCntCard36 foiftCntCardAll iGiftTime First GiftTimeLast PromCnt12 PromCnt36 PromCntAll PromCntCard12 ;PromCntCard36 PromCntCardAII FarqetD
7 Basjc
\~ Mining
Default Default Default Default Default Default Default Default Default Default Default Default Default Default Default Default DefauH DefauH DefauH Default Default
Limit Method
r Lower Limit
Default DefauH User Specified Default Default Default Default Default Default Default Default DefauH DefauH DefauH Default DefauH Default Default Default DefauH DefauH
Mo Mo Mo No MO Mo No No MO NO Mo No No No No NO NO No No NO No
Statistics Upper Limit •
1
b
•
b b D D D
• P • P P "
•
D D ET D c
•
b
~
" P * P b b
Jj OK
Generate Summary
•
 P
_ j
<l
b
* P
Cancel
If you use this specification, any DemMedlncome values that fall below the lower limit o f 1 are set to missing. A l l other values of this variable do not change. 7.
Select O K to close the Interactive Replacement Interval Filter window.
Running the Analysis and Viewing the Results Use these steps to run the process flow that you created. 1.
Rightclick on the Replacement node and select Run from the Option menu. A Confirmation window appears, and requests that you verify the run action. Confirmation *^ .
Do you want to run this path? Diagram:Predictive Analysis Path:
Replacement 1 Ves 1 No
1
276
2.
Chapter 2 Accessing and Assaying Prepared Data
Select Yes to close the Confirmation window. A small animation in the lower right corner o f each node indicates analysis activity in the node. The Run Status window opens when the process flow run is complete. Run Status Run completed Diagram: Predictive Analysis Path;
Replacement Results...
OK
3.
Select Results... to review the analysis outcome. The Results  Node: Replacement Diagram: Predictive Analysis window appears. tP Results  Mode: Replacement Diagram: Predictive Analysis File
Edit
View
Window
HHEiESoutput
ÂŁf 1 otal Replacement Counts Variable Role Label DemMedln...
INPUT
Train 2357
M e d i a n Inc
rjaet:
Student
Date:
June 1 7 , 2009
Tine:
16:24:52
* T r a i n i n g Output
V a r i a b l e Summary Measurement
HI Interval Variables Variable Replace Variable
Lower Smit
DemMedln... REPâ€žDem...
Upper Limit i
Limls Method .MANUAL
Label
Replacement Method
Frequency
Lower Replacemert Value
Upper Replacemeri Value
Median Inc... MISSING
The Replacement Counts window shows that 2357 observations were modified by the Replacement node. The Interval Variables window summarizes the replacement that was condLicted. The CHitput window provides more or less the same information as the Total Replacement Counts window and the Interval Variables window (but it is presented as a static text file). 4.
Close the Results window.
2.4 Exploring a Data Source
277
Examining Exported Data In a SAS Enterprise Miner process flow diagram, each node takes in data, analyses it, creates a result, and exports a possibly modified version of the imported data. While the report gives the analysis results in abstract, it is good practice to see the actual effects of an analysis step on the exported data. This enables you to validate your expectations at each step of the analysis. Use these steps to examine the data exported from the Replacement node. 1.
Select the Replacement node in your process How diagram.
2.
Select Exported Data •=> _MJ.
r General
Pro pert/
Node ID
Imported Data Exported Data Notes
•
Train
Interval Variables Replacement Edito •Default Limits MethiNone Cutoff Values Replacement Editoi •Unknown Levels Ignore
Score
Replacement ValueMjssing Hide iNo Report Replacement RepoYes The Exported Data  Replacement window appears.
Port TRAIN VALIDATE TEST SCORE TRANSACTION DOCUMENT
Role Table EMWS.Repi_TRAIN Train EMWS.Repl_VALIDATE Validate EMWS.Repl_TEST Test EMWS.RepLSCORE IScore E MWS. R ep l_TRAN SACTI0 N Transaction EMWS.RepLDOCUMENT Document
1 (trotvtc. j j
Yes MO NO No •Jo NO
CxptaiS
[
Data Exists
Ptopeities

OK
This window lists the types of data sets that can be exported from a SAS Enterprise Miner process flow node. As indicated, only a T R A I N data set exists at this stage of the analysis.
278
3.
Chapter 2 Accessing and Assaying Prepared Data
Select the Train table from the Exported Data  Replacement window.
Port TRAIN VALIDATE TEST SCORE TRANSACTION DOCUMENT
Table Role EUWS.RepI TRAIN Train EMWS.RepLVALIDATE Validate EMWS.RepLTEST Test EMWS.RepLSCORE Score IEMWS.Repl_TRANSACTION Transaction EMWS.RepLDOCUMENT Document Browse..
4.
Yes No NO No No No Explore.
Data Btists
Properties.
OK
Select Explore... to open the Explore window again.
Sample Properties Rows Columns Library Member Type Sample Method Fetch Size Fetched Rows Random Seed
Property
Unknown 29 EMWS REPL TRAIN VIEW Random Max 9686 12345
Value
Apply []3 EMWSJtepl_TRAIN Obs#
Plot... r/Ef
m
j Target Gilt Control Nu... Target Gift ... Gilt Count 3..l0ffl Count A.J Gift Count... I Gilt Count... I Gilt Amount... 000014974 4 3 1 $17.00 000006294 8 3 2 $20.00 100046110 $4.00 41 20 3 $6.00 100185937 $10.00 8 4 12 $10.00 000029637 1 1 $20.00 5 100112632 9 $11.00 $11.00 11 6 000123712 3 $15.00 4 7 000045409 3 $15.00 4 8 10O019094 1 $35.00 3 S $40.00 000178271 2 $6.00 5 10 1 00102224 $6.00 16 13 11 $6.00 i
'i
â€˘ i i i i
24 Exploring a Data Source
5.
279
Scroll completely to the right in the data table. EMWS.Repl_TRA!N
1
Aqe
.F 67F .M .M 53M 47M
58M .U JF .U
.u
<
Gender
Home Owner Median Ho...! ercentVet..J Median Income Reoiorj Replacement: DemMedlncome I A 0 U $0 $0 85 U SO $186,800 $87,600 36 $38,750 38750 u $139,200 27 $38,942 38942 u $168,100 37 $71,509 71509 u H $253,100 0 $92,514 92514 H $234,700 22 $72,868 72868 U $207,000 44 $0 U $1 37,300 32 $0 $180,700 37 $0 u $143,900 30 $0 u 3
1
1
"1<
inn.
c o
\
m
m
n>
A new column is added to the analysis data: Replacement: DemMedlncome. Notice that the values o f this variable match the Median Income Region variable, except when the original variables value equals zero. The replaced zero value is shown by a dot (.). which indicates a missing value. 6.
Close the Explore and Exported Data windows to complete this part o f the analysis.
280
Chapter 2 Accessing and Assaying Prepared Data
2.5 Chapter Summary Data Access Tools Review .  jft; AwflystoOota
1
• Link existing analysis data sets to SAS Enterprise Miner. • Set variable metadata. • Explore variable distribution characteristics. Remove unwanted cases from analysis data.
18
Analyses in SAS Enterprise Miner are organized hierarchically into projects, data libraries, diagrams, process flows, and analysis nodes. This chapter demonstrated the basics of creating each of these elements. Most process flows begin with a data source node. To access a data source, you must define a SAS library. After a library is defined, a data source is created by linking a SAS table and associated metadata to SAS Enterprise Miner. After a data source is defined, you can assay the underlying cases using SAS Enterprise Miner Explore tools. Care should be taken to ensure that the sample of the data source that you explore is representative of the original data source. You can use the Replacement node to modify variables that were incorrectly prepared. After all necessary modifications, the data is ready for subsequent analysis.
Chapter 3 Introduction to Predictive Modeling: Decision Trees 3.1
Introduction Demonstration: Creating Training and Validation Data
3.2
Cultivating Decision Trees Demonstration: Constructing a Decision Tree Predictive Model
3.3
Optimizing the Complexity of Decision Trees Demonstration: Assessing a Decision Tree
3.4
Understanding Additional Diagnostic Tools (SelfStudy) Demonstration: Understanding Additional Plots and Tables (Optional)
3.5
Autonomous Tree Growth Options (SelfStudy)
33 323
328 343
363 379
398 399
3106
Exercises
3114
3.6
Chapter Summary
3116
3.7
Solutions
3117
Solutions to Exercises
3117
Solutions to Student Activities (Polls/Quizzes)
3133
32
Chapter 3 Introduction to Predictive Modeling: Decision Trees
3.1 Introduction
3.1
33
Introduction
Predictive Modeling The Essence of Data Mining
7
rijH
"Mosf of the big payoff [in data mining] has been in predictive modeling."  Herb Edelstein
3
Predictive modeling has a long history of success in the field of data mining. Models are built from historical event records and are used to predict future occurrences of these events. T h e methods have great potential for monetary gain. Herb Edelstein states in a 1997 interview (Beck 1997): "I also want to stress that data mining is successfully applied to a wide range of problems. T h e beeranddiapers type of problem is k n o w n as association discovery or market basket analysis, that is, describing something that has happened in existing data. This should be differentiated from predictive techniques. Most of the big payoff stuff has been in predictive modeling." Predicting the future has obvious financial benefit, but, for several reasons, the ability of predictive models to do so should not be overstated. First, the models depend on a property k n o w n statistically as stationarity, meaning that its statistical properties do not change over time. Unfortunately, m a n y processes that generate events of interest are not stationary enough to enable meaningful predictions. Second, predictive models are often directed at predicting events that occur rarely, and in the aggregate, often look somewhat random. Even the best predictive model can only extract rough trends from these inherently noisy processes.
34
Chapter 3 Introduction to Predictive Modeling: Decision Trees
Predictive Modeling Applications Database marketing Financial risk m a n a g e m e n t Fraud detection Process monitoring
s4
Pattern detection
4 Applications of predictive modeling are only limited by your imagination. However, the models created by S A S Enterprise Miner typically apply to one of the following categories. • Database marketing applications include offer response, upsell, crosssell, and attrition models. • Financial risk m a n a g e m e n t models attempt to predict monetary events such as credit default, loan prepayment, and insurance claim. • Fraud detection methods attempt to detect or impede illegal activity involving financial transactions. • Process monitoring applications detect deviations from the norm in manufacturing, financial, and security processes. • Pattern detection models are used in applications ranging from handwriting analysis to medical diagnostics.
3.1 Introduction
35
Predictive Modeling Training Data Training Data
, IBM ' t lnptitt .': : H I &C9 K S
â€”
R 9
Bwl BW
M i H i
Training data case: categorical or numeric input and target measurements
â€”
T o begin using S A S Enterprise Miner for predictive modeling, you should be familiar with standard terminology. Predictive modeling (also k n o w n as supervised prediction ox supervised learning) starts with a training data set. T h e observations in a training data set are k n o w n as training cases (also k n o w n as examples, instances, or records). T h e variables are called inputs (also k n o w n as predictors, features, explanatory variables, or independent variables) and targets (also k n o w n as a response, outcome, or dependent variable), for a given case, the inputs reflect your state of knowledge before measuring the target. The measurement scale of the inputs and the target can be varied. T h e inputs and the target can be numeric variables, such as income. They can be nominal variables, such as occupation. They are often binary variables, such as a positive or negative response concerning h o m e ownership.
36
Chapter 3 Introduction to Predictive Modeling: Decision Trees
Predictive Model Training Data
mmm
WKM
WMM
H
Predictive model: a concise representation of the input and target association W W
T h e purpose of the training data is to generate a predictive model. T h e predictive model is a concise representation of the association between the inputs and the target variables.
Predictive Model â€˘ M l
mm mmi mm mm mmm wmw mmm mm M M
m
m
m
Predictions: output of the predictive model given a set of input measurements
10
The outputs of the predictive model are called predictions. Predictions represent your best guess for the target given a set of input measurements. T h e predictions are based on the associations learned from the training data by the predictive model.
3.1 Introduction
37
Modeling Essentials
Predict n e w cases. Select useful inputs. Optimize complexity.
12
Predictive models are widely used and c o m e in m a n y varieties. A n y model, however, must perform three essential tasks: â€˘ provide a rule to transform a measurement into a prediction â€˘ have a m e a n s of choosing useful inputs from a potentially vast number of candidates â€˘ be able to adjust its complexity to compensate for noisy training data T h e following slides illustrate the general principles involved in each of these essentials. Later sections of the course detail the approach used by specific modeling tools.
38
Chapter 3 Introduction to Predictive Modeling: Decision Trees
Modeling Essentials
Predict new cases.
13
T h e process of predicting n e w cases is examined first.
Three Prediction Types 1
inputs
mm mm mm
mmm mmm mmm mm mm mm mm mm mm
z
s
s
z
prediction
mmm mmm mmm
decisions rankings estimates
15
T h e training data is used to construct a model (rule) that relates the input variables to the original target variable. T h e predictions can be categorized into three distinct types: â€˘ decisions â€˘ rankings â€˘ estimates
3.1 Introduction
39
Decision Predictions
A predictive model uses input measurements to make the best decision for each case.
17
T h e simplest type of prediction is the decision. Decisions usually are associated with s o m e type of action (such as classifying a case as a donor or a nondonor). For this reason, decisions are also k n o w n as classifications. Decision prediction examples include handwriting recognition, fraud detection, and direct mail solicitation. Decision predictions usually relate to a categorical target variable. For this reason, they are identified as primary, secondary, and tertiary in correspondence with the levels of the target. B y default, model assessment in S A S Enterprise Miner assumes decision predictions w h e n the target variable has a categorical measurement level (binary, nominal, or ordinal).
310
Chapter 3 Introduction to Predictive Modeling: Decision Trees
Ranking Predictions WMWf â€˘ MMbMNM i M M
mmm mmm mmm mmm mmm
mmm mm mm warn mmm mmm umm mmt BH
mm mm mm mm
720 520 470 630
A predictive model uses input measurements to optimally rank each case.
19
Ranking predictions order cases based on the input variables' relationships with the target variable. Using the training data, the prediction model attempts to rank high value cases higher than low value cases. It is assumed that a similar pattern exists in the scoring data so that high value cases have high scores. T h e actual, produced scores are inconsequential; only the relative order is important. T h e most c o m m o n example of a ranking prediction is a credit score. Ranking predictions can be transformed into decision predictions by taking the primary decision for cases above a certain threshold while making secondary and tertiary decisions for cases below the correspondingly lower thresholds. In credit scoring, cases with a credit score above 700 can be called good risks; those with a score between 600 and 700 can be intermediate risks; and those below 600 can be considered poor risks.
3.1 Introduction
311
Estimate Predictions
A predictive model uses input measurements to optimally estimate the target value.
21
Estimate predictions approximate the expected value of the target, conditioned on the input values. For cases with numeric targets, this number can be thought of as the average value of the target for all cases having the observed input measurements. For cases with categorical targets, this number might equal the probability of a particular target outcome. Prediction estimates are most c o m m o n l y used w h e n their values are integrated into a mathematical expression. A n example is twostage modeling, where the probability of an event is combined with an estimate of profit or loss to form an estimate of unconditional expected profit or loss. Prediction estimates are also useful w h e n you are not certain of the ultimate application of the model. Estimate predictions can be transformed into both decision and ranking predictions. W h e n in doubt, use this option. Most S A S Enterprise Miner modeling tools can be configured to produce estimate predictions.
312
Chapter 3 Introduction to Predictive Modeling: Decision Trees
Modeling Essentials  Predict Review
D e c i d e , rank,
Predict new cases.
and estimate.
23 T o summarize the first model essential, w h e n a model predicts a n e w case, the model generates a decision, a rank, or an estimate.
3.01 Quiz Match the Predictive Modeling Application to the Decision Type. Modeling Application Loss Reserving Risk Profiling Credit Scoring Fraud Detection Revenue Forecasting Voice Recognition
Decision Type A. Decision B. Ranking C. Estimate
Submit
Clear
25
T h e quiz shows s o m e predictive modeling examples and the decision types.
3.1 Introduction
Modeling Essentials
V> Select useful inputs.
27
Consider the task of selecting useful inputs.
313
314
Chapter 3 Introduction to Predictive Modeling: Decision Trees
The Curse of Dimensionality
1D
2D
3D 28
T h e dimension of a problem refers to the number of input variables (more accurately, degrees of freedom) that are available for creating a prediction. Data mining problems are often massive in dimension. T h e curse of dimensionality refers to the exponential increase in data required to densely populate space as the dimension increases. For example, the eight pointsfillthe onedimensional space but b e c o m e more separated as the dimension increases. In a 100dimensional space, they would be similar to distant galaxies. T h e curse of dimensionality limits your practical ability tofita flexible model to noisy data (real data) w h e n there are a large number of input variables. A densely populated input space is required tofithighly complex models. W h e n you assess h o w m u c h data is available for data mining, you must consider the dimension of the problem.
3.02 Quiz How many inputs are in your modeling data? Mark your selection in the polling area.
30
3.1 Introduction
315
Input Reduction  Redundancy Redundancy
Irrelevancy •
*
. j a r .__
31
Input selection, that is, reducing the number of inputs, is the obvious w a y to thwart the curse of dimensionality. Unfortunately, reducing the dimension is also an easy w a y to disregard important information. T h e two principal reasons for eliminating a variable are redundancy and irrelevancy.
Input Reduction  Redundancy Redundancy • •
Input x has the same information as input X), 2
•d 32
A redundant input does not give any n e w information that w a s not already explained by other inputs. In the example above, know ing the value of input X gives you a good idea of the value ofjfeFor decision tree models, the modeling algorithm makes input redundancy a relatively minor issue. For other modeling tools, input redundancy requires more elaborate methods to mitigate the problem.
316
Chapter 3 Introduction to Predictive Modeling: Decision Trees
Input Reduction  Irrelevancy Irrelevancy
Predictions change with input x but much less with input x . 4
3
33
A n irrelevant input does not provide information about the target. In the example above, predictions change with input .v.h but not with input x 3 . For decision tree models, the modeling algorithm automatically ignores irrelevant inputs. Other modeling methods must be modified or rely on additional tools to properly deal with irrelevant inputs.
Modeling E s s e n t i a l s  Select Review
Eradicate
Select useful inputs.
redundancies and irrelevancies.
I
L
,
34
To thwart the curse of dimensionality, a model must select useful inputs. You can do this by eradicating redundant and irrelevant inputs (or more positively, selecting an independent set of inputs that are correlated with the target).
3.1 Introduction
Modeling Essentials
i s Optimize complexity.
36
Finally, consider the task of optimizing model complexity.
317
318
Chapter 3 Introduction to Predictive Modeling: Decision Trees
Model Complexity
37
Fitting a model to data requires searching through the space of possible models. Constructing a model with good generalization requires choosing the right complexity. Selecting model complexity involves a tradeoff between bias and variance. A n insufficiently complex model might not be flexible enough, which can lead to underfitting, that is, systematically missing the signal (high bias). A naive modeler might assume that the most complex model should always outperform theothers, but this is not the case. A n overly complex model might be too flexible, which can lead to overfitting, that is, accommodating nuances of the random noise in the particular sample (high variance). A model with the right amount of flexibility gives the best generalization.
3.1 Introduction
319
Data Partitioning Trainina Data
Partition available data into training and validation sets.
38
A quote from the popular introductory data analysis textbook Data Analysis and Regression by Mosteller and Tukey (1977) captures the difficulty of properly estimating model complexity. "Testing the procedure on the data that gave it birth is almost certain to overestimate performance, for the optimizing process that chose it from a m o n g m a n y possible procedures will have m a d e the greatest use of any and all idiosyncrasies of those particular data...." In predictive modeling, the standard strategy for honest assessment of model performance is data splitting. A portion is used forfittingthe model, that is, the training data set. T h e remaining data is separated for empirical validation. T h e validation data set is used for monitoring and tuning the model to improve its generalization. T h e tuning process usually involves selecting a m o n g models of different types and complexities. T h e tuning process optimizes the selected model on the validation data. f
Because the validation data is used to select from a set of related models, reported performance will, on the average, be overstated. Consequently, a further holdout sample is needed for a final, unbiased assessment.The test data set has only one use, that is, to give afinalhonest estimate of generalization. Consequently, cases in the test set must be treated in the same w a y that n e w data would be treated. T h e cases cannot be involved in any w a y in the determination of the fitted prediction model. In practice, m a n y analysts see no need for a final honest assessment of generalization. A n optimal model is chosen using the validation data, and the model assessment measured on the validation data is reported as an upper bound on the performance expected w h e n the model is deployed.
With small or moderate data sets, data splitting is inefficient; the reduced sample size can severely degrade thefitof the model. Computerintensive methods, such as the cross validation and bootstrap methods, were developed so that all the data can be used for bothfittingand honest assessment. However, data mining usually has the luxury of massive data sets.
320
Chapter 3 Introduction to Predictive Modeling: Decision Trees
Predictive Model Sequence Trainin q Data
mmP<mm â€”
Validation Data
umm â€˘ ' mm
1
I
mm
{ w
mm
fopvB mm
mm
tmt mm
mm
Create a sequence of models with increasing complexity.
40
In S A S Enterprise Miner, the strategy for choosing model complexity is to build a sequence of similar models using the training component only. T h e models are usually ordered with increasing complexity.
Model Performance Assessment mm
mm
mm
mm
mgm mmm mmm
Rate model *A*Wt performance using validation data. F 4 r
42
5
Model Complexity
Validation Assessment
A s Mosteller and Tukey caution in the earlier quotation, using performance on the training data set usually leads to selecting a model that is too complex. (The classic example is selecting linear regression models based on R square.) T o avoid this problem, S A S Enterprise Miner selects the model from the sequence based on validation data performance.
3.1 Introduction
Model Selection Validation Data
TBI
â€˘ Select the simplest frvv \ model with the highest validation assessment.
Complexity
Validation Assessment
In keeping with Ockham's razor, the best model is the simplest model with the highest validation performance.
321
322
Chapter 3 Introduction to Predictive Modeling: Decision Trees
Modeling Essentials  Optimize Review
Optimize complexity.
Tune models with ! validation data.
46
T o avoid overfitting (and underfitting) a predictive model, you should calibrate its performance with an independent validation sample. Y o u can perform the calibration by partitioning the data into training and validation samples. In Chapter 2, you assigned roles and measurement levels to the P V A 9 7 N K data set and created an analysis diagram and a simple process flow. In the following demonstration, you partition the raw data into training and validation components. This sets the stage for using the Decision Tree tool to generate a predictive model.
3.1 Introduction
323
Creating Training a n d Validation Data
A critical step in prediction is choosing a m o n g competing models. Given a set of training data, you can easily generate models that very accurately predict a target value from a set of input values. Unfortunately, these predictions might be accurate only for the training data itself. Attempts to generalize the predictions from the training data to an independent, but similarly distributed, sample can result in substantial accuracy reductions. To avoid this pitfall, S A S Enterprise Miner is designed to use a validation data set as a means of independently gauging model performance. Typically, the validation data set is created by partitioning the raw analysis data. Cases selected for training are used to build the model, and cases selected for validation are used to tune and compare models. f
S A S Enterprise Miner also enables the creation of a third partition, n a m e d the test data set. T h e test data set is meant for obtaining unbiased estimates of model performance from a single selected model.
This demonstration assumes that you completed the demonstrations in Chapter 2. Your S A S Enterprise Miner workspace should appear as s h o w n below: s i ts File
Edit View
53
Actions
Options
Window
HHHHHBIffi
•3 My Project o O Data Sources 9 [hi Diagrams 3£> Predictive Analysis O \9j Model Packages GiflJ Users
1
Help
Property ID Name Status Notes History
173 Sample
• B Bfi Explore
k :
Modify . Model
Assess : Utility
Credit Scoring
so Predictive Analysis
Value EMWS Predictive Analysis Open
r
y • >VA97NK
ID Diagram Identifier. This identifier Diagram Predictive Analyst...
J
p^lrf? placement6
100::. jX Connected to SASMain  Logical Workspace Server
324
Chapter 3 Introduction to Predictive Modeling; Decision Trees
Follow these steps to partition the raw data into training and validation sets. 1. Select the S a m p l e tool tab. The Data Partition tool is the second from the left.
• Sample  Explore Modify _^°del j Assess ) Utility" Credit Scoring 2. Drag a Data Partition tool into the diagram workspace, next to the Replacement node.
•ma
j; Predictive Analysis
>VA97NK
!f^yeplacemenl
• ata Partition Q
2l\
W 9
—
\—
a
©100%
3.1 Introduction
325
3. Connect the Replacement node to the Data Partition node. • Predictive Analysis
ate merit
>VA97NK
6
0
jnjnr j^sasbata Partition
4. Select the Data Partition node and examine its Properties panel. Value
Property
General Node ID Imported Data Exported Data Notes
Part ...
•
Train Variables Data OutputType Partitioning Method Default 12345 R a n d o m Seed Ebata Set Allocations 40.0 rTraining 30.0 ^•Validation 30.0 'Test Interval Targets Class Targets
E u
Yes Yes
Use the Properties panel to select the fraction of data devoted to the Training, Validation, and Test partitions. B y default, less than half the available data is used for generating the predictive models.
326
Chapter 3 Introduction to Predictive Modeling: Decision Trees
f
There is a tradeoff in various partitioning strategies. M o r e data devoted to training results in more stable predictive models, but less stable model assessments (and vice versa). Also, the Test partition is used only for calculating fit statistics after the modeling and model selection is complete. M a n y analysts regard this as wasteful of data. A typical partitioning compromise foregoes a Test partition and places an equal number of cases in Training and Validation.
5. Type 5 0 as the Training value in the Data Partition node. 6. Type 5 0 as the Validation value in the Data Partition node. 7. Type 0 as the Test value in the Data Partition node. §bataSet Allocations jTraining 50.0 50.0 rValidation ••Test 0.0 f
With smaller raw data sets, model stability can become an important issue. In this case, increasing the number of cases devoted to the Training partition is a reasonable course of action.
8. Rightclick the Data Partition node and select Run. 9. Select Yes in the Confirmation window. S A S Enterprise Miner runs the Data Partition node.
Do you want to run this path? Diagram: Predictive Analysis Path:
Data Partition Yes
f
No
Only the Data Partition node should run, because it is the only "new" element in the process flow. In general, S A S Enterprise Miner only runs elements of diagrams that are n e w or that are affected by changes earlier in the process flow.
W h e n the Data Partition process is complete, a R u n Status w i n d o w opens.
Run completed Diagram: Predictive Analysis Path: OK
Data Partition Results..
3.1 Introduction
327
10. Select Results... in the R u n Status w i n d o w to view the results. T h e Results  N o d e : Data Partition w i n d o w provides a basic metadata s u m m a r y of the raw data that feeds the node and, more importantly, a frequency table that shows the distribution of the target variable, T a r g e t B , in the raw, training, and validation data sets. r
File
1
Edit
1 (73
View
1
Window
â€˘
output
Summary Statistics for Class Targets Data=DATA
Variable
Humeric Value
Formatted Value
TargetB TargetB
Frequency Count
Percent
4843
50 50
Frequency Count
Percent
2422
50.0103
2421
49.9897
Frequency Count
Percent
2421
49.9897
2422
50.0103
4843
Data=TRAIW
Variable
numeric Value
Formatted Value
TargetB TargetB
DataVALIDATE
Variable TargetB TargetB
f
Humeric Value
Formatted Value
T h e Partition node attempts to maintain the proportion of zeros and ones in the training and validation partitions. T h e proportions are not exactly the same due to an odd number of zeros' and ones' cases in the raw data.
T h e analysis data w a s selected and partitioned. You are ready to build predictive models.
328
Chapter 3 Introduction to Predictive Modeling: Decision Trees
3.2 Cultivating Decision Trees
Predictive Modeling Tools Primary
IS
49
M a n y predictive modeling tools are found in S A S Enterprise Miner. T h e tools can be loosely grouped into three categories: primary, specialty, and multiple models. T h e primary tools interface with the three most c o m m o n l y used predictive modeling methods: decision tree, regression, and neural network.
Predictive Modeling Tools
'
5
g^Aiirftart j
Specialty
[u Ditv
1 ÂŁ, Gftctent
6E
so T h e specialty tools are either specializations of the primary tools or tools designed to solve specific problem types.
3.2 Cultivating Decision Trees
329
Predictive Modeling Tools
!! v. , i —' —
::u
L/: r :
Multiple
Multiple model tools are used to combine or produce more than one predictive model. T h e Ensemble tool is briefly described in a later chapter. T h e T w o Stage tool is discussed in the Advanced Predictive Modeling Using S A S ® Enterprise Miner™ course.
Model Essentials  Decision Trees
Predict n e w cases.
; Prediction rules
Select useful inputs.

Optimize complexity.
Split search Pruning
52
Decision trees provide an excellent introduction to predictive modeling. Decision trees, similar to all modeling methods described in this course, address each of the modeling essentials described in the introduction. Cases are scored using prediction rules. A splitsearch algorithm facilitates input selection. M o d e l complexity is addressed by pruning. T h e following simple prediction problem illustrates each of these model essentials.
330
Chapter 3 Introduction to Predictive Modeling: Decision Trees
Simple Prediction Illustration
1.0
I
0.9
Predict dot color for each x, and x2. X
2
0.5 0.4 0.3 0.2 n
U1
1
0.0
...
54
Consider a data set with two inputs and a binary target. The inputs, X\ and x 2 , locate the case in the unit square. T h e target outcome is represented by a color; yellow is primary and blue is secondary. T h e analysis goal is to predict the outcome based on the location in the unit square.
Decision Tree Prediction Rules
56
Decision trees use rules involving the values of the input variables to predict cases. The rules are arranged hierarchically in a treelike structure with nodes connected by lines. The nodes represent decision rules, and the lines order the rules. The first rule, at the base (top) of the tree, is called the root node. Subsequent rules are called interior nodes. Nodes with only one connection are called leaf nodes.
3.2 Cultivating Decision Trees
331
Decision Tree Prediction Rules
57
T o score a n e w case, examine the input values and apply the rules defined by the decision tree.
Decision Tree Prediction Rules Predict:, \
ÂŠ w
Estimate = 0.70
0.9 >0.63
06 X
2
0.5 0.1
58
T h e input values of a n e w case eventually lead to a single leaf in the tree. A tree leaf provides a decision (for example, classify as yellow) and an estimate (for example, the primarytarget proportion).
332
Chapter 3 Introduction to Predictive Modeling: Decision Trees
Model Essentials  Decision Trees
Select useful inputs.
Split search
62
T o select useful inputs, trees use a splitsearch algorithm. Decision trees confront the curse of dimensionality by ignoring irrelevant inputs. f
Curiously, trees have no builtin method for ignoring redundant inputs. Because trees can be trained quickly and have simple structure, this is usually not an issue for model creation. H o w e v e r , it can be an issue for model deployment, in that trees might somewhat arbitrarily select from a set of correlated inputs. T o avoid this problem, you must use an algorithm that is external to the tree to m a n a g e input redundancy.
3.2 Cultivating Decision Trees
333
Decision Tree Split Search left
right
0.0 0.1 0.2 0.1 0.4 O.S OS 07 0.3 0.9 1.0
63
Understanding the default algorithm for building trees enables you to better use the Tree tool and interpret your results. T h e description presented here assumes a binary target, but the algorithm for interval targets is similar. (The algorithm for categorical targets with more than two outcomes is m o r e complicated and is not discussed.) T h e first part of the algorithm is called the split search. The split search starts by selecting an input for partitioning the available training data. If the measurement scale of the selected input is interval, each unique value serves as a potential split point for the data. If the input is categorical, the average value of the target is taken within each categorical input level. T h e averages serve the s a m e role as the unique interval input values in the discussion that follows. For a selected input and fixed split point, two groups are generated. Cases with input values less than the split point are said to branch left. Cases with input values greater than the split point are said to branch right. The groups, combined with the target outcomes, form a 2x2 contingency table with columns specifying branch direction (left or right) and rows specifying target value (0 or 1). A Pearson chisquared statistic is used to quantify the independence of counts in the table's columns. Large values for the chisquared statistic suggest that the proportion of zeros and ones in the left branch is different than the proportion in the right branch. A large difference in outcome proportions indicates a good split. Because the Pearson chisquared statistic can be applied to the case of multi w a y splits and multioutcome targets, the statistic is converted to a probability value or/?value. T h e v a l u e indicates the likelihood of obtaining the observed value of the statistic assuming identical target proportions in each branch direction. For large data sets, these /?values can be very close to zero. For this reason, the quality of a split is reported by logworth = /og(chisquared /7value). At least one logworth must exceed a threshold for a split to occur with that input. B y default, this threshold corresponds to a chisquared pvalue of 0.20 or a logworth of approximately 0.7.
334
Chapter 3 Introduction to Predictive Modeling: Decision Trees
Decision Tree Split Search
The best split for an input is the split that yields the highest logworth.
3.2 Cultivating Decision Trees
335
Additional SplitSearch Details (Self Study) There are several peripheral factors that m a k e the split search s o m e w h a t m o r e complicated than whatis described above. First, the tree algorithm settings disallow certain partitions of the data. Settings, such as the m i n i m u m number of observations required for a split search and the m i n i m u m n u m b e r of observations in a leaf, force a m i n i m u m n u m b e r of cases in a split partition. This m i n i m u m n u m b e r of cases reduces the number of potential partitions for each input in the split search. Second, w h e n y o u calculate the independence of columns in a contingency table, it is possible to obtain significant (large) values of the chisquared statistic even w h e n there are n o differences in the outcome proportions between split branches. A s the number of possible split points increases, the likelihood of obtaining significant values also increases. In this way, an input with a multitude of unique input values has a greater chance of accidentally having a large logworth than an input with only a few distinct input values. Statisticians face a similar problem w h e n they combine the results from multiple statisticaltests. A s the number of tests increases, the chance of a false positive result likewise increases. T o maintain overall confidence in the statistical findings, statisticians inflate the pvalues of each test by a factor equal to the number of tests being conducted. If an inflated /rvalue shows a significant result, then the significance of the overall results is assured. This type of pvalue adjustment is k n o w n as a Bonferroni correction. Because each split point corresponds to a statistical test, Bonferroni corrections are automatically applied to the logworth calculations for an input. These corrections, called Kass adjustments (named after the inventor of the default tree algorithm used in S A S Enterprise Miner), penalize inputs with m a n y split points by reducing the logworth of a split by an amount equal to the log of the n u m b e r of distinct input values. This is equivalent to the Bonferroni correction because subtracting this constant from logworth is equivalent to multiplying the corresponding chisquared pvalue by the n u m b e r of split points. T h e adjustment enables a fairer comparison of inputs with m a n y and few levels later in the splitsearch algorithm. Third, for inputs with missing values, two sets of Kassadjusted logworths are actually generated. For the first set, cases with missing input values are included in the left branch of the contingency table and logworths are calculated. For the second set of logworths, missing value cases are m o v e d to the right branch. T h e best split is then selected from the set of possible splits with the missing values in the left and right branches, respectively.
336
Chapter 3 Introduction to Predictive Modeling: Decision Trees
Decision Tree Split Search
Repeat for input x . 2
0.0 0.1 0.2 0.3 0.4 0.5 0.6 07 0.8 0.9 1.0
65
T h e partitioning process is repeated for every input in the training data. Inputs whose adjusted logworth fails to exceed the threshold are excluded from consideration.
Decision Tree Split Search
1.0 0.9 0.8 0.7 0.6 X bottom
top
54%
35%
46%
65%
2
0.63
0.5 0.4 0.3
max
j
logworth(Xj)
4.92
0.2 0.1 00 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.6 0.9 1.0
Again, the optimal split for the input is the one that maximizes the logworth function.
3.2 Cultivating Decision Trees
Decision Tree Split Search left
right
53%
42%
47%
58%
bottom
top
54%
35%
46%
65%
max logworth(x,) 0.95 Compare partition logworth ratings. max logworth(jr2) 4.92
69
After you determine the best split for every input, the tree algorithm compares each best split's corresponding logworth. T h e split with the highest adjusted logworth is d e e m e d best.
Decision Tree Split Search
The training data is partitioned using the best split rule.
337
338
Chapter 3 Introduction to Predictive Modeling: Decision Trees
Decision Tree Split Search
1.0
<0.63 ^
7
>0.63
>
:: 0.4
1
0.0
1
Repeat the process in each subset.
72
Decision Tree Split Search left
right
61%
55%
39%
; logworth{x,) 45% 5.72
1
m a x Q
B
1
1
D.3 [ M V i j r 0.2 Km/KjaP
"â€˘1 r 1 1 1 ' 1
BiS&c
;
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.
74
1.0
...
3.2 Cultivating Decision Trees
339
Decision Tree Split Search
0.0 0.1 0 2 0.3 0.4 0.S
76
T h e logworth of the split is negative. This might seem surprising, but it results from several adjustments m a d e to the logworth calculation. (The Kass adjustment w a s described previously. Another, called the depth adjustment, is outlined in the following SelfStudy section.)
Decision Tree Split Search left
right
0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 *1 77
340
Chapter 3 Introduction to Predictive Modeling: Decision Trees
Decision Tree Split Search left
right
61%
55%
39%
45%
1.0 max logwortri(x,)
572 .J x , 0.5
0.2
2.01
0.1 0.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 O.S 0.'
*1 70
T h e split search continues within each leaf. Logworths are compared as before.
Additional SplitSearch Details (SelfStudy) Because the significance of secondary and subsequent splits depends on the significance of the previous splits, the algorithm again faces a multiple comparison problem. To compensate for this problem, the algorithm increases the threshold by an amount related to the number of splits above the current split. For binary splits, the threshold is increased by logw{2) d~ 03d, where d is the depth of the split on the decision tree. B y increasing the threshold for each depth (or equivalently decreasing the logworths), the tree algorithm makes it increasingly easy for an input's splits to be excluded from consideration.
3.2 Cultivating Decision Trees
341
Decision Tree Split Search
79
The data is partitioned according to the best split, which creates a second partition rule. T h e process repeats in each leaf until there are no more splits whose adjusted logworth exceeds the depthadjusted thresholds. This process completes the splitsearch portion of the tree algorithm.
Decision Tree Split Search
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Repeat to form a maximal tree. 81
The resulting partition of the input space is k n o w n as the maximal tree. Development of the maximal tree is based exclusively on statistical measures of split worth on the training data. It is likely that the maximal tree will fail to generalize well on an independent set of validation data.
342
Chapter 3 Introduction to Predictive Modeling: Decision Trees
Model Essentials  Decision Trees
Select useful inputs.
Split search
82
T h e splitsearch algorithm can easily cut through the curse of dimensionality and select useful modeling inputs. Because it is simply an algorithm, m a n y variations are possible. ( S o m e variations are discussed at the end of this chapter.) T h e following demonstration illustrates the use of the splitsearch algorithm to construct a decision tree model.
3.2 Cultivating Decision Trees
343
Constructing a Decision Tree Predictive M o d e l
This fourpart demonstration illustrates the interactive construction of a decision tree model and builds on the process flows of previous demonstrations.
Preparing for Interactive Tree Construction Follow these steps to prepare the Decision Tree tool for interactive tree construction. 1. Select the M o d e l tab. T h e Decision Tree tool is second from the left.
fe fr 1§ %
•Q
Sample  Explore
Modify
Model
m
Assess
jrj i Hi n Utility Credit Scoring
2. Drag a Decision Tree tool into the diagram workspace. Place the node next to the Data Partition node. •cm Predictive Analysis
IE!
344
Chapter 3 Introduction to Predictive Modeling: Decision Trees
3. Connect the Data Partition node to the Decision Tree node. m o Predictive Analysis
£1
Decision True
TTi
100%
T h e Decision Tree tool can build predictive models autonomously or interactively. T o build a model autonomously, simply run the Decision Tree node. Building models interactively, however, is more informative and is helpful w h e n you first learn about the node, and even w h e n you are more expert at predictive modeling. 4. Select Interactive
from the Decision Tree node's Properties panel. Value
Property Node ID Imported Data Exported Data [Notes
Tree
Train [Variables nteractive Use Frozen Tree No Use Multiple Targets N O • Splitting Rule ProbF ! IInterval Criterion ProbChisq Nominal Criterion Entropy [•Ordinal Criterion Significance Level 0.2 Use in search Missing Values fUse Input Once No 2 Maximum Branch 6 Maximum Depth l * Minimum Categorical 5 llalNode 5 hLeaf Size
•«
• ••
I mmm
3.2 Cultivating Decision Trees T h e S A S Enterprise Miner Interactive Decision Tree application opens. Interactive Decision Tree  EMWS.TREE_BROWSETREE[EMWS.PART2_TRAIN] File Edit View g
&
Tree Train Window 1
6\6
•
a _nx
Tree View
N in Node: 4843 Target: TargetB • (%) : 50% 1 (%) : 50%
£
1
, .
1
345
346
Chapter 3 Introduction to Predictive Modeling: Decision Trees
Creating a Splitting Rule Decision tree models involve recursive partitioning of the training data in an attempt to isolate concentrations of cases with identical target values. T h e blue box in the Tree w i n d o w represents the unpartitioned training data. T h e statistics show the distribution of TargetB. Use the following steps to create an initial splitting rule: 1. Rightclick the purple box and select Split Node... from the m e n u . T h e Split N o d e 1 dialog b o x opens. Split Node 1 Target Variable: TargetB Variable GiftCnt36 :GiftAvgCard36 iGiftAvgLast GiftAvg36 QftAvgAll GiftCntCard36 GiftCntCardAH StatusCatStarAll StatusCat96NK GiftCntAll PromCntCard36 GiftTimeFirst GiftTirneLast PromCntAll PromCntCardAll Dern Me dHome Value PromCntCardl2 PromCnt36 PromCntl2 DernPctVeterans DemAge REP_DemMedIncome DemHomeOwner
1 Variable Description Gift Count 36 Months Gift Amount Average ... Gift Amount Last Gift Amount Average... Gift Amount Average .., Gift Count Card 36 M... Gift Count Card All Mo... Status Category Star... Status Category 96NK Gift Count All Months Promotion Count Card... Time Since First Gift Time Since Last GiFt Promotion Count AllM.,, Promotion Count Card.,, Median Home Value R... Promotion Count Card... Promotion Count 36 M... Promotion Count 12 M... Percent Veterans Reg... Age Replacement: DemMe... Home Owner
Log(p)
Branches 18.44 17.24 15.02 14.63 13,95 12,74 11.91 U1S5 9.52 9.28 8.73 6.69 5.70 5.49 4.59 4.46 3.S7 3.35 2.61 1.99 1.51 1.47 0.40
Apply 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
Edit Rule.,. OK Cancel
T h e Split N o d e dialog box shows the relative value, l.og(p) or logworth, of partitioning the training data using the indicated input. A s the logworth increases, the partition better isolates cases with identical target values. G i f t Count 36 Months has the highest logworth, followed by G i f t Amount Average C a r d 36 Months and G i f t Amount L a s t . You can choose to split the data on the selected input or edit the rule for more information.
32 Cultivating Decision Trees
347
2. Select Edit Rule.... T h e GittCnt36  Interval Splitting Rule dialog box opens. GiftCnt36  Interval Splitting Rule Target Variable: TargetB Assign missing values to {* A specific branch
[i
C A separate missing values branch C All branches Branches Branch 1 2
Split Point 2.5 2.5
<
>=
New split point:
Remove Branch
Add Branch
OK
Cancel
Apply
Undo
Reset
This dialog box shows h o w the training data is partitioned using the input G i f t Count 36 Months. T w o branches are created. T h e first branch contains cases with a 36rnonth gift count less than 2.5, and the second branch contains cases with a 36month gift count greater than or equal to 2.5. In other words, with cases that have a 36month gift count of zero, one or two branch left and three or more branch right. In addition, any cases with a missing or u n k n o w n 36month gift count are placed in the first branch. f
T h e interactive tree assigns any cases with missing values to the branch that maximizes the purity or logworth of the split by default. For G i f tCnt36, this rule assigns cases with missing values to branch 2.
348
Chapter 3 Introduction to Predictive Modeling: Decision Trees
3. Select Apply. T h e GiftCnt36  Interval Splitting Rule dialog box remains open, and the Tree View w i n d o w is updated. Interactive Decision Tree  EMWSZ.TREE_BROW5ETREE[EMWS2.PART_TRAIN] File Edit View
Tree
Train
Window
Target: Count: 0: It
TargetB 4843 S0% 50%
1
Gift Countit 36 Months 31
Target Count 0 1
1 <2.5 1T a r g e t B 22B3
57% 43%
1
1 >"2.5 Of Missing
1
Target: TargetB Count: 25E0 0: 44% l: 56%
T h e training data is partitioned into two subsets. The first subset, corresponding to cases with a 36month gift count less than 2.5, has a higher than average concentration of T a r g e t B = 0 cases. T h e second subset, corresponding to cases with a 36month gift count greater than 2.5, has a higher than average concentration of TargetB=l cases. The second branch has slightly more cases than the first, which is indicated by the N i n n o d e field. The partition of the data also represents your first (nontrivial) predictive model. This model assigns to all cases in the left branch a predicted T a r g e t s value equal to 0.43 and to all cases in the right branch a predicted T a r g e t B value equal to 0.56. f
In general, decision tree predictive models assign all cases in a leaf the same predicted target value. For binary targets, this equals the percentage of cases in the target variable's primary outcome (usually the t a r g e t = l outcome).
3.2 Cultivating Decision Trees
349
Adding More Splits Use the following steps to interactively add more splits to the decision tree: I. Select the lower left partition. T h e Split N o d e dialog box is updated to s h o w the possible partition variables and their respective logworths. Split Node 3 Target Variable: TargetB Variable DemMedHorneValue DernPctVeterans StatusCatStarAll REP_DemMedIncorne DernAge GiftAvgAll GiftCnt36 PromCntCardl2 5tatusCat96NK GiFtTimeFirst DemHorneOwner GiftCntCardAll GiftCntAII PromCnt36 PromCntCardAll GiFtAvg36 DemGender PromCntAll GiftCntCard36 GiftTimeLast GiftAvgCard36 GiftAvgLast PromCntl2 PromCntCard36 DemCluster
Variable Description Median Home Value Region 3 ercent Veterans Region Status Category Star All M... Replacement: Median Inco... Age Gift Amount Average All Mo... Gift Count 36 Months Promotion Count Card 12 M... Status Category 96NK Time Since First Gift Home Owner Gift Count Card All Months Gift Count All Months Promotion Count 36 Months Promotion Count Card All M... GiFt Amount Average 36 M,.. Gender Promotion Count All Months GiFt Count Card 36 Months Time Since Last GiFt Gift Amount Average Card ... GiFt Amount Last Promotion Count 12 Months Promotion Count Card 36 M... Demographic Cluster
Branches
iog(p) 3.45360 2,38360 1.81739 1.56106 1.24470 1,03539 0.92244 0.88959 0.76545 0.68583 0.66592 0.58917 0.41340 0.33333 0.26385 0.19233 0.15916 0.13244 0.03293 0.01715 0.00000 0.00000 0,00000 0.00000 0.00000
2
Edit Rule... OK
Cancel
T h e input with the highest logworth is Median Home V a l u e Region.
1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
Apply
Refresh
Chapter 3 Introduction to Predictive Modeling: Decision Trees
350
2. Select Edit Rule.... T h e D e m M e d H o m e V a l u e  Interval Splitting Rule dialog box opens. D e m M e d H o m e V a l u e  Interval Splitting Rule Target Variable: TargetB Assign missing values to (â€˘ A specific branch
[2
^
C A separate missing values branch C All branches Br a r i c h e s
Branch 1
Split Point
<
67350 67350
>=
2
New split point:
Remove Branch
Add Branch
OK
Cancel
Apply
Undo
Reset
Branch 1 contains all cases with a median h o m e value less than 67350. Branch 2 contains all cases that are equal to or greater than 67350.
3.2 Cultivating Decision Trees 3. Select Apply. T h e Tree View w i n d o w is updated to show the additional split. Interactive Decision Tree EMWS2.TREE_BROWSETREE[EMWS2.PART_TRAIN] File Edit View
pa
Tree Train
i I lib fUT
Window n 0
ml
&
Tree View Target: TargetB Count: 4843 0: 50% 1: 50%
T
Gift Count 36 Months
r
>=2.5 or Missing
<2.5
I
J.
Target: TargetB Count: 2233 L 0: 57% 1: 431
Target: Count:
TargetB 2560
• :
44%
1:
56%
I
1
Median Home Value Region
<S7350 Target: TargetB Count: 84£ 0: £3% 1: 37%
>=67350 or Missing Target: TargetB Count: 1437 •: 53% 47% 1:
Both left and right leaves contain a lovverthanaverage proportion of cases with T a r g e t B = l .
351
352
Chapter 3 Introduction to Predictive Modeling: Decision Trees
4. Repeat the process for the branch that corresponds to cases with G i f t Count 36 Month in excess of 2.5. Interactive Decision Tree EMWS2.TREE_BROWSETREE[EMW52.PART_TRAIN] File Edit
View
Tree Train
MiJL
i
ifnr m : 6% •a

Window  g  xa
^1
Tree View Tar ge t Count 0 1
TargetB 4843 50% 50%
T
Gift Count 36 Months
>=2.5 or Missing
J Target: TargetB Count: 2203 0: 57% 1: 43%
Target: Count: 0: 1:
Y
T~"
Median Home Value Region
<67350
s=67350 or Missing
TargetB 2S£0 44% 56%
Gift Amount Last
<7.5
=7.5 or Missing
I Target: TargetB Count: 34£ 0: 63% 1: 37%
Target: TargetB Count: 1437 S3% •: 47% 1:
Target: TargetB Count: £57 0: 36% 1: £4%
Target: Targ*tB Count: 1903 0: 47% 1: 53%
T h e rightbranch cases are split using the G i f t Amount L a s t input. This time, both branches have a higherthanaverage concentration of TargetB=l cases. f
It is important to remember that the logworth of each split, reported above, and the predicted T a r g e t B values are generated using the training data only. T h e main focus is on selecting useful input variables for the first predictive model. A diminishing marginal usefulness of input variables is expected, given the splitsearch discussion at the beginning of this chapter. This prompts the question: W h e r e do you stop splitting? T h e validation data provides the necessary information to answer this question.
3.2 Cultivating Decision Trees
353
Changing a Splitting Rule T h e bottom left split on Median Home V a l u e at 67350 was found to be optimal in a statistical sense, but it might seem strange to someone simply looking at the tree. ( W h y not split at the more socially acceptable value of 70000?) Y o u can use the Splitting Rule w i n d o w to edit where a split occurs. Use the following steps to define your o w n splitting rule for an input. 1.
Select the node above the label Median Home V a l u e Region. Interactive Decision Tree EMWS2.TREE_BRDWSETREE[EMWS2.PART_TRAIN] File
Edit
View
Tree
Train
a 11 a l l P i l f e r
m
Window re 1 • I • 1 rid I n>  xa 
A
_j


Tree View Target: TargetB Count: 4S43 •: 50* 1: 50V
Y
Girl Count 36 Months
r
>=2.5 or Missing
I
Target: TargetB Count: 2$£D O: 44% 1: 5£ %
Target: TargetB Count: "S3 . O: 57% 1: 43%
1
Gift Amount Last
Median Home Value Region
1 <67350 Target: TargetB Count: S4£ D: E3% 1: 37%
1 >=57350 or Missing
1
Target: TargetB Count: 1437 0: 53% 1: 471
1 <7.5
1
Tar get T a r g e t B Count 657 0: 36% 1 64%
1 >75 or Missing
1
Target: TargetB Count: 1S03 •: 47% 1: 53%
354
Chapter 3 Introduction to Predictive Modeling: Decision Trees
2. Rightclick this node and select Split Node... from the m e n u . T h e Split N o d e w i n d o w opens. Split Node 3 Target Variable: TargetB Variable 1 DernMedHorneValue DernPctVeterans StatusCatStarAil  REP_DemMedIncome DemAge GiftAvgAll GiftCnt36 PromCntCardl2 StatusCat96NK GiftTimeFirst iDemHorneOwner GiftCntCardAll GiftCntAll PromCnt36 PromCntCardAll GiftAvg36 DemGender PromCntAll GiftCntCard36 iGiftTirneLast GiftAvgCard36 ;GiftAvgLast ;PromCntl2 jpromCntCard36 : DernCluster
Variable Description Median Home Value Region Percent Veterans Region Status Category Star All M... Replacement: Median Inco... Age Gift Amount Average All Mo... Gift Count 36 Months Promotion Count Card 12 M.., Status Category 96NK Time Since First Gift Home Owner Gift Count Card All Months Gift Count All Months Promotion Count 36 Months Promotion Count Card All M... GiFt Amount Average 36 M... Gender Promotion Count All Months Gift Count Card 36 Months Time Since Last Gift Gift Amount Average Card ... Gift Amount Last Promotion Count 12 Months Promotion Count Card 36 M... Demographic Cluster
Branches
Log(p) 3.36085 2.38360 1.81739 1.56106 1.24470 1.08539 0.92244 0.88959 0.76545 0,68583 0.66592 0,58917 0,41340 0.38333 0.26385 0.19283 0.15916 0,13244 0.03293 0,01715 0,00000 0,00000 0.00000 ' 0,00000 0.00000
Edit Rule... OK
Cancel
1 Zi 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
Apply
Refresh
3.2 Cultivating Decision Trees
355
3. Select Edit Rule.... T h e D e m M e d H o m e V a l u e  Interval Splitting Rule dialog box opens. DernMedHorneValue  Interval Splitting Rule
LU
Target Variable: TargetB Assign missing values to (â€˘ A specific branch
[2 ^ ]
C A separate missing values branch C All branches Branches Split Point
Branch 67350 67350
>=
New split point:
Remove Branch
Add Branch
OK
Cancel
Apply
Undo
Reset

Chapter 3 Introduction to Predictive Modeling: Decision Trees
356
4. Type 7 0 0 0 0 in the N e w s p l i t p o i n t field. D e m M e d H D m e V a l u e  Interval Splitting Rule Target Variable: TargetB Assign missing values to â€” (* A specific branch
2
C A separate missing values branch C All branches Branches Branch
Split Point
1 2
New split point:
67350 67350
<
>=
70000
Add Branch
OK
Remove Branch
Cancel
Apply
Undo

Reset
3.2 Cultivating Decision Trees 5. Select A d d Branch. T h e Interval Splitting Rule dialog box shows three branches. DernMedHorneValue  Interval Splitting Rule Target Variable: TargetB Assign missing values to (* A specific branch
fi
C A separate missing values branch C All branches Branches Branch
Split Point
< < >=
l 2 3
New split point:
67350 70000 70000
 Add Branch I
OK
Cancel
Remove Branch
Apply
Undo
Reset
357
358
Chapter 3 Introduction to Predictive Modeling: Decision Trees
6. Select Branch 1 by highlighting the row. DernMedHorneValue  Interval Splitting Rule Target Variable: TargetB Assign missing values to (* A specific branch
12
>J
C A separate missing values branch C All branches Branches Branch
Split Point
70000 70000
>=
New split point:
Add Branch
OK
Remove Branch
Cancel
Apply
Undo

Reset
3.2 Cultivating Decision Trees
359
7. Select R e m o v e Branch. DernMedHorneValue  Interval Splitting Rule Target Variable: TargetB Assign missing values to (* A specific branch
[1
â€˘
C A separate missing values branch C Al! branches Branches Split Point
Branch 1 2
70000 70000
<
>=
Add Branch
New split point: OK
Cancel
Apply
Remove Branch Undo
Reset
T h e split point for the partition is m o v e d to 70000. jf
In general, to change a Decision Tree split point, add the n e w split pointfirstand then remove the unwanted split point.
360
Chapter 3 Introduction to Predictive Modeling: Decision Trees Select O K to close the Interval Splitting Rule dialog box and to update the Tree View w i n d o w .
in N o d e : 4843 T a r g e t : Targets • (V) I 50% 1 (%): 50%
T
Gift Count 36 Months
I
1 I
<2.5 or Missing
.
1
>=2.5
,
22B3 in M o d e : T a r g e t : TargetB 57% 0 (%) : 43% 1 (%J:
25£0 in N o d e : Target: TargetB 44% 0 %): 5c* 1 1%) :
T
Median H o m e Value Region
<700D0 or Missing
Gift Amount Last
=70000
J
>=7.5 or Missing
I H in N o d e : 922 T a r g e t : TargetB 0 (%) : £2% i (%> : 38%
N in N o d e : 1361 T a r g e t : TargetB 54% • (%) : 46% 1 (%) :
in N o d e : £57 T a r g e t : TargetB 0 (%) : 36% 1 (%): 64%
in N o d e Target 0 1%) 1 (*>
1903 TargetB 47% 53*
Overriding the statistically optimal split point slightly changed the concentration of cases in both nodes.
3.2 Cultivating Decision Trees
361
Creating the Maximal Tree A s demonstrated thus far, the process of building the decision tree model is one of carefully considering and selecting each split sequentially. There are other, more automated w a y s to grow a tree. Follow these steps to automatically grow a tree with the S A S Enterprise Miner Interactive Tree tool: 1. Select the root node of the tree. 2. Rightclick and select Train N o d e from the m e n u . T h e tree is grown until stopping rules prohibit further growth. f
Stopping rules were discussed previously and in the SelfStudy section at the end of this chapter.
It is difficult to see the entire tree without scrolling. However, you can z o o m into the Tree View w indow to examine the basic shape of the tree. 3. Select Options •=> Z o o m <=> 5 0 % from the main m e n u . T h e tree view is scaled to provide the general shape of the maximal tree.
s
' 1
r
i
t
T
1
1
i.
1
1
y
Ai ,±,,

i •,,'.„,,
T"
i
' 1 f
1
L
1
1 I ^^^^^^^^^
T
1
j
i
Y i
1
i i i ~
_!.„
"T
Y
1
...!....
_.u
~L
i
_i_
1
i
.J^I
i
i
L
' i"~
1
1
u
X
i
T
i
J
L
iZTi . . J ..i
Plots and tables contained in the Interactive Tree tool provide a preliminary assessment of the maximal tree.
362
Chapter 3 Introduction to Predictive Modeling: Decision Trees
4. Select V i e w â€˘=> Assessment Plot. Interactive Decision Tree  EMW5.TREE jmOWSETREE[EMW5.PART2_TRAIN] File
Edit
View
i
Tree
Train
1
J
Window
1

â€˘2)
OQ
Assessment Plot Proportion Misclassifiod 0.50
N u m b e r of L e a v e s
While the majority of the improvement in fit occurs over the first few splits, it appears that the maximal, fifteenleaf tree generates a lower misclassification rate than any of its simpler predecessors. T h e plot seems to indicate that the maximal tree is preferred for assigning predictions to cases. However, this plot is misleading. Recall that the basis of the plot is the validation data. Using the s a m e sample of data both to evaluate input variable usefulness and to assess model performance c o m m o n l y leads to overfit models. This is the mistake that the Mosteller and Tukey quotation cautioned you about. You created a predictive model that assigns one of 15 predicted target values to each case. Your next task is to investigate h o w well this model generalizes to another similar, but independent, sample: the validation data. Select File O Exit to close the Interactive Tree application. A dialog box opens. Y o u are asked whether you want to save the maximal tree to the default location. Unsaved Changes ?^
Do you want to save the changes to the default location, EMWS2.TREE_BROWSETREE? I Yes !
No
Cancel
Select Yes. T h e Interactive Tree results are n o w available from the Decision Tree node in the S A S Enterprise Miner flow.
3.3 Optimizing the Complexity of Decision Trees
3.3
363
Optimizing the Complexity of D e c i s i o n T r e e s
Model Essentials  Decision Trees
Optimize complexity.
Pruning
The maximal tree represents the most complicated model you are willing to construct from a set of training data. T o avoid potential overfitting. m a n y predictive modeling procedures offer s o m e mechanism for adjusting model complexity. For decision trees, this process is k n o w n as pruning.
Predictive Model Sequence mmm mm
mmm
J IHM M M WmW ^mWm
â€”
Create a sequence of models with increasing complexity.
)mplexity
The general principle underlying complexity optimization is creating a sequence of related models of increasing complexity from training data and using validation data to select the optimal model.
364
Chapter 3 Introduction to Predictive Modeling: Decision Trees
The Maximal Tree Training Data
Validation Data
Maximal
Create a sequence of models with increasing complexity.
Tree
A maximal tree is the most complex model in the sequence.
r 6 Model 87
Complexity
The Maximal Tree 1
•
•
•
E
 irp NO
i bfcH3
•
OIK
C
A maximal tree is the most complex model in the sequence. Mode/ 89
Complexity
T h e maximal tree that you generated earlier represents the most complex model in the sequence.
3.3 Optimizing the Complexity of Decision Trees
365
Pruning One Split Training Data
mm
Validation Data
mm mmm
*pufl • r 
• —
m
w
m
r
m
w
m
mm
The next model in the sequence is formed by pruning one split from the maximal tree. Model Complexity
The predictive model sequence consists of subtrees obtained from the maximal tree by removing splits. T h efirstbatch of subtrees is obtained by pruning (that is, removing) one split from the maximal tree. This process results in several models with one less leaf than the maximal tree.
Pruning One Split
Illll
II
I Il ll ll ll l
Data
III Illll
Validation
Illll
Illll Illll
Training Data
Each subtree's predictive performance is rated on validation data. Model 92
The subtrees are compared using a model rating statistic calculated on validation data. (Choices for model fit statistics are discussed later.)
366
Chapter 3 Introduction to Predictive Modeling: Decision Trees
Pruning One Split Training Data
Validation Data WW*
\""""
1
1
; The subtree with the highest validation assessment is selected. Model 94
Complexity
T h e subtree with the best validation assessment statistics is selected to represent a particular level of model complexity.
Pruning Two Splits Training Data
Validation
Data twit
i
ILMJIII'IH m
"i
p**n
H+H I Similarly this is done for subsequent models.
Model Complexity
A similar process is followed to obtain the next batch of subtrees.
3.3 Optimizing the Complexity of Decision Trees
Pruning Two Splits Training Data
ÂŁ
A
^
Prune two splits from the maximal tree,...
Model Complexity
97
This time the subtrees are formed by removing two splits from the maximal tree.
Pruning Two Splits Training Data
Validation
Dais
mm: mmi mil mm
I
1
mmm
mi mi
mm
llll
II
II
mm
1
2 9 validation assessment, and. r a t e
"
"
e a c h
s u b t r e e
u s i n
Model Complexity
O n c e again, the predictive rating of each subtree is compared using the validation data.
367
368
Chapter 3 Introduction to Predictive Modeling: Decision Trees
Pruning Two Splits Training Data
Validation
Data
T2
s 4k
...select the subtree with the best assessment rating.
Model Complexity
100
T h e subtree with the highest predictive rating is chosen to represent the given level of model complexity.
Subsequent Pruning lion Data
C~2
Continue pruning until all subtrees are considered.
4k 106
Model Complexity
T h e process is repeated until all levels of complexity are represented by subtrees.
3.3 Optimizing the Complexity of Decision Trees
Selecting the Best Tree Training Data
Validation Data
mm mmm mm mmm mmm mm
III
III
mm mm
imam Compare validation assessment between tree complexities.
A MMft ATocte/
Validation
Complexity
107
Assessment
Consider the validation assessment rating for each of the subtrees.
Selecting the Best Tree Validation Data
Training Data
mm
mm
mmm
mm mm
mm
mmm mm
Mkhttt
.<â€˘â€˘ trtrtrXr
Model Complexitv
mm
Which subtree should be selected as the best model?
Validation Assessment
Which of the sequence of subtrees is the best model?
369
370
Chapter 3 Introduction to Predictive Modeling: Decision Trees
Validation Assessment Ml
mmm
mm â€” mmm mm mm
109
ft
â€˘
Mode/
Validation
Complexity
Choose the simplest model with the highest validation assessment.
Assessment
In S A S Enterprise Miner, the simplest model with the highest validation assessment is considered best.
Validation Assessment Validation Data
What are appropriate validation assessment ratings?
110
O f course, you must finally define what is meant by validation assessment ratings.
3.3 Optimizing the Complexity of Decision Trees
371
Assessment Statistics
Validation
Data
Ratings depend on...
mm mm mm ——mm
 .  t a r g e t measurement prediction type.
112
T w o factors determine the appropriate validation assessment rating, or more properly, assessment statistic: • the target measurement scale • the prediction type A n appropriate statistic for a binary target might not m a k e sense for an interval target. Similarly, models tuned for decision predictions might m a k e poor estimate predictions.
Binary Targets
primary outcome secondary outcome
114
For this discussion, a binary target is assumed with a primary outcome ( t a r g e t = l ) and a secondary outcome ( t a r g e t = 0 ) . T h e appropriate assessment measure for interval targets is noted below.
372
Chapter 3 Introduction to Predictive Modeling: Decision Trees
Binary Target Predictions
• ^^^ff^^^M t : mMM
HH
mm
•HBBB
Hi
mm
mmmwm
mm
mm
Mmm
mm
m
m
m
m
mm
^u 5! *I
mm • • • • ma

' 1
primary
1
decisions
secondary
520
rankings
720
;
estimates
116
There are different s u m m a r y statistics corresponding to each of the three prediction types: • decisions • rankings • estimates The statistic that you should use to judge a model depends on the type of prediction that you want.
3.3 Optimizing the Complexity of Decision Trees
373
Decision Optimization
mmm
mmm mmt
mm mm
mmm
warn mm
mm
—
MB H i
mm mm
M
mm. mm
mm mmnrnm
decisions
119
Consider decision predictionsfirst.With a binary target, you will typically consider two decision types: • the primary decision, corresponding to the primary outcome • the secondary decision, corresponding to the secondary outcome
Decision Optimization  Accuracy
true positive true negative Maximize accuracy: agreement between outcome and prediction
120
Matching the primary decision with the primary outcome yields a correct decision called a true positive Likewise, matching the secondary decision to the secondary outcome yields a correct decision called a true negative. Decision predictions can be rated by their accuracy, that is, the proportion of agreement between prediction and outcome.
374
Chapter 3 Introduction to Predictive Modeling: Decision Trees
Decision Optimization  Misclassification
fnpulS
mm mm mm â€” mm mm H I mm mm mm mm mm ii 11.
mm mm mm mm mm mm
tj/giir
predietioti 1
1
 secondly!]
false negative
0
[ primly 
false positive
mum mmm
Minimize misclassification: disagreement between outcome and prediction
122
Mismatching the secondary decision with the primary outcome yields an incorrect decision called a false negative. Likewise, mismatching the primary decision to the secondary outcome yields an incorrect decision called a false positive. Decision predictions can be rated by their misclassification, the proportion of disagreement between prediction and outcome.
Ranking Optimization
124
Consider ranking predictions for binary targets. With ranking predictions, a score is assigned to each case. T h e basic idea is to rank the cases based on their likelihood of being a primary or secondary outcome. Likely primary outcomes receive high scores and likely secondary outcomes receive low scores.
3.3 Optimizing the Complexity of Decision Trees
375
Ranking Optimization  Concordance
target=0—+lowscore target=l—"high score Maximize concordance: proper ordering of primary and secondary outcomes
126
W h e n a pair of primary and secondary cases is correctly ordered, the pair is said to be in concordance. Ranking predictions can be rated by their degree of concordance, that is, the proportion of such pairs whose scores are correctly ordered.
Ranking Optimization  Discordance
M
M 720 520
target=0—>high score target=l—dow score Minimize discordance: improper ordering of primary and secondary outcomes
129
W h e n a pair of primary and secondary cases is incorrectly ordered, the pair is said to be in discordance. Ranking predictions can be rated by their degree of discordance, that is, the proportion of such pairs whose scores are incorrectly ordered.
376
Chapter 3 Introduction to Predictive Modeling: Decision Trees
Estimate Optimization
1
r~~ "
HH
â€˘ â€˘
mmm
mm H
mm
mm
ffl^BBS
Hi
Hi
mm mm mm mm mmm H I E M
estimates
130
Finally, consider estimate predictions. For a binary target, estimate predictions are the probability of the primary outcome for each case. Primary outcome cases should have a high predicted probability; secondary outcome cases should have a low predicted probability.
3.3 Optimizing the Complexity of Decision Trees
377
Estimate Optimization  Squared Error
(target  estimate)2 Minimize squared error. squared difference between target and prediction
132
T h e squared difference between target and estimate is called the squared error. Averaged over all cases, squared error is a fundamental statistical measure of model performance.
where, for the /'' case, y, is the actual target value and yt is the predicted target value. W h e n calculated in an unbiased fashion, the average square error is related to the amount of bias in a predictive model. A model with a lower average square error is less biased than a model with a higher average square error. f
Because S A S Enterprise Miner enables the target variable to have more than two outcomes, the actual formula used to calculate average square error is slightly different (but equivalent, for a binary target). Suppose the target variable has L outcomes given by (Cj, C2, ...C/.), then Average
square
error
=
NL
where, for the t case,>'; is the actual target value, p
tJ
and the following:
is the predicted target value for the f
class,
378
Chapter 3 Introduction to Predictive Modeling: Decision Trees
Complexity Optimization  Summary
decisions accuracy / misclassification rankings concordance/discordance estimates squared error
134
In s u m m a r y , decisions require high accuracy or low misclassification, rankings require high concordance or low discordance, and estimates require low (average) squared error.
3.3 Optimizing the Complexity of Decision Trees
379
A s s e s s i n g a Decision Tree
Use the following steps to perform an assessment of the maximal tree on the validation sample. 1. Select the Decision Tree node. In the Decision Tree node's properties, change Use Frozen Tree from N o to Yes.
Protect Fie Ed* Vew Araju Optitns W n d w
S E n t e r p r l M Miier M y
Heb
D  l ^ iXBI^j]aDgJl^i*
3 wf Prsjoil
S*Tte Eiptaril rtad^vl Mafe Istns j Unity  Craftfrumpj Text Hlrtifj !
£i 1.1. •• tr<
4Hjperty
Gt rural
J
1
vaue 
rft 5
3
Imnnrt !! Hals Exported Dal; Moles Vaiaoes tnlurattnr* Use Frozen Tree UseMjIliclBTaroels
M
Yea
InltrrJ Ci leilJd ProbF ProbCnsq Ncmirar Cntenor • tTjin;' Crnerion Entncv 0.2 r Sijnfflrarue Level rMissinsVilues Jse In search l U5S Input Of It H Nu jMBdrrumOroicri 3 >M3rlmimDBB* 6 Minimum :a!eaoica Se5 Use Froien Tree
p U p v t Pr«*rtkr* £ r w J v w
J
2l\ZO
Q — J —o w i j '
fn[w*H
T h e Frozen Tree property prevents the maximal tree from being changed by other property settings w h e n the flow is run.
380
Chapter 3 Introduction to Predictive Modeling: Decision Trees
2. Rightclick the Decision Tree node and run it. Select Results. Enterprise Minertest I lie grit yjew fictions Options Window Help
3 teitl •*• Ol D a t a Sources H il Diagrams Predictive Analysi 1—J™ predictive analysi; 1 B ftj Model Packages I JlJ Users
Sample Explore Morify Model Assess Ut«y Oedt Scoring Teit r**ig ;
Predictive Analysis
o
•J
USH Multiple TaigetNa B Interval Criterion ProBF Nominal Criterion ProttChisq Ordinal Criterion Entropy Significance Level 0.2 Missing Values Use In search Use Input Once No Maximum Branch 2 Maiimum Depth 6 "Minimum Calegorfris
Run completed BaojerniFTeclrtive Analyst Path:
Decision Tree OK
Results...
—
3.3 Optimizing the Complexity of Decision Trees
381
T h e Results w i n d o w opens. nr Results  fode: Decision Tree Diagram: Predictive Analysis Fte Ed* yew Wnctow
Si. i "i i'
ME
\>. mknnji Live rljy: Tjftjelfl
9 14 15 Leaf Index
19
21
â€˘ Tiaimng Percent 1 â€˘ validation Percent 1
Target TargetB TargetB TargetB TargetB TargetB TargetB TargetB TargetB TargetB
Statistic 0:
It H:
Train so.a* so.at
Validation so.a* so.o\
4943
4S4J
Ft Statistics _NOBS_ _SUMW_ _MISC_ _MAX_ _SSE_ _ASE_ JMSE_ _DIV_ _DFT_
Statistics Label
Tram
S u m of f re._ S u m of Cas... Hisclassrfic . Maximum A... S u m of Squ.. Average Sq.. RoolAvera... Divisor (or A.. Total Degre...
Validation 4843
4843
9686
9686
0103425
0.432377
0.923571
0.928571
2293.347
2377.757
0236769
0 245484
0486599
0.495463
9666
9686
4843
Gift C o u n tret 36 Months
Variable Inpoctencc
T h e Results w i n d o w contains a variety of diagnostic plots and tables, including a cumulative lift chart, a tree m a p , and a table of lit statistics. T h e diagnostic tools s h o w n in the results vary with the measurement level of the target variable. S o m e of the diagnostic tools contained in the results s h o w n above are described in the following SelfStudy section.
382
Chapter 3 Introduction to Predictive Modeling: Decision Trees T h e saved tree diagram is in the bottom left corner of the Results window. Y o u can verify this by maximizing the tree plot.
at...
i
'
25
>
I
Me... >. 70000
Pro...
175
i
*
I
Cif...
i
> 175
:t
I
r
« 532
T~ c Owner
^7S I
> is
I
i
I
i
> 5.32
< 1D5
I
Tin...
,
,
> 105
I
•> 175
l
> 115
18.53...
I
'
i —
L > 175
I
Do...
, . I ,
25
Pio...
,
(
&' 115
I
Gift Amount Lost
room
i
1
I
Me... 24,49
* 124100
. I . . I
Gil...
> 124100
I
> 14.415
< 14.415
. I . .• 1
~Y SSH...
7S500
< 7S500
3.3 Optimizing the Complexity of Decision Trees
383
4. The main focus of this part of the demonstration is on assessing the performance of a 15leaf tree, created with the Interactive Decision Tree tool, on the validation sample. Notice that several of the diagnostics s h o w n in the results of the Decision Tree node use the validation data as a basis. Select V i e w <=> M o d e l â€˘=> Iteration Plot. Iteration Plot
E M
Average Square Error
0.250 
0.245 
0.240
10
15
N u m b e r of L e a v e s Train: Average Squared Error
Valid: Average Squared Error
T h e plot shows the Average Square Error corresponding to each subtree as the data is sequentially split. This plot is similar to the one generated with the Interactive Decision Tree tool, and it confirms suspicions about the optimality of the 15leaf tree. The performance on the training sample becomes monotonically better as the tree becomes more complex. However, the performance on the validation sample only improves up to a tree of, approximately, four or five leaves, and then diminishes as model complexity increases. y
T h e validation performance shows evidence of model overtltting. Over the range of one to approximately four leaves, the precision of the model improves with the increase in complexity. A marginal increase in complexity over this range results in better accommodation of the systematic variation or signal in data. Precision diminishes as complexity increases past this range; the additional complexity accommodates idiosyncrasies in the training sample, and the model extrapolates less well.
384
Chapter 3 Introduction to Predictive Modeling: Decision Trees Both axes of the Iteration Plot, s h o w n above, require s o m e explanation.
Average Square Error: T h e formula for average square error w a s given previously. For any decision tree in the sequence, necessary inputs for calculating this statistic are the actual target value and the prediction. Recall that predictions generated by trees are the proportion of cases with the primary outcome ( T a r g e t B = l ) in each terminal leaf. This value is calculated for both the training and validation data sets. Number of leaves: Starting with the fourleaf tree that y o u constructed earlier, it is possible to create a sequence of simpler trees by removing partitions. For example, a threeleaf tree can be constructed from your fourleaf tree by removing either the left partition (involving M e d i a n H o m e V a l u e ) or the right partition (involving G i f t A m o u n t L a s t ) . R e m o v i n g either split might affect the generated average square error. T h e subtree with the lowest average square error (measured, by default, on the validation data) is the split that remains. T o create simpler trees, the process is repeated by, once m o r e , starting with the fourleaf tree that y o u constructed and removing m o r e leaves. For example, the twoleaf tree is created from the fourleaf tree by removing t w o splits, and so on.
3.3 Optimizing the Complexity of Decision Trees
385
5. To further explore validation performance, select the arrow in the upper left corner of the Iteration Plot, and switch the assessment statistic to Misclassification Rate. Iteration Plot
llnlx
Misclassification Rate
i
0.500 
0.475 
0.450 
0.425 
10
15
N u m b e r of L e a v e s Train: Misclassification Rate
Valid: Misclassification Rate
T h e validation performance under Misclassification Rate is similar to the performance under Average Square Error. T h e optimal tree appears to have, roughly, four or five leaves. Proportion misclassijied: T h e n a m e decision tree model comes from the decision that is m a d e at the leaf or terminal node. In S A S Enterprise Miner, this decision, by default, is a classification based on the predicted target value. A predicted target value in excess of 0.5 results in a primary outcome classification, and a predicted target value less than 0.5 results in a secondary outcome classification. Within a leaf, the predicted target value decides the classification for all cases regardless of each case's actual outcome. Clearly, this classification will be inaccurate for all cases that are not in the assigned class. T h e fraction of cases for which the wrong decision (classification) was m a d e is the proportion misclassified. This value is calculated for both the training and validation data sets. 6. T h e current Decision Tree node will be renamed and saved for reference. Close the Decision Tree Results window.
386
Chapter 3 Introduction to Predictive Modeling: Decision Trees
7. Rightclick on the Decision Tree node and select R e n a m e . N a m e the node Maximal Tree.
T h e current process flow should look similar to the following: 2 Predictive Analysis
Maximal TreÂť
l.ita Partition
'J
J
T h e next step is to generate the optimal tree using the default Decision Tree properties in S A S Enterprise Miner.
3.3 Optimizing the Complexity of Decision Trees
387
Pruning a Decision Tree The optimal tree in the sequence can be identified using tools contained in the Decision Tree node. T h e relevant properties are listed under the Subtree heading. 1. Drag another Decision Tree node from the Model tab into the diagram. Connect it to the Data Partition node so that your diagram looks similar to what is s h o w n below. ^149.173.38.118Remote Desktop
ME
£f Enterprise Miner  M y Project O
Ed* £ewfiftkmsOptions Wncbw tJ*b
DH^l^XlSNBlDlBlailJjMI.'.'l :Qt5Hi •3 My Project Q Data Sourc WH Diagrams
Sample EipJore Modify Model Assess Ltity I Credit Scoring
HE
i •  Predictive Analysis
_£] Model Packages Property
General Node ID ImportedData Exported Data Noles
Train
jr*^,p!«— ..at ; — —vfcieiaJliliPirl.li.".
Variables Interactive Use Frozen Tree No Use Multiple Tarjie No
J
6

J
RnteiyajCrtterlon Nominal Criterion ProbCnlsq r Ordinal Criterion Entropy r Significance Level G 2 rMisslng Values Use in search rUselnputpnce
General pun completed
 f~l.Mi.imil Ttt*
m
ti * a a  j 
1 [0 student as student
Connected to SASApp  LogicaJ Workspace Server
Chapter 3 Introduction to Predictive Modeling: Decision Trees
388
2. Select the n e w Decision Tree node and scroll d o w n in the Decision Tree properties to the Subtree section. T h e main tree pruning properties are listed under the Subtree heading. Value Property Maximum ueptn b I Minimum Categorical Si 5 §Node 5 Leaf Size • N u m b e r of Rules 5 • 'Number of Surroqate R L 0 : Split Size dsplit Search 5000 iExhaustive 20000 Node Sample :E Subtree ; Method Assessment 1 jNumber of Leaves [Assessment Measure Decision Assessment Fraction 0 . 2 5 [slcross Validation Perform Cross Validatio NO Number of Subsets 10 [Number of Repeats 1 i 12345 iSeed [•jObservation Based Impt Hpbservation Based ImpcNo ^ N u m b e r Single Var ImpcS •Bonferroni Adj u stm e nt Yes "ime of Kass AdjustmerBefore Inputs No •Number of Inputs Split Adjustment Yes T h e default method used to prune the maximal tree is Assessment. This means that algorithms in S A S Enterprise Miner choose the best tree in the sequence based on s o m e optimality measure. f
Alternative method options are Largest and N . The Largest option provides an autonomous w a y to generate the maximal tree. The N option generates a tree with N leaves. T h e maximal tree is upperbound on N.
T h e Assessment Measure property specifies the optimality measure used to select the best tree in the sequence. T h e default measure is Decision. The default measure varies with the measurement level of the target variable, and other settings in S A S Enterprise Miner. f
Metadata plays an important role in h o w S A S Enterprise Miner functions. Recall that a binary target variable w a s selected for the project. Based on this, S A S Enterprise Miner assumes that you want a tree that is optimized for making the best decisions (as opposed to the best rankings or best probability estimates). That is, under the current project settings, S A S Enterprise Miner chooses the tree with the lowest misclassification rate on the validation sample by default.
3.3 Optimizing the Complexity of Decision Trees
389
Ritjhtclick on the n e w Decision Tree node and select R u n . ^Enterprise Miner  M y Project lie Edft View Ac lions Options Wjndow Help
•lOlgHlul^l 3 My Prolect t a Date Sources " Diagrams B
Model Packages
ME
Pr elite live Analysis
Property
General Node ID Imported Dot.) EipcrterJ Data Nates
J Use Multiple TargeNo interval Criterion ProbF •Nominal Criterion PrpbChisq I Ordinal Criterion Entropy r Significance Level 07 r Missing Values Use in search rUse Input Once No r Maximum Branch 2 ' Maximum Depth 6 ; Minimum Calegcnl5
Ed* Variables...
A Run SlResJts... ft Oeate Model Package... j*) Export Path as SAS Program
Leaf See
cut
Cow Genera]
flxn completed
"P? student as student fat Cornec
4. After the Decision Tree node runs, select Results.... Run Status
V
L3
Run completed Diagram: Predictive Analysis Path:
Decision Tree
OK
Results...
Delete °—una
390 5.
Chapter 3 Introduction to Predictive Modeling: Decision Trees In the Result w i n d o w , select V i e w ÂŤ=> M o d e l â€˘=> Iteration Plot. T h e Iteration Plot shows model performance under Average Square Error by default. However, this is not the criterion used to select the optimal tree.
6. C h a n g e the basis of the plot to Misclassification Rate.
Sum of Square Errors Maximum Absolute Error Subtree Assessment
0.235 0
5
10
N u m b e r of Leaves Train: Average Squared Error Valid: Average Squared Error
15
3.3 Optimizing the Complexity of Decision Trees
391
The fiveleaf tree has the lowest associated misclassification rate on the validation sample. Iteration Plot
013
Misclassification Rate 0.500 
0.475 
0.450 
0.425 
1
5
10
Number of Leaves Train: Misclassification Rate Valid: Misclassification Rate
Notice that the first part of the process in generating the optimal tree is very similar to the process followed in the interactive Decision Tree tool. The maximal, 15leaf tree is generated. The maximal tree is then sequentially pruned so that the sequence consists of the best 15leaf tree, the best 14leaf tree, and so on. Best is defined at any point, for example, eight, in the sequence as the eightleaf tree with the lowest Validation Misclassification Rate a m o n g all candidate eightleaf trees. T h e tree in the sequence with the lowest overall validation misclassification rate is selected.
392
Chapter 3 Introduction to Predictive Modeling: Decision Trees The optimal tree generated by the Decision Tree node is found in the bottom left corner of the Results window.
SO.01 10.01 BM3
Gift Count 36 Months
2.5 Tiiin 41.81
3t At 1st ic
V A I irtAt inn
44.Z\
41.0*
7
Girt Amount Last
>=
7.5
7.5 V A I idition
S)
4 1 . It s;.4t
.41
Time Since Last Gift
*=
17.5
7.5
I
T E A in
11.11 40.71
U A I AIIAC
Iialn
Ion
11.81 40.:%
41.11 sa.it
1 • • '~
Gift Amount Average Card 36 Mont
«
< 14.415 SCACucit
TlAtn
41.11 ll.tl
UAI
14.415
idArion
TtAin
tf.M 10.11
IS.CI
Vnlid.it ion
14.El
41.41
45.11
1113
411
7. Close the Results w i n d o w of the Decision Tree node.
1,Lit i o n
UAI
4;.BI
n.ii .
t
n
Jjl idaalon 11.01
S3.01
l t . i t
TIAIJI
UAIidACicn
11.0 C4.4I ,•7
34.01
ei.oi
3.3 Optimizing the Complexity of Decision Trees
393
Alternative Assessment Measures T h e Decision Tree generated above is optimized for decisions, that is, to m a k e the best classification of cases into donors and nondonors. W h a t if instead of classification, the primary objective of the project is to assess the probability of donation? The discussion above describes h o w the different assessment measures correspond to the optimization of different modeling objectives. Generate a tree that is optimized for generating probability estimates. T h e main purpose is to explore h o w different modeling objectives can change the optimal model specification. 1. Navigate to the M o d e l tab. Drag a n e w Decision Tree into the process flow. fc. Enterprise Miner  My Project FJe £dit View Actions fijtiors Window (jelp
Dl^ltiixiSll^lBllGlDlAJl^l* 3 My Project  O Data Sources i JTJ Diagrams
•0fl \.UI£i ^j^lidiiUj*]
^ J ^ l i d M Jill 5arr.pl?  Explore  Mj'ify  Model 
1

Scoring 
Mi
• Predictive Analysis
 Model Packages Property
General
Node ID imported Data Exported Data doles
Train
Variables Interactive Use Frozen Tree No use MultipleTargeNo
J
a
J
^JD.i.iion Tret
Interval Critenon ProbF •Nominal Criterion ProbChisc; '•Ordinal Criterion Enlmpy I Signified'e Lew*I 0.2 . Mssinc Values Use in seach !Use Input Once No : vatrr.j':i Branch 2 jMailmum Depth E Minimum CalegoriiS
m
•J533BI Leaf See
1
General pun completed
2.
ft} «udent as student ft*, Connected to SASApp  Logical Workspace Server
Rightclick the n e w Decision Tree node and select R e n a m e .
3. N a m e the n e w Decision Tree node P r o b a b i l i t y Rename
m
Node Name; [probability Tree
OK
Cancel
Tree.
394
Chapter 3 Introduction to Predictive Modeling: Decision Trees
4. Connect the Probability Tree node to the Data Partition node. Your diagram should look similar to the following: ...... p r e d i c t i v e A n a l y s i s
5.
Recall that an appropriate assessment measure for generating optimal probability estimates is Average Square Error.
3.3 Optimizing the Complexity of Decision Trees 6. Scroll d o w n in the properties of the Probability Tree node to Subtree. 7. Change the Assessment Measure property to Average Square Error. Value Property 2 ["Maximum Branch 6 jMaximum Depth '•Minimum Categorical Si:5 ©Node 5 [•Leaf Size 5 '•• Number of Rules H N u m b e r of Surrogate Ru0 M s pi It Size [=]split Search 5000 '•• Exhaustive 20000 'Node Sample § Subtree Assessment rpMethod 1 r Number of Leaves 1 ['Assessment Measure Average Sguare Error i ''Assessment Fraction 0.25 Tperform Cross Validatio N o 10 H N u m b e r of Subsets Number of Repeats 1 12345 '•Seed .[^Observation Based Impt [reservation Based ImpcNo '•Number Single Var Impc5 PipValue Adjustment [Bonferroni Adjustment Yes ! 'Time of Kass AdjustmerBefore No r Inputs 1 H N u m b e r of Inputs Ves '•[Split Adjustment ^Output Variables ; res LeafVariable Disk Performance 8. R u n the Probability Tree node. 9. After it runs, select Results....
•
395
396
Chapter 3 Introduction to Predictive Modeling: Decision Trees
10. In the Results w i n d o w , select View â€˘=> M o d e l <=> Iteration Plot. Iteration Plot
â€˘ 2 1
Average Square Error 0.250 
0.245 
0.240 
0.235 N u m b e r of Leaves Train: Average Squared Error Valid: Average Squared Error
A fiveleaf tree is also optimal under the Validation Average Square Error criterion. However, viewing the Tree plot reveals that, although the optimal Decision (Misclassification) Tree and the optimal Probability Tree have the same number of leaves, they are not identical trees.
3.3 Optimizing the Complexity of Decision Trees
397
11. Maximize the Tree plot at the bottom left corner of the Results window. T h e tree s h o w n below is a different fiveleaf tree than the tree optimized under Validation Misclassification. •T Results  Node: Probability Tree Diagram; Predictive Analysis He F,dt View Window
QJDj^lPi/] E m ll.lt I t . 11
f
Gill Caumafl nil Months
2.5
1
Trill. VilidttiOT lilt 11.n lt.lt 11. It
Ti4ln V.llittim ii tt si it
it :t ii.it
Gift Amounl Last
Median H o m e Value Region
I r350 Ti.lr. V.lliiti
>=
II It II (t
I
T.5 Ilainii •
(111 It.M
1:
TzilB ».(!
a tt
l.lii.tj. It.It <(.n
I 1
Time Since Last Gift
\7£ T u r n '.'.liditlm U.tt U.ll ll.lt II,tt
Turn 11.It tilt
Ulllditln. 11.It ii :i
It is important to remember h o w the assessment measures work in model selection. In the first step, cultivating (or growing or splitting) the tree, the important measure is logworth. Splits of the data are m a d e based on logworth to isolate subregions of the input space with high proportions of donors and nondonors. Splitting continues until a boundary associated with a stopping rule is reached. This process generates the maximal tree. The maximal tree is then sequentially pruned. Each subtree in the pruning sequence is chosen based on the selected assessment measure. For example, the eightleaf tree in the Decision (Misclassification) Tree pruning sequence is the eightleaf tree with the lowest misclassification rate on the validation sample a m o n g all candidate eightleaf trees. T h e eightleaf tree in the Probability Tree pruning sequence is the eightleaf tree with the lowest average square error on the validation sample a m o n g all candidate eightleaf trees. This is h o w the change in assessment measure can generate a change in the optimal specification. f
In order to minimize clutter in the diagram, the Maximal Tree node is deleted and does not appear in the notes for the remainder of the course.
398
Chapter 3 Introduction to Predictive Modeling: Decision Trees
3.4
Understanding Additional Diagnostic Tools (SelfStudy)
Several additional plots and tables contained in the Decision Tree node Results w i n d o w are designed to assist in interpreting and evaluating a decision tree. T h e most useful are demonstrated on the next several pages.
3.4 Understanding Additional Diagnostic Tools (SelfStudy)
399
Understanding Additional Plots a n d Tables (Optional)
Tree Map T h e Tree M a p w i n d o w is intended to be used in conjunction with the Tree w i n d o w to gauge the relative size of each leaf.
1. Select the small rectangle in the lower right corner of the Tree m a p .
1
1
3100
Chapter 3 Introduction to Predictive Modeling: Decision Trees
T h e lower right leaf in the tree is also selected.
Statistic 0;
Training
1
SO*
node:
41 4 3
II
< Statistic
H
in
ii\
V a l idas i<m
so*
50* SO* 4»43
t.s
Training
>• ValiilacIon
0:
ST*
ST
1:
43*
49*
node:
ttoi
« 0 7
' '1 Z.S z
5c a c i n c i c 0:
T r a i n ing
1:
set zsto
0
in
Mil
i<taclon
44*
nodt: tift
Amount
44* SS* ZS96 List
t.S T r a i n ing IS*
(41 SSI
7.5
>= Val idat ion
Statistic
Training
Validation
94*
0:
41*
SS*
It
SJ*
St*
in n o J r .
1903
1971
sss
H
3 ince
I
Last
49*
G I :•
17.5 riing
Val idat ion
ininj
Val idation
41* J9»
49*
SI*
092
914
Sit 49* 1011
SI* 49* 10S7
Gift <
Amount
Average
36
Ho >=
Statistic 0:
Training 47*
Validation S0t
1:
S3*
SOt
in n « d « :
641
£40
>
Card
14.415 e
14.411
Tiaininj
<
T h e size of the small rectangle in the tree m a p indicates the fraction of the training data present in the corresponding leaf. It appears that only a small fraction of the training data finds its w a y to this leaf.
3.4 Understanding Additional Diagnostic Tools (SelfStudy)
3101
Leaf Statistic Bar Chart  m i x 0.6¬ £ 0.4(73 0.2 H 0.0
1
6 Leaf Index • Training Percent 1 • Validation Percent 1 This w i n d o w compares the blue predicted outcome percentages in the left bars (from the training data) to the red observed outcome percentages in the right bars (from the validation data). This plot reveals h o w well training data response rates are reflected in the validation data. Ideally, the bars should be of the s a m e height. Differences in bar heights are usually the result of small case counts in the corresponding leaf.
3102
Chapter 3 Introduction to Predictive Modeling: Decision Trees
Variable Importance T h e following steps open the Variable Importance window. Select V i e w ^ M o d e l O Variable Importance. The Variable Importance w i n d o w opens. 1 HI Variable Importance Variable Name
Label
I
H
i
^
Number of Splitting Rules
GiftCnt36 Gift Count 3... GiftAvgLast Girt Amount... DemMedH... Median Ho... GiftTimeLast Time Since... DemPctVet... Percent Vet... GiftAvg36 GiftAmount... GiftAvgCard... Gift Amount... DernAge Age GiftAvgAII GiftAmount... GiftCntAII Gift Count A... GiftCntCard... Gift Count... GiftCntCard... Gift Count... GiftTimeFirst Time Since... PromCntCa...Promotion... PromCnt12 Promotion... PromCnt36 Promotion... PromCntAII Promotion... PromCntCa...Promotion... PromCntCa...Promotion... StatusCatSt..Status Gate... REP_Dem... Replaceme... DemCluster Demograp... DernGender Gender DernHome... H o m e Owner StatusCatG... Status Cate...
i
H
Importance
1 1 1 1 0 0 0
n
o 0 0 H I 0
l
Validation Importance
1 0.524135 0.50277 0.480916 0 0
0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 K O I £ 3 9 0 0 H I 0
^
j
Ratio of Validation to Training Importance
1 0.671883 0.10861 0.445303
1 1.281891 0.216024 0.925948 0 . 0 . 0 . 0 . 0 . 0 . 0 . 0 . 0 . 0 . 0 .
0 0 0 0 • o 0 0
.
0
.
H I
T h e Variable importance w i n d o w provides insight into the importance of inputs in the decision tree. T h e magnitude of the importance statistic relates to the amount of variability in the target explained by the corresponding input relative to the input at the top of the table. For example, G i f t Amount L a s t explains 5 2 . 4 % of the variability explained by G i f t Count 36 Months.
3.4 Understanding Additional Diagnostic Tools (SelfStudy)
3103
The Score Rankings Overlay Plot Score Rankings Overlay: Targets
Hi H
E3
Percentile TRAIN
VALIDATE
The Score Rankings Overlay plot presents what is c o m m o n l y called a cumulative lift chart. Cases in the training and validation data are ranked, based on decreasing predicted target values. A fraction of the ranked data is selected (given by the decile value). The proportion of cases with the primary target value in this fraction is compared to the proportion of cases with the primary target value overall (given by the cumulative lift value). A useful model shows high lift in low deciles in both the training and validation data. For example, the Score Ranking plot shows that in the top 2 0 % of cases (ranked by predicted probability), the training and validation lifts are approximately 1.26. This means that the proportion of primary outcome cases in this top 2 0 % is about 2 6 % more likely to have the primary outcome than a randomly selected 2 0 % of cases.
3104
Chapter 3 Introduction to Predictive Modeling; Decision Trees
Fit Statistics \ Fit Statistics Target TargetB TargetB TargetB TargetB TargetB TargetB TargetB TargetB TargetB
Fit Statistics Statistics Train Validation Test Label _NOBS_ S u m of Fre... 4843 4843 _SUMW_ S u m of Cas... 9686 9686 _MISC_ Misclassific... 0.420607 0.42804 _MAX_ Maximum A... 0.643836 0.643836 _SSE_ S u m of Squ... 2351.152 2357.331 _ASE_ Average Sq... 0.242737 0.243375 _RASEâ€ž Root Avera... 0.492684 0.493331 _DIV_ Divisor for A... 9686 9686 DFT Total Degre... 4843
T h e Fit Statistics w i n d o w is used to compare various models built within S A S Enterprise Miner. T h e misclassification and average square error statistics are of most interest for this analysis.
3.4 Understanding Additional Diagnostic Tools (SelfStudy) 3105
The Output Window Output
* User: Date: Time: *
Dodotech 08APR08 13:04
* T r a i n i n g Output
V a r i a b l e Summary Measurement
Frequency
Level
Count
Role
2
INPUT
BINARY
INPUT
INTERVAL
20 3
INPUT
NOMINAL
REJECTED
INTERVAL
1
TARGET
BINARY
1
TARGET
INTERVAL
1
Model E v e n t s Number
The Output w i n d o w provides information generated by the S A S procedures that are used to generate the analysis results. For the decision tree, this information includes variable importance, tree leaf report, model fit statistics, classification information, and score rankings.
3106
Chapter 3 Introduction to Predictive Modeling: Decision Trees
3.5 Autonomous Tree Growth Options (SelfStudy)
Autonomous Decision Tree Defaults Default Settings m/*\
I
~]
Maximum Branches
Splitting Rule Criterion
X^*
Subtree Method
/^)^
Tree Size Options
2
Logworth
Average Profit
Bonferroni Adjust Split Adjust Maximum Depth Leaf Size
139
...
T h e behavior of the tree algorithm in S A S Enterprise Miner is governed by m a n y parameters that can be divided into four groups: • the number of splits to create at each partitioning opportunity • the metric used to compare different splits • the method used to prune the tree model • the rules used to stop the autonomous tree growing process T h e defaults for these parameters generally yield good results for an initial prediction.
3.5 Autonomous Tree Growth Options (SelfStudy)
3107
Tree Variations: Maximum Branches Complicates split search
Trades height for width
Uses heuristic shortcuts
140 S A S Enterprise Miner accommodates a multitude of variations in the default tree algorithm. T h e first involves the use of multiway splits instead of binary splits. Theoretically, there is no clear advantage in doing this. A n y multiway split can be obtained using a sequence of binary splits. T h e primary change is cosmetic. Trees with multiway splits tend to be wider than trees with only binary splits. T h e inclusion of multiway splits complicates the splitsearch algorithm. A simple linear search becomes a search whose complexity increases geometrically in the number of splits allowed from a leaf. T o combat this complexity explosion, the Tree tool in S A S Enterprise Miner resorts to heuristic search strategies.
3108
Chapter 3 Introduction to Predictive Modeling: Decision Trees
Tree Variations: Maximum Branches 1
r» • l.vfi*J r.,*_. . .. • ' u • 1 O'ffin >•rtrriQn r,, i... L**il
—
,
U » «n >» n t h
"1
j  J M Input 0*.r» :
M a j v n u m D«pD>
! Ljlf S C * UunMU'ol f?u*< l
S * W f
l :.••>  HM i 
IhB
1
Maximum branches in split
* ;
Exhaustive search size limit
BUM
 M*mn J i_. ,. • • Ml* u**'. ' A s » » r n # r t Fractal  N IE  I
h
rvrittr
En*
Oil
141 T w o fields in the Properties panel affect the number of splits in a tree. T h e M a x i m u m Branch property sets an upper limit on the number of branches emanating from a node. W h e n this number is greater than the default of 2. the number of possible splits rapidly increases. T o save computation time, a limit is set in the Exhaustive property as to h o w m a n y possible splits are explicitly examined. W h e n this n u m b e r is exceeded, a heuristic algorithm is used in place of the exhaustive search described above. T h e heuristic algorithm alternately merges branches and reassigns consolidated groups of observations to different branches. T h e process stops w h e n a binary split is reached. A m o n g all candidate splits considered, the one with the best worth is chosen. T h e heuristic algorithm initially assigns each consolidated group of observations to a different branch, even if the number of such branches is more than the limit allowed in the final split. At each merge step, the two branches that degrade the worth of the partition the least are merged. After the two branches are merged, the algorithm considers reassigning consolidated groups of observations to different branches. Each consolidated group is considered in turn, and the process stops w h e n no group is reassigned.
3.5 Autonomous Tree Growth Options (SelfStudy)
3109
Tree Variations: Splitting Rule Criterion Yields similar splits
t ti
Grows enormous trees
Favors manylevel inputs
142
In addition to changing the number of splits, you can also change h o w the splits are evaluated in the splitsearch phase of the tree algorithm. For categorical targets, S A S Enterprise Miner offers three separate splitworth criteria. Changing from the chisquared default criterion typically yields similar splits if the number of distinct levels in each input is similar. If not, the other split methods tend to favor inputs with more levels due to the multiple comparison problem discussed above. Y o u can also cause the chisquared method to favor inputs with more levels by turning off the Bonferroni adjustments. Because Gini reduction and Entropy reduction criteria lack the significance threshold feature of the chisquared criterion, they tend to grow enormous trees. Pruning and selecting a tree complexity based on validation profit limit this problem to s o m e extent.
3110
Chapter 3 Introduction to Predictive Modeling: Decision Trees
Tree Variations: Splitting Rule Criterion • IrtTtforCmflpjC" '•• Nornrial Crdtnon Qn>ny CtAtnan
Split Criterion
Pi:. r
Us*mtt«cri • uttinpjo*(« Mi.krr^m Bf*rJh • itanrru*
r"* 
*
; fJirmrWrCfRmtl 5 • fiiiinh*ra[^Ljrrij^tf ftui*s • Sci::t :•
Hflrtafiarnpla Htnm • tUPBtrULlm * •  •. LI_ ,
Chisquare logworth Entropy Gini
Categorical Criteria
Variance ProbF logworth
Interval Criteria
ft ' 57^>tC"or
Ptfl '•'.H r  /<• A • lis N u m b * ! of Subtil io QfHiOtlti .•
Logworth adjustments
Jifnm of kati U]<j(rri*nt .BtTor* "_lnpu(f Mq n
143
A total of five choices in S A S Enterprise Miner can evaluate split worth. Three (chisquared logworth, entropy, and Gini) are used for categorical targets, and the remaining two (variance and ProbF logworth) are reserved for interval targets. Both chisquared and ProbF logworths are adjusted (by default) for multiple comparisons. It is possible to deactivate this adjustment. T h e split worth for the entropy, Gini, and variance options are calculated as shown below. Let a set of cases S be partitioned intop subsets Si, ...,SP so that
Let the number of cases in S equal N and the number of cases in each subset S, equal n,. Then the worth of a particular partition of 5" is given by the following: worth =
l{S)^w,l(Sl)
where WrnJN (the proportion of cases in subset SI), and for the specified split worth measure, /(•) has the following value: classes
res classes
Entropy Gini
Variance N
notk cotes
Each worth statistic measures the change in I(S) from node to branches. In the Variance calculation, Y is the average of the target value in the node with Y&tt as a m e m b e r .
3.5 Autonomous Tree Growth Options (SelfStudy)
3111
Tree Variations: Subtree Method Controls Complexity
144
S A S Enterprise Miner features t w o settings for regulating the complexity of subtree: modify the pruning process or the Subtree method. B y changing the model assessment measure to average square error, y o u can construct what is k n o w n as a class probability tree. It can be s h o w n that this action minimizes the imprecision of the tree. Analysts sometimes use this model assessment measure to select inputs for a flexible predictive model such as neural networks. Y o u can deactivate pruning entirely by changing the subtree to Largest. Y o u can also specify that the tree be built with a fixed number of leaves.
3112
Chapter 3 Introduction to Predictive Modeling: Decision Trees
Tree Variations: Subtree Method
Assessment Largest N Pruning options Pruning metrics imbii of Sublets H n rii. â€˘  r or Rspeaii
J
Decision Average Square Error Misclassification Lift
145
T h e pruning options are controlled by two properties: Method and Assessment Measure.
Tree Variations: Tree Size Options Avoids orphan nodes
Controls sensitivity
Grows large trees
146
T h e family of adjustments that you modify most often w h e n building trees are the rules that limit the growth of the tree. Changing the m i n i m u m number of observations required for a split search and the m i n i m u m number of observations in a leaf prevents the creation of leaves with only one or a handful of cases. Changing the significance level and the m a x i m u m depth allows for larger trees that can be more sensitive to complex input and target associations. T h e growth of the tree is still limited by the depth adjustment m a d e to the threshold.
3.5 Autonomous Tree Growth Options (SelfStudy)
3113
Tree Variations: Tree Size Options Logworth threshold Maximum tree depth Minimum leaf size
Threshold depth adjustment Changing the logworth threshold changes the m i n i m u m logworth required for a split to be considered by the tree algorithm (when using the chisquared or ProbF measures). Increasing the m i n i m u m leaf size avoids orphan nodes. For large data sets, you might want to increase the m a x i m u m leaf setting to obtain additional modeling resolution. If you want big trees and insist on using the chisquared splitworth criterion, deactivate the Split Adjustment option.
3114
Chapter 3 Introduction to Predictive Modeling: Decision Trees
Exercises
1.
Initial Data Exploration A supermarket is offering a n e w line of organic products. T h e supermarket's management wants to determine which customers are likely to purchase these products. T h e supermarket has a customer loyalty program. A s an initial buyer incentive plan, the supermarket provided coupons for the organic products to all of the loyalty program participants and collected data that includes whether these customers purchased any of the organic products. T h e ORGANICS data set contains 13 variables and over 22,000 observations. T h e variables in the data set are s h o w n below with the appropriate roles and levels: Name
Model Role
Measurement Level
Description
ID
ID
Nominal
Customer loyalty identification number
DemAff1
Input
Interval
Affluence grade on a scale from 1 to 30
DemAge
input
Interval
A g e , in years
DemCluster
Rejected
Nominal
Type of residential neighborhood
DemClusterGroup
Input
Nominal
Neighborhood group
DemGender
Input
Nominal
M = male, F = female, U = u n k n o w n
DemRegion
Input
Nominal
Geographic region
DemTVReg
Input
Nominal
Television region
PromClass
Input
Nominal
Loyalty status: tin, silver, gold, or platinum
PromSpend
Input
Interval
Total amount spent
PromTime
Input
Interval
Time as loyalty card m e m b e r
TargetBuy
Target
Binary
Organics purchased? 1 = Yes, 0  N o
TargetAmt
Rejected
1 nterval
N u m b e r of organic products purchased
$T
Although two target variables are listed, these exercises concentrate on the binary variable
TargetBuy. a. Create a n e w diagram n a m e d Organics.
3.5 Autonomous Tree Growth Options (SelfStudy)
3115
b. Define the data set A A E M 6 1 . O R G A N I C S as a data source for the project. 1) Set the roles for the analysis variables as s h o w n above. 2) Examine the distribution of the target variable. W h a t is the proportion of individuals w h o purchased organic products? 3) T h e variable D e m C l u s t e r G r o u p contains collapsed levels of the variable D e m C l u s t e r . Presume that, based on previous experience, y o u believe that D e m C l u s t e r G r o u p is sufficient for this type of modeling effort. Set the model role for D e m C l u s t e r to Rejected. 4) A s noted above, only T a r g e t B u y will be used for this analysis and should have a role of Target. C a n T a r g e t A m t be used as an input for a model used to predict T a r g e t B u y ? W h y or w h y not?
5) Finish the O r g a n i c s data source definition. c. A d d the A A E M 6 1 . O R G A N I C S data source to the Organics diagram workspace. d. A d d a Data Partition node to the diagram and connect it to the Data Source node. Assign 5 0 % of the data for training and 5 0 % for validation. e. A d d a Decision Tree node to the workspace and connect it to the Data Partition node. f. Create a decision tree model autonomously. Use average square error as the model assessment statistic. 1) H o w m a n y leaves are in the optimal tree? 2) W h i c h variable w a s used for thefirstsplit? W h a t were the competing splits for thisfirstsplit? g. A d d a second Decision Tree node to the diagram and connect it to the Data Partition node. 1) In the Properties panel of the n e w Decision Tree node, change the m a x i m u m number of branches from a node to 3 to allow for threeway splits. 2) Create a decision tree model using average square error as the model assessment statistic. 3) H o w m a n y leaves are in the optimal tree? h.
Based on average square error, which of the decision tree models appears to be better?
3116
Chapter 3 Introduction to Predictive Modeling: Decision Trees
3.6 Chapter Summary Predictive modeling fulfills s o m e of the promise of data mining. Past observation of target events coupled with past input measurements enables s o m e degree of assistance for the prediction offuture target event values, conditioned on current input measurements. All of this depends on the stationarity of the process being modeled. A predictive model must perform three essential tasks: 1. It must be able to systematically predict n e w cases, based on input measurements. These predictions might be decisions, rankings, or estimates. 2. A predictive model must be able to thwart the curse of dimensionality and successfully select a limited n u m b e r of useful inputs. This is accomplished by systematically ignoring irrelevant and redundant inputs. 3.
A predictive model must adjust its complexity to avoid overfitting or underfitting. In S A S Enterprise Miner, this adjustment is accomplished by splitting the raw data into a training data set for building the model and a validation data set for gauging model performance. T h e performance measure used to gauge performance depends on the prediction type. Decisions can be evaluated by accuracy or misclassification, rankings by concordance or discordance, and estimates by average square error.
Decision trees are a simple but potentially useful example of a predictive model. They score n e w cases by creating prediction rules. They select useful inputs via a splitsearch algorithm. They optimize complexity by building and pruning a maximal tree. A decision tree model can be built interactively, automatically, or autonomously. For autonomous models, the Decision Tree Properties panel provides options to control the size and shape of the final tree.
Decision Tree Tools Review i fatten
Partition raw data into training and validation sets. Interactively grow trees using the Tree Desktop application. You can control rules, selected inputs, and tree complexity. Autonomously grow decision trees based on property settings. Settings include branch count, split rule criterion, subtree method, and tree size options.
149
3.7 Solutions
3.7
Solutions
Solutions to Exercises 1. Initial Data Exploration a. Create a n e w diagram n a m e d Organics. 1) Select File •=> N e w «=> Diagram.... The Create N e w Diagram w i n d o w opens.
Diagram Name:
OK
Cancel —
2) Type O r g a n i c s in the D i a g r a m N a m e field.
Diagram Name: Organics OK
3) Select O K .
Cancel
3117
3118
Chapter 3 Introduction to Predictive Modeling: Decision Trees
b. Define the data set AAEM61. ORGANICS as a data source for the project. 1) Set the model role for the target variable. 2) Examine the distribution of the target variable. W h a t is the proportion of individuals w h o purchased organic products? a) Select File <=> N e w ^> Data Source.... The Data Source Wizard w i n d o w opens. JjfData Source Wizard â€” StepI of 7 Metadata Source
Select a metadata source
Source: SAS Table
~3
Cancel
b) Select Next >. The wizard proceeds to Step 2.
3.7 Solutions
3119
c) Type AAEM61. ORGANICS in the T a b l e field. jjjjfData Source Wizard  Step 2 of 7 Select a SAS Table
Select a S A S table
Table: aaem6l .organics
Browse..
< Qack
Cancel
(text >
d) Select Next >. T h e wizard proceeds to Step 3.
m
Iff Data Source Wizard â€” Step 3 of 7 Table Information Table Properties
Properly Table Name Description Member Type Data Set Type Engine Number of Variable? Created Date loclfied Date
Value AAEM6I. ORGANICS DATA DATA BASE 13 22223 April 10, 2008 10:37:03 AM EDT April 10, 2008 10:37:03 AM EDT
<Back
j Net > "j
Cancel
3120
Chapter 3 Introduction to Predictive Modeling: Decision Trees e) Select Next >. T h e wizard proceeds to Step 4. SCOAta Source Wizard â€”5tep4of7 MetadataAdvisor Options Metadata Advisor Options Use the basic setting lo se! the initial measurement levels and roles based on tbe variable attributes. U s e the advanced setting to set the Initial measurement levels and roles based on both the variable attributes and distributions.
P Basic
r Advanced
<6ack
QÂťxt>''l
C***!
f) Select the A d v a n c e d radio button and select Customize.... T h e Advanced Advisor Options w i n d o w opens. g) Type 2 as the Class Levels Count Threshold value. Advanced Advisor Options Property
El Value
50 Missing Percentage Threshold Reject Vars with Excessive Missing ValueYes Class Levels Count Threshold 2 Detect Class Levels Yes Reject Levels Count Threshold 20 . , Reject Vars with Excessive Class Values Yes Y e s Class Levels Count Threshold If "Detect class levels"=Yes, interval variables with less than the number specified for this property wiU be marked as N O M I N A L . The default value is 20.
OK
Cancel
h) Select O K . T h e Advanced Advisor Options w i n d o w closes and you are returned to Step 4 of the Data Source Wizard.
3.7 Solutions 3121 i) Select Next >. T h e wizard proceeds to Step 5. B y customizing the Advanced Metadata Advisor, most of the roles and levels are correctly set. j) Select Role O Rejected for T a r c r e t A m t . fjjjfData Source Wizard —Step 5 of 7 Column Metadata •* [ r not jEquai !o
i(nooe)
"J
lokrnns: V Label
r
Name  Role DemAfrl Input D e m A g e . _ Input DemCluster Rejected D e m C lusterO input DemGender Input DemReg Input D e m T V R e g Input ID ID PromClass Input PromSpend Input PromTime input TargetAmt Rejected TargetBuy Target
Show code
Explore
 Level Interval Interval Nominal Nominal Nominal Nominal Nominal Nominal Nominal Interval Interval Interval Binary
Oder
 Report No No No No Ho Ho Ho No No No No NO NO
Apply r Sanities
Bas* Drop
Lower Lir«
Upper Li
No Ho Ho Ho
1
No
No No No No No
Refresh Sgranary
< Back
.
Next >
Cancel
)

k) Select TargetBuy and select Explore. T h e Explore w i n d o w opens. Explore  AAEM61.ORG ANICS File
View fictions Window
HUl.ll.HJJJ.l.W!
^Jnjxj
_ t n l x  i Targets Value
Property
15000 
22223
Rows Columns
13 AAEM61 ORGANICS DATA Random Max 22223 [l2345
Library Member Type S a m p l e Method Fetch Size Fetched R o w s R a n d o m Seed
S 10000 m S 
£
5000 
n u
Apply

Plot...
^
' I
l
0
1
Or <i,inics Purchase Imlicato! ^jnjxj
5"! A A E M 6 1 . 0 R G A M C S Ous#
•ner
—
Organics...
o
°1
3122
Chapter 3 Introduction to Predictive Modeling: Decision Trees T h e proportion of individuals w h o purchased organic products appears to be 2 5 % . I) Close the Explore window. 3) T h e variable DemC l u s t e r Group contains collapsed levels of the variable DemCluster. Presume that, based on previous experience, you believe that DemClusterGroup is sufficient for this type of modeling effort. Set the model role for DemCluster to Rejected. This is already done using the Advanced Metadata Advisor. Otherwise, select Role •=> Rejected for DemCluster. 4) A s noted above, only TargetBuy will be used for this analysis and should have a role of Target. C a n TargetAmt be used as an input for a model used to predict TargetBuy? W h y or w h y not? N o , using TargetAmt as an input is not possible. It is measured at the s a m e time as TargetBuy and therefore has no causal relationship to TargetBuy. 5) Finish the O r g a n i c s data source definition. a) Select Next >. T h e wizard proceeds to Step 6. mm
ItfData Source Wizard"  Step b or 9 Decision Configuration
Decision Processing Do you want to build models based on the vabes of the decisions ? IF you answer yes, you may enter information on the cost or profit of each possible decision, prior probability and cost Function, The data will be scanned for the distributions of the target variables,
® <Back
 Ne.t >
j
£ancel

3.7 Solutions 3123 b) Select Next >. T h e wizard proceeds to Step 7. £f Data Source Wizard — Step 7of 8 Data Source Attributes
You m a y change the n a m e and the tole, and can specify a population segment Identifier for the data source
5
Name:
ORGANICS
Role:
 Raw
33
Segment: \
Notes:
<gack
; Next > 
Cancel
c) Select Next >. T h e wizard proceeds to Step 8.
m
£ D,it ,i Source Wizard  Step 0 of a Summary Metadata Completed. Library: AAEM81 Table: ORGANICS Role: Raw Role ID Input Input Rejected Rejected Target
Count 1 1
Level Nominal Interval Nominal Interval Nominal Binary
5 1 1 1
<Back
[ hash

Cancel

d) Select Finish. T h e wizard closes and the O r g a n i c s data source is ready for use in the Project Panel. 13 My Project <? E3 Data Sources IORGANICS PVA97NK <? Lii] Diagrams tj££> Organics Predictive Analysis © H j Model Packages °  JLl Users
3124
Chapter 3 Introduction to Predictive Modeling: Decision Trees
c. A d d the A A E M 6 1 . O R G A N I C S data source to the Organics diagram workspace. Organics
ET [El
ORGANICS
V
100%
d. A d d a Data Partition node to the diagram and connect it to the Data Source node. Assign 5 0 % of the data for training and 5 0 % for validation. Organics
ORGANICS
ate Partition
Â°x
0
0 100%
â€˘
3.7 Solutions 1) Type 5 0 as the Training and Validation values under Data Set Allocations. 2) Type 0 as the Test value.
Variables Data OutputType Partitioning Method Default 12345 Random Seed [•Training [•Validation Test
.nil
50.0 50.0 0.0
e. A d d a Decision Tree node to the workspace and connect it to the Data Partition node. cc
° Organics
ORGANICS
)nta Paitrtion
Decision Tr ee
3125
3126
Chapter 3 Introduction to Predictive Modeling: Decision Trees
f. Create a decision tree model autonomously. Use average square error as the model assessment statistic. • Select Average Square Error as the Assessment Measure property. Train Variables Interactive
Q
I
B
H
Default [Criterion 0 2 ?• Significance Level Use in search [Missing Values NO [ Use Input Once [•Maximum Branch 2 [•Maximum Depth 6 Minimum Categorical S 5 llMode 5 [Leaf Size [•Number ofRules 5 [Number of Surrogate R 0 '•Split Size =]Split Search [Exhaustive 5000 : 20000 Node Sample Assessment [Method 1 [•Number of Leaves [•Assessment Measure Average Square Error I Assessment Fraction 0 . 2 5
Rightclick on the Decision Tree node and select R u n from the Option m e n u . Select Yes in the Confirmation window.
Do you want to run this path? Diagram: Organics Path:
Decision Tree Yes
No
1) H o w m a n y leaves are in the optimal tree?
Run completed Diagram:Organics Path: OK
Decision Tree
Results..
3.7 Solutions
3127
a) W h e n the Decision Tree node run finishes, select Results... from the R u n Status window. T h e Results w i n d o w opens. ' Result*  Node: Decision Tree Diagram: Organics Fie Ed* yjew Window Score Rankings Overlays TargetBuy :uirojl*stjve lift
lO Training g h ! Starbt c* Target TargetBuy TargetBuy TargetBuy TargetBuy TargetBuy TargelSuy TargetBuy TargetBuy TargetBuy
n staaaci SUtntHn _NOBS_ _SUMW_ _M!SC_ _MAX _SSE_ „ASE_ _RASE_
_D1V_ _DFT_
Tn»i Label Sum orFre... Sum otCas.. Mtsclassific... 0. Maximum A... 0 Sum ofSqu.. 26 0. Average 8fl~ ^cotAwra... 0. Divisor for A. Total Degre...
•ini*l Statisnc
at l a
Ttain 75.;% 24.B*
liu: T Age
J.
Validation 75.=% =4.8%
mil
•sec: Dace: Tlae: * Training Oucpuc •
Scudtnt July 0 6 , 200 13:35:37
3128
Chapter 3 Introduction to Predictive Modeling: Decision Trees T h e easiest w a y to determine the number of leaves in your tree is via the iteration plot. b) Select V i e w •=> M o d e l •=> Iteration Plot from the Result w i n d o w m e n u . T h e Iteration Plot w i n d o w opens. Iteration Plot ;
x?\£
Average Squared Error
HI]
•
0.13 
0.16 
0.14I
i
10
20
30
Nil mh er of Leaves Train: Average Squared Error —Valid:Average Squared Error [
u
:
,
I
Using average square error as the assessment measure results in a tree with 29 leaves. 2) Which variable w a s used for the first split? W h a t were the competing splits for this first split? These questions are best answered using interactive training. a) Close the Results w i n d o w for the Decision Tree model. I
Train Variables Interactive
b) Select the Interactive ellipsis (...) from the Decision Tree node's Properties panel.
3.7 Solutions T h e S A S Enterprise Miner Tree w i n d o w opens.
I Ffe Edt Model Trah Vtow Options Window Hefc
! Statistic
Training
I
0;. l! I H in noda;
Statistic
Training
1: II In nodi: Aflluanca
« I t u i n l c
0: 1:
Truuuig
»«» 4C1
Training
481
a:
1 14.S
711
2)4 111 It
Sit
399 C n d *
1 >• 14.E Statistic Training 0: 111 1: 331
Statistic 0: 1:
Training »7» 114
jQ. Tree EMWS: ForHeti, preisFI
train Interactively No Piers
3129
3130
Chapter 3 Introduction to Predictive Modeling: Decision Trees c) Rightclick the root node and select Split Node... from the Option m e n u . T h e Split N o d e I w i n d o w opens with information that answers the two questions. I
Split Node 1
01
Target Variable: TargetBuy Variable DemAffl DemGender PromSpend PromClass
Variable Description Affluence Grade Gender Total Spend .oyalty Status
Branches
Log(p) 200.19 133.14 32.97 23.33
Apply 2 2 2 2
Edit Rule... OK Cancel
Age is used for the first split. C o m p e t i n g splits are A f f l u e n c e Grade, Gender, T o t a l Spend, a n d L o y a l t y Status.
3.7 Solutions A d d a second Decision Tree node to the diagram and connect it to the Data Partition node.
Decision Tree
: : : ORGANICS
Ll_
0
Decision Tree
3
l) In the Properties panel of the n e w Decision Tree node, change the m a x i m u m n u m b e r of branches from a node to 3 to allow for threeway splits. Train
Variables Interactive !â€˘ Criterion [Significance Level rMissing Values rUse Input Once jMaximum Branch Maximum Depth linimum Categorical
Default 0.2 Use in search No 3 6 S5
2) Create a decision tree model again using average square error as the model assessment statistic. 3) H o w m a n y leaves are in the optimal tree. Based on the discussion above, the optimal tree with threeway splits has 33 leaves.
3131
3132 h.
Chapter 3 Introduction to Predictive Modeling: Decision Trees Based on average square error, which of the decision tree models appears to be better? 1) Select the first Decision Tree node. 2) Rightclick and select Results... from the Option m e n u . The Results w i n d o w opens. 3) Examine the Average Squared Error row of the Fit Statistics window. Target TargetBuy
Fit Statistics Statistics Label ASE Average Squared Error
Train Validation 0.132861 0.132773
4) Close the Results window. 5) Repeat the process for the Decision Tree (2) model. Target TargetBuy
Fit Statistics Statistics A Label _ASE_
Average Sq...
Train 0.133009
Validation
Test
0.132662
It appears that the second tree with threeway splits has the lower validation average square error a n d is the better m o d e l .
3.7 Solutions
Solutions to Student Activities (Polls/Quizzes)
3.01 Quiz  Correct Answer Match the Predictive Modeling Application to the Decision Type. Modeling Application C j Loss Reserving
A.
Decision
B Risk Profiling
B.
Ranking
B Credit Scoring
C.
Estimate
A Fraud Detection C \ Revenue Forecasting A Voice Recognition
26
Decision type
3133
3134
Chapter 3 Introduction to Predictive Modeling: Decision Trees
Chapter 4 Introduction to Predictive Modeling: Regressions 4.1
4.2
Introduction
Demonstration: Running the Regression Node
425
Selecting Regression Inputs
Optimizing Regression Complexity
Interpreting Regression Models Demonstration: Interpreting a Regression Model
4.5
Transforming Inputs Demonstration: Transforming Inputs
4.6
Categorical Inputs Demonstration: Receding Categorical Inputs
4.7
3
418
Demonstration: Optimizing Complexity 4.4

Demonstration: Managing Missing Values
Demonstration: Selecting Inputs 4.3
4
P o l y n o m i a l R e g r e s s i o n s (SelfStudy)
429 433 438 440 449 451 452 456 466 469 476
Demonstration: Adding Polynomial Regression Terms Selectively
478
Demonstration: Adding Polynomial Regression Terms Autonomously (SelfStudy)
485
Exercises
488
4.8
Chapter Summary
489
4.9
Solutions
491
Solutions to Exercises Solutions to Student Activities (Polls/Quizzes)
491 4102
42
Chapter 4 Introduction to Predictive Modeling: Regressions
4.1 Introduction
4.1
43
Introduction Model Essentials  Regressions Predict new cases.
Prediction formula
Select useful inputs.
Sequential selection
Optimize complexity.
Best model from sequence
3 Regressions offer a different approach to prediction compared to decision trees. Regressions, as parametric models, assume a specific association structure between inputs and target. By contrast, trees, as predictive algorithms, do not assume any association structure; they simply seek to isolate concentrations o f cases with likevalued target measurements. The regression approach to the model essentials in SAS Enterprise Miner is outlined over the following pages. Cases are scored using a simple mathematical prediction formula. One o f several heuristic sequential selection techniques is used to pick from a collection o f possible inputs and creates a series o f models with increasing complexity. Fit statistics calculated from validation data select the best model from the sequence.
44
Chapter 4 Introduction to Predictive Modeling: Regressions
Model Essentials  Regressions i > Predict new cases.
I
p
^<*f*
formula
5 Regressions predict cases using a mathematical equation involving values o f the input variables.
4.1 Introduction
45
Linear Regression Prediction Formula input measurement A
y
A
A
= w
0
intercept estimate
+ w,
A
X)
+ w x 2
2
prediction estimate
parameter estimate
Choose intercept and parameter estimates to minimize: squat ed error function
liy.y,)
1
framing data
6
In standard linear regression, a prediction estimate for the target variable is formed from a simple linear combination o f the inputs. The intercept centers the range o f predictions, and the remaining parameter estimates determine the trend strength (or slope) between each input and the target. The simple structure o f the model forces changes in predicted values to occur in only a single direction (a vector in the space o f inputs with elements equal to the parameter estimates). Intercept and parameter estimates are chosen to minimize the squared error between the predicted and observed target values (least squares estimation). The prediction estimates can be viewed as a linear approximation to the expected (average) value o f a target conditioned on observed input values. Linear regressions are usually deployed for targets with an interval measurement scale.
46
Chapter 4 Introduction to Predictive Modeling: Regressions
Logistic Regression Prediction Formula A
l
o
g
P W + Wj X, + vv x (i3j) = Q
2
2
/0
*
ft
scores
Logistic regressions are closely related to linear regressions. In logistic regression, the expected value o f the target is transformed by a link function to restrict its value to the unit interval. In this way, model predictions can be viewed as primary outcome probabilities. A linear combination o f the inputs generates a logit score, the log o f the odds o f primary outcome, in contrast to the linear regression's direct prediction o f the target. I f your interest is ranking predictions, linear and logistic regressions yield virtually identical results.
4.1 Introduction
47
Logit Link Function A
.
/
H P
\
A
log ^  — A  j = w
p
A 0
A
+ Wi x
1
+ w
x
2
2
'"a"
/og/f link function
The logit link function transforms probabilities (between 0 and 1) to logit scores (between °° and +»).
For binary prediction, any monotonic function that maps the unit interval to the real number line can be considered as a link. The logit link function is one o f the most common. Its popularity is due, in part, to the interpretability o f the model.
Logit Link Function log y
=
+
+ w * 2
= logit(p)
1
A _
^
2
1
+ e
loglt(p)
To obtain prediction estimates, the logit equation is solved for ft.
The predictions can be decisions, rankings, or estimates. The logit equation produces a ranking or logit score. To get a decision, you need a threshold. The easiest way to get a meaningful threshold is to convert the prediction ranking to a prediction estimate. You can obtain a prediction estimate using a straightforward transformation o f the logit score, the logistic function. The logistic function is simply the inverse o f the logit function. You can obtain the logistic function by solving the logit equation for p.
48
Chapter 4 Introduction to Predictive Modeling: Regressions
14
To demonstrate the properties o f a logistic regression model, consider the t w o  c o l o r prediction p r o b l e m introduced in Chapter 3. A s before, the goal is to predict the target color, based on the location in the unit square. To make use o f the prediction f o r m u l a t i o n , y o u need estimates o f the intercept and other model parameters.
Simple Prediction Illustration  Regressions logit(p) = w + w ' X 1
0
A
p=
1
+ w
2
1 1 + logiKp)
â€”
e
Find parameter estimates
by maximizing I
logtfj)+1109(1^
MmlnjciiH
UUICOB*'
f lining c u t J
loglikelihood function
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
16 The presence o f the logit l i n k function complicates parameter estimation. Least squares estimation is abandoned in favor o f m a x i m u m l i k e l i h o o d estimation. The l i k e l i h o o d function is the j o i n t probability density o f the data treated as a function o f the parameters. The m a x i m u m l i k e l i h o o d estimates are the values o f the parameters that m a x i m i z e the probability o f obtaining the t r a i n i n g sample.
4.1 Introduction
49
Simple Prediction Illustration  Regressions
logit( p ) = 0.81+ 0.92X, +1.11 x
o.s
2
1 P
^^^^^^^
~ 1 + i^'(p> e
x 0.5 2
Using the maximum likelihood estimates, the prediction formula assigns a logit score to each x. and x . 2
0.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0,7 0.8 0.9 1.0
18
Parameter estimates are obtained by m a x i m u m l i k e l i h o o d estimation. These estimates can be used in the logit and logistic equations to obtain predictions. The plot on the right shows the p r e d i c t i o n estimates f r o m the logistic equation. One o f the attractions o f a standard logistic regression model is the s i m p l i c i t y o f its predictions. The contours are simple straight lines. (In higher dimensions, they w o u l d be hyperplanes.)
410
Chapter 4 Introduction to Predictive Modeling: Regressions
4.01 Multiple Choice Poll What is the logistic regression prediction for the indicated point? a. 0.243 b. 0.56 c. yellow d. It depends . . .
s
^
s
^
S
^ ^ \ v i v \ ' V ^ N
logit(p) =0.81+0.92x + 1.11x 1
2
20
To score a new case, the values o f the inputs are plugged into the logit or logistic equation. T h i s action creates a logit score or prediction estimate. Typically, i f the prediction estimate is greater than 0.5 (or equivalently, the logit score is positive), cases are usually classified to the p r i m a r y outcome. ( T h i s assumes an equal misclassification cost.) The answer to the question posed is, o f course, it depends. • A n s w e r A , the logit score, is reasonable i f the goal is ranking. • A n s w e r B, the prediction estimate f r o m the logistic equation, is appropriate i f the goal is estimation. •
A n s w e r C, a classification, is a g o o d choice i f the goal is deciding dot color.
4.1 Introduction
411
Regressions: Beyond the Prediction Formula Manage missing values. Interpret the model. Handle extreme or unusual values. Use nonnumeric inputs. Account for nonlinearities.
While the prediction formula would seem to be the final word in scoring a new case with a regression model, there are actually several additional issues that must be addressed. • What should be done when one o f the input values used in the prediction formula is missing? You might be tempted to simply treat the missing value as zero and skip the term involving the missing value. While this approach can generate a prediction, this prediction is usually biased beyond reason. • How do you interpret the logistic regression model? Certain inputs influence the prediction more than others. A means to quantify input importance is needed. • How do you score cases with unusual values? Regression models make their best predictions for cases near the centers o f the input distributions. I f an input can have (on rare occasion) extreme or outlying values, the regression should respond appropriately. • What value should be used in the prediction formula when the input is not a number? Categorical inputs are common in predictive modeling. They did not present a problem for the rulebased predictions o f decision trees, but regression predictions come from algebraic formulas that require numeric inputs. (You cannot multiply marital status by a number.) A method to include nonnumeric data in regression is needed. • What happens when the relationship between the inputs and the target (or rather logit o f the target) is not a straight line? It is preferable to be able to build regression models in the presence o f nonlinear (and even nonadditive) input target associations.
412
Chapter 4 Introduction to Predictive Modeling: Regressions
Regressions: Beyond the Prediction Formula Manage missing values.
The above issues affect both model construction and model deployment. The first o f these, h a n d l i n g missing values, is dealt w i t h immediately. The r e m a i n i n g issues are addressed, in t u r n , at the end o f this chapter.
Missing Values and Regression Modeling
mm ^^HiCZZZj • • • mm h mm
rzii
C~l
•
•
•
MM • • • . " ; — am mm
m
•
•
•
Problem 1: Training data cases with missing values on inputs used by a regression mode! are ignored.
M i s s i n g values present t w o distinct problems. The first relates to model construction. The default method for treating m i s s i n g values in most regression tools in SAS Enterprise M i n e r is completecase analysis. In completecase analysis, only those cases w i t h o u t any missing values are used in the analysis.
4.1 Introduction
413
Missing Values and Regression Modeling Training Data beto
i
KBiia j tmm t''Wi ta'^a f r y * H O t^aa
Consequence: Missing values can significantly reduce your amount of training data for regression modeling!
26
Even a smattering o f missing values can cause an enormous loss o f data in high dimensions. For instance, suppose that each o f the k input variables is missing at random with probability a . In this situation, the expected proportion o f complete cases is as follows: (taf Therefore, a 1 % probability o f missing ( a = 0 1 ) for 100 inputs leaves only 3 7 % o f the data for analysis, 200 leaves 13%, and 400 leaves 2%. I f the "missingness" were increased to 5% (cc=.05), then < 1 % o f the data would be available with 100 inputs.
414
Chapter 4 Introduction to Predictive Modeling: Regressions
Missing Values and the Prediction Formula logit(p) = 0.81 + 0.92 â€˘ x +1.11 x 1
2
Predict: (x1, x2) = (0.3, ? ) Problem 2: Prediction formulas cannot score cases with missing values.
27
Missing Values and the Prediction Formula logit(p) = ?
Problem 2: Prediction formulas cannot score cases with missing values.
30
The second missing value problem relates to model deployment or using the prediction formula. How would a model built on the complete cases score a new case i f it had a missing value? To decline to score new incomplete cases would be practical only i f there were a very small number o f missing values.
4.1 Introduction
Missing Value Issues Manage missing values. Problem 1: Training data cases with missing values on inputs used by a regression model are ignored. Problem 2: Prediction formulas cannot score cases with missing values.
31
A remedy is needed for the two problems o f missing values. The appropriate remedy depends on the reason for the missing values.
415
416
Chapter 4 Introduction to Predictive Modeling: Regressions
Missing Value Causes > Manage missing values. CZD Nonapplicable measurement czj\
No match on merge Nondisclosed measurement
33
Missing values arise for a variety o f reasons. For example, the time since last donation to a card campaign is meaningless i f someone did not donate to a card campaign. In the PVA97NK data set, several demographic inputs have missing values in unison. The probable cause was no address match for the donor. Finally, certain information, such as an individual's total wealth, is closely guarded and is often not disclosed.
Missing Value Remedies Manage missing values. Nonapplicable measurement No match on merge Nondisclosed measurement
34
Synthetic distribution
Estimation
4.1 Introduction
417
The primary remedy for missing values in regression models is a missing value replacement strategy. Missing value replacement strategies fall into one o f two categories. Synthetic distribution methods
Estimation methods
use a onesizefitsall approach to handle missing values. A n y case with a missing input measurement has the missing value replaced w i t h a fixed number. The net effect is to modify an input's distribution to include a point mass at the selected fixed number. The location o f the point mass i n synthetic distribution methods is not arbitrary. Ideally, i t should be chosen to have minimal impact on the magnitude o f an input's association w i t h the target. M t h many modeling methods, this can be achieved by locating the point mass at the input's mean value. eschew the onesizefitsall approach and provide tailored imputations for each case w i t h missing values. This is done by viewing the missing value problem as a prediction problem. That is, you can train a model to predict an input's value from other inputs. Then, when an input's value is unknown, you can use this model to predict or estimate the unknown missing value. This approach is best suited for missing values that result from a lack o f knowledge, that is, nomatch or nondisclosure, but i t is not appropriate for notapplicable missing values.
Because predicted response might be different for cases with a missing input value, a binary imputation indicator variable is often added to the training data. Adding this variable enables a model to adjust its predictions in the situation where "missingness" itself is correlated with the target.
418
Chapter 4 Introduction to Predictive Modeling: Regressions
Managing Missing Values
The demonstrations in this chapter b u i l d on the demonstrations o f Chapters 2 and 3. A t this point, the process f l o w diagram has the f o l l o w i n g structure:
Data A s s e s s m e n t Continue the analysis at the Data Partition node. A s discussed above, regression requires that a case have a complete set o f input values for both t r a i n i n g and scoring. F o l l o w these steps to examine the data status after the p a r t i t i o n . 1.
Select the Data Partition node.
2.
Select Exported Data •=> ^ Property
Value
General Node ID Imported Data Exported Data Notes
Part
1
Train
Variables Data Output Type Partitioning Method Default Random Seed 1 234 5
j •BISsB IU 1 BETOfflU [Training [•Validation 'Test
50.0 50.0 0.0
f r o m the Data Partition node property sheet.
_ jgj
•
4.1 Introduction
419
T h e E x p o r t e d Data w i n d o w opens. 3.
Select the T R A I N data p o r t and select E x p l o r e . . . .
mmm TRAIN VALIDATE TEST
Table EMWS.Part_TRA!N EMWS.Part VALIDATE EMWS.Part TEST
Port

Role
J ~
Train IValidate 'Test
Browse..
Data Exists
Yes Yes No
Properties..
Explore..,
OK
There are several inputs w i t h a noticeable frequency o f m i s s i n g values, f o r e x a m p l e . A g e and the replaced value o f m e d i a n i n c o m e .
;
\
^^•llot^EM'Vl^ij.UMIiMlthMMMiifltti^B File
1
View
Actions
M B .
m ' >
Window
Sample Properties
i4843
Property Rows ; Columns Library Member Type j Sample Method Fetch Size Fetched Rows Random Seed
Value
bo EMWS PART TRAIN DATA iRandom Max 4843 12345
Apply
r/cf
o j EMWS.Part TRAIN Status Cate.J Demograp... 100 035 035 115 136 016 017 024 035 139 045
Plot.
Gender
Age ,M 53M 58 M .U .F .F 39 M 50F M M .F
m
lliome Owner Median Ho...j Percent Vet.J Median Inc .lRepiacerneJ 1
Ij,
=
' '
IpO
$139,200 $168,100 $234,700 $143,900 $134,400 $68,200 $115,100 $0 $207,900 $104,400 $70,000
V
27 37 22 30 41 47 30 33 48 35 37
TO
$38,942 $71,509 $72,868 $0 $0 $0 $63,000 $0 $0 $54,385 $0
38942 71609 72868
•
63000
54385
—
420
Chapter 4 Introduction to Predictive Modeling: Regressions
There are several ways to proceed: â€˘ Do nothing. I f there are very few cases with missing values, this is a viable option. The difficulty with this approach comes when the model must predict a new case that contains a missing value. Omitting the missing term from the parametric equation usually produces an extremely biased prediction. â€˘ Impute a synthetic value for the missing value. For example, i f an interval input contains a missing value, replace the missing value with the mean o f the nonmissing values for the input. This eliminates the incomplete case problem but modifies the input's distribution. This can bias the model predictions. Making the missing value imputation process part o f the modeling process allays the modified distribution concern. Any modifications made to the training data are also made to the validation data and the remainder o f the modeling population. A model trained with the modified training data w i l l not be biased i f the same modifications are made to any other data set that the model might encounter (and the data has a similar pattern o f missing values). â€˘ Create a missing indicator for each input in the data set. Cases often contain missing values for a reason. I f the reason for the missing value is in some way related to the target variable, useful predictive information is lost. The missing indicator is 1 when the corresponding input is missing and 0 otherwise. Each missing indicator becomes an input to the model. This enables modeling o f the association between the target and a missing value on an input. 4.
Close the Explore and Exported Data windows.
4.1 Introduction
421
Imputation To address missing values in the PVA97NK data set, use the following steps to impute synthetic data values and create missing value indicators: 1.
Select the Modify tab.
2.
Drag an Impute tool into the diagram workspace.
3.
Connect the Data Partition node to the Impute node. In the display, below, the Decision Tree modeling nodes are repositioned for clarity. : Predictive Analysis
JLl
*1 J
j
J
i n r "
<*>
n
q
i ~ j 
ÂŠ
 i h
p
i
ii
r
Chapter 4 Introduction to Predictive Modeling: Regressions
422
4.
Select the I m p u t e node and examine the Properties panel. Value
Property
General Node ID Imported Data Exported Data Notes
Impt
•••
Train Variables Non Missinq Variables Missing Cutoff = ] c i a s s Variables r Default Input Method [• Default Target Method Normalize Values
Count None Yes
[ Default Input Method ' [Default Target Method
Mean None
:
... ...
No 50.0
[Default Character Value '[Default Number Value ^ M e t h o d Options 12345 •Random Seed '•Tuning Parameters  Tree Imputation l
...
... •••
Score [Hide Original Variables Yes [=]lndicatorVariables Unique [•Type Imputed Variables }Source
.
The defaults o f the Impute node are as f o l l o w s : •
For interval inputs, replace any missing values w i t h the mean o f the nonmissing values.
•
For categorical inputs, replace any missing values w i t h the most frequent category.
tf
These are acceptable default values and are used throughout the rest o f the course.
W i t h these settings, each input w i t h missing values generates a new input. The new input named \\A?_originaljnpui_name
w i l l have missing values replaced by a synthetic value and n o n m i s s i n g
values copied f r o m the original input.
4.1 Introduction
423
Missing Indicators Use the f o l l o w i n g steps to create m i s s i n g value indicators. The settings for m i s s i n g value indicators are found in the Score property group.
.Score [Hide Original Variables rVes rType [{Source ••Role
I
Unique Imputed Variables Input
1.
Select I n d i c a t o r V a r i a b l e s •=> T y p e
Unique.
2.
Select I n d i c a t o r V a r i a b l e s O R o l e
Input.
W i t h these settings, n e w inputs named M_original_input_name indicate the synthetic data values.
w i l l be added to the t r a i n i n g data to
424
Chapter 4 Introduction to Predictive Modeling: Regressions
Imputation Results Run the Impute node and review the Results w i n d o w . Three inputs had missing values. â€˘r Results  Mode: Impute Diagram: Predictive Analysis File
Edit
View
Window
Imputation Summary Variable Name
Impute Method
DemAge MEAN GiftAvgCard...MEAN REP_Dem... MEAN
r/Rf Imputed Variable
Indicator Variable
IMP_DemA... M_DemAge IMP_GifiAvg... M_GiftAvgC... IMP_REP_... M_REP_De...
Impute Value Role
59.26291 INPUT 14.2049INPUT 53570.85INPUT
Measureme nt Level INTERVAL INTERVAL INTERVAL
[EI
Label
Number of Missing for TRAIN Age 1203 Gift Amount... 910 Repiaceme... 1191
H i Output
*
*
User: Date: Time:
*
n
* Training
Output
s
Variable
*
Summary
W i t h all o f the m i s s i n g values i m p u t e d , the entire training data set is available for b u i l d i n g the logistic regression m o d e l . In addition, a method is in place for scoring new cases w i t h m i s s i n g values. (See Chapter 7.)
4.1 Introduction
425
Running the Regression Node
There are several tools i n SAS Enterprise Miner to fit regression or regressionlike models. By far, the most commonly used (and, arguably, the most useful) is the simply named Regression tool. Use the following steps to build a simple regression model. 1.
Select the Model tab.
2.
Drag a Regression tool into the diagram workspace.
3.
Connect the Impute node to the Regression node. ;
;; P r e d i c t i v e A n a l y s i s ;
XTUmpttte
1^
placement
Regression
Probability T r e e
a Deciiion T r e e
JJ
J
>m"T~
<9
11
Q
I—J"
® ' r«r»l
S
9
I El
f
The Regression node can create several types o f regression models, including linear and logistic. The type o f default regression type is determined by the target's measurement level.
426
4.
Chapter 4 Introduction to Predictive Modeling: Regressions
Run the Regression node and v i e w the results. The Results  N o d e : Regression D i a g r a m w i n d o w opens.
5.
M a x i m i z e the Output w i n d o w by d o u b l e  c l i c k i n g its title bar. The initial lines o f the Output w i n d o w summarize the roles o f variables used (or not) by the Regression node. V a r i a b l e Summary ROLE
LEVEL
COUNT
INPUT
BINARY
INPUT INPUT
INTERVAL NOMINAL
3
REJECTED
INTERVAL
2
TARGET
BINARY
1
The fit model has 28 inputs that predict a binary target.
5 20
4.1 Introduction 427 Ignore the output related to model events and predicted and decision variables. The next lines give more information about the model, including the training data set name, target variable name, number o f target categories, and most importantly, the number o f model parameters. Model
Information
T r a i n i n g Data S e t
EMWS2.IMPT_TRAIN.VIEW
DMDB C a t a l o g
WORK.REG_DMDB
Target
TargetB (Target G i f t Flag)
Variable
Ordinal
T a r g e t Measurement L e v e l Number o f T a r g e t
Categories
2
Error
MBernoulli
Link Function Number o f Model P a r a m e t e r s
Logit 86
Number o f O b s e r v a t i o n s
4843
Based on the introductory material about logistic regression that is presented above, you might expect to have a number o f model parameters equal to the number o f input variables. This ignores the fact that a single nominal input (for example, D e m C l u s t e r ) can generate scores o f model parameters. Next, consider maximum likelihood procedure, overall model fit, and the Type 3 Analysis o f Effects. The Type 3 Analysis tests the statistical significance o f adding the indicated input to a model that already contains other listed inputs. A value near 0 in the Pr > ChiSq column approximately indicates a significant input; a value near 1 indicates an extraneous input. Type 3 A n a l y s i s o f E f f e c t s Wald Effect
DF
ChiSquare
Pr > ChiSq
DemCluster DemGender
53 2 1
58.9098
0.2682
0.5088 0.1630 2.4464
0.7754 0.6864 0.1178
DemHomeOwner DemMedHomeValue DemPctVeterans
1 1
5.2502
0.0219
GiftAvg36
1
1.6709
0.1961
GiftAvgAll
1
0.0339
0.8540
GiftAvgLast
1 1
0.0026
0.9593
1.2230
0.2688
GiftCntAll GiftCntCard36
1
0.1308
0.7176
1
1.0244
0.3115
GiftCntCardAll GiftTimeFirst
1
0.0061
1
1.6064
0.9380 0.2050
GiftTimeLast IMP_DemAge IMP_GiftAvgCard36
1
21.5351 0.0701
0.7911
GiftCnt36
1
<.0001
1 1
0.0476
0.8273
IMP_REP_DemMedIncome
0.1408
0.7074
M_DemAge
1
3.0616
0.0802
M_GiftAvgCard36
1 1
0.9190
0.3377
0.6228
0.4300
PromCntCard12
1 1 1 1
3.2335 1.0866 1.9715 0.0294
PromCntCard36
1
PromCntCardAll StatusCat96NK
1
0.0049 2.9149
0.0721 0.2972 0.1603 0.8639 0.9441
5
11.3434
0.0878 0.0450
StatusCatStarAll
1
1.7487
0.1860
M_REP_DemMedIncome PromCnt12 PromCnt36 PromCntAll
428
Chapter 4 Introduction to Predictive Modeling: Regressions The statistical significance measures a range from <0.0001 (highly significant) to 0.9593 (highly dubious). Results such as this suggest that certain inputs can be dropped without affecting the predictive prowess o f the model. Restore the Output window to its original size by doubleclicking its title bar. Maximize the Fit Statistics window.
Target TargetB TargetB argetB argetB argetB argetB argetB argetB argetB argetB argetB argetB argetB argetB argetB  TargetB TargetB TargetB TargetB TargetB
Fit Statistics Statistics Train Validation Test Label _AIC_ Akaike's Inf... 6633.2 Average Sq... 0.237268 _ASE_ 0.24381 0.667066 _AVERRâ€ž Average Err... 0.680861 Degrees of.. 4757 _DFE_ Model Degr... 86 _DFM_ Total Degre.. 4843 _DFT_ Divisor for A.. 9686 _DIV_ 9686 Error FunctL. 6461.2 _ERR_ 6594.821 Final Predic. 0.245847 _FPE_ Maximum A... 0.941246 0.841531 _MAX_ Mean Squa... 0.241558 0.24381 _MSE_ Sum ofFre... 4843 4843 _NOBS_ Number of... 86 RootAvera... _RASE_ 0.487102 0.493771 Root Final... _RFPE_ 0.49583 Root Mean... _RMSE_ 0.491485 0.493771 Schwa rz's... _SBC_ 7190.935 Sum of Squ.. _SSE_ 2298.179 2361.545 SumofCas.. _SUMW_ 9686 9686 Misclassific. MISC 0.411522 0.431964
I f the decision predictions are o f interest, model fit can be judged by misclassification. I f estimate predictions are the focus, model fit can be assessed by average squared error. There appears to be some discrepancy between the values of these two statistics in the train and validation data. This indicates a possible overfit o f the model. It can be mitigated by using an input selection procedure. 7.
Close the Results window.
4.2 Selecting Regression Inputs
429
4.2 Selecting Regression Inputs Model Essentials  Regressions
Select useful inputs.
Sequential selection
38
The second task that all predictive models should perform is input selection. One way to find the optimal set o f inputs for a regression is simply to try every combination. Unfortunately, the number o f models to consider using this approach increases exponentially in the number o f available inputs. Such an exhaustive search is impractical for realistic prediction problems. A n alternative to the exhaustive search is to restrict the search to a sequence o f improving models. While this might not find the single best model, it is commonly used to find models with good predictive performance. The Regression node in SAS Enterprise Miner provides three sequential selection methods.
430
Chapter 4 Introduction to Predictive Modeling: Regressions
Sequential Selection  Forward Input pvalue
Entry Cutoff
• • • • • • • 1 1 • • • • • • • •
43
Forward selection creates a sequence o f models of increasing complexity. The sequence starts with the baseline model, a model predicting the overall average target value for all cases. The algorithm searches the set o f oneinput models and selects the model that most improves upon the baseline model. It then searches the set o f twoinput models that contain the input selected in the previous step and selects the model showing the most significant improvement. By adding a new input to those selected in the previous step, a nested sequence o f increasingly complex models is generated. The sequence terminates when no significant improvement can be made. Improvement is quantified by the usual statistical measure o f significance, the pvalue. Adding terms in this nested fashion always increases a model's overall fit statistic. By calculating the change in the fit statistic and assuming that the change conforms to a chisquared distribution, a significance probability, or pvalue, can be calculated. A large fit statistic change (corresponding to a large chisquared value) is unlikely due to chance. Therefore, a small pvalue indicates a significant improvement. When no pvalue is below a predetermined entry cutoff, the forward selection procedure terminates.
4.2 Selecting Regression Inputs
431
Sequential Selection  Backward
• •••• •••• •••• • •• • • • • Input pvalue
•
•
m
Stay Cutoff
•
•
mm

—
i
• • •
51
In contrast to f o r w a r d selection, backward selection creates a sequence o f models o f d e c r e a s i n g c o m p l e x i t y . The sequence starts w i t h a saturated model, w h i c h is a model that contains all available inputs, and therefore, has the highest possible fit statistic. Inputs are sequentially r e m o v e d f r o m the m o d e l . A t each step, the input chosen for removal least reduces the overall m o d e l fit statistic. This is equivalent to r e m o v i n g the input w i t h the highest pvalue. The sequence terminates w h e n all r e m a i n i n g inputs have a pvalue that is less than the predetermined stay cutoff.
432
Chapter 4 Introduction to Predictive Modeling: Regressions
Sequential Selection  Stepwise Input pvalue
â€˘
â€˘
Entry Cutoff Stay Cutoff
58 Stepwise selection combines elements f r o m both the forward and backward selection procedures. The method begins in the same way as the f o r w a r d procedure, sequentially adding inputs w i t h the smallest pvalue b e l o w the entry cutoff. A f t e r each input is added, however, the algorithm reevaluates the statistical significance o f all included inputs. I f the pvalue o f any o f the included inputs exceeds the stay cutoff, the input is removed f r o m the model and reentered into the pool o f inputs that are available for inclusion in a subsequent step. The process terminates when all inputs available for inclusion in the model have pvalues in excess o f the entry c u t o f f and all inputs already included in the model h a v e p  v a l u e s b e l o w the stay cutoff.
4.2 Selecting Regression Inputs
Selecting Inputs
I m p l e m e n t i n g a sequential selection method in the Regression node requires a m i n o r change to the Regression node settings. 1.
Select Selection M o d e l •=> Stepwise on the Regression node property sheet.
Train Variables
m
B i M a i n Effects Yes r TwoFactor InteracticNo ^Polynomial Terms No rPolynomial Degree 2 No fUserTerms Term Editor '•Regression Type  Link Function
~U
Loqistic Regression Logit
E3 Suppress Intercept Input Coding Deviation B lection Selection Model ;Step_wise Selection Criterion Default Use Selection DefaiYes Selection Options
!...
The Regression node is n o w c o n f i g u r e d to use stepwise selection to choose inputs for the m o d e l . 2.
Run the Regression node and v i e w the results.
3.
M a x i m i z e the O u t p u t w i n d o w .
4.
H o l d d o w n the C T R L key and type G. The Go To Line w i n d o w opens.
OK
Enter line number:
. 5.
79
Type 7 9 in the E n t e r
— line
Cancel
n u m b e r field and select O K .
433
434
Chapter 4 Introduction to Predictive Modeling: Regressions
The stepwise procedure starts with Step 0, an interceptonly regression model. The value o f the intercept parameter is chosen so that the model predicts the overall target mean for every case. The parameter estimate and the training data target measurements are combined in an objective function. The objective function is determined by the model form and the error distribution o f the target. The value o f the objective function for the interceptonly model is compared to the values obtained in subsequent steps for more complex models. A large decrease in the objective function for the more complex model indicates a significantly better model. S t e p 0: I n t e r c e p t e n t e r e d . The DMREG P r o c e d u r e NewtonRaphson Ridge
Optimization
Without P a r a m e t e r S c a l i n g Parameter
Estimates
Optimization Active Constraints Max Abs G r a d i e n t
0
Element
1
Start
Objective Function
3356.9116922
5.707879E12 Optimization
Results
Iterations
0
Function C a l l s
3
Hessian
1
Active Constraints
0
Calls
Objective Function
3356.9116922
Ridge
0
Convergence c r i t e r i o n
(ABSGC0NV=0.00001)
Max Abs G r a d i e n t
5.707879E12 0
satisfied.
L i k e l i h o o d R a t i o T e s t f o r G l o b a l N u l l Hypothesis: 2 Log L i k e l i h o o d
Element
A c t u a l Over Pred Change
BETA=0
Likelihood
Intercept
Intercept &
Ratio
Only
Covariates
ChiSquare
DF
6713.823
6713.823
0.0000
0
Pr > ChiSq
A n a l y s i s o f Maximum L i k e l i h o o d
Estimates
Parameter
DF
Estimate
Standard Error
Wald ChiSquare
Pr > ChiSq
Intercept
1
0.00041
0.0287
0.00
0.9885
Standardized Estimate
Exp(Est) 1 .000
4.2 Selecting Regression Inputs
435
Step 1 adds one input to the interceptonly model. The input and corresponding parameter are chosen to produce the largest decrease in the objective function. To estimate the values o f the model parameters, the modeling algorithm makes an initial guess for their values. The initial guess is combined with the training data measurements in the objective function. Based on statistical theory, the objective function is assumed to take its minimum value at the correct estimate for the parameters. The algorithm decides whether changing the values o f the initial parameter estimates can decrease the value o f the objective function. I f so, the parameter estimates are changed to decrease the value o f the objective function and the process iterates. The algorithm continues iterating until changes in the parameter estimates fail to substantially decrease the value of the objective function. S t e p 1: E f f e c t G i f t C n t 3 6 e n t e r e d . The
DMREG P r o c e d u r e
NewtonRaphson Ridge
Optimization
Without P a r a m e t e r S c a l i n g Parameter E s t i m a t e s Optimization Active Constraints Max
Abs
Gradient
0
Element
2
Start
Objective Function
3356.9116922
89.678463762 Ratio Between Actual Abs
and
Function
Gradient
Predicted
Objective Function
Active
Objective
Constraints
Max
Iter
Restarts
Calls
Function
Change
Element
Ridge
1
0
2
0
3316
41.4036
2.1746
0
Change 1.014
2
0
3
0
3315
0.0345
0.00690
0
1.002
3
0
4
0
3315
2.278E7
4.833E8
0
1.000
Optimization Results Iterations
3
Function
Hessian
4
Active Constraints
Calls
Objective
Function
3315.473573
Ridge
0
Convergence c r i t e r i o n
(GC0NV=1E6)
Max
Abs
Calls Gradient
6 0 4.833086E8
Element
A c t u a l Over P r e d Change
0.999858035
satisfied.
The output next compares the model fit in Step 1 with the model fit in Step 0. The objective functions o f both models are multiplied by 2 and differenced. The difference is assumed to have a chisquared distribution with one degree o f freedom. The hypothesis that the two models are identical is tested. A large value for the chisquared statistic makes this hypothesis unlikely. The hypothesis test is summarized in the next lines. Likelihood Ratio Test f o r Global Null Hypothesis: 2 Log L i k e l i h o o d Intercept Intercept & Only Covariates 6713.823
6630.947
Likelihood Ratio ChiSquare
DF
Pr > C h i S q
82.8762
1
<.0001
BETA=0
436
Chapter 4 Introduction to Predictive Modeling: Regressions
The output summarizes an analysis o f the statistical significance o f individual model effects. For the one input model, this is similar to the global significance test above. Type 3 A n a l y s i s o f E f f e c t s Wald Effect
DF
ChiSquare
Pr > C h i S q
1
79.4757
<.0001
GiftCnt36
Finally, an analysis o f individual parameter estimates is made. (The standardized estimates and the odds ratios merit special attention and are discussed in the next section o f this chapter.) A n a l y s i s of Maximum Likelihood Estimates Standard
Wald
Error
ChiSquare
Standardized
Parameter
DF
Estimate
Intercept
1
0.3956
0.0526
56.53
<.0001
GiftCnt36
1
0.1250
0.0140
79.48
<.0001
Pr > ChiSq
Estimate
Exp(Est) 0.673
0.1474
1.133
Odds Ratio Estimates Point Effect
Estimate
GiftCnt36
1.133
The standardized estimates present the effect o f the input on the logodds o f donation. The values are standardized to be independent o f the input's unit o f measure. This provides a means o f ranking the importance o f inputs in the model. The odds ratio estimates indicate by what factor the odds o f donation increase for each unit change in the associated input. Combined with knowledge o f the range o f the input, this provides an excellent way to judge the practical (as opposed to the statistical) importance o f an input in the model. The stepwise selection process continues for eight steps. After the eighth step, neither adding nor removing inputs from the model significantly changes the model fit statistic. A t this point the Output window provides a summary o f the stepwise procedure. 6.
Go to line 850 to view the stepwise summary. The summary shows the step in which each input was added and the statistical significance o f each input in the final eightinput model. NOTE: No ( a d d i t i o n a l ) e f f e c t s met t h e 0.05 s i g n i f i c a n c e l e v e l f o r e n t r y i n t o t h e model. Summary o f S t e p w i s e Number
Score
Wald
DF
In
ChiSquare
ChiSquare
1 1
1 2 3
81.6807 23.2884 16.9872
<.0001 <.0001 <.0001
Effect Step
Entered
Selection
Pr > C h i S q
1 2 3
GiftCnt36
4
GiftAvgAll
1
StatusCat96NK
5
4 5
14.8514
5
18.2293
0.0001 0.0027
6
DemPctVeterans
1
6
7.4187
0.0065
7
M_GiftAvgCard36
1
7
7.1729
8
M_DemAge
1
8
4.6501
0.0074 0.0311
GiftTimeLast DemMedHomeValue
1
4.2 Selecting Regression Inputs
437
The default selection criterion selects the model from Step 8 as the model with optimal complexity. As the next section shows, this might not be the optimal model, based on the fit statistic that is appropriate for your analysis objective. The selected model, based on the CH0OSE=NCrÂŁ criterion, i s the model trained i n Step 8. I t consists of the following effects: Intercept
DenMecHomeValue
DenPctVeterans GiftAvpAll GiftCnt36
GiftTimeLast
M_DemAge M_GiftAvgCard36 StatusCat96rl<
For convenience, the output from Step 8 is repeated. A n excerpt from the analysis o f individual parameter estimates is shown below. A n a l y s i s of Maximum L i k e l i h o o d Estimates Standard Parameter
DF
Estimate
Wald
Standardized
Error
ChiSquare
Pr > ChiSq
Estimate
Exp(Est)
Intercept
1
0.2727
0.2024
1
1.385E6
3.009E7
1.82 21.18
0.1779
DemMedHomeValue
<.0001
0.0763
DemPctVeterans
1
0.00658
0.00261
6.38
0.0115
0.0412
1.007
GiftAvgAll
1
0.0136
0.00444
9.33
0.0023
0.0608
0.987
GiftCnt36
1
0.0587
0.0187
9.79
0.0018
0.0692
1.060
GiftTimeLast
1
0.0376
0.00770
23.90
<.0001
0.0837
0.963
1.314
M_DemAge
0
1
0.0741
0.0344
4.65
0.0311
1.077
M_GiftAvgCard36
0
1
0.1112
0.0411
7.30
StatusCat96NK
A
1
0.0880
0.0927
0.90
0.0069 0.3423
0.916
StatusCat96NK
E F
1 1
0.4974
StatusCat96NK StatusCat96NK
L
StatusCat96NK
N
0.1818 0.1303
7.48
0.0062
1.644
12.30
0.0005
0.633
1
0.3735
0.15
0.6966
1.157
1
0.1206
0.1323
0.83
0.3621
0.886
Restore the Output window and maximize the Fit Statistics window. H f Fit Statistics
1.118
0.4570 0.1456
The parameter with the largest standardized estimate (in absolute value) is G i f t T i m e L a s t . 7.
1.000
WM&
Target
Fit Statistics Statistics Label
TargetB TargetB TargetB TargetB TargetB TargetB TargetB TargetB TargetB TargetB TargetB TargetB TargetB argetB argetB argetB argetB TargetB TargetB TargetB
_AIC_ _ASE_ _AVERR_ _DFE_ _DFM_ _DFT_ _DIV_ _ERR_ _FPE_ _MAX_ _MSE_ _NOBS_ _nw_ _RASE_ _RFPE_ _RMSE_ _SBC_ _SSE_ _SUMW_ _MISC
Akaike's Inf... Average Sq... Average Err... Degrees of ... Model Degr... Total Degre... Divisor for A... Error Functi... Final Predic... Maximum A... Mean Squa... Sum of Fre... Number of... RootAvera... Root Final... Root Mean... Schwarz*s... Sum of Squ... Sum of Cas... Misclassific...
Train 6563.093 0.240919 0.674901 4830 13 4843 9686 6537.093 0.242216 0.965413 0.241567 4843 13 0.490835 0.492154 0.491495 6647.402 2333.54 9686 0.421433
Validation
Test
0.242336 0.678517
9686 6572.12 0.998582 0.242336 4843 0.492277 0.492277 2347.27 9686 0.42453
The simpler model improves on both the validation misclassification and average squared error measures o f model performance.
438
Chapter 4 Introduction to Predictive Modeling: Regressions
4.3 Optimizing Regression Complexity Model Essentials  Regressions
Optimize complexity.
!
B e s t m o d e
l
! from sequence
61
Regression complexity is optimized by choosing the optimal model in the sequential selection sequence.
Model Fit versus Complexity Model fit statistic
 Evaluate each sequence step.
>xm~m ar.m im irnra tanra tmmm mmm
1
2
3
4
5
6
62
The process involves two steps. First, fit statistics are calculated for the models generated in each step o f the selection process. Both the training and validation data sets are used.
4.3 Optimizing Regression Complexity
439
Select Model with Optimal Validation Fit Model fit statistic
1t
Choose simplest optimal model. 3
63 T h e n , as w i t h the decision tree in Chapter 3. the simplest model (that is, the one w i t h the fewest inputs) w i t h the o p t i m a l f i t statistic is selected.
440
Chapter 4 Introduction to Predictive Modeling: Regressions
Optimizing Complexity
Iteration Plot The following steps illustrate the use o f the iteration plot in the Regression tool Results window. In the same manner as the decision tree, you can tune a regression model to give optimal performance on the validation data. The basic idea involves calculating a fit statistic for each step in the input selection procedure and selecting the step (and corresponding model) with the optimal fit statistic value. To avoid bias, o f course, the fit statistic should be calculated on the validation data set. 1.
Select View ÂŤ=> Model ÂŤ=> Iteration Plot. The Iteration Plot window opens.
Average Squared Error 0.2500 
0.2475
0.2450
02425
024002
4
Model Selection Step Number â€˘Train: Average Squared Error Valid: Average Squared Error
J
The Iteration Plot window shows (by default) average squared error (training and validation) from the model selected in each step o f the stepwise selection process. f
Surprisingly, this plot contradicts the naive assumption that a model fit statistic calculated on training data w i l l always be better than the same statistic calculated on validation data. This concept, called the optimism principle, is correct only on the average, and usually manifests itself only when overly complex (overly flexible) models are considered. It is not uncommon for training and validation fit statistic plots to cross (possibly several times). These crossings illustrate unquantified variability in the fit statistics.
Apparently, the smallest average squared error occurs in Step 4, rather than in the final model, Step 8. I f your analysis objective requires estimates as predictions, the model from Step 4 should provide slightly less biased ones.
4.3 Optimizing Regression Complexity 2.
441
Select Select Chart ÂŤ=> Misclassification Rate.
Misclassification Rate 0.500 
0.475 
0.450 
0.425 
Model Selection Step Number Train: Misclassification Rate Valid: Misclassification Rate The iteration plot shows that the model with the smallest misclassification rate occurs in Step 3. I f your analysis objective requires decision predictions, the predictions from the Step 3 model are as accurate as the predictions from the final Step 8 model. The selection process stopped at Step 8 to limit the amount o f time spent running the stepwise selection procedure. In Step 8, no more inputs had a chisquared pvalue below 0.05. The value 0.05 is a somewhat arbitrary holdover from the days of statistical tables. With the validation data available to gauge overfitting, it is possible to eliminate this restriction and obtain a richer pool o f models to consider.
442
Chapter 4 Introduction to Predictive Modeling: Regressions
Full M o d e l S e l e c t i o n Use the f o l l o w i n g steps to b u i l d and evaluate a larger sequence o f regression models: 1.
Close the Results  Regression w i n d o w .
2.
Select Use Selection Default •=> No f r o m the Regression node Properties panel.
Train Variables
UJ
[Main Effects Yes [TwoFactor InteracticNo '•Polynomial Terms No [Polynomial Degree 2 [UserTerms No " T e r m Editor
y
Mciass Targets [Regression Type _ink Function BModel Options '[•Suppress Intercept Input Coding ;
Logistic Regression Logit No Deviation
[Selection Model Stepwise [Selection Criterion Default [Use Selection DefaiNo ^Selection Options 3.
Select Selection Options "=> ^ J .
wmm i...
4.3 Optimizing Regression Complexity
The Selection Options w i n d o w opens.
J
Property Sequential Order Entry Significance Level Stay Significance Level Start Variable Number Stop Variable Number Force Candidate Effects Hierarchy Effects Moving Effect Rule Maximum Number of Steps
Value No 0.05 0.05 0 0 0 Class None 0
Sequential O r d e r Specifies whether to add or remove variables in the order that is specified in the M O D E L statement.
OK
Cancel
443
444
Chapter 4 Introduction to Predictive Modeling: Regressions
4.
Type 1 . 0 as the Entry Significance Level value.
5.
T y p e 0 . 5 as the Stay Significance L e v e l value.
I _ Selection Options Property Sequential Order Entry Significance Level Stay Significance Level Start Variable Number Stop Variable Number Force Candidate Effects Hierarchy Effects Moving Effect Rule Maximum Number of Steps
Value No 1.0 0.5 0 0 0 Class None 0
Stay Significance L e v e l Significance level for removing variables in backward or stepwise regression.
OK
Cancel
The Entry Significance value enables any input in the model. (The one chosen w i l l have the smallest pvalue.) T h e Stay Significance value keeps any input in the model w i t h a pvalue less than 0.5. This second choice is somewhat arbitrary. A smaller value can terminate the stepwise selection process earlier, w h i l e a larger value can maintain it longer. A Stay Significance o f 1.0 forces stepwise to behave in the manner o f a f o r w a r d selection. 6.
Run the Regression node and v i e w the results.
4.3 Optimizing Regression Complexity 7.
Select View
445
Model ÂŤ=> Iteration Plot. The Iteration Plot window opens.
Average Squared Error 0.250
0245
0.240 
15
Model Selection Step Number â€˘Train: Average Squared Error Valid: Average Squared Error The iteration plot shows the smallest average squared errors occurring in Steps 4 or 12. There is a significant change in average squared error in Step 13, when the D e m C l u s t e r input is added. Inclusion o f this nonnumeric input improves training performance but hurts validation performance.
446
Chapter 4 Introduction to Predictive Modeling: Regressions
Select Select C h a r t •=> M i s c l a s s i f i c a t i o n R a t e . Iteration Plot
r/eT m
Misclassification Rate 0.500 
0.475 
0.450 
0.425 
0.400 10
15
r
20
M o d e l S e l e c t i o n Step N u m b e r Train: Misclassification Rate 'Valid: Misclassification Rate
The iteration plot shows that the smallest validation misclassi fication rates occur at Step 3. N o t i c e that the change in the assessment statistic in Step 13 is m u c h less pronounced.
Best S e q u e n c e Model Y o u can c o n f i g u r e the Regression node to select the model w i t h the smallest f i t statistic (rather than the final stepwise selection iteration). This method is how SAS Enterprise M i n e r optimizes c o m p l e x i t y for regression models. 1.
Close the Results  Regression w i n d o w .
2.
I f y o u r predictions are decisions, use the f o l l o w i n g setting: Select Selection C r i t e r i o n •=> V a l i d a t i o n M i s c l a s s i f i c a t i o n . ( E q u i v a l e n t l y , y o u can select V a l i d a t i o n P r o f i t / L o s s . The equivalence is demonstrated in Chapter 6.)
3.
I f y o u r predictions are estimates (or rankings), use the f o l l o w i n g setting: Select Selection C r i t e r i o n => V a l i d a t i o n E r r o r . B
•Selection Model Stepwise Selection Criterion Validation Error Use Selection DefauNo Selection Options f
The c o n t i n u i n g demonstration assumes validation error selection criteria.
4.3 Optimizing Regression Complexity 4.
Run the Regression node and view the results.
5.
Select View
Model
447
Iteration Plot.
Average Squared Error 0250
0245
0240 
Model Selection Step Number â€˘Train: Average Squared Error Valid: Average Squared Error
The vertical blue line shows the model with the optimal validation error (Step 12). 6.
Go to line 2690. A n a l y s i s of Maximum Likelihood Estimates
Parameter
DF
Estimate
Standard
Wald
Error
ChiSquare
Pr > ChiSq
Standardized Estimate
Exp(Est)
Intercept
1
0.4999
0.2575
3.77
0.0522
DemMedHomeValue DemPctVeterans
1
1.416E6
3.011E7
22.12
<.0CO1
0.0781
1.000
1
0.00651
0.00261
6.23
GiftAvg36
1
0.0101
0.00355
8.02
0.0126 0.0046
0.0407 0.0561
0.990
GiftCnt36
1
0.0574
0.0197
8.53
0.0035
0.0677
1.059
GiftTimeLast M_DemAge
1
0.0415
0.00829
<.0001
0.0923
0.959
0
1
0.0720
0.0345
25.07 4.36
M_GiftAvgCard36
0
1
0.1126
0.0412
7.46
0.0063
1
0.0381
0.0281
1.85
0.1740
PromCntCard12
1.649
0.0367
1.007
1.075 1.119 0.0282
0.963
StatusCat96NK
A
1
0.0353
0.0957
0.14
0.7122
0.965
StatusCat96NK StatusCat96NK
E F
1 1
0.4010 0.4485
0.1950 0.1314
4.23 11.66
0.0398 0.0006
1.493 0.639
StatusCat96NK
L
1
0.1733
0.3743
0.21
0.6433
1.189
StatusCat96NK
N 0
1
0.0988
0.1353
0.53
0.906
1
0.0701
0.0367
3.64
0.4649 0.0563
StatusCatStarAll
0.932
448
Chapter 4 Introduction to Predictive Modeling: Regressions
While not all the pvalues are less than 0.05, the model seems to have a better validation average squared error (and misclassification) than the model selected using the default Significance Level settings. In short, there is nothing sacred about 0.05. It is not unreasonable to override the defaults o f the Regression node to enable selection from a richer collection o f potential models. On the other hand, most o f the reduction in the fit statistics occurs during inclusion o f the first three inputs. I f you seek a parsimonious model, it is reasonable to use a smaller value for the Stay Significance parameter.
4.4 Interpreting Regression Models
449
4.4 Interpreting Regression Models Beyond the Prediction Formula
Interpret the model.
66
After you build a model, you might be asked to interpret the results. Fortunately regression models lend themselves to easy interpretation.
450
Chapter 4 Introduction to Predictive Modeling: Regressions
Odds Ratios and Doubling Amounts A Wo'
Doubling amount Input change is required to double odds.
Ax,
consequence
1
=> odds x exp(tv,)
M2=>ocf<fex2
X?
logit
scores
Odds ratio: Amount odds change with unit change in input.
69
There are two equivalent ways to interpret a logistic regression model. Both relate changes in input measurements to changes in odds o f primary outcome. • A n odds ratio expresses the increase in primary outcome odds associated with a unit change in an input. It is obtained by exponentiation o f the parameter estimate o f the input o f interest. • A doubling amount gives the amount o f change required for doubling the primary outcome odds. It is equal to log(2) « 0.69 divided by the parameter estimate o f the input o f interest. f
I f the predicted logit scores remain in the range 2 to +2, linear and logistic regression models o f binary targets are virtually indistinguishable. Balanced stratified sampling (Chapter 6) often ensures this. Thus, the prevalence o f balanced sampling in predictive modeling might, in fact, be a vestigial practice from a time when maximum likelihood estimation was computationally extravagant.
4.4 Interpreting Regression Models
451
Interpreting a Regression Model
The following steps demonstrate how to interpret a model using odds ratios: 1.
Go to line 2712 o f the regression model output. Odds R a t i o E s t i m a t e s Point Effect
Estimate
DemMedHomeValue DemPctVeterans
1.000
GiftAvg36
0.990
1.007
GiftCnt36
1.059
GiftTimeLast M_DemAge
0.959 0 vs 1
M_GiftAvgCard36
0 vs 1
1 .253
PromCntCard12 StatusCat96NK
A vs S
0.963 0.957
StatusCat96NK
E vs S
1 .481
StatusCat96NK
F vs S
0.633
1.155
StatusCat96NK
L vs S
1.179
StatusCat96NK
N vs S
0.898
StatusCatStarAll
0 vs 1
0.869
This output includes most o f situations you will encounter when you build a regression model. For G i f tAvg36, the odds ratio estimate equals 0.990. This means that for each additional dollar donated (on average) in the past 36 months, the odds of donation on the 97NK campaign change by a factor o f 0.99, a 1 % decrease. For G i f tCnt36, the odds ratio estimate equals 1.059. This means that for each additional donation in the past 36 months, the odds o f donation on the 97NK campaign change by a factor o f 1.059, a 5.9% increase. For MJDemAge, the odds ratio (0 versus 1) estimate equals 1.155. This means that cases with a 0 value for M_DemAge are 1.155 times more likely to donate than cases with a 1 value for M_DemAge. f
2.
The unusual value o f 1.000 for the DemMedHomeValue odds ratio has a simple explanation. Unit (that is, single dollar) changes in home value do not change the odds o f response by an amount captured in three significant digits. To obtain a more meaningful value for this input's effect on response odds, you can multiply the parameter estimate by 1000 and exponentiate the result. You then have the change in response odds based on 1000 dollar changes in median home value. Equivalently, you could use the Transform Variables node to replace DemMedHomeValue with DemMedHmVall000=DemMedHomeValue/1000, and a unit increase on that new input would represent a $1000 increase in the DemMedHomeValue.
Close the Results window.
452
Chapter 4 Introduction to Predictive Modeling: Regressions
4.5 Transforming Inputs Beyond the Prediction Formula
Handle extreme or unusual values.
72
Classical regression analysis makes no assumptions about the distribution o f inputs. The only assumption is that the expected value o f the target (or some function thereof) is a linear combination o f fixed input measurements. Why should you worry about extreme input distributions?
4.5 Transforming Inputs
453
Extreme Distributions and Regressions Original I n p u t Scale
true association gr
i
•
"
bundard regression
,
• standardmpm i ifnif • skewed input distribution
•
^
true association high leverage points
74
There are at least two compelling reasons. • First, in most realworld applications, the relationship between expected target value and input value does not increase without bound. Rather, it typically tapers off to some horizontal asymptote. Standard regression models are unable to accommodate such a relationship. • Second, as a point expands from the overall mean o f a distribution, the point has more influence, or leverage, on model fit. Models built on inputs with extreme distributions attempt to optimize fit for the most extreme points at the cost o f fit for the bulk o f the data, usually near the input mean. This can result in an exaggeration or an understating o f an input's association with the target. Both o f these phenomena are seen in the above slide. f
The first concern can be addressed by abandoning standard regression models for more flexible modeling methods. Abandoning standard regression models is often done at the cost o f model interpretability and, more importantly, failure to address the second concern o f leverage.
454
Chapter 4 Introduction to Predictive Modeling: Regressions
Extreme Distributions and Regressions Regularized Scale
Original Input Scale true association sjg '*' 1
"^standard regression , ~~
i
!
* standardmifin i
" :
skewed input distribution
high leverage points
more symmetric distribution
76
A simpler and, arguably, more effective approach transforms or regularizes offending inputs in order to eliminate extreme values.
Regularizing Input Transformations Original Jriput 'jcaly
Regularized Scale
standard/egression B
skewed input distribution
high leverage points
rf j « • • • J
more symmetric distribution
Then, a standard regression model can be accurately fit using the transformed input in place o f the original input.
4.5 Transforming Inputs
455
Regularizing Input Transformations Regularized Scale
Original I n p u t Scale
true association u
regularized estimate
• ,
.
.
f*~** • '
regularized estimate °
"
standard regression \ • .«.•
•
i
j
• standard/epression
true association
78
Often this can solve both problems mentioned above. This not only mitigates the influence o f extreme cases, but also creates the desired asymptotic association between input and target on the original input scale.
Chapter 4 Introduction to Predictive Modeling: Regressions
456
Transforming Inputs
Regression models are sensitive to extreme or outlying values in the input space. Inputs with highly skewed or highly kurtotic distributions can be selected over inputs that yield better overall predictions. To avoid this problem, analysts often regularize the input distributions using a simple transformation. The benefit o f this approach is improved model performance. The cost, o f course, is increased difficulty in model interpretation. The Transform Variables tool enables you to easily apply standard transformations (in addition to the specialized ones seen in Chapter 9) to a set o f inputs.
The Transform Variables Tool Use the following steps to transform inputs with the Transform Variables tool: 1.
Remove the connection between the Data Partition node and the Impute node.  â€˘ Predictive Analysis L:
2.
Select the Modify tab.
3.
Drag a Transform Variables tool into the diagram workspace.
4.
Connect the Data Partition node to the Transform Variables node.
5.
Connect the Transform Variables node to the Impute node.
H B E3
4.5 Transforming Inputs
6.
457
A d j u s t the diagram icons for aesthetics. (So that y o u can see the entire d i a g r a m , the z o o m level is reduced.) • Predictive Analysis
i
^
—
&
—»>r^j* . ™ pl
tr
rt
^ 6
The T r a n s f o r m Variables node is placed before the Impute node to keep the imputed values at the average (or center o f mass) o f the model inputs. 7.
Select the V a r i a b l e s ^>
property o f the Transform Variables node. Value
Prope
General
Node ID Imported Data Exported Data Notes
Trans
U
m
Train
• •
Variables Formulas Interactions SAS Code
...
u
Ibefault Methods '•Interval Inputs p Interval Targets ; Class Inputs •Class Targets
[slsample
None None None None
Properties First N jMethod Default [Size 12345  R a n d o m Seed
458
Chapter 4 Introduction to Predictive Modeling: Regressions
The Variables  Trans w i n d o w opens.
(none)
•
not
Name Method DemAge Default DemCluster Default DemGender Default D e rn Ho m e OwDef autt DemMedHomOefault DemMedlnconOefautt DemPctVeteraDefault GiftAvg36 Default GiftAvgAII DefauH GiftAvgCard36Default GifWvgLast Default GiftCnt36 Default GiftCntAII Default GiftCntCard3 6 Default GirtCnlCardAil DefauH GiftTimeFirst Default GiftTimeLast Default PromCntl2 Default PromCnt36 Default PromCntAII Default Pro mCntC a rd Default PromCntCard Default PromCntCard/Default REP DernMecDefault Statu sCat9 6NDefault Statu sCalStar/Oefault
Equal to
• Number of Bins
Apply Role 4.0 Input 4.0 Input 4.0 input 4.0'lnput 4.0 Input 4.0 Rejected 4.0 Input 4.0 Input 4.0'lnput 4.0 Input 4.0 Input 4.0 Input 4.0 Input 4J)'lnput 4.0'lnput 4.0 Input 4.0 Input 4.0 Input 4.0 Input 4.0 Input 4.0 Input 4.0 Input 4.0 Input 4.0 Input 4.0 Input 4.0 Input
Level Interval Nominal Nominal Binary Interval Interval Interval Interval Interval (Interval Interval Interval Inlerval Inlerval Inlerval Interval Inlerval Interval Inlerval 'interval Interval Interval Interval Interval Nominal Binary
Type
Order
Reset Label
Numeric Character Character Character Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Character Numeric
Age Demograph Gender Home Owne Median Horr Median Incoi PercentVete GiftAmountJ • GiftAmountJ GiftAmountJ Gift Amount 1 Gift Count 3 £ Gift Count A l l GiftCountC Gift Count C; Times Since Time Since 1 1 Promotion Cj:. Promotion C Promotion C Promotion C : Promotion O Promotion 0 Repiacemer Status Cated Status Cated*
<
! H Update Path
OK
Cancel
4.5 Transforming Inputs
Select all inputs w i t h M.b, ..^lr.r. i
â€˘
(none)
Gift
in the name.
M
not
Method
Name
459
Default DemAge DemCluster Default DemGender Default DemHomeOwDefauft DemMedHomOefault D e m M ed In c o nDefault DemPctVeteraDefault Default GiftAvg36 GiftAvgAII Default GiftAvgCard36Default GiftAvgLasl Default Default GiftCnt36 Default GiftCntAII GiflCntC a rd 36 Default GiftCntCardAII Default GiftTimeFirst Default GiftTimeLast Default PromCnt12 Default PromCnt36 Default PromCntAII Default Pro mCntC a rd Default P ro mCntC a rd Default P ro mC ntC ard/Default REP DernMecOefautt Status Cat9 6 N Default Status CatStar/Defautt
Apply
Equal to
Number of Bins
Role 4.0 Input 4.0Jnput 4.0 Input 4.0 Input 4.0 Input 4.0 Rejected 4.0 Input 4.0 Input 4.0 Input 4.0 Input 4.0 Input 4.0 Input 4.0 Input 4.0 Input 4,0 Input 4.0 Input 4.0 input 4.0 Input 4.0 Input 4.0 Input 4.0 Input 4.0 Input 4.0 Input 4.0 Input 4.0'lnput 4.0 Input
Level Interval Nomina] Nominal Binary Interval Interval Interval Interval Interval Inlerval Inlerval Interval Interval Interval Interval Inlerval Interval Interval Interval Interval Inlerval Interval Interval Interval Nominal Binary
Type
Label
Order
Age k Demographj Gender Home Owne Median Hon" Median Incoi Percent Vetej Gift Amount; Gift Amount/ Gift Amount/ Gift Amount Gift Count 3^ Gift Count All Gift Count cj Gift Count Cj Times Since Time Since I Promotion C Promotion Cp : Promotion C Promotion C Promotion 0 Promotion Cj, Replacemer^" Status Catec Status Caterj*
Numeric Character Character Character Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Character Numeric
:
:
in
<
Explore..
Update Path
Reset
OK
Cancel
Chapter 4 Introduction to Predictive Modeling: Regressions
460
9.
Select E x p l o r e . . . . T h e E x p l o r e w i n d o w opens.
mWKM File
View
Actions
Window
1 0 ^ Sample Properties ; ; :
:
Property •Rows
4843
$133 p
EMWS.Par1_lRAlN Obs#
A
<  
l°t~
r / l ^ 0

U GtftAvg36 X 4000 a) 2000 u. 0
I 40005 2000¬ £ o ..
1
i
[  1 •
i
 .
0
.
Gift Amount Average 36 M... n*Ef
t 4000 5 2000u0
i
 .
• i
r
.
$100.00 $200.00
0.0
m
1000
5.8 11.G 17.4 23.2 20.0
Gift Count Card All Months t l GiftTimeFirst
U GfftCnt36
i
$104.00 $208:00
U GiftAvgAII
•
1500  f — I — , 1000 500¬ 0
Gift Amount Last
• 1500 
— i — i
i
$0.00

$0.00
!
1500 i  1000 • — 500n  J—L \—i—I—'—I—i—I—'—I—r 0.0 1.8 3.6 5.4 7.2 9.0 Gift Count Card 36 Months U GiftCntCardAII
isi
•
Ef
$104.80 $208.27
I U GmAvgLast
n
:
L i G(ftCntCard36
Gift Amount Average Card...
l
DATAOBS  Tarqet Git 1 4 5 7
1 2 3
l i GiftAvgCard36
I
Apply 3
m\
r/sf
Value
f  i
 i — r ~ r~TT
1000 500 0
t — . — i — i — i — i — i — i — i — • — r
* 8 : I U 0.0 3.0 6.0 9.0 12.0 15.0
15.0
79.5 144.0
208.5
Time Since First Gift GiftTimeLast
r^Ef
IH
IB) — i — — i —— i I • $1.50 $30.90 $11)0.30 1
1
1
1
Gift Amount Average All M...
 i — i — i— —i — • — i — i — i — i — r 4.0 8.6 13.2 17.8 22.4 27.0 Time Since Last Gift 1
G i f tAvg and G i f t C n t inputs show some degree o f skew i n t h e i r d i s t r i b u t i o n . T h e G i f tTime inputs do not. T o regularize the skewed d i s t r i b u t i o n s , use the l o g t r a n s f o r m a t i o n .
The
For
these i n p u t s , the order o f m a g n i t u d e o f the u n d e r l y i n g measure predicts the target rather than the measure itself. I 0 . C l o s e the E x p l o r e w i n d o w .
4.5 Transforming Inputs
461
1. Deselect the t w o inputs w i t h G i f tTime in their names. f*»'
M i i l
(none) Name
*
r t w
''
• •
not
Method
DemAge DefauH DemCluster befauR DemGender DefauH DemHomeOwOefau_R_ DemMedHomOefauR DernMedlncortJefault D e m P ctVetera DefauH GiftAvg36 DefauH GiftAvgAII DefauH GiftAvgCard36DefauK GiftAvgLast DefauH Gif!Cnt36 DefauH GiftCntAII Default GiftCntCard36 DefauH GiftCntCardAll DefauH GiftTimeFirst DefauH GiftTimeLast DefauH PromCnt12 DefauH P ro m C nt3 6 Dejault PromCntAII DefauH P ro mC ntC a rd Default PromCntCard DefauH Prom C ntC a rd/OefauK REP_DernMetDefauR StatusCat96NDefauR Status CatStar/DefauR
E q u a l to
Apply
•
Number of Bins
Role 4.0 Input 4.0 Input 4.0 Input 4.0 Input 4.0 Input 4.0 Rejected 4.0 Input 4.0 Input 4.0 Input 4.0 Input 4.0 Input 4.0 Input 4.0 Input 4.0 Input 4.0 Input 4.0 Input 4.0 Input 4.0 Input 4.0 Input 4.0 Input 4.0 Input 4.0 Input 4.0 Input 4.0 Input 4.0 Input 4.0 Input
Level Interval Nominal Nominal Binary Inlerval Interval Interval Interval Interval Inlerval Inlerval Interval Interval Interval Inlerval Interval Interval Interval Inten/al Interval Interval Interval Interval Interval Nominal Binary
Explore...
Type
Order
Label Age Demograph Gender Home Owne Median Horn Median Incoi Percent Vete Gift Amount; GiftAmountJ GiftAmountJ Gift Amount  Gift Count 36 Gift Count At Gift Count C; Gift Count C; Times Since Time Since L Promotion Cj Promotion 0 Promotion C Promotion C Promotion C Promotion C Replacemeri Status C a t e g _ Status Calerj*_
Numeric Character Character Character Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Character Numeric
Update Path
Reset
OK
Cancel
462
12.
Chapter 4 Introduction to Predictive Modeling: Regressions
Select M e t h o d •=> L o g for one o f the r e m a i n i n g selected inputs. The selected m e t h o d changes f r o m D e f a u l t to L o g f o r the
(none)
Q not
T
G i f tAvg
and
G i f tCnt
inputs.
Apply
Equal to
Name Method DemAge Default DemCluster Default DemGender Default DernHomeOwDefautt DemMedHomOefautt DemMedlnconDefautt DernPctVeteraDefault Log GiftAvg36 GiftAvgAII Log GiftAvgCard36Log GiftAvgLast Log Log GiflCnt36 GiftCntAII Log GiftCntCard36Log GiftCntCardAll Loy GiftTimeFirst Default GiftTimeLast DefauH PromCntl 2 DefauH PromCnt36 Default PromCntftll DefauH Pro mCntCard DefauH PromCnlCardDefauH PromCnlCard/Default REP DemMecDefautt Statu sCal96NIDefault Statu sCalStar/Default
Number of Bins
Role 4.0 Input 4.0 input 4.0 Input 4.0 In put 4.0 Input 4.0 Rejected 4.0 Input 4.0 Input 4.0 Input 4.0 Input 4.0 Input 4.0 Input 4.0 Input 4.0 Input 4.0 Input 4.0 Input 4.0 Input 4.0 Input 4.0 Input 4.0 Input 4.0 Input 4.0 Input 4.0 Input 4.0 Input 4.0 Input 4.0 Input
4
Level Interval Nominal Nominal Binary Inlerval Inlerval Inlerval Interval Interval Interval Interval Interval Interval Interval Interval Interval Interval Interval Interval Interval Interval Interval Interval Interval Nominal Binary
Explore..,
13.
i Type Numeric Character Character Character Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Character Numeric
Update Path
Order
Reset
Label Age Demographl Gender Home Owne Median Horn Median Incoi Percent Vete GiftAmountJ Gift Amount; Gift Amount/ Gift Amount i Gift Count 3 £ Gift Count Al Gift Count C: Gift Count C? Times Since Time Since t Promollon C Promolion C Promotion C Promotion C Promotion C Promotion C Replacemer Status Cates Status Catec.
OK
Cancel
Select O K to close the Variables  Trans w i n d o w .
14. Run the T r a n s f o r m Variables node and v i e w the results. 5. M a x i m i z e the O u t p u t w i n d o w and go to line 2 8 . Input I n p u t Name
Role
Level
GiftAvg36 GiftAvgAII GiftAvgCard36
INPUT INPUT INPUT
INTERVAL INTERVAL INTERVAL
L0G_ G i f t A v g 3 6
INTERVAL INTERVAL INTERVAL
log(GiftAvg36
L0G_ G i f t A v g A I I L0G_ _ G i f t A v g C a r d 3 6
GiftAvgLast GiftCnt36 GiftCntAII GiftCntCard36 GiftCntCardAll
INPUT INPUT INPUT INPUT INPUT
INTERVAL INTERVAL INTERVAL
LOG. L0G_ LOG LOG_ LOG_
INTERVAL INTERVAL INTERVAL INTERVAL INTERVAL
log(GiftAvgLast + 1) log(GiftCnt36 + 1) log(GiftCntAII + 1) log(GiftCntCard36 + 1) log(GiftCntCardAll + 1)
INTERVAL INTERVAL
Name
GiftAvgLast _GiftCnt36 GiftCntAII _GiftCntCard36 GiftCntCardAll
Level
Formula + 1 )
log(GiftAvgAII + 1) log(GiftAvgCard36 + 1)
N o t i c e the F o r m u l a c o l u m n . W h i l e a l o g transformation was s p e c i f i e d , the actual t r a n s f o r m a t i o n used was \og(input + 1). T h i s default action o f the T r a n s f o r m Variables tool avoids p r o b l e m s w i t h 0valucs o f the u n d e r l y i n g inputs. 16.
Close the T r a n s f o r m Variables  Results w i n d o w .
4.5 Transforming Inputs
463
Regressions with Transformed Inputs The following steps revisit regression, and use the transformed inputs: 1.
Run the diagram from the Regression node and view the results.
2.
Go to line 3754 the Output window. Summary of Stepwise S e l e c t i o n Effect Step
Entered
Removed
DF
Number
Score
In
ChiSquare
Wald ChiSquare
Pr > ChiSq
1
L0G_GiftCnt36
!
1
95.0275
<.0CO1
2
GiftTimeLast
1
2
21.1330
<.OO01
3
DemMedHomeValue
1
3
17.7373
<.0001
4
L03_GiftAvgAll
4
21.7306
5 6
DemPctVeterans
1 1
7.0742
<.0O01 0.0078
StatusCat96NK L0G_GiftCntCard36
5 6
13.7906
1
7
5.9966
0.0170 0.0143
1
8
5.0301
0.0249
53 1
9
61.2167
0.2049
10
M_DemAge DemCluster StatusCatStarAll
10
1.2431
0.2649
11
PromCntCard12
1
11
1.4604
0.2269
12
PromCntAll
1
12
1.0022
0.3168
13
LOG_GiftCntAll
1
13
2.2990
0.1295
14 15
PromCntl2
1
14
0.8158
0.3664
PromCntCardAll
1
15
1.8875
1
0.6075
7 8 9
PromCntCard12
0.0358
0.1695 0.8500
16 17
M_REP_DemMedIncome
1
14 15
18
L0G_GiftAvg36
1
16
0.4691
0.4357 0.4934
19
M_L0G_GiftAvgCard36
0.6226
0.4301
GiftTimeFirst
1 1 1
17
20
18
0.3972
21
GiftTimeFirst
17
0.5285 0.3971
0.5286
The s e l e c t e d model, based on the (HX)SE=VERR0R c r i t e r i o n , i s the model t r a i n e d i n Step 4. I t c o n s i s t s of the following e f f e c t s : Intercept
DemMedHomeValue
GiftTimeLast
L0G_GiftAvgAll
L0G_GiftCnt36
The stepwise selection process took 21 steps, and the selected model came from step 4. Notice that half o f the selected inputs are log transformations o f the original gift variables.
464
3.
Chapter 4 Introduction to Predictive Modeling: Regressions
Go to line 3800 to view more statistics from the selected model. Analysis of Maximum Likelihood Estimates Standard DF
Parameter
Estimate
Error
Wald
Standardized
ChiSquare
Pr > ChiSq
Estimate
Exp(Est)
Intercept DerrrVtedHorneValue
1
0.8251
0.2921
7.98
0.0047
1
1.448E6
23.26
<.CO01
0.0798
1.000
GiftTimeLast LOGjSiftAvgAH
1
0.0341
3.002E7 0.00756
20.33
<.0001
0.0758
0.966
1
0.3469
0.0747
21.58
<.0O01
0.0895
0.707
LOG GiftCnt36
1
0.3736
0.0728
26.34
<.0O01
0.0998
1.453
2.282
Odds Ratio Estimates Point Estimate
Effect
4.
DerrrVtedHorneValue
1.000
GiftTimeLast
0.966
LOGjSiftAvgAll
0.707
LOG GiftCnt36
1.453
Select View <=> Model «=> Iteration Plot.
Average Squired Error
0250
0245
0240
023510
r~ 15
— r 20
Model Selection Step Number •Train: Average Squared Error Valid: Average Squared Error
The selected model (based on minimum error) occurs i n Step 4. The value o f average squared error for this model is slightly lower than that for the model with the untransformed inputs.
4.5 Transforming Inputs
5.
465
Select Select Chart â€˘=> Misclassification Rate.
m Misclassification Rate 0.500
0.475 
0.450 
0.425 
0.40010
15
 r 20
Model Selection Step Number Train: Misclassification Rate Valid: Misclassification Rate
The misclassification rate with the transformed input model is nearly the same as that for the untransformed input model. The model with the lowest misclassification rate comes from Step 3. I f you want to optimize on the misclassification rate, you must change this property in the Regression node's property sheet. 6.
Close the Results window.
466
Chapter 4 Introduction to Predictive Modeling: Regressions
4.6 Categorical Inputs Beyond the Prediction Formula
Use nonnumeric inputs.
81
Using nonnumeric or categorical inputs presents another problem for regressions. As was seen i n the earlier demonstrations, inclusion o f a categorical input with excessive levels can lead to overfitting.
4.6 Categorical Inputs
467
To represent these nonnumeric inputs in a model, you must convert them to some sort o f numeric values. This conversion is most commonly done by creating design variables (or dummy variables), with each design variable representing approximately one level o f the categorical input. (The total number o f design variables required is, in fact, one less than the number o f inputs.) A single categorical input can vastly increase a model's degrees o f freedom, which, in turn, increases the chances o f a model overfitting.
468
Chapter 4 Introduction to Predictive Modeling: Regressions
Coding Consolidation Level
Dgh
&ABCD
A
1
0
0
B
1
0
0
C
1
0
0
D
1
0
0
E
0
1
0
F
0
1
0
G
0
0
1
H
0
0
1
1
0
0
0
8G There are many remedies to this p r o b l e m . One o f the simplest remedies is to use d o m a i n k n o w l e d g e to reduce the number o f levels o f the categorical input. In this way, levelgroups are encoded in the model in place o f the original levels.
4.6 Categorical Inputs
469
Recoding Categorical Inputs In Chapter 2, you used the Replacement tool to eliminate an inappropriate value in the median income input. This demonstration shows how to use the Replacement tool to facilitate combining input levels o f a categorical input. 1.
Remove the connection between the Transform Variables node and the Impute node. ; Predictive Analysis
> flDcdiioaTfee
^f5"O0 2.
Select the Modify tab.
3.
Drag a Replacement tool into the diagram workspace.
4.
Connect the Transform Variables node to the Replacement node.
5.
Connect the Replacement node to the Impute node.
J
—© 80%S:°Hj'
 i ; Predictive Analysis
ft
^toputa 1= L B
PVAOTNt
»
JLl
—
6
<
>
^
6
K
J
ii^Q0  J —Q 76%gqqj
470
Chapter 4 Introduction to Predictive Modeling: Regressions
Y o u need to change some o f the node's default settings so that the replacements are l i m i t e d to a single categorical input. 6.
In the Interval Variables property group, select D e f a u l t L i m i t s M e t h o d O N o n e . Value
Property
General
Node ID Imported Data Exported Data iNotes
Repl2
... ...
...
Train
j Replacement Editor
•
7.
Cutoff Values Class Variables Replacement Editor Ignore Unknown Levels
•
In the Class Variables property g r o u p , select R e p l a c e m e n t E d i t o r O Properties panel. 1
Value
Property
General
Node ID Imported Data Exported Data Notes
I
Repl2
Train
Slnterval Variables iReplacement Editor Default Limits MethoNone 'CutofTValues a C l a s s Variables \ Replacement Editor Ignore " U n k n o w n Levels
,.u !
y
•
•
•
. „ [ f r o m the Replacement node
4.6 Categorical Inputs
471
The Replacement E d i t o r opens.
Variable DemCluster DemCluster DemCluster DemCluster DemCluster DemCluster DemCluster DemCluster DemCluster DemCluster DemCluster DemCluster DemCluster DemCluster DemCluster DemCluster DemCluster DemCluster DemCluster DemCluster DemCluster DemCluster DemCluster DemCluster
Level 40 24 36 35 49 13 12 27 18 30 00 39 45 51 14 11 02 43 28 16 46 41 17 08
Frequency 236 209 206 194 178 154 152 152 148 143 129 129 114 114 113 111 107 103 99 98 98 97 93 89
Type
c c c c c
C
c c c c" c ;C C C C
c
c
iC !C
b c
*
I Char Raw v... Num RawV... Repiaceme... â€˘40 (24 t 36 35 49 ,13 ~T~
gl
L
27 18 30 00 _i39 J45 [51
I. ~ ~ I. ~f
11 02 43
r
28 16 *46
I
U1
.
08 Reset
OK
Cancel
The categorical input Replacement Editor lists all levels o f each binary, o r d i n a l , and n o m i n a l input. Y o u can use the Replacement c o l u m n to reassign values to any o f the levels. The input w i t h the largest number o f levels is
DemCluster.
w h i c h has so many levels that
consolidating the levels using the Replacement Editor w o u l d be an arduous task. ( A n o t h e r , autonomous method for c o n s o l i d a t i n g the levels o f D e m C l u s t e r is presented as a special topic in Chapter 8.) For this demonstration, y o u c o m b i n e the levels o f another input,
StatusCat96NK.
472
8.
Chapter 4 Introduction to Predictive Modeling: Regressions
Scroll the Replacement E d i t o r to v i e w the levels o f
Variable DemCluster DemCluster DemCluster DemCluster DemCluster DemCluster DemCluster DemGender DemGender DemGender DemGender DemHomeOwner DemHomeOwner DemHomeOwner StatusCal96NK StatusCa!96NK StatusCat96NK StatusCat96NK StatusCat96NK StatusCat96NK StatusCat96NK StatusCatStarAII Status CatStarAII StatusCatStarAII
Level 33 06 50 04 19 52 JJNKNOW.. F M }U JJNKNOW.. H
ty ~
Frequency 30 C 26 c 26 "c 22 19 C 19 c 2592 c 2002 249
c
c
2655 2188
JJNKNOW.. A 2911 S 1168 342 N 286 E 115 IL 21 JJNKNOW.. 1 2590 0 2253 JJNKNOW..
FT
c c c c c c c c
c c
c C c N N N
StatusCat96NK.
Type
CharRawv... Num Raw... Repiaceme.., 33 06 50 04 19 52 JDEFAULT.
—
M U .DEFAULT, H U .DEFAULT. A~~ S F N E L .DEFAULT. 1.0 0.0 DEFAULT Reset
OK
Cancel
The input has six levels, plus a level to represent u n k n o w n values ( w h i c h do not occur in the t r a i n i n g data). The levels o f
StatusCat96NK
w i l l be consolidated as f o l l o w s :
•
Levels A and S (active and star donors) indicate consistent donors and are grouped into a single level, A .
•
Levels F and N ( f i r s t  t i m e and new donors) indicate new donors and are grouped into a single level, N .
•
Levels E and L (inactive and lapsing donors) indicate lapsing donors and are grouped into a single level L.
4.6 Categorical Inputs
9.
Type
10. T y p e
A as the N
Replacement level for
StatusCat96NK
levels A and S.
as the Replacement level for
StatusCat96NK
levels F and N .
473
11. Type L as the Replacement level for StatusCat96NK levels L and E. StatusCat96NK StatusCat96NK StatusCat96NK StatusCat96NK StatusCat96NK StatusCat96NK StatusCat96NK
A S F N E L UNKNOW..
2911 1168 342 286 115 21
12. Select O K to close the Replacement Editor.
C C
c C c c
c
A
N
7~
E
I
A A N N L
L
I.
L
IS F
.
I DEFAULT
474
Chapter 4 Introduction to Predictive Modeling: Regressions
13. Run the Replacement node and v i e w the results.
file
Edit View
Window
1 1 1 0
1
I
c ^ E f [ET(
Total Replacement Counts $ Variable
Role
Label
StatusCat9... INPUT
Train
Status Cate...
Validation 1625
1627
[2) output
*
*
Us e r :
Dodotech
Dace:
17HAY08
Time:
21:51
; K
* Training
Output
s
Variable
Role
*
Summary
Measurement
Frequency
Level
Count
The Total Replacement Counts w i n d o w shows the number o f replacements that occur in the t r a i n i n g and v a l i d a t i o n data.
4.6 Categorical Inputs
14. Select View «=> Model !§ Replaced Levels
475
Replaced Levels. The Replaced Levels window opens. fH
W 0 0 $ $ ^
Variable
Formatted Value
Type
StatusCat9... StatusCat9... StatusCat9... StatusCat9... StatusCat9... StatusCat9...
A S F N E L
C C
Character Numeric Unformated Value Value A S F N E L
c c c c
Replacemen Variable Label t Value A A N N L L
Status Status Status Status Status Status
Cate... Cate... Cate... Cate... Cate... Cate...
The replaced level values are consistent with expectations. 15. Close the Results window. 16. Run the Regression node and view the results. 17. Go to line 3659 o f the Output window. Summary of Stepwise Selection Effect Step
Entered
Removed
DF
Number
Score
Wald
In
ChiSquare
ChiSquare
Pr > ChiSq
1
L0G_GiftCnt36
1
1
95.0275
2
GiftTimeLast
1
2
21.1330
<.0001
3
DenttedHomeValue
1
3
17.7373
<.0001
4
L0G_GiftAvgAll
1
4
21.7306
<.0001
5
DemPctVeterans
1
5
7.0742
0.0078
6
REP_StatusCat96NK
6
9.7073
0.0078
7
L0G_GiftCntCard36
1
7
6.2112
0.0127
8
M_DemAge DemCluster
1
8
4.8754
0.0272
9
61.7834
0.1910
10
StatusCatStarAII
1
10
1.6743
0.1957
11
PromCntCard12
1
11
1.3961
0.2374
12
1
12
1.1442
0.2848
13
PromCntAil L0G_GiftCntAll
1
13
1.8685
0.1717
14
PromCnt12
1
14
0.6761
0.4109
15
ProraCntCardAll
1
15
2.0585
1
14
1
15
0.7608
0.3831
1 1 1
16
0.7343
0.3915
17
0.5853
0.4443
18
0.3821
1
17
9
PromCntGard12
16 17 18 19 20
L0G_GiftAvg36 M_L0G_GiftAvgCarxf36 M_R£P_DemMedIncome GiftTimeFirst
21
GiftTimeFirst
<.0CO1
0.1514 0.0216
0.8830
0.5365 0.3821
0.5365
The REP_StatusCat96NK input (created from the original StatusCat96NK input) is included in Step 6 the Stepwise Selection process. The three level input is represented by two degrees o f freedom. 18. Close the Results window.
476
Chapter 4 Introduction to Predictive Modeling: Regressions
4.7
Polynomial Regressions (SelfStudy)
Beyond the Prediction Formula
A c c o u n t f o r nonlinearities.
90
The Regression tool assumes (by default) a linear and additive association between the inputs and the logit o f the target. I f the true association is more complicated, such an assumption m i g h t result in biased predictions. For decisions and rankings, this bias can ( i n some cases) be unimportant. For estimates, this bias appears as a higher value for the validation average squared error fit statistic.
Standard Logistic Regression
0.7 0.6
X 0.5 2
0.4 03 0.2 0.1 0.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 O.S 0.9 1.0 *1
91
In the dot c o l o r p r o b l e m , the (standard logistic regression) assumption that the concentration o f y e l l o w dots increases t o w a r d the upper right corner o f the unit square seems to be suspect.
4.7 Polynomial Regressions (SelfStudy)
477
Polynomial Logistic Regression
92
W h e n m i n i m i z i n g prediction bias is important, y o u can increase the f l e x i b i l i t y o f a regression model by adding p o l y n o m i a l c o m b i n a t i o n s o f the model inputs. This enables predictions to better match the true input/target association. It also increases the chances o f o v e r f i t t i n g w h i l e simultaneously reducing the interpretability o f the predictions. Therefore, p o l y n o m i a l regression must be approached w i t h some care In SAS Enterprise M i n e r , a d d i n g p o l y n o m i a l terms can be done selectively or a u t o n o m o u s l y .
Chapter 4 Introduction to Predictive Modeling: Regressions
478
Adding Polynomial Regression Terms Selectively
This demonstration shows how to use the Term Editor window to selectively add polynomial regression terms. You can modify the existing Regression node or add a new Regression node. I f you add a new node, you must configure the Polynomial Regression node to perform the same tasks as the original. A n alternative is to make a copy o f the existing node. 1.
Rightclick the Regression node and select Copy from the menu.
2.
Rightclick the diagram workspace and select Paste from the menu. A new Regression node is added with the label R e g r e s s i o n ( 2 ) to distinguish it from the existing one.
3.
Select the Regression (2) node. The properties are identical to the existing node.
4.
Rename the new regression node P o l y n o m i a l R e g r e s s i o n with model identification i n later chapters.
5.
Connect the Polynomial Regression (2) node to the Impute node.
( 2 ) . The (2) is retained to help
SHIES
• ^ P r e d i c t i v e Analysis
PVA97fK
I
^
"
i . .PotfOOBili
v r~lOedda>Tceo
JLl
J
0
 J —
©
76% S^D
To add polynomial terms to the model, you use the Term Editor. To use the Term Editor, you need to enable User Terms.
47 Polynomial Regressions (SelfStudy)
6.
479
Select U s e r T e r m s "=> Yes in the Polynomial Regression ( 2 ) property panel. Value
Property
General Node ID Imported Data Exported Data Notes
Reg2 U u
Train
Variables
y
BEquatfon
Yes Main Effects TwoFactor InteracticNo •Polynomial Terms NO •Polynomial Degree 2 Yes UserTerms Term Editor
~12
T h e T e r m E d i t o r is n o w unlocked and can be used to add specific p o l y n o m i a l terms to the regression model. 3.
Select T e r m E d i t o r "=>
Node ID Imported Data Exported Data Notes
II
Value
Property
General
f r o m the Polynomial Regression Properties panel.
Reg2
Train
Variables
Q U
UJ
Yes Main Effects TwoFactor InteracticNo Polynomial Terms No •Polynomial Degree 2 Yes UserTerms Term Editor
•
Chapter 4 Introduction to Predictive Modeling: Regressions
480
T h e T e r m s w i n d o w opens.
EH Target: TargetB
1
•
I !
* •
X
Variables
Term
DemCluster DemGender DemHomeOwner DemMedHomeV DemPctVeterans
Save
OK
Cancel
4.7 Polynomial Regressions (SelfStudy)
Interaction T e r m s Suppose that y o u suspect an interaction between home value and time since last g i f t . (Perhaps a recent change in property values affected the donation patterns.) I.
Select D e m M e d H o m e V a l u e in the Variables panel o f the Terms d i a l o g box.
2.
Select the A d d button,
. The D e m M e d H o m e V a l u e input is added to the T e r m panel.
Target: TargetB
X
Term
Variables
0