PREDICTIVE MODELING USING REGRESSION
linear and logistic regression. Explore data issues associated with regression. Discuss variable selection methods.
To understand the application off regression analysis in data mining Linear/nonlinear / Logistic
To understand the key statistical measures of fit To learn how to run and interpret regression analyses using SAS Enterprise Miner software
Linear Regression 4
We attempt to model the variation in a dependent variable as a linear combination of one or more independent variables.
SIMPLE LINEAR REGRESION
Written in slope intercept form y=ax+b
x= independent variable
y= depends on x
Constant C t t a and d b are computed t d via i supervised i d llearning i b by applying l i a statistical criterion to a dataset of known values for x and y
Multiple regression: Y = b0 + b1 X1 + b2 X2. X2
Expected value of y (outcome) ( )
Intercept T Term
Predictor variable i bl
Example: Kwatts vs. Temp TTemp 59.2 61.9 55.1 66.2 52 1 52.1 69.9 46.8 76.8 79.7 79 3 79.3 80.2 83.3
Kwatts K tt 9,730 9,750 , 10,180 10,230 10 800 10,800 11,160 , 12,530 13,910 15,110 15 690 15,690 17,020 , 17,880
Is the Relationship Linear? KWatts vs. Temp 20,000 18 000 18,000 16,000
14,000 12,000 KWatts
10,000 8,000 6,000 4,000 2,000 0 40
Example Results Let X = Temp, p, Y = Kwatts Y = 319.04 + 185.27 X KWatts vs. Temp 20,000 18,000 16,000
KWatts Forecast average
10,000 8,000 6,000 4,000 , 2,000 0 40
Multiple p Regression g Consider the following data relating family size and income to food expenditures: family 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
food $ 5.2 5.1 5.6 4.6 11.3 8.1 7.8 5.8 5.1 18 4.9 11.8 5.2 4.8 7.9 6.4 20 13.7 5.1 2.9
income $ 28 26 32 24 54 59 44 30 40 82 42 58 28 20 42 47 112 85 31 26
family size 3 3 2 1 4 2 3 2 1 6 3 4 1 5 3 1 6 5 2 2
We can run this W hi problem bl in i Enterprise E i Miner Mi using i the h same approachh followed with the previous example On our model field we have p placed the data source called foodexpenditures, and also both Multiplot and StatExplore found under the Explore tab above the model field Highlight foodexpenditures, foodexpenditures then in the left-hand panel under Training, Training find variables and click on the box to the right to open up the variables Change the role of family to rejected (it is just the number of the observation) b i ) and d change h the h level l l off food_ f d to target, t t and d income_, i food_, and fam_size to interval, then click OK
What happens in regression analysis when the target variable is binary?
There are many situations when the target variable is binary – some examples: whether
a customer will or will not receive credit whether a customer will or will not response p to a promotion Whether a firm will g go bankrupt p in a year y Whether a student will pass an exam!!!
Passing an Exam Data Student id Outcome 1 2 3 4 5 6 7 8 9 10 11 12 13 14
0 1 0 0 0 1 1 1 0 1 0 1 1 0
Study Hours 3 34 17 6 12 15 26 29 14 58 2 31 26 11
Running a linear regression to predict pass/donâ€™t pass as a function of hours of study y provides p a model that doesnâ€™t correctlyy model the data. The data are given in exampassing.xls
Passing an Exam 1.6 1.4
passs or don't pass
0.8 06 0.6 0.4 0.2 0 0
hours of s tudy
Similar to linear regression, two main differences Y
(outcome or response) is categorical
Yes/No Approve/Reject Responded/Did
is expressed as a probability of being in either group.
Logistic Regression 14
nonlinear li regression i technique t h i ffor problem bl having h i a binary outcome. A created regression equation limits the values of the output attribute to class values between 0 and 1. This allow output to represent a probability of class membership. Logistic Regression Model: ax+ c e ax p( y = 1 x ) = 1 + e ax + c
is the base off natural logarithms often f denoted as exp.
Logisitic regression p = Prob(y=1|x) = exp(a+bx)/[1+exp(a+bx)] 1-p =1/[1+exp(a+bx)]
ln [p/(1-p)] = a + bx where: exp or e is the exponential function (e=2.71828â€Ś) ln is the natural logarithm (ln(e) = 1) p is probability that the event y occurs given x, and can range between 0 and 1 p/(1 p) is the "odds p/(1-p) odds ratio ratio" ln[p/(1-p)] is the log odds ratio, or "logit" all other components p of the regression g model are the same
FFrequently l used d Related to probability of an event as follows: Odds dd Ratio i = p/(1-p) /( )
PProbability b bili off firm fi going i b bankrupt k = .25 25 Odds firm will go bankrupt = .25/(1-.25) = 1/3 or 3 to 1 This is how sports books calculate odds (e.g., if odds of Kelantan winning a Malaysian Cup cup are 2:1, probability is 1/3
ln [p/(1-p)] = a + bx means that as x increases by 1, the natural log of the odds ratio increases by b, or the odds ratio increase by a factor of exp(b)
The results show that the odds ratio = p(1-p) = exp(8.4962+0.4949x). For every additional hour of study the odds ratio increases by a factor of exp(0.4949)= 1.640
Understanding Response Rate and Lift To better understand the top left chart, change cumulative lift to cumulative l ti % response. The Th observations b ti are ranked k d by b the th predicted di t d probability of response (highest to lowest) for each observation (from the fitted model).
Understanding U de s a d g Response espo se Rate aea and d Lift
Since the first 6 passes were correctly classified, the cumulative % response is 100% through the 40th percentile. At the 50th percentile the next observation with the highest predicted probability is a non-response, so the cumulative response drops to 6/7 or 85.7%. The 8th ranked observation, between the 55th and 60th percentile, is a positive response, so the cumulative % response is about 7/8 or 87%. Since Si th there are no more positive iti responses after ft th the 60th percentile, til th the cumulative response rate will drop to 50%. The chart compares p how well the cumulative ranked predictions p lead to a match between actual and predicted responses
Understanding g Response p Rate and Lift Â…
Lift calculates the ratio of the actual response p rate (passing) (p g) of the top p n% of the ranked observations to the overall response rate. Cumulative lift is likewise defined. At the 50thh percentile, the cumulative % response is 88.7%, the cumulative base response is 50%, for a lift of 1.7142.
On the Properties Panel, click on Exported Data to see the predicted probabilities and response p p for each observation and compare p to the actual response.
Logistic regression uses maximum likelihood (and not sum of squared errors)) to estimate the model parameters. p The results below show that the model is highly significant based on a chi-square test. The Wald chisquare statistic tests whether an effect is significant or not.
Linear versus Logistic Regression 23
Li Linear Regression R i
L i ti Regression Logistic R i
Targett is T i an interval i t l variable.
Targett is T i a discrete di t (binary or ordinal) variable.
Input variables have any measurement level.
Input variables have any measurement level.
Predicted values are the mean of the target variable at the given values of the input variables.
Predicted values are the probability of a particular level(s) of the target variable at the given values of the input variables. i bl
Logistic Regression Assumption 24
SAS EM for data preprocessing 25
Noisy data: duplicated records, incorrect values Missing data: using the Replacement node Using the Transform node: normalization, logtransformation, etc. Data type conversion: use Input data source node Variable selection: Stepwise regression, regression Variable selection node Instance selection: Stratification Stratification, prior probability
Missing Values Inputs ?
? ? ?
Stepwise Selection Methods 27
B k Backward d Selection S l ti Stepwise p Selection
REGRESSION IN ENTERPRISE MINER
SAS Enterprise Miner
Results R lt can be b obtained bt i d using i Excel E l or using i a data d t mining i i package such as SAS Enterprise Miner 5.3 Using SAS Enterprise Miner requires the following steps:
Convert your data (usually in an Excel file) into a SAS data file Using SAS 9.1 C Create a project j in i EEnterprise i Mi Miner Within the project: Create a data source using g your y SAS data file Create a diagram that includes a data node and a regression node and a multiplot node for graphs Run the model in the diagram g and review the results
missing value imputation. Examine transformations of data. Generate a regression model.
This demonstration illustrates imputing missing values, l transforming f data, d and d generating a regression model.