Issuu on Google+

PREDICTIVE MODELING USING REGRESSION


Objectives 2

† Describe

linear and logistic regression. † Explore data issues associated with regression. † Discuss variable selection methods.


Learning Objectives …

To understand the application off regression analysis in data mining † Linear/nonlinear / † Logistic

… …

(Logit)

To understand the key statistical measures of fit To learn how to run and interpret regression analyses using SAS Enterprise Miner software


Linear Regression 4

†

We attempt to model the variation in a dependent variable as a linear combination of one or more independent variables.

†

SIMPLE LINEAR REGRESION

†

Written in slope intercept form y=ax+b

…

†

x= independent variable

†

y= depends on x

†

Constant C t t a and d b are computed t d via i supervised i d llearning i b by applying l i a statistical criterion to a dataset of known values for x and y

Multiple regression: Y = b0 + b1 X1 + b2 X2. X2

Expected value of y (outcome) ( )

Intercept T Term

coefficient

Predictor variable i bl


Example: Kwatts vs. Temp TTemp 59.2 61.9 55.1 66.2 52 1 52.1 69.9 46.8 76.8 79.7 79 3 79.3 80.2 83.3

Kwatts K tt 9,730 9,750 , 10,180 10,230 10 800 10,800 11,160 , 12,530 13,910 15,110 15 690 15,690 17,020 , 17,880


Is the Relationship Linear? KWatts vs. Temp 20,000 18 000 18,000 16,000

KWattts

14,000 12,000 KWatts

10,000 8,000 6,000 4,000 2,000 0 40

45

50

55

60

65 Temp

70

75

80

85

90


Example Results Let X = Temp, p, Y = Kwatts Y = 319.04 + 185.27 X KWatts vs. Temp 20,000 18,000 16,000

KWaatts

14,000 12,000

KWatts Forecast average

10,000 8,000 6,000 4,000 , 2,000 0 40

45

50

55

60

65 Temp

70

75

80

85

90


Multiple p Regression g Consider the following data relating family size and income to food expenditures: family 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

food $ 5.2 5.1 5.6 4.6 11.3 8.1 7.8 5.8 5.1 18 4.9 11.8 5.2 4.8 7.9 6.4 20 13.7 5.1 2.9

income $ 28 26 32 24 54 59 44 30 40 82 42 58 28 20 42 47 112 85 31 26

family size 3 3 2 1 4 2 3 2 1 6 3 4 1 5 3 1 6 5 2 2


Multiple Regression …

…

…

…

We can run this W hi problem bl in i Enterprise E i Miner Mi using i the h same approachh followed with the previous example On our model field we have p placed the data source called foodexpenditures, and also both Multiplot and StatExplore found under the Explore tab above the model field Highlight foodexpenditures, foodexpenditures then in the left-hand panel under Training, Training find variables and click on the box to the right to open up the variables Change the role of family to rejected (it is just the number of the observation) b i ) and d change h the h level l l off food_ f d to target, t t and d income_, i food_, and fam_size to interval, then click OK


What happens in regression analysis when the target variable is binary? …

There are many situations when the target variable is binary – some examples: † whether

a customer will or will not receive credit † whether a customer will or will not response p to a promotion † Whether a firm will g go bankrupt p in a year y † Whether a student will pass an exam!!!


Passing an Exam Data Student id Outcome 1 2 3 4 5 6 7 8 9 10 11 12 13 14

0 1 0 0 0 1 1 1 0 1 0 1 1 0

Study Hours 3 34 17 6 12 15 26 29 14 58 2 31 26 11


Running a linear regression to predict pass/don’t pass as a function of hours of study y provides p a model that doesn’t correctlyy model the data. The data are given in exampassing.xls

Passing an Exam 1.6 1.4

passs or don't pass

1.2 1

Actual Predicted

0.8 06 0.6 0.4 0.2 0 0

10

20

30

40

hours of s tudy

50

60

70


Logistic Regression …

Similar to linear regression, two main differences †Y

(outcome or response) is categorical

„ Yes/No „ Approve/Reject „ Responded/Did

† Result

not respond

is expressed as a probability of being in either group.


Logistic Regression 14

†A

nonlinear li regression i technique t h i ffor problem bl having h i a binary outcome. A created regression equation limits the values of the output attribute to class values between 0 and 1. This allow output to represent a probability of class membership. † Logistic Regression Model: ax+ c e ax p( y = 1 x ) = 1 + e ax + c

†e

is the base off natural logarithms often f denoted as exp.


Logisitic regression p = Prob(y=1|x) = exp(a+bx)/[1+exp(a+bx)] 1-p =1/[1+exp(a+bx)]

ln [p/(1-p)] = a + bx where: exp or e is the exponential function (e=2.71828‌) ln is the natural logarithm (ln(e) = 1) p is probability that the event y occurs given x, and can range between 0 and 1 p/(1 p) is the "odds p/(1-p) odds ratio ratio" ln[p/(1-p)] is the log odds ratio, or "logit" all other components p of the regression g model are the same


Odds Ratio … …

FFrequently l used d Related to probability of an event as follows: Odds dd Ratio i = p/(1-p) /( )

…

Example: † † †

…

PProbability b bili off firm fi going i b bankrupt k = .25 25 Odds firm will go bankrupt = .25/(1-.25) = 1/3 or 3 to 1 This is how sports books calculate odds „ (e.g., if odds of Kelantan winning a Malaysian Cup cup are 2:1, probability is 1/3

ln [p/(1-p)] = a + bx means that as x increases by 1, the natural log of the odds ratio increases by b, or the odds ratio increase by a factor of exp(b)


The results show that the odds ratio = p(1-p) = exp(8.4962+0.4949x). For every additional hour of study the odds ratio increases by a factor of exp(0.4949)= 1.640


Understanding Response Rate and Lift To better understand the top left chart, change cumulative lift to cumulative l ti % response. The Th observations b ti are ranked k d by b the th predicted di t d probability of response (highest to lowest) for each observation (from the fitted model).


Understanding U de s a d g Response espo se Rate aea and d Lift …

…

…

…

Since the first 6 passes were correctly classified, the cumulative % response is 100% through the 40th percentile. At the 50th percentile the next observation with the highest predicted probability is a non-response, so the cumulative response drops to 6/7 or 85.7%. The 8th ranked observation, between the 55th and 60th percentile, is a positive response, so the cumulative % response is about 7/8 or 87%. † Since Si th there are no more positive iti responses after ft th the 60th percentile, til th the cumulative response rate will drop to 50%. The chart compares p how well the cumulative ranked predictions p lead to a match between actual and predicted responses


Understanding g Response p Rate and Lift Â…

Â…

Lift calculates the ratio of the actual response p rate (passing) (p g) of the top p n% of the ranked observations to the overall response rate. Cumulative lift is likewise defined. At the 50thh percentile, the cumulative % response is 88.7%, the cumulative base response is 50%, for a lift of 1.7142.


On the Properties Panel, click on Exported Data to see the predicted probabilities and response p p for each observation and compare p to the actual response.


Logistic regression uses maximum likelihood (and not sum of squared errors)) to estimate the model parameters. p The results below show that the model is highly significant based on a chi-square test. The Wald chisquare statistic tests whether an effect is significant or not.


Linear versus Logistic Regression 23

Li Linear Regression R i

L i ti Regression Logistic R i

Targett is T i an interval i t l variable.

Targett is T i a discrete di t (binary or ordinal) variable.

Input variables have any measurement level.

Input variables have any measurement level.

Predicted values are the mean of the target variable at the given values of the input variables.

Predicted values are the probability of a particular level(s) of the target variable at the given values of the input variables. i bl


Logistic Regression Assumption 24

logit transformation


SAS EM for data preprocessing 25

… … …

… …

…

Noisy data: duplicated records, incorrect values Missing data: using the Replacement node Using the Transform node: normalization, logtransformation, etc. Data type conversion: use Input data source node Variable selection: Stepwise regression, regression Variable selection node Instance selection: Stratification Stratification, prior probability


Missing Values Inputs ?

26

? ?

?

Cases

? ?

? ? ?


Stepwise Selection Methods 27

Forward Selection

B k Backward d Selection S l ti Stepwise p Selection


REGRESSION IN ENTERPRISE MINER


SAS Enterprise Miner …

…

Results R lt can be b obtained bt i d using i Excel E l or using i a data d t mining i i package such as SAS Enterprise Miner 5.3 Using SAS Enterprise Miner requires the following steps: † † †

Convert your data (usually in an Excel file) into a SAS data file Using SAS 9.1 C Create a project j in i EEnterprise i Mi Miner Within the project: „ Create a data source using g your y SAS data file „ Create a diagram that includes a data node and a regression node and a multiplot node for graphs „ Run the model in the diagram g and review the results


Objectives 30

† Conduct

missing value imputation. † Examine transformations of data. † Generate a regression model.


Demonstration

31

Â…

This demonstration illustrates imputing missing values, l transforming f data, d and d generating a regression model.


Chapt6regression