(Part 2)
4
Chapter
Descriptive Statistics
Standardized Data Percentiles, Quartiles and Box Plots Grouped Data Skew ness and Kurtosis
McGrawHill/Irwin
Copyright ÂŠ 2009 by The McGrawHill Companies, Inc. All rights reserved.
Standardized Data Chebyshev’s Theorem • Developed by mathematicians Jules Bienaymé (17961878) and Pafnuty Chebyshev (18211894). • For any population with mean μ and standard deviation σ, the percentage of observations that lie within k standard deviations of the mean must be at least 100[1 – 1/k2].
4B2
Standardized Data Chebyshev’s Theorem • For k = 2 standard deviations, 100[1 – 1/22] = 75% • So, at least 75.0% will lie within μ + 2σ • For k = 3 standard deviations, 100[1 – 1/32] = 88.9% • So, at least 88.9% will lie within μ + 3σ • Although applicable to any data set, these limits tend to be too wide to be useful. 4B3
Standardized Data The Empirical Rule • The normal or Gaussian distribution was named for Karl Gauss (17711855). • The normal distribution is symmetric and is also known as the bellshaped curve. • The Empirical Rule states that for data from a normal distribution, we expect that for k = 1 about 68.26% will lie within μ + 1σ k = 2 about 95.44% will lie within μ + 2σ k = 3 about 99.73% will lie within μ + 3σ 4B4
Standardized Data The Empirical Rule • Distance from the mean is measured in terms of the number of standard deviations. Note: no upper bound is given. Data values outside μ + 3σ are rare.
4B5
Standardized Data Example: Exam Scores • If 80 students take an exam, how many will score within 2 standard deviations of the mean? • Assuming exam scores follow a normal distribution, the empirical rule states about 95.44% will lie within μ + 2σ so 95.44% x 80 ≈ 76 students will score + 2σ from μ. • How many students will score more than 2 standard deviations from the mean? 4B6
Standardized Data Unusual Observations • Unusual observations are those that lie beyond μ + 2σ. • Outliers are observations that lie beyond μ + 3σ.
4B7
Standardized Data Unusual Observations â€˘ For example, the P/E ratio data contains several large data values. Are they unusual or outliers? 7 8 8 10 13 13 13 14 16 16 17 18 20 20 20 21
10 10 10 12 13 13 14 14 15 15 15 15 18 18 18 19 19 19 21 21 22 22 23 23
13 13 15 16 19 19
23 24 25 26 26 26 26 27 29 29 30 31 34 36 37 40 41 45 48 55 68 91 4B8
Standardized Data The Empirical Rule • If the sample came from a normal distribution, then the Empirical rule states
4B9
x ± 1s
= 22.72 ± 1(14.08) = (8.6, 38.8)
x ± 2s
= 22.72 ± 2(14.08) = (5.4, 50.9)
x ± 3s
= 22.72 ± 3(14.08) = (19.5, 65.0)
Standardized Data The Empirical Rule â€˘ Are there any unusual values or outliers? 7 8 . . . 48 55 68 91
Unusual
Unusual
Outlier s
4B10
19.5
Outlier s 5.4
8.6
22.72
36.8
50.9
65.0
Standardized Data Defining a Standardized Variable • A standardized variable (Z) redefines each observation in terms the number of standard deviations from the mean.
4B11
Standardization formula for a population:
xi − μ zi = σ
Standardization formula for a sample:
xi − x zi = s
Standardized Data Defining a Standardized Variable • zi tells how far away the observation is from the mean. • For example, for the P/E data, the first value x1 = 7. The associated z value is
xi − x zi = s 4B12
= 7 – 22.72 = 1.12 14.08
Standardized Data Defining a Standardized Variable • A negative z value means the observation is below the mean. • Positive z means the observation is above the mean. For x68 = 91,
xi − x = 91 – 22.72 = 4.85 zi = 14.08 s
4B13
Standardized Data Defining a Standardized Variable â€˘ Here are the standardized z values for the P/E data:
â€˘ What do you conclude for these three values? 4B14
Standardized Data Defining a Standardized Variable â€˘ MegaStat calculates standardized values as well as checks for outliers. â€˘ In Excel, use =STANDARDIZE(Array, Mean, STDev) to calculate a standardized z value.
4B15
Standardized Data Outliers • What do we do with outliers in a data set? • If due to erroneous data, then discard. • An outrageous observation (one completely outside of an expected range) is certainly invalid. • Recognize unusual data points and outliers and their potential impact on your study. • Research books and articles on how to handle outliers. 4B16
Standardized Data Estimating Sigma • For a normal distribution, the range of values is 6σ (from μ – 3σ to μ + 3σ). • If you know the range R (high – low), you can estimate the standard deviation as σ = R/6. • Useful for approximating the standard deviation when only R is known. • This estimate depends on the assumption of normality. 4B17
Percentiles and Quartiles Percentiles • Percentiles are data that have been divided into 100 groups. • For example, you score in the 83rd percentile on a standardized test. That means that 83% of the testtakers scored below you. • Deciles are data that have been divided into 10 groups. • Quintiles are data that have been divided into 5 groups. • Quartiles are data that have been divided into 4 groups. 4B18
Percentiles and Quartiles Percentiles â€˘ Percentiles are used to establish benchmarks for comparison purposes (e.g., health care, manufacturing and banking industries use 5, 25, 50, 75 and 90 percentiles). â€˘ Quartiles (25, 50, and 75 percent) are commonly used to assess financial performance and stock portfolios. â€˘ Percentiles are used in employee merit evaluation and salary benchmarking. 4B19
Percentiles and Quartiles Quartiles
• Quartiles are scale points that divide the sorted data into four groups of approximately equal size. Q1
ÕLower 25%Ö
4B20

Q2 ÕSecond 25%Ö

Q3 ÕThird 25%Ö

ÕUpper 25%Ö
• The three values that separate the four groups are called Q1, Q2, and Q3, respectively.
Percentiles and Quartiles Quartiles • The second quartile Q2 is the median, median an important indicator of central tendency. Q2 Õ Lower 50% Ö

Õ Upper 50% Ö
• Q1 and Q3 measure dispersion since the interquartile range Q3 – Q1 measures the degree of spread in the middle 50 percent of data values. Q1 ÕLower 25%Ö 4B21

Q3 Õ Middle 50% Ö

ÕUpper 25%Ö
Percentiles and Quartiles Quartiles • The first quartile Q1 is the median of the data values below Q2, and the third quartile Q3 is the median of the data values above Q2. Q1 ÕLower 25%Ö

Q2 ÕSecond 25%Ö
For first half of data, 50% above, 50% below Q1. 4B22

Q3 ÕThird 25%Ö

ÕUpper 25%Ö
For second half of data, 50% above, 50% below Q3.
Percentiles and Quartiles Quartiles â€˘ Depending on n, the quartiles Q1,Q2, and Q3 may be members of the data set or may lie between two of the sorted data values.
4B23
Percentiles and Quartiles Method of Medians â€˘ For small data sets, find quartiles using method of medians: medians Step 1. Sort the observations. Step 2. Find the median Q2. Step 3. Find the median of the data values that lie below Q2. Step 4. Find the median of the data values that lie above Q2. 4B24
Percentiles and Quartiles Excel Quartiles
• Use Excel function =QUARTILE(Array, k) to return the kth quartile. • Excel treats quartiles as a special case of percentiles. For example, to calculate Q3 =QUARTILE(Array, 3) =PERCENTILE(Array, 75) • Excel calculates the quartile positions as:
4B25
Position of Q1 Position of Q2
0.25n + 0.75 0.50n + 0.50
Position of Q3
0.75n + 0.25
Percentiles and Quartiles Example: P/E Ratios and Quartiles â€˘ Consider the following P/E ratios for 68 stocks in a portfolio. 7
8
8
10 10 10 10 12 13 13 13 13 13 13 13 14 14
14 15 15 15 15 15 16 16 16 17 18 18 18 18 19 19 19 19 19 20 20 20 21 21 21 22 22 23 23 23 24 25 26 26 26 26 27 29 29 30 31 34 36 37 40 41 45 48 55 68 91
â€˘ Use quartiles to define benchmarks for stocks that are lowpriced (bottom quartile) or highpriced (top quartile). 4B26
Percentiles and Quartiles Example: P/E Ratios and Quartiles • Using Excel’s method of interpolation, the quartile positions are:
Quartile Position Q1 Q2 Q3 4B27
Formula = 0.25(68) + 0.75 = 17.75 = 0.50(68) + 0.50 = 34.50 = 0.75(68) + 0.25 = 51.25
Interpolate Between X17 + X18 X34 + X35 X51 + X52
Percentiles and Quartiles Example: P/E Ratios and Quartiles â€˘ The quartiles are:
Quartile First (Q1)
Formula Q1 = X17 + 0.75 (X18X17) = 14 + 0.75 (1414) = 14 Second (Q2) Q2 = X34 + 0.50 (X35X34) = 19 + 0.50 (1919) = 19 Third (Q3) Q3 = X51 + 0.25 (X52X51) = 26 + 0.25 (2626) = 26
4B28
Percentiles and Quartiles Example: P/E Ratios and Quartiles • So, to summarize: Q1 ÕLower 25%Ö of P/E Ratios
14
Q2 ÕSecond 25%Ö of P/E Ratios
19
Q3 ÕThird 25%Ö of P/E Ratios
26
ÕUpper 25%Ö of P/E Ratios
• These quartiles express central tendency and dispersion. What is the interquartile range? • Because of clustering of identical data values, these quartiles do not provide clean cut points between groups of observations. 4B29
Percentiles and Quartiles Tip Whether you use the method of medians or Excel, your quartiles will be about the same. Small differences in calculation techniques typically do not lead to different conclusions in business applications.
4B30
Percentiles and Quartiles Caution • Quartiles generally resist outliers. • However, quartiles do not provide clean cut points in the sorted data, especially in small samples with repeating data values. Data set A:
1, 2, 4, 4, 8, 8, 8, 8
Q1 = 3, Q2 = 6, Q3 = 8
Data set B:
0, 3, 3, 6, 6, 6, 10, 15
Q1 = 3, Q2 = 6, Q3 = 8
• Although they have identical quartiles, these two data sets are not similar. The quartiles do not represent either data set well. 4B31
Box Plots • A useful tool of exploratory data analysis (EDA). • Also called a boxandwhisker plot. • Based on a fivenumber summary: Xmin, Q1, Q2, Q3, Xmax • Consider the fivenumber summary for the 68 P/E ratios: Xmin, Q1, Q2, Q3, Xmax 7 4B32
14 19 26 91
Box Plots
â€˘ The box plot is displayed visually, like this.
â€˘ A box plot shows central tendancy, tendancy dispersion, dispersion and shape. 4B33
Box Plots Fences and Unusual Data Values • Use quartiles to detect unusual data points. • These points are called fences and can be found using the following formulas: Lower fence Upper fence
Inner fences Q1 – 1.5 (Q3–Q1)
Outer fences: Q1 – 3.0 (Q3–Q1)
Q3 + 1.5 (Q3–Q1)
Q3 + 3.0 (Q3–Q1)
• Values outside the inner fences are unusual while those outside the outer fences are outliers. outliers 4B34
Box Plots Fences and Unusual Data Values • For example, consider the P/E ratio data: Inner fences
Outer fences:
Lower fence: 14 – 1.5 (26–14) = −4
14 – 3.0 (26–14) = −22
Upper fence: 26 + 1.5 (26–14) = +44
26 + 3.0 (26–14) = +62
• Ignore the lower fence since it is negative and P/E ratios are only positive.
4B35
Box Plots Fences and Unusual Data Values â€˘ Truncate the whisker at the fences and display unusual values Inner Outer and outliers Fence Fence as dots. Unusual
Outliers
â€˘ Based on these fences, there are three unusual P/E values and two outliers. 4B36
Percentiles and Quartiles Midhinge • The average of the first and third quartiles. Q1 + Q3 Midhinge = 2
• The name “midhinge” midhinge derives from the idea that, if the “box” were folded in half, it would resemble a “hinge”.. 4B37
Box Plots Whiskers
Center of Box is Midhinge
Box
Q1
Q3
Minimum Median (Q2) 4B38
Rightskewed
Maximum
Correlation Correlation Coefficient â€˘ The sample correlation coefficient is a statistic that describes the degree of linearity between paired observations on two quantitative variables X and Y.
4B39
Correlation Correlation Coefficient • Its range is 1 ≤ r ≤ +1. • Excel’s formula =CORREL(Xdata, Ydata)
4B40
Correlation Correlation Coefficient â€˘ Illustration of Correlation Coefficients
4B41
Correlation â€˘ What is the nature of the relationship between square feet of shopping area and sales that is implied by the following correlation?
4B42
Grouped Data Nature of Grouped Data • Although some information is lost, grouped data are easier to display than raw data. • When bin limits are given, the mean and standard deviation can be estimated. • Accuracy of grouped estimates depend on  the number of bins  distribution of data within bins  bin frequencies 4B43
Grouped Data Mean and Standard Deviation • Consider the frequency distribution for prices of Lipitor® for three cities:
4B44
• Where mj = class midpoint k = number of classes
fj = class frequency n = sample size
Grouped Data Nature of Grouped Data
• Estimate the mean and standard deviation by k
f jmj
j =1
n
x=∑
s=
3427.5 = = 72.92552 47
k
f j (m j − x )2
j =1
n −1
∑
2091.48936 = = 6.74293 47 − 1
• Note: don’t round off too soon. 4B45
Grouped Data Nature of Grouped Data • Now estimate the coefficient of variation CV = 100 (s / x ) = 100 (6.74293 / 72.92552) = 9.2%
Accuracy Issues • How accurate are grouped estimates compared to ungrouped estimates? • For the previous example, we can compare the grouped data statistics to the ungrouped data statistics. 4B46
Grouped Data Accuracy Issues • Accuracy tends to improve as the number of bins increases. • If the first or last class is openended, there will be no class midpoint (no mean can be estimated). • Assume a lower limit of zero for the first class when the data are nonnegative. • You may be able to assume an upper limit for some variables (e.g., age). • Median and quartiles may be estimated even with openended classes. 4B47
Skew ness and Kurtosis Skew ness â€˘ Generally, skew ness may be indicated by looking at the sample histogram or by comparing the mean and median.
â€˘ This visual indicator is imprecise and does not take into consideration sample size n. 4B48
Skew ness and Kurtosis Skew ness • Skew ness is a unitfree statistic. • The coefficient compares two samples measured in different units or one sample with a known reference distribution (e.g., symmetric normal distribution). • Calculate the sample’s skew ness coefficient as: 3 n n ⎛ xi − x ⎞ ∑⎜ (n − 1)(n − 2) i =1 ⎝ s ⎟⎠ 4B49
Skew ness and Kurtosis Skew ness â€˘ In Excel, go to Tools  Data Analysis  Descriptive Statistics or use the function =SKEW(array)
4B50
Skew ness and Kurtosis Skew ness â€˘ Consider the following table showing the 90% range for the sample skew ness coefficient.
4B51
Skew ness and Kurtosis Skew ness â€˘ Coefficients within the 90% range may be attributed to random variation.
4B52
Skew ness and Kurtosis Skew ness
(Figure 4.36)
â€˘ Coefficients outside the range suggest the sample came from a nonnormal population.
4B53
Skew ness and Kurtosis Skew ness â€˘ As n increases, the range of chance variation narrows.
4B54
Skew ness and Kurtosis Kurtosis
â€˘ Kurtosis is the relative length of the tails and the degree of concentration in the center. â€˘ Consider three kurtosis prototype shapes. Heavier tails
4B55
Skew ness and Kurtosis Kurtosis • A histogram is an unreliable guide to kurtosis since scale and axis proportions may differ. • Excel and MINITAB calculate kurtosis as: 4
n(n + 1) 3(n − 1) 2 ⎛ xi − x ⎞ − Kurtosis = ∑⎜ ⎟ ( n − 1)(n − 2)(n − 3) i =1 ⎝ s ⎠ (n − 2)(n − 3) n
4B56
Skew ness and Kurtosis Kurtosis â€˘ Consider the following table of expected 90% range for sample kurtosis coefficient.
4B57
Skew ness and Kurtosis Kurtosis â€˘ A sample coefficient within the ranges may be attributed to chance variation.
4B58
Skew ness and Kurtosis Kurtosis â€˘ Coefficients outside the range would suggest the sample differs from a normal population.
4B59
Skew ness and Kurtosis Kurtosis â€˘ As sample size increases, the chance range narrows.
Inferences about kurtosis are risky for n < 50. 4B60
Applied Statistics in Business & Economics
End of Chapter 4B
4B61
Published on Sep 24, 2008
Powerpoint slides of Chap 4a of the Applied Statistics in Business and Economics