Computing a t-test for the Difference between Two Means of Paired Samples Frequently, we are interested in comparing â€œbeforeâ€? and "after" scores for a group in order to evaluate the effectiveness of a training program or a teaching technique. Or, we may have administered a test to "matched pairs" of students. Matched pairs is when you select pairs of subjects with identical characteristics and assign one of them to an experimental group and the other to a control group in order to ensure the groups are equal. For either of these two situations (same group or matched groups), a t-test for difference between two means for paired samples may be appropriate. We will discuss the assumptions necessary for its use before we look at a specific example.
Criteria for Selecting a t-Test for the Difference between Two Means A t-test is selected for testing hypotheses regarding samples if certain assumptions can be made. Perhaps the most important of these is that the scores or measurements of the population(s) about which conclusions will be drawn (boys who want to throw the shot-put; outfielders; infielders) would form a normal or near normal distribution. If it is known that the measurement on the population differs markedly from normal, a different test should be selected. The t-test for the differences between two means for paired samples is used for test-retest situations with a single group. It is also used when two samples have been matched on some characteristic so that they are assumed to be "identical," The t-test for the difference between two means for independent samples is used when the scores of one sample are in no way dependent on the scores of the other sample. (See table 4-5) Table 4-5 How to choose a t test for the difference between 2 means
t-test for independent samples 2 independent samples Normally distributed population(s)
t-test for paired samples Test-retest of same sample 2 samples matched on some characteristic Normally distributed population(s)
Suppose a coach advocates a weight training program for shot putters, because he believes that weight training will enhance their ability to throw the shot-put farther. To support his position, he conducts the following experiment. He selects 10 boys and measures how far each can throw the shotput. After participating in a weight training program for six weeks, each boy is retested in the shotput. The coach reasons that, if the difference between the "before" and "after" distances is statistically significant, he can safely conclude that a weight training program is beneficial to boys who wish to throw the shot-put. (Note that he will use statistics to draw a conclusion about the population of boys, who wish to throw the shot-put from the actual experience of a small sample of boys he drew from that population.) He assumes that, if the training program has no effect, the populationâ€™s mean score on the test before training (M1) will be about equal to the population mean score on the test after training (M2). The null hypothesis (HO) is indicated by: HO : M1=M2. If, however, the weight training program has the desired effect of increasing the distance of the shot put, he expects M1 to be less than M2. The alternate hypothesis (HA) is denoted: HA: M1 < M2. The coach selects the .05 level of significance as the dividing line between chance differences and differences due to the training program. The degrees
of freedom for the t-test for the difference between two means for paired samples is one less than the number of people (N-l). In this example, df=10-1=9. Once again we refer to the table in Appendix A. With 9df, the critical value of t at the .05 level of significance is 1.833. This means that 5% of the area under the t-curve for 9df falls above 1.833. This can also be interpreted by saying that the probability of getting a t value above 1.833 due to chance alone is less than 5%. Figure 4-2 t-distribution for 9df showing critical value of t
Steps in Performing a t-Test for the Difference between Two Means for Paired Samples 1. 2. 3. 4. 5. 6. 7.
State the null hypothesis – HO: M1 = M2 State the alternate hypothesis – HA: M1 < M2 Fill in the subject, pretest (X1) and post-test (X2) columns Form the D column. Find the difference, X2 - X1, for each subject Form the D2 column. Square each value in the D column Find Σ D2. Sum the D2 column Use the formula: D t df=N – 1 N D 2 ( D ) 2
N 1 to compute t. In this case, t=5.46 8. Make a rough sketch of the t-curve 9. Determine the number of degrees of freedom. A t-test for paired samples has N-1 df. In this case, there are 9df. 10. Determine the critical value of t by using the table in Appendix A. The critical value is the t value which has 5% of the area above it. In this case, with 9df, that critical value is 1.833. 11. Shade the Region of Rejection. This is the area in the tail of the curve above the critical value. The probability of getting a t-value in the Region of Rejection is less than 5%. 12. See if the computed t-value falls in the Region of Rejection. In this case, t=5.46 does fall in the Region of Rejection. This means we reject HO and accept HA. 13. Write the conclusion. This includes the statement in statistical terms and its implication for this particular situation.
The t-value is larger than the critical value (t = l.833, 9df at the .05 level) and therefore falls in the Region of Rejection. Thus, we reject HO and accept the alternate hypothesis that M1< M2. Since the mean after training is significantly greater, we conclude that the weight training program was effective in increasing distance for shot putters. Table 4-6 t-test for the difference between 2 means for paired samples
M1 = M2 M1 < M2 Subject 1 2 3 4 5 6 7 8 9 10 N = 10
Distance Before Distance After X2 D X1 20 40 20 30 40 10 35 45 10 30 30 0 40 55 15 50 55 5 40 50 10 35 50 15 25 35 10 45 50 5 Σ D = 100 Σ X1 = 350 Σ X2 = 450 M1 = 35 M2 = 45
N D 2 ( D ) 2 N 1 100
10 1300 (100) 2 9 100
333.33 100 t 5.46 18.3
df=N – 1 = 9
D2 400 100 100 0 225 25 100 225 100 25 Σ D2 =1300
Other Tests of Significance As previously mentioned, other tests of significance frequently appear in research literature. Among these are other t-tests, various types of analysis of variance (f-tests), analysis of co-variance, and critical ratio (t-test). The choice of test depends on the nature of the data collected and the use you wish to make of it. Associated with each test are certain assumptions about the population from which the sample is drawn. In order to conduct a t-test or an analysis of variance, for example, the researcher must assume that the population is normally distributed. It is not the purpose of this text to teach the computational procedures for all of the other tests mentioned. The principle of hypothesis testing is basically the same for all of them. In each case, a null and alternate hypothesis is formed, a statistic computed, and the likelihood that the statistic would occur due to chance determined. On the basis of that likelihood, the null hypothesis is accepted or rejected and a conclusion is drawn. Two tests that I would like to discuss briefly are the analysis of variance and the analysis of covariance. Both of these tests appear quite often in the research literature and therefore it would be judicious on your part to have a fundamental understanding of what they are and how they can be used. You donâ€™t want to be a hand calculator your whole lifeâ€Ś do you? Donâ€™t answer that.
Analysis of Variance (F) (ANOVA) We have noted that t-tests were employed to determine whether, after an experiment, the means of two random samples were different to attribute to sampling error. The analysis of variance is a convenient way to determine whether the means of more than two random samples are too different to attribute to sampling error. First of all, you might ask why an experiment should ever include more than two variations in condition. Two answers may be given to this question: (a) we may be interested in studying more than two conditions at a time; (b) the data obtained with two conditions may give quite a different answer to our experimental problem than would comparable data from three or more conditions. Referring to our first point, suppose that an investigator is interested in the relative effectiveness of various methods of strength training. We can see that there might be three, four, or even more strength training method which he would like to compare rather than only two. However (and this is our second point), if the experimenter studied only two strength training method and found no differences in the results, he might conclude that strength training method never differentially affect the students' performance when, in fact, some other strength training method which could have been included in an experiment would have affected performance significantly. Let us consider another experiment in which more than two variations in experimental conditions would be desirable. If we wished to determine if steroids enhanced strength we could give one group steroids and use the other group as a control group. The problem with this design is that there could be a placebo effect that could not be accounted for by the control group. Here too, we need more than two groups in order to obtain an adequate picture of the relationship between our experimental variable and the behavior being observed. Once it is agreed that experiments involving more than two groups or conditions are sometimes desirable, we are faced with the necessity of testing the significance of the differences among the X's obtained with the several conditions. Our first thought would surely be to compute t-ratios and test the significance of the differences between X's for all possible pairs of conditions. But this has certain
disadvantages. For one thing, there may be a great number of pairs of conditions to be studied. If four methods of strength training were to be compared, there would be six different t-ratios to be found. This fact is shown in the following list of pairs of methods whose data would form the basis for individual t-tests. pair 1 method A and method B pair 2 method A and method C pair 3 method A and method D pair 4 method B and method C pair 5 method B and method D pair 6 method C and method D As indicated it would be possible to use a number of t-test to determine the significance of the difference between four means, two at a time, but it would involve six separate tests. Just think how much fun you would have doing six t-tests. I will answer that for youâ€Śloads of fun. Not only would a great deal of numerical work be required, but the use of separate t-tests for analyzing the results of an experiment based upon more than two groups would lead to results which could not easily be interpreted. If we obtain one significant t-value out of six, what does this mean? Had we performed a single t-test, we would have had 5% probability of obtaining a t significant at the 5% level by random sampling alone, but with six t-tests our chances of having at least one of them significant because of random sampling is much larger. To carry this argument to its extreme,(something I love doing) if a million t-tests were involved in an experiment, a single significant t-value would certainly be no indication that the conditions being varied had an effect. We would not be handicapped by the fact that increasing the number of t-ratios to be computed for an experiment increases the probability that a significant t will be found if we could state how this probability changes with the number of t-values which are to be computed. Unfortunately this cannot be done, and so our use of t leads us to results which cannot be evaluated very easily. Consequently, a single test for the significance of the differences among all the X's of the experiment seems necessary as a means of reducing computational effort and permitting a more correct determination of the significance of differences obtained in the experiment. An analysis of variance can do just that. How neat is that? The question raised by the analysis of variance is whether the sample means differ from one another (among-groups variance) to a greater extent than the scores differ from their own sample means (within-group variance). If the variation of sample means from the grand mean is greater than the variation of the individual scores from their sample means, the samples are different enough to reject the null hypothesis, sampling error explanation. If the among-groups variance is not substantially greater than the within-group variance, the samples are not significantly different and probably behave as random samples from the same population. In other words, the analysis of variance technique is simply a method which provides an objective criterion for deciding whether the variability between groups is large enough in comparison with the variability within groups to justify the inference that the means of the populations from which the different groups were drawn are not all the same.
Variance among groups Variance within groups
Understand that the analysis of variance technique is appropriate for use with two or more groups. However, it is most frequently employed with three or more groups, since the t-test is available for experiments involving two groups.
Analysis of Covariance The analysis of covariance represents an extension of the analysis of variance (ANCOVA) particularly useful when it is not possible to compare randomly selected samples or when the samples cannot be matched or blocked. In such cases, a pretest is administered to each group before the administration of the independent variable (experimental treatment). At the end of the experimental period, a post-test is given and the gain evaluated by a test of covariance. What the analysis of covariance does is it uses a linear predictor, which is known as a linear progression equation, to predict what would be a normal progression. By doing this, it neutralizes the differences in the pre-test scores. Then the analysis of variance is capable of testing the post-test scores even if the pretest scores are not equal. A neat little trick done with mirrors…NOT! In actuality an analysis of covariance is really an analysis of variance with a covariate to control variables that are beyond control, such as pre-test variants. In short, the adjusted means are the posttest means you would have expected if all of the groups in the study had the same pretest means. Consequently, the ANCOVA can analyze data collected from groups which were not initially equal. A neat little trick…done without the use of mirrors even. Since analysis of variance and covariance are complex tests, only the basic purposes have been described. I know you are glad to hear that. Still, it is important that you understand these methods, when to use them and how to interpret them.
Non-Parametric Tests When we have measured two or more samples, we may determine the significance of the difference between or among their means by way of a t-test or of the F in an analysis of variance. These statistical techniques are examples of what are called parametric tests since they require that we estimate from the sample data the value of at least one population characteristic (parameter) such as its SD. One of the assumptions we make in applying these parametric techniques to sample data is that the variable we have measured is normally distributed in the populations from which the samples were obtained. Parametric tests are what we have been talking about for the last couple of hours…I am assuming you are a sloooooooooow reader. If you are a fast read it’s what we have been talking about for the last couple of minutes. In some instances, little may be known about the distribution of the population or it may be known that it differs markedly from a normal distribution. In such a situation a non-parametric, or distribution free, statistical procedure may be more appropriate. Non-parametric tests may be used in a wider range of situations since they generally require fewer and more easily satisfied assumptions about the population. Quite obviously, nonparametric techniques become very valuable when the samples we have measured suggest that the assumptions underlying an otherwise appropriate parametric test cannot be met. The application of non-parametric tests, however, is not limited to measurement data but, can be used with other types of data such as sets of ranks or classified frequencies. For example, we might ask a coach to rank twenty of his athletes, from the best to the poorest, and then compare the ranks received by ten of the athletes who had been given a weight training course before being selected for
the team, with the ranks of the ten athletes who had been given no special training. It is unnecessary to give an illustration of classified frequencies since these were the type of data with which we were dealing in the chapter earlier on M2. Actually, I should mention, however, that M2 is itself an example of a nonparametric technique and that several of nonparametric tests involve converting our raw data into classified frequencies and then computing a M2. When we have measurement data for which it is permissible to apply a parametric test, we often find that it is more time consuming and computationally involved to compute the t-test, analysis of variance, or whatever, than a parallel nonparametric test. However, whenever possible we should continue to use parametric tests in preference to nonparametric ones for the following reason. If we apply a parametric and nonparametric test to the same data, we usually find that the latter is less powerful in terms of rejecting the null hypothesis: other things being equal, a larger difference is needed between groups to reject the null hypothesis at a given significance level for a non parametric than for a parametric test. In other words, parametric test are significantly more powerful than non-parametric tests. Nonparametric techniques, then, are reserved for situations in which we need the results of a significance test in a hurry or in which the nature of our data makes the use of a parametric test inappropriate. The latter occurs either because we are not dealing with measurement data or because our measurement data suggest that the assumptions underlying the use of parametric tests are not met. Now I have some more good news for you. It is not the purpose of this text to teach the computational procedures for non-parametric test. You might want to know that of the many nonparametric tests the three most commonly used are: 1. Chi square test 2. Spearman rank order coefficient correlation 3. Sign test If you want to know how they are used and calculated, buy yourself a “freaken” statistics book because that is all you are getting from me. There is another thing you need to understand; even hand held computers can process the most in-depth mathematical computations within moments. The question now is, why did I teach you how to do all those computational procedures by hand if a computer can do them in seconds. Well, let me start out by saying this; a lot of my students are absolute amazed to find that 8 x 7 equals 56 and they are totally flabbergasted to find that 7 x 8 will give them the same answer. Now, I am not saying they are dumber than an ox, but I won’t say they are any smarter either. Well, I want things to be a lot better for you. I want you to understand that addition is the inverse of subtraction and multiplication is the inverse of division. I also want you to know that 7 x 8 = 56 and 8 x 7 will give you the same answer and I want you to know why. That is what all this is about…knowing WHY. In fact, all this information I am throwing at you is about understanding and comprehension. Simply put, it's about understanding why you're doing what you're doing. A technical college will give you a formula and tell you that it works for most instances. A traditional college will show you why the formula works and will show you how the formula works and when it can be applied. The education you are to receive in a traditional college setting isn't about plugging elements into an equation. As indicated, a computer can do that efficiently without you. What we are doing here is about understanding the relationship between the elements that are entered into the equation. It's about understanding for instance, that an analysis of covariance utilizes a linear predictor to forecast what would be a normal progression from pre-test to post-test means, and by so doing, it neutralizes
the differences in the pre-test scores thus giving the analysis of variance the capability of testing the post-test scores even if the pretest scores are not equal. Again, it is understanding WHY those equations work the way they do so that you will be able to expand you thought processes to consider conditions that may not apply for 95% of the time. In other words, I am hoping that through our efforts here you will develop a keener sense of cognitive clearance in dealing with statistical procedures and problems. One last thought. Life is not always a plug in and play component. To teach with that limited a mindset would be an injustice to you and pure laziness on my part. Like I said, I want the best for you and I am doing my best to see that you get the best. WORD!