Computing a t-test for the Difference between Two Means of Paired Samples Frequently, we are interested in comparing â&#x20AC;&#x153;beforeâ&#x20AC;? and "after" scores for a group in order to evaluate the effectiveness of a training program or a teaching technique. Or, we may have administered a test to "matched pairs" of students. Matched pairs is when you select pairs of subjects with identical characteristics and assign one of them to an experimental group and the other to a control group in order to ensure the groups are equal. For either of these two situations (same group or matched groups), a t-test for difference between two means for paired samples may be appropriate. We will discuss the assumptions necessary for its use before we look at a specific example.

Criteria for Selecting a t-Test for the Difference between Two Means A t-test is selected for testing hypotheses regarding samples if certain assumptions can be made. Perhaps the most important of these is that the scores or measurements of the population(s) about which conclusions will be drawn (boys who want to throw the shot-put; outfielders; infielders) would form a normal or near normal distribution. If it is known that the measurement on the population differs markedly from normal, a different test should be selected. The t-test for the differences between two means for paired samples is used for test-retest situations with a single group. It is also used when two samples have been matched on some characteristic so that they are assumed to be "identical," The t-test for the difference between two means for independent samples is used when the scores of one sample are in no way dependent on the scores of the other sample. (See table 4-5) Table 4-5 How to choose a t test for the difference between 2 means

t-test for independent samples 2 independent samples Normally distributed population(s)

t-test for paired samples Test-retest of same sample 2 samples matched on some characteristic Normally distributed population(s)

Suppose a coach advocates a weight training program for shot putters, because he believes that weight training will enhance their ability to throw the shot-put farther. To support his position, he conducts the following experiment. He selects 10 boys and measures how far each can throw the shotput. After participating in a weight training program for six weeks, each boy is retested in the shotput. The coach reasons that, if the difference between the "before" and "after" distances is statistically significant, he can safely conclude that a weight training program is beneficial to boys who wish to throw the shot-put. (Note that he will use statistics to draw a conclusion about the population of boys, who wish to throw the shot-put from the actual experience of a small sample of boys he drew from that population.) He assumes that, if the training program has no effect, the populationâ&#x20AC;&#x2122;s mean score on the test before training (M1) will be about equal to the population mean score on the test after training (M2). The null hypothesis (HO) is indicated by: HO : M1=M2. If, however, the weight training program has the desired effect of increasing the distance of the shot put, he expects M1 to be less than M2. The alternate hypothesis (HA) is denoted: HA: M1 < M2. The coach selects the .05 level of significance as the dividing line between chance differences and differences due to the training program. The degrees

of freedom for the t-test for the difference between two means for paired samples is one less than the number of people (N-l). In this example, df=10-1=9. Once again we refer to the table in Appendix A. With 9df, the critical value of t at the .05 level of significance is 1.833. This means that 5% of the area under the t-curve for 9df falls above 1.833. This can also be interpreted by saying that the probability of getting a t value above 1.833 due to chance alone is less than 5%. Figure 4-2 t-distribution for 9df showing critical value of t

Steps in Performing a t-Test for the Difference between Two Means for Paired Samples 1. 2. 3. 4. 5. 6. 7.

State the null hypothesis – HO: M1 = M2 State the alternate hypothesis – HA: M1 < M2 Fill in the subject, pretest (X1) and post-test (X2) columns Form the D column. Find the difference, X2 - X1, for each subject Form the D2 column. Square each value in the D column Find Σ D2. Sum the D2 column Use the formula: D t df=N – 1 N   D 2  ( D ) 2

N 1 to compute t. In this case, t=5.46 8. Make a rough sketch of the t-curve 9. Determine the number of degrees of freedom. A t-test for paired samples has N-1 df. In this case, there are 9df. 10. Determine the critical value of t by using the table in Appendix A. The critical value is the t value which has 5% of the area above it. In this case, with 9df, that critical value is 1.833. 11. Shade the Region of Rejection. This is the area in the tail of the curve above the critical value. The probability of getting a t-value in the Region of Rejection is less than 5%. 12. See if the computed t-value falls in the Region of Rejection. In this case, t=5.46 does fall in the Region of Rejection. This means we reject HO and accept HA. 13. Write the conclusion. This includes the statement in statistical terms and its implication for this particular situation.

The t-value is larger than the critical value (t = l.833, 9df at the .05 level) and therefore falls in the Region of Rejection. Thus, we reject HO and accept the alternate hypothesis that M1< M2. Since the mean after training is significantly greater, we conclude that the weight training program was effective in increasing distance for shot putters. Table 4-6 t-test for the difference between 2 means for paired samples

HO: HA:

M1 = M2 M1 < M2 Subject 1 2 3 4 5 6 7 8 9 10 N = 10

t

t

t

D

Distance Before Distance After X2 D X1 20 40 20 30 40 10 35 45 10 30 30 0 40 55 15 50 55 5 40 50 10 35 50 15 25 35 10 45 50 5 Σ D = 100 Σ X1 = 350 Σ X2 = 450 M1 = 35 M2 = 45

N   D 2  ( D ) 2 N 1 100

10  1300  (100) 2 9 100

333.33 100 t  5.46 18.3

df=N – 1 = 9

D2 400 100 100 0 225 25 100 225 100 25 Σ D2 =1300

Other Tests of Significance As previously mentioned, other tests of significance frequently appear in research literature. Among these are other t-tests, various types of analysis of variance (f-tests), analysis of co-variance, and critical ratio (t-test). The choice of test depends on the nature of the data collected and the use you wish to make of it. Associated with each test are certain assumptions about the population from which the sample is drawn. In order to conduct a t-test or an analysis of variance, for example, the researcher must assume that the population is normally distributed. It is not the purpose of this text to teach the computational procedures for all of the other tests mentioned. The principle of hypothesis testing is basically the same for all of them. In each case, a null and alternate hypothesis is formed, a statistic computed, and the likelihood that the statistic would occur due to chance determined. On the basis of that likelihood, the null hypothesis is accepted or rejected and a conclusion is drawn. Two tests that I would like to discuss briefly are the analysis of variance and the analysis of covariance. Both of these tests appear quite often in the research literature and therefore it would be judicious on your part to have a fundamental understanding of what they are and how they can be used. You donâ&#x20AC;&#x2122;t want to be a hand calculator your whole lifeâ&#x20AC;Ś do you? Donâ&#x20AC;&#x2122;t answer that.

Analysis of Variance (F) (ANOVA) We have noted that t-tests were employed to determine whether, after an experiment, the means of two random samples were different to attribute to sampling error. The analysis of variance is a convenient way to determine whether the means of more than two random samples are too different to attribute to sampling error. First of all, you might ask why an experiment should ever include more than two variations in condition. Two answers may be given to this question: (a) we may be interested in studying more than two conditions at a time; (b) the data obtained with two conditions may give quite a different answer to our experimental problem than would comparable data from three or more conditions. Referring to our first point, suppose that an investigator is interested in the relative effectiveness of various methods of strength training. We can see that there might be three, four, or even more strength training method which he would like to compare rather than only two. However (and this is our second point), if the experimenter studied only two strength training method and found no differences in the results, he might conclude that strength training method never differentially affect the students' performance when, in fact, some other strength training method which could have been included in an experiment would have affected performance significantly. Let us consider another experiment in which more than two variations in experimental conditions would be desirable. If we wished to determine if steroids enhanced strength we could give one group steroids and use the other group as a control group. The problem with this design is that there could be a placebo effect that could not be accounted for by the control group. Here too, we need more than two groups in order to obtain an adequate picture of the relationship between our experimental variable and the behavior being observed. Once it is agreed that experiments involving more than two groups or conditions are sometimes desirable, we are faced with the necessity of testing the significance of the differences among the X's obtained with the several conditions. Our first thought would surely be to compute t-ratios and test the significance of the differences between X's for all possible pairs of conditions. But this has certain

disadvantages. For one thing, there may be a great number of pairs of conditions to be studied. If four methods of strength training were to be compared, there would be six different t-ratios to be found. This fact is shown in the following list of pairs of methods whose data would form the basis for individual t-tests. pair 1 method A and method B pair 2 method A and method C pair 3 method A and method D pair 4 method B and method C pair 5 method B and method D pair 6 method C and method D As indicated it would be possible to use a number of t-test to determine the significance of the difference between four means, two at a time, but it would involve six separate tests. Just think how much fun you would have doing six t-tests. I will answer that for youâ&#x20AC;Śloads of fun. Not only would a great deal of numerical work be required, but the use of separate t-tests for analyzing the results of an experiment based upon more than two groups would lead to results which could not easily be interpreted. If we obtain one significant t-value out of six, what does this mean? Had we performed a single t-test, we would have had 5% probability of obtaining a t significant at the 5% level by random sampling alone, but with six t-tests our chances of having at least one of them significant because of random sampling is much larger. To carry this argument to its extreme,(something I love doing) if a million t-tests were involved in an experiment, a single significant t-value would certainly be no indication that the conditions being varied had an effect. We would not be handicapped by the fact that increasing the number of t-ratios to be computed for an experiment increases the probability that a significant t will be found if we could state how this probability changes with the number of t-values which are to be computed. Unfortunately this cannot be done, and so our use of t leads us to results which cannot be evaluated very easily. Consequently, a single test for the significance of the differences among all the X's of the experiment seems necessary as a means of reducing computational effort and permitting a more correct determination of the significance of differences obtained in the experiment. An analysis of variance can do just that. How neat is that? The question raised by the analysis of variance is whether the sample means differ from one another (among-groups variance) to a greater extent than the scores differ from their own sample means (within-group variance). If the variation of sample means from the grand mean is greater than the variation of the individual scores from their sample means, the samples are different enough to reject the null hypothesis, sampling error explanation. If the among-groups variance is not substantially greater than the within-group variance, the samples are not significantly different and probably behave as random samples from the same population. In other words, the analysis of variance technique is simply a method which provides an objective criterion for deciding whether the variability between groups is large enough in comparison with the variability within groups to justify the inference that the means of the populations from which the different groups were drawn are not all the same.

Fď&#x20AC;˝

Variance among groups Variance within groups

Understand that the analysis of variance technique is appropriate for use with two or more groups. However, it is most frequently employed with three or more groups, since the t-test is available for experiments involving two groups.

Analysis of Covariance The analysis of covariance represents an extension of the analysis of variance (ANCOVA) particularly useful when it is not possible to compare randomly selected samples or when the samples cannot be matched or blocked. In such cases, a pretest is administered to each group before the administration of the independent variable (experimental treatment). At the end of the experimental period, a post-test is given and the gain evaluated by a test of covariance. What the analysis of covariance does is it uses a linear predictor, which is known as a linear progression equation, to predict what would be a normal progression. By doing this, it neutralizes the differences in the pre-test scores. Then the analysis of variance is capable of testing the post-test scores even if the pretest scores are not equal. A neat little trick done with mirrors…NOT! In actuality an analysis of covariance is really an analysis of variance with a covariate to control variables that are beyond control, such as pre-test variants. In short, the adjusted means are the posttest means you would have expected if all of the groups in the study had the same pretest means. Consequently, the ANCOVA can analyze data collected from groups which were not initially equal. A neat little trick…done without the use of mirrors even. Since analysis of variance and covariance are complex tests, only the basic purposes have been described. I know you are glad to hear that. Still, it is important that you understand these methods, when to use them and how to interpret them.

Non-Parametric Tests When we have measured two or more samples, we may determine the significance of the difference between or among their means by way of a t-test or of the F in an analysis of variance. These statistical techniques are examples of what are called parametric tests since they require that we estimate from the sample data the value of at least one population characteristic (parameter) such as its SD. One of the assumptions we make in applying these parametric techniques to sample data is that the variable we have measured is normally distributed in the populations from which the samples were obtained. Parametric tests are what we have been talking about for the last couple of hours…I am assuming you are a sloooooooooow reader. If you are a fast read it’s what we have been talking about for the last couple of minutes. In some instances, little may be known about the distribution of the population or it may be known that it differs markedly from a normal distribution. In such a situation a non-parametric, or distribution free, statistical procedure may be more appropriate. Non-parametric tests may be used in a wider range of situations since they generally require fewer and more easily satisfied assumptions about the population. Quite obviously, nonparametric techniques become very valuable when the samples we have measured suggest that the assumptions underlying an otherwise appropriate parametric test cannot be met. The application of non-parametric tests, however, is not limited to measurement data but, can be used with other types of data such as sets of ranks or classified frequencies. For example, we might ask a coach to rank twenty of his athletes, from the best to the poorest, and then compare the ranks received by ten of the athletes who had been given a weight training course before being selected for

the differences in the pre-test scores thus giving the analysis of variance the capability of testing the post-test scores even if the pretest scores are not equal. Again, it is understanding WHY those equations work the way they do so that you will be able to expand you thought processes to consider conditions that may not apply for 95% of the time. In other words, I am hoping that through our efforts here you will develop a keener sense of cognitive clearance in dealing with statistical procedures and problems. One last thought. Life is not always a plug in and play component. To teach with that limited a mindset would be an injustice to you and pure laziness on my part. Like I said, I want the best for you and I am doing my best to see that you get the best. WORD!

TM-4-3

Computing a t-test for the Difference between Two Means of Paired Samples Table 4-5 How to choose a t test for the difference between 2 mean...