TM-5-2 by everium llc

Reliability Reliability may be thought of as the repeatability of test results. If a test is reliable a student's scores should not differ markedly when he takes the same test again. Some measurements must be exactly the same each time to be considered reliable, while other measurements are allowed more leeway. For example, a scale reading would not be acceptable at all if it were not the same each time a specified weight was placed on the scale. On the other hand, a student could not be expected to obtain the exact same score on a golf test involving the hitting of balls into numbered circles. In the latter example, too many factors are involved that tend to reduce the probability of identical scores. As noted earlier, reliability and validity are interrelated. In fact, reliability is a necessary part of validity, because if consistent measurements cannot be achieved, the test cannot be considered valid. To illustrate this further, suppose that a physical education teacher wishes to measure leg strength. Using a dynamometer and belt arrangement, he is able to obtain an accurate measure of force exerted by the legs. However, the exact angle at which the leg extension test is performed is difficult to determine. Consequently, although the teacher obtains an accurate (and valid) measure of a particular student's leg lift at a specified angle, he may find considerable variation in his scores on subsequent tests because of the difficulty in establishing the correct angle—resulting in a low reliability coefficient. Furthermore, when he tests a group of students for leg strength, his inability to establish the same angle for everyone results in an invalid measure of the group's leg strength at the desired position. In short in order for a test to be valid it has to be reliable. However, a test can be highly reliable but not valid. You know why …right! Don’t make me slap you! Because for a test to be valid it has to be measuring what it is suppose to measure. Just because every time you give a test you get the same answer, meaning it is reliable, does not mean it is measuring what it is suppose to measure. Conversely, if the test is not reliable it can’t be valid because you are obviously getting different results every time you administer it.

Objectivity This is a simple one. For a test to have high objectivity no matter who administers the test you are going to get the same results. Objectivity of a test pertains primarily to the clarity of the directions for administering and scoring the test. As indicated, high objectivity is obtained when different teachers give a test to the same individuals and obtain approximately the same results. Naturally, the assumption is that the testers are equally competent. Objectivity is a part of reliability, or a form of it. Testing and scoring are two major sources of potential problems regarding the reliability of any test. As with any testing situation, the competency of the tester and the skill and care with which he or she administers the test are determining factors in obtaining reliable and thus valid results. Objectivity is dependent to a large extent on how complete and clear the test instructions are. For example, if a tester was measuring an individuals leg strength on a squat, and the directions are not clear as to the depth of the squat …parallel to the floor or an inch below parallel…there is increased likelihood for error. Moreover, if procedures for scoring are not standardized, some testers, for example in a sprint test, might record scores to the nearest half second, others to the nearest tenth of a second, and so on. Generally the more sophisticated the test the more objectivity presents a problem. For example to evaluate body fat Lange skin calibers are often used. This devise requires that a trained individual use them. At the world championships my body fat was evaluated by four different testers.

They got scores ranging from 2.3% all the way up to7%. Every one of those testers came up with a different rating. Now, what does that say about the objectivity of that test? My body fat was then measured with a Bipod by the same four researchers independently. Every one of them got the same answer...4.1%. What does that tell you about these two instruments from the standpoint of objectivity? Simple, Lange skin calibers have a very low objectivity (actually it is rather high when trained administers use them) and the BOD POD has a very high objectivity when measuring body fat.

Norms Norms are values considered representative of a specified population. A test that has accompanying norms is definitely desirable. Norms provide information that enables the student and teacher to interpret the student's score in relation to the scores made by other individuals in the same population. An understanding of what constitutes the "same population" is necessary for intelligent use of norm tables. Norms are usually based on age, grade, height, and weight or various combinations of these characteristics. In norm tables for physical performance there are separate scales for boys and girls; in written tests this distinction is usually not made, "the important factor is that norm tables are interpreted in light of the specific group from which the norms were compiled. For example, a standing broad jump score of 8 ft would not be very impressive if it was accomplish by a college athlete, whereas it would be an outstanding achievement if it was performed by a 10-year-old boy. To evaluate performance in relation to a set of norms, one must first evaluate the adequacy of the norms. Several factors should be considered. 1. The number of subjects used in establishing the norms should be sufficiently large. Although sheer numbers do not guarantee accuracy, in general the larger the sample, the more likely that it will approximate the population. 2. The norms should represent the performance of the population for which the test was devised. It would not be appropriate to compile norms from a select group (such as physical education majors) to represent all college students in a physical performance test. Why would I say that? Obviously because physical education majors should be in better shape (At least you would hope so) so they don’t represent the, quote, “normal” population in our society. Similarly, the user of norms should not evaluate the performance of his or her students on the basis of norms designed for a different population. 3. The geographic distribution that norms represent should be taken into account. Considerable variation in performance is often found among students in different geographic locations. For the most part, local norms are of more value to the teacher than are national norms. 4. The clarity of the directions for test administration and scoring is definitely involved in the evaluation of the accompanying norms. If the testing and scoring procedures that the teacher uses are not identical to those that the testers who compiled the norms employed, the norms are…well, worthless. 5. Norms are only temporary and must be revised periodically. Certain traits, characteristics, and abilities of children today differ from those of children a number of years ago. Consequently, the date on which the norms were established should be considered and weighed accordingly.

Additional criteria for the selection and evaluation of tests In addition to the basic concepts of validity, reliability, objectivity, and norms, several other features should be considerations in the selection and evaluation of tests. Nelson and Johnson the authors of Practical Measurement for Evaluation in Physical Education suggest you ask yourself a number of questions before you decide to use a test. The following are some of the questions they propose you ask yourself before selecting a particular test as part of the evaluation program. 1. Is the Test Easy to Administer? Before you select a test you should consider the amount of time, equipment, space, and the number of testers needed to administer the test. How feasible a test is can even extend to the attitude that the tester, and certainly the students, have toward the test. For example you might not want to administer a test that is so rigorous that some students might actually become ill and others suffer from soreness for days afterward. In such a case, the test could have high validity, reliability, and objectivity coefficients; it could possess any number of desirable test characteristics, such as economy of time, space, or equipment; but it would still be unacceptable if you half kill your students. Don’t laugh, because there are test out there can do just that and there are physical education instructors who love giving those test…the old, “No pain, no gain, concept.” Understand that your objective in testing is to evaluate fitness, not to present a life threatening situation. You are looking for results; you are not trying to induce cardiac infarction. Remember also that the students have to be motivated to take the test too. If the test looks life threatening rather than life rewarding how many students do you think are going to go all out? Thus, ease of administration encompasses the entire realm of administrative considerations, including attitude in relation to the contribution that the test can make to the program. 2. Does the Test Require Expensive Equipment? This is another question you need to ask yourself…can you afford to give the test. Most likely you won’t have to ask yourself that question. Your school administrator will ask it for you. If you are teaching at one of those rich preppie schools were even the janitors wear Tommy Hilfiger shirts with matching slacks and Rolex watches you probably have unlimited resources and you can use elaborate instruments, machines, and electronic devices to measure human performance with great precision. In reality, if your physical education budget includes food stamps you will probably be limited as to the type of test you can afford. Such a budget rarely permits the purchase of expensive equipment that is only germane to a specific test. Some tests must be excluded for exactly this reason. One of the outstanding features of the AAHPERD Youth Fitness Test is that it requires almost no equipment. Occasionally, a teacher may have to compromise to some extent by selecting a test having less accuracy than another test but requiring less equipment. This is not desirable but it is acceptable as long as the test is valid. 3. Can the Test Be Administered in a Relatively Short Time? A perpetual problem that confronts most physical educators is the numerous encroachments on class time. Recognizing this, most authorities recommend that tests and measurement programs consume no more than 10% of the total instructional time. Therefore, any single test battery must be evaluated in terms of economy of time as well as of money. This often presents the teacher with a dilemma. For a test to meet the demands of validity and reliability, a sufficient number of trials must be given, which may consume more time than the teacher wishes to spend.

Attempts to compromise, by reducing the number of test items or trials or both, can result in a serious loss in validity and reliability, thereby reducing the intended worth of the test. The problem is compounded in short activity units, in which time for instruction and practice is at a premium. 4. Can the Test Be Used as a Drill During Practice Sessions? This is not the most desirable situation, but since Nelson and Johnson mentioned it I stuck it in here. Okay, Okay, I stuck it in here because I get paid by the word and I though here was an easy way to make a little extra money. Anywho, although this feature of a test is not always desirable, its presence can offer a partial solution to the problem of economy of time as well as other test criteria. For example, the more familiar students are with a test, the less time the teacher needs to explain the directions and scoring procedures. In addition, practice of the test serves to reduce the effects of insight into the nature of the test, which may cause a rather pronounced improvement in scores in the middle or later portions of the test. If the test is a measure of skills that represent the actual abilities required in the activity, its use as a form of practice would seem to be logical and desirable. This line of reasoning is based on the principle of content validity. In other words, students should be tested on what they have practiced. On the other hand, if the test skill is artificial and the student practices it more than the actual activity, the test then loses its validity. 5. Does the Test Require Several Trained Testers? Some tests are so sophisticated that not only do you need high tech equipment, but you need high tech people to administer it. That is not going to over real well with an administrator who has a food stamp budget to follow. Also, since some test batteries contain a number of individual test items, in the interest of time it is almost imperative that more than one person be called on to administer the test. Furthermore, some test items require more skill and experience to administer than others (the Balke treadmill test for example), and thus training and practice are necessary. If more than one person is to give the same test, standardized directions and objectivity coefficients should be established. The utilization of several testers requires considerable planning and organization. Naturally, arrangements must be made to have the testers available at the proper time. Pretest meetings are usually required, and various other details of coordination need to be accomplished. Thus far it would appear that tests that call for several testers are undesirable. At times, however, tests of this nature are of immense value. One such instance is when large scale comprehensive evaluation is advocated, as is sometimes the case for physical fitness testing at various times during the school year. Generally, placement and screening tests are most effectively administered in this way. To summarize, whenever the abilities that are to be measured necessitate the use of several test items, or when a particular test item requires a specialist to administer it, the use of trained testers is not only expedient but also necessary. On the other hand, teachers who must evaluate students in various activities ordinarily do not have other staff members or trained assistants available and must bear this in mind in the selection of tests. One alternative is to use students as testers. While the use of students as testers may be cost-effective to the administration budget and it may take some burden off the teacher, their use is not always feasible or desirable. One reason is they cheat a lot when it comes to test…DON’T GIVE ME THAT LOOK, YOU KNOW DARN WELL WHAT I AM TALKING ABOUT…you are a student aren’t you?

6. Can the Test Be Easily and Objectively Scored? Pay attention here this is real important. I know I mentioned objectivity and validity so many times already that it is starting to sound like an Indian mantra. Nevertheless, it calls for further comment. Specifically, one should consider whether (a) a test requires another person to act as an opponent, a thrower, a server, spotter and so forth, (b) the students can test and score themselves during practice sessions, and (c) the scores adequately distinguish among different levels of skill. The first consideration, concerning the involvement of another individual in the performance of the student, represents somewhat of a paradox in the construction of a performance test. In most sports, the skill of the opponent has a direct bearing on an individual's performance. In activities such as tennis, badminton, handball, volleyball, baseball and football, to name but a few, the quality of the performance of a player is relative to the skill of the opponent. In other sports such as golf, archery, and bowling, a person's performance can be immediately assessed by score alone (playing conditions being equal for everyone, of course). Therefore, in many activities a performance test simply cannot take into account the influence that is rendered by the competitive situation. Recognizing this restriction, the makers of performance tests have attempted to isolate the skills that are involved and to measure them independently. However, if the isolated skills call for the services of another individual, acting either as an opponent or as a teammate, then objectivity is reduced…a not so good thing. To illustrate, in an attempt to duplicate the actual activity, it may be desirable to have someone pass a basketball to the person being tested on jump shots, or pitch to a student being tested on batting, or run a pass pattern to test an individual's skill at passing a football. In these cases it is obvious that the skill of the helping individual could greatly affect the student's score. This is not to say that tests of this type are inferior. On the contrary, if the helper is sufficiently skilled and his performance is constant for each subject, this method of evaluation can be efficient and valid. The helper might well be the teacher, but problems arise in the planning and organizing of a testing arrangement such as this, including sufficient number of trials and fatigue on the part of the teacher. In the effort to avoid the influence of another person and to preserve high objectivity, there is danger of an artificial situation being created. A student hitting a softball from a batting tee, or bouncing the ball himself before stroking in tennis, is an example of a situation not found in the actual game condition. This, then, is the paradox inherent in performance test construction…scientific precision versus game like conditions. A second consideration in scoring a test is whether students can test and score themselves. Did I mention that students cheat a lot on tests? Although this feature of a test is not always applicable, it is usually of considerable importance when the test is to be used as a teaching aid. For instance, a wall volley drill in tennis might be employed as a regularly scheduled exercise and rainy day activity for development of proficiency in the basic strokes. It also might be one of the measuring devices used in evaluation. In this situation, students could benefit from self testing throughout the course if they are provided with a record of progress, while at the same time the practice blocks the possible negative effects of their confronting a unique test situation at the end of the unit. A third aspect of scoring relates to the precision with which test scores can differentiate among persons of different abilities. This consideration overlaps with so many other characteristics of testing that it will only be mentioned briefly here. Tests that stress speed sometimes encourage poor form. Some wall volley tests are examples of this, in that a higher score may be achieved if the person stands very close to the wall and uses mainly wrist action

rather than the desired stroking movement. Trust me students are notorious for figuring out ways to beat tests. They are not so much interested in evaluating their skill as they are getting a good grade. When you see this type of behavior evolving, go to the student immediately and kill them…did I say that? In some tests, the units of measurement are not fine enough to reflect various levels of ability. In other words, the variance in scoring is too small. A classic example is an agility test in which students score a point each time they cross a center line as they run from one side of the court to the other. Because of the distance involved and the limited opportunity for earning of points, most of the scores fall within a range of about 3 points. Other examples include test items scored on a pass or fail basis, tests using targets with widely spaced point values and sprinting tests. 7. Is the Test Challenging and Meaningful? Of vital importance to the success of any testing program is the attitude with which the students approach the tests. Generally speaking, most students like to be tested. Consequently, the challenge of the task, the information that is derived, and the curiosity and competitive nature of the individual are some of the factors that produce a favorable testing situation. On the other hand, students can learn to dread tests for any number of reasons. The student may be made to feel inferior and the object of ridicule by the test, the tester or other students in the class. Unlike most tests, tests in physical education are performed in groups were everyone in that group gets to see not only your effort but your results. This can be a not so pretty sight at times, especially for kids who are…well…motor…should I say dysfunctional…how about motor challenged…okay, let me give it to you straight…motor morons. Now, there I said it. Not everyone is cut out to be Michael Jordan. If I am not mistaken I think only one guy was cut out like that. Still, you don’t want a student to feel like a motor moron, when in fact, he is one. In short, you don’t want to embarrass anyone…especially Fat Freddy. Hell, you don’t even want to embarrass Tom Terrific if you can help it. Of course, the physical educator should seek to capitalize on the motivating properties that are generally inherent in physical performance tests. The tests and the conditions in which they are given should be carefully considered with regard to student enjoyment. This is a time when the physical education teacher has an excellent opportunity to establish a favorable teacher student relationship through encouragement and individual attention. 8. The test itself must be challenging. The test must offer sufficient latitude in scoring to accommodate large differences in ability…you know variance in the scores. A disadvantage of a test such as pullups (which is NOT a valid test anywho, because the test is neither a relative or absolute measure of strength) is that almost always one or more students in the class cannot do a single pull-up and therefore receives a score of zero. Yep! Fat Freddy! Similarly, a test should allow opportunity to record improvement. Referring again to pull-ups, a student may have improved during the course to the point where he can just about pull himself up to chin level, which for him might represent a significant gain in strength, but he still receives a zero. At the other extreme are tests that have a performance ceiling, which make no provision for better scores after a particular level has been attained. Related closely to the challenging aspect of a test is the degree to which it is meaningful. Performance tests should involve the actual skills that are used in the activity, and the skills should be

measured as much as possible in game like situations. A golf test in which the student is asked to hit a golf ball while running backwards with his eyes closed and reciting Mary Had a Little Lamb can hardly be considered realistic. It is funny watching that but it may not be measuring exactly what you want it to measure. The teacher would have difficulty convincing students that their ability to play golf was being assessed in such an un-game like situation. Although that is a “flashing statement of brilliance of the absolute obvious,” some physical education instructors are so brilliant it isn’t that obvious to them. Test selection, then, calls for coupling of scientific considerations with common sense. There is no substitute for good judgment. The physical educator must be constantly alert to the needs and interests of the students as these pertain to tests and measurements as well as to the program under study. Amen…Word! Darn, I am glad that is over with!

Summary Effective teachers will use both assessment and evaluation techniques regularly and on a daily basis to improve student learning and to guide instruction. So, does all this talk about validity, reliability, and objectivity mean you need to conduct statistical analyses on your classroom quizzes? No, it doesn't. (Although you may, on occasion, want to ask one of your peers to verify the content validity of your major assessments.) However, you should be aware of the basic tenets of validity reliability and objectivity as you construct your classroom assessments, and you should be able to help students interpret scores for the test they take.