Glossary of Technical Terms

from Primer on Large-Scale Assessments of Educational Achievement

Asia Primary Learning Metrics 2019 Assessment

The following definitions have been adapted from the Standards for Educational and Psychological Testing (AERA, APA, and NCME 2014) and the 2014 ETS Standards for Quality and Fairness (ETS 2014).

Accessibility. A test is accessible when its design permits as many students as possible to demonstrate what they know and can do without test or item characteristics that are irrelevant to the knowledge or skill domain being assessed impeding them.

Accommodations. Changes to a test’s format or administration conditions to address the needs of particular students (for example, extra testing time for students for whom the language of testing is not the language they speak at home). These changes should not alter the knowledge or skill that the test measures or the ability to compare student scores.

Achievement, proficiency, performance levels. Descriptions of what students know and can do organized into categories on a continuum aligned with content standards (for example, basic, proficient, advanced).

Adaptation, test adaptation. Changes to the original test format or its administration to increase accessibility for students who would otherwise face barriers unrelated to the knowledge domain being assessed; depending on the nature of the adaptation, it may or may not affect test score interpretation. Changes to a test as part of the translation and contextualization process for a particular linguistic and cultural group.

Adaptive test. A test, typically administered on a computer, in which easier or more difficult items are presented depending on a student’s correct or incorrect responses to previous items.

Alignment. The degree to which the content and cognitive demands of test items match the targeted content and cognitive demands described in the test specifications.

Alternate assessments, tests. Tests designed to assess the performance of students unable to participate in the regular assessment, even with accommodations. Alternate forms or editions of the same test that measure the same knowledge and skills at the same difficulty level but with different items or tasks.

Comparability, score comparability. The extent to which scores from two or more tests are comparable. The degree of score comparability depends on the type of linking procedure used.

Content standard. A statement of content and skills that students are expected to learn in a subject matter area, often by a particular grade or upon completion of a particular level of schooling.

Criterion-referenced score interpretation. Test score interpretation in relation to a criterion domain. A common example is the use of cut scores and proficiency levels to describe what students with different test scores know and are able to do in a subject area.

Cut score. A point on a score scale above which students are classified differently from those below it. Score interpretation and results reporting differ for students above and below the cut score (for example, pass versus fail, basic versus proficient).

Equating. The statistical process of expressing scores from two or more alternative test forms on a common score scale.

Fairness. A test is fair when any differences in performance between subgroups of students are derived from construct-relevant sources of variance. That is, construct-irrelevant contextual or individual characteristics should not systematically affect test scores that one or more subgroups of students obtain. Group differences in performance do not necessarily make a test unfair, because the groups may differ on the knowledge domain being assessed.

Linking, score linking. Procedure for expressing scores from different tests in a comparable way. Linking methods range from statistical equating to the judgment of subject matter experts.

Norm-referenced score interpretation. Score interpretation based on comparing a student’s performance with the score distribution of a reference group (also known as the norm group). For instance, a student’s score can be described in terms of how far it is from the average for a national sample of students taking the same assessment.

Reliability, precision. The extent to which test scores are free of random measurement error; the likely consistency of the attained test scores across assessment administrations, use of alternative test forms, or scoring by different raters.

Scale score. Transformation of raw test scores into a different metric to facilitate interpretation.

Scaling. Transforming raw test scores to scaled test scores.

Scoring rubric. Established criteria, including rules, principles, and examples, used in scoring open-ended items and performance tasks. The scoring rubric should include rules for and examples of each score level.

Standard-setting. Methods used to determine cut scores on a test and map test scores onto discrete proficiency levels. Normally requires the judgment of subject matter experts and, in some cases, information about test properties and distribution of test scores.

Standardization. Set of procedures and protocols to be followed in the test development and administration process to ensure consistency in testing conditions for all students. Standardization is necessary for fair comparison of test scores of students. Exceptions to standardization may occur when students require accommodations to take the test.

Test specifications. Documentation of the purpose and intended uses of a test and of the test’s content, format, length, psychometric characteristics of the items and test overall, delivery mode, administration, scoring, and score reporting.

Universal design. An approach to test development and administration to ensure accessibility of a test to all of its intended students.

Validity. Extent to which interpretations of scores and actions taken on the basis of these scores are appropriate and justified by evidence and theory. Validity refers to how test scores are interpreted and used rather than to the test itself.

Vertical scaling. Procedure to express scores comparably when underlying tests differ in difficulty. Vertical scaling is commonly used to report results from tests administered to students in different grades on the same scale.

References

AERA (American Educational Research Association), APA (American Psychological

Association), and NCME (National Council on Measurement in Education). 2014.

Standards for Educational and Psychological Testing. Washington, DC: AERA. ETS (Educational Testing Service). 2014. 2014 ETS Standards for Quality and Fairness.

Princeton, NJ: ETS.

To improve their education systems, countries around the world have increasingly initiated national largescale assessment programs or participated in international or regional large-scale assessment studies for the first time. Well-constructed large-scale assessments can provide credible information on student achievement levels, which, in turn, can promote better resource allocation to schools, stronger education service delivery, and improved learning outcomes. The World Bank developed this Primer on Large-Scale Assessments of Educational Achievement as a firststop resource for those wanting to understand how to design, administer, analyze, and use the results from these assessments of student achievement. The book addresses frequently asked questions from people working on large-scale assessment projects and those interested in making informed decisions about them. Each chapter introduces a stage in the assessment process and offers advice, guidelines, and country examples. This book also reports on emerging trends in large-scale assessment and provides updated information on regional and international large-scale assessment programs.

DIRK HASTEDT, Executive Director of the International Association for the Evaluation of Educational Achievement (IEA) “A special feature of the publication is that it not only gives an overview of technical specifications, but also includes examples from around the world on how countries are conducting large-scale assessments, what they found, and how the results were used. With this perspective, the Primer on Large-Scale Assessments of Educational Achievement is an excellent and easy-to-read publication to get a comprehensive overview of large-scale assessments and how and why they are conducted.”

SILVIA MONTOYA, Director of UNESCO Institute for Statistics (UNESCO UIS) “If you are responsible for learning assessment in a country and are searching for a comprehensive, yet readable, guide on large-scale assessment, this is your book. Extremely well structured and written, this primer is easy to follow, and makes points clearly and concisely. It is an excellent resource that explores the steps for a good large-scale assessment with examples from all international large-scale assessment programs.”

ANDREAS SCHLEICHER, Director for the Directorate of Education and Skills and Special Advisor on Education Policy to the Organization for Economic Co-operation and Development’s (OECD) SecretaryGeneral “Many countries have joined international educational assessments to benchmark quality, equity, and efficiency in their education systems. But what does it take to design and implement those efforts well and to draw value from this to help students learn better, teachers teach better, and schools to work more effectively? This Primer on Large-Scale Assessments of Educational Achievement helps policy makers and their technical teams to find answers to these questions.”

ANDREI VOLKOV, Director of the Institute for Public Strategy, Moscow School of Management SKOLKOVO “In 2008, when the Russia Education Aid for Development (READ) Program was launched, we determined its main goal was the improvement of the quality of basic education. Today, the READ Program keeps setting trends as the largest Russian initiative promoting educational assessment. Approaches developed within the READ Program, from building institutional and expert capacity to influencing educational reforms, have proven their efficacy in many countries. The Primer on Large-Scale Assessments of Educational Achievement brings together in a practical format the best experience and case studies in conducting assessments under the READ Program. An especially important feature of the book is an integrated capacity building component, which makes it a practical tutorial ready for use in different cultural contexts. Through this book, we hope that our collective experience gathered during READ will be widely shared, bringing us closer to achievement of the Sustainable Development Goal on Education.”

ISBN 978-1-4648-1659-8

Glossary of Technical Terms

Next Article

Asia Primary Learning Metrics 2019 Assessment

More articles from this publication:

Asia Primary Learning Metrics 2019 Assessment

Primary Learning Metrics 2019 Reading Literacy Assessment

Benchmarks for Grades 4 and 6

Metrics 2019 Mathematical Literacy Assessment

CONFEMEN 2014

Programme d’Analyse des Systèmes Éducatifs de la CONFEMEN

Assessment, 2000–18

Eastern Africa Consortium for Monitoring Educational Quality

8.9 Translation and Adaptation of International Large-Scale Assessments

This article is from:

Primer on Large-Scale Assessments of Educational Achievement