Page 1

Assessment of Competencies in Higher Education Editors Sigrid Blömeke Jan-Eric Gustafsson Richard J. Shavelson

Volume 223 / Number 1 / 2015

Assessment of Competencies in Higher Education

Assessment of Competencies in Higher Education

In our globalized, knowledge-based societies with increased demands for competencies in the workforce, higher education institutions must ensure that their graduates have the competencies they need to succeed. Research on assessment of such competencies is only just beginning. The papers presented here are high-quality studies that integrate theory and methods to provide readers with an overview of the current state of research.

Sigrid Blömeke, Jan-Eric Gustafsson, and Richard J. Shavelson (Editors)

Contents include: Beyond Dichotomies: Competence Viewed as a Continuum Sigrid Blömeke, Jan-Eric Gustafsson, and Richard J. Shavelson

The Relationship of Mathematical Competence and Mathematics Anxiety: An Application of Latent State-Trait Theory Lars Jenßen, Simone Dunekacke, Michael Eid, and Sigrid Blömeke

Zeitschrift für Psychologie

Modeling the Competencies of Prospective Business and Economics Teachers: Professional Knowledge in Accounting Kathleen Schnick-Vollmer, Stefanie Berger, Franziska Bouley, Sabine Fritsch, Bernhard Schmitz, Jürgen Seifried, and Eveline Wuttke

Validating Test Score Interpretations by Cross-National Comparison: Comparing the Results of Students From Japan and Germany on an American Test of Economic Knowledge in Higher Education Manuel Förster, Olga Zlatkin-Troitschanskaia, Sebastian Brückner, Roland Happ, Ronald K. Hambleton, William B. Walstad, Tadayoshi Asano, and Michio Yamaoka

www.hogrefe.com/journals/zfp

Zeitschrift für Psychologie

Assessment of Mathematical Competencies and Epistemic Cognition of Preservice Teachers Benjamin Rott, Timo Leuders, and Elmar Stahl

Founded by Hermann Ebbinghaus and Arthur König in 1890 Volume 223 / Number 1 / 2015 ISSN-L 2151-2604 • ISSN-Print 2190-8370 • ISSN-Online 2151-2604

Scientific Reasoning in Higher Education: Constructing and Evaluating the Criterion-Related Validity of an Assessment of Preservice Science Teachers’ Competencies Stefan Hartmann, Annette Upmeier zu Belzen, Dirk Krüger, and Hans Anand Pant

Editor-in-Chief Bernd Leplow

Assessing Professional Vision in Teacher Candidates: Approaches to Validating the Observer Extended Research Tool Kathleen Stürmer and Tina Seidel Opinion: Gaining Substantial New Insights Into University Students’ Self-Regulated Learning Competencies: How Can We Succeed?

ISBN 978-0-88937-473-7

90000 9 780889 374737

Associate Editors Edgar Erdfelder · Herta Flor · Dieter Frey Friedrich W. Hesse · Heinz Holling · Christiane Spiel


The Zeitschrift fr Psychologie, founded in 1890, is the oldest psychology journal in Europe and the second oldest in the world. One of the founding editors was Hermann Ebbinghaus. Since 2007, it appears in English and is devoted to publishing topical issues that provide convenient state-of-the-art compilations of research in psychology, each covering an area of current interest. The Zeitschrift fr Psychologie is available as a journal in print and online by annual subscription and the different topical compendia are also available as individual titles by ISBN.

Zeitschrift für Psychologie, Volume 223, No. 1, 2015 Editor-in-Chief

Associate Editors

Editorial Board

Publisher

Production

Subscriptions Advertising/Inserts

ISSN Copyright Information

Publication

Bernd Leplow, Institute of Psychology, University of Halle-Wittenberg, Brandbergweg 23, D-06120 Halle, Germany, Tel. +49 345 552-4358(9), Fax +49 345 552-7218, E-mail bernd.leplow@psych.uni-halle.de Edgar Erdfelder, Mannheim, Germany Herta Flor, Mannheim, Germany

Dieter Frey, Munich, Germany Friedrich W. Hesse, Tbingen, Germany

Heinz Holling, Mnster, Germany Christiane Spiel, Vienna, Austria

G. M. Bente, Cologne, Germany D. Do¨rner, Bamberg, Germany N. Foreman, London, UK J. Funke, Heidelberg, Germany W. Greve, Hildesheim, Germany W. Hacker, Dresden, Germany R. Hartsuiker, Ghent, Belgium J. Hellbru¨ck, Eichsta¨tt-Ingolstadt, Germany

R. Hu¨bner, Konstanz, Germany A. Jacobs, Berlin, Germany M. Jerusalem, Berlin, Germany A. Kruse, Heidelberg, Germany W. Miltner, Jena, Germany T. Moffitt, London, UK A. Molinsky, Waltham, MA, USA H. Moosbrugger, Frankfurt/Main, Germany

W. Schneider, Wu¨rzburg, Germany B. Schyns, Durham, UK B. Six, Halle, Germany P. K. Smith, London, UK W. Sommer, Berlin, Germany A. von Eye, Vienna, Austria K. Wiemer-Hastings, DeKalb, IL, USA

Hogrefe Publishing, Merkelstr. 3, 37085 Gçttingen, Germany, Tel. +49 551 99950-0, Fax +49 551 99950-425, E-mail publishing@hogrefe.com North America: Hogrefe Publishing, 38 Chauncy Street, Suite 1002, Boston, MA 02111, USA Tel. +1 866 823-4726, Fax +1 617 354-6875, E-mail customerservice@hogrefe-publishing.com Christina Sarembe, Hogrefe Publishing, Merkelstr. 3, 37085 Gçttingen, Germany, Tel. +49 551 99950-424, Fax +49 551 99950-425, E-mail publishing@hogrefe.com Hogrefe Publishing, Herbert-Quandt-Str. 4, 37081 Gçttingen, Germany, Tel. +49 551 99950-900, Fax +49 551 99950-998 Hogrefe Publishing, Merkelstr. 3, 37085 Gçttingen, Germany, Tel. +49 551 99950-423, Fax +49 551 99950-425, E-mail marketing@hogrefe.com ISSN-L 2151-2604, ISSN-Print 2190-8370, ISSN-Online 2151-2604 Ó 2015 Hogrefe Publishing. This journal as well as the individual contributions and illustrations contained within it are protected under international copyright law. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, microfilming, recording or otherwise, without prior written permission from the publisher. All rights, including translation rights, reserved. Published in 4 topical issues per annual volume.

Subscription Prices

Calendar year subscriptions only. Rates for 2015: Institutions US $364.00 / 1266.00 / £213.00; Individuals US $195.00 / 1139.00 / £111.00 (all plus US $16.00 / 112.00 / £10.00 shipping & handling; 16.00 in Germany). Single issue US $49.00 / 134.95 / £27.90 (plus shipping & handling).

Payment

Payment may be made by check, international money order, or credit card, to Hogrefe Publishing, Merkelstr. 3, 37085 Gçttingen, Germany. US and Canadian subscriptions can also be ordered from Hogrefe Publishing, 38 Chauncy Street, Suite 1002, Boston, MA 02111, USA

Electronic Full Text Abstracting Services

The full text of Zeitschrift fu¨r Psychologie is available online at www.psyjournals.com and in PsycARTICLESTM. Abstracted/indexed in Current Contents/Social and Behavioral Sciences (CC/S&BS), Social Sciences Citation Index (SSCI), Research Alert, PsycINFO, PASCAL, PsycLit, IBZ, IBR, ERIH, and PSYNDEX. Impact Factor (2013): 1.036


Assessment of Competencies in Higher Education

Editors: Sigrid Blömeke Jan-Eric Gustafsson Richard J. Shavelson

Zeitschrift für Psychologie Vol. 223, No. 1, 2015


Library of Congress Cataloging in Publication is available via the Library of Congress Marc Database under the LC Control Number 2014958932

Cover image Ó Robert Kneschke – fotolia.com Ó 2015 Hogrefe Publishing PUBLISHING OFFICES USA: Hogrefe Publishing, 38 Chauncy Street, Suite 1002, Boston, MA 02111 Phone (866) 823-4726, Fax (617) 354-6875, E-mail customerservice@hogrefe.com EUROPE: Hogrefe Publishing GmbH, Merkelstr. 3, 37085 Go¨ttingen, Germany Phone +49 551 99950-0, Fax +49 551 99950-425, E-mail publishing@hogrefe.com SALES & DISTRIBUTION USA: Hogrefe Publishing, Customer Services Department, 30 Amberwood Parkway, Ashland, OH 44805, Phone (800) 228-3749, Fax (419) 281-6883, E-mail customerservice@hogrefe.com UK: Hogrefe Publishing, c/o Marston Book Services Ltd., 160 Eastern Ave., Milton Park, Abingdon, OX14 4SB, UK Phone +44 1235 465577, Fax +44 1235 465556; E-mail direct.orders@marston.co.uk EUROPE: Hogrefe Publishing, Merkelstr. 3, 37085 Go¨ttingen, Germany Phone +49 551 99950-0, Fax +49 551 99950-425, E-mail publishing@hogrefe.com OTHER OFFICES CANADA: Hogrefe Publishing, 660 Eglinton Ave. East, Suite 119-514, Toronto, Ontario M4G 2K2 SWITZERLAND: Hogrefe Publishing, La¨nggass-Strasse 76, CH-3000 Bern 9 Hogrefe Publishing Incorporated and registered in the Commonwealth of Massachusetts, USA, and in Go¨ttingen, Lower Saxony, Germany No part of this publication may be reproduced, stored in a retrieval system or transmitted, in any form or by any means, electronic, mechanical, photocopying, microfilming, recording or otherwise, without written permission from the publisher. ISBN 978-0-88937-473-7


Contents Editorial

Review Article

Original Articles

Opinion

Call for Papers

Ó 2015 Hogrefe Publishing

Approaches to Competence Measurement in Higher Education Sigrid Blo¨meke, Jan-Eric Gustafsson, and Richard J. Shavelson

1

Beyond Dichotomies: Competence Viewed as a Continuum Sigrid Blo¨meke, Jan-Eric Gustafsson, and Richard J. Shavelson

3

Validating Test Score Interpretations by Cross-National Comparison: Comparing the Results of Students From Japan and Germany on an American Test of Economic Knowledge in Higher Education Manuel Fo¨rster, Olga Zlatkin-Troitschanskaia, Sebastian Bru¨ckner, Roland Happ, Ronald K. Hambleton, William B. Walstad, Tadayoshi Asano, and Michio Yamaoka

14

Modeling the Competencies of Prospective Business and Economics Teachers: Professional Knowledge in Accounting Kathleen Schnick-Vollmer, Stefanie Berger, Franziska Bouley, Sabine Fritsch, Bernhard Schmitz, Ju¨rgen Seifried, and Eveline Wuttke

24

The Relationship of Mathematical Competence and Mathematics Anxiety: An Application of Latent State-Trait Theory Lars Jenßen, Simone Dunekacke, Michael Eid, and Sigrid Blo¨meke

31

Assessment of Mathematical Competencies and Epistemic Cognition of Preservice Teachers Benjamin Rott, Timo Leuders, and Elmar Stahl

39

Scientific Reasoning in Higher Education: Constructing and Evaluating the CriterionRelated Validity of an Assessment of Preservice Science Teachers’ Competencies Stefan Hartmann, Annette Upmeier zu Belzen, Dirk Kru¨ger, and Hans Anand Pant

47

Assessing Professional Vision in Teacher Candidates: Approaches to Validating the Observer Extended Research Tool Kathleen Stu¨rmer and Tina Seidel

54

Gaining Substantial New Insights Into University Students’ Self-Regulated Learning Competencies: How Can We Succeed? Barbara Schober, Julia Klug, Gregor Jo¨stl, Christiane Spiel, Markus Dresel, Gabriele Steuer, Bernhard Schmitz, and Albert Ziegler

64

‘‘Neural Plasticity in Rehabilitation and Psychotherapy – New Perspectives and Findings’’: A Topical Issue of the Zeitschrift fu¨r Psychologie Guest Editors: Wolfgang H. R. Miltner and Otto W. Witte

66

Zeitschrift fu¨r Psychologie 2015; Vol. 223(1)


Editorial Approaches to Competence Measurement in Higher Education Sigrid Blömeke,1 Jan-Eric Gustafsson,1,2 and Richard J. Shavelson3,4 1

Centre for Educational Measurement (CEMO), University of Oslo, Norway, 2Department of Education and Special Education, University of Gothenburg, Sweden, 3SK Partners LLC, Menlo Park, CA, USA, 4 Graduate School of Education, Stanford University, CA, USA

In a globalized knowledge-based society with increased demands for high levels of workforce competence, information about higher education’s capacity to develop competencies is highly necessary. However, before obtaining such information, competencies need to be assessed. Research of this type is only just beginning. The objective of this topical issue on the ‘‘Assessment of competencies in higher education’’ is to point out high-quality studies that integrate theory and methods to provide readers with an overview of the current state of research on competence measurement. A major challenge of this type of research is to assess competencies – as the latent cognitive and affectivemotivational traits underpinning domain-specific performance in varying (job) situations – reliably and validly, given the heterogeneous and changing nature of labor markets, competencies needed, and the inter- and intra-national diversity of higher education systems, institutions, programs, and processes. The research presented in this topical issue combines subject specialists with methodological experts to overcome these challenges. Two papers cover the field of economics: ZlatkinTroitschanskaia et al. (2015) examine measurement invariance of a US economic knowledge test in Germany and Japan, and Schnick-Vollmer et al. (2015) develop an assessment to examine professional knowledge in economics and business education. Three papers cover the fields of mathematics and science, including assessments of both cognition and affect: Jenßen, Dunekacke, Eid, and Blömeke (2015) apply a latent-state trait model to test the stability of mathematics anxiety and mathematics content knowledge across different measurements; Hartmann, Upmeier zu Belzen, Krüger, and Pant (2015) develop an assessment to examine student science teachers’ inquiry skills; and Rott, Leuders, and Stahl (2015) shape our understanding of mathematics teachers’ cognition by assessing knowledge and epistemic beliefs. The final paper by Stürmer and

Ó 2015 Hogrefe Publishing

Seidel (2015) is on teachers’ professional vision assessed with a video-based tool. To frame these papers, Blömeke, Gustafsson, and Shavelson (2015) review the current state of research on the assessment of competencies in higher education. They clarify fundamental conceptual and methodological issues showing that ‘‘controversies’’ are built on dichotomies that are not useful: for example, competence as an underlying trait versus competence as real-world performance or classical test theory versus item response theory. An opinion paper (Schober et al., 2015) concludes the topical issue; the authors were invited to contribute with an innovative view on self-regulated learning in university students.

References Blömeke, S., Gustafsson, J.-E., & Shavelson, R. (2015). Beyond dichotomies: Competence viewed as a continuum. Zeitschrift für Psychologie, 223, 3–13. doi: 10.1027/2151-2604/a000194 Förster, M., Zlatkin-Troitschanskaia, O., Brückner, S., Happ, R., Hambleton, R., Walstad, W. B., . . . Yamaoka, M. (2015). Validating test score interpretations by cross-national comparison: Comparing the results of students from Japan and Germany on an American test of economic knowledge in higher education. Zeitschrift für Psychologie, 223, 14–23. doi: 10.1027/2151-2604/a000195 Hartmann, S., Upmeier zu Belzen, A., Krüger, D., & Pant, H. A. (2015). Scientific reasoning in higher education: Constructing and evaluating the criterion-related validity of an assessment of preservice science teachers’ competencies. Zeitschrift für Psychologie, 223, 47–53. doi: 10.1027/21512604/a000199 Jenßen, L., Dunekacke, S., Eid, M., & Blömeke, S. (2015). The relationship of mathematical competence and mathematics anxiety: An application of latent state-trait theory. Zeitschrift für Psychologie, 223, 31–38. doi: 10.1027/2151-2604/ a000197

Zeitschrift für Psychologie 2015; Vol. 223(1):1–2 DOI: 10.1027/2151-2604/a000193


2

Editorial

Rott, B., Leuders, T., & Stahl, E. (2015). Assessment of mathematical competencies and epistemic cognition of preservice teachers. Zeitschrift für Psychologie, 223, 39–46. doi: 10.1027/2151-2604/a000198 Schnick-Vollmer, K., Berger, S., Bouley, F., Fritsch, S., Schmitz, B., Seifried, J., & Wuttke, E. (2015). Modeling the competencies of prospective business and economics teachers: Professional knowledge in accounting. Zeitschrift für Psychologie, 223, 24–30. doi: 10.1027/2151-2604/a000196 Schober, B., Klug, J., Jöstl, G., Spiel, C., Dresel, M., Steuer, G., . . ., Ziegler, A. (2015). Gaining substantial new insights into university students’ self-regulated learning competencies: How can we succeed? Zeitschrift für Psychologie, 223, 64–65. doi: 10.1027/2151-2604/a000201 Stürmer, K., & Seidel, T. (2015). Assessing professional vision in teacher candidates: Approaches to validating the Observer Extended Research Tool. Zeitschrift für Psychologie, 223, 54–63. doi: 10.1027/2151-2604/a000200

Zeitschrift für Psychologie 2015; Vol. 223(1):1–2

Sigrid Blömeke University of Oslo Faculty of Education Centre for Educational Measurement (CEMO) Niels Henrik Abels hus Moltke Moes vei 35 0318 Oslo Norway Tel. +47 464 18755 E-mail sigribl@cemo.uio.no

Ó 2015 Hogrefe Publishing


Review Article

Beyond Dichotomies Competence Viewed as a Continuum Sigrid Blömeke,1 Jan-Eric Gustafsson,1,2 and Richard J. Shavelson3,4 1

Centre for Educational Measurement (CEMO), University of Oslo, Norway, 2Department of Education and Special Education, University of Gothenburg, Sweden, 3SK Partners LLC, Menlo Park, CA, USA, 4 Graduate School of Education, Stanford University, CA, USA Abstract. In this paper, the state of research on the assessment of competencies in higher education is reviewed. Fundamental conceptual and methodological issues are clarified by showing that current controversies are built on misleading dichotomies. By systematically sketching conceptual controversies, competing competence definitions are unpacked (analytic/trait vs. holistic/real-world performance) and commonplaces are identified. Disagreements are also highlighted. Similarly, competing statistical approaches to assessing competencies, namely itemresponse theory (latent trait) versus generalizability theory (sampling error variance), are unpacked. The resulting framework moves beyond dichotomies and shows how the different approaches complement each other. Competence is viewed along a continuum from traits that underlie perception, interpretation, and decision-making skills, which in turn give rise to observed behavior in real-world situations. Statistical approaches are also viewed along a continuum from linear to nonlinear models that serve different purposes. Item response theory (IRT) models may be used for scaling item responses and modeling structural relations, and generalizability theory (GT) models pinpoint sources of measurement error variance, thereby enabling the design of reliable measurements. The proposed framework suggests multiple new research studies and may serve as a ‘‘grand’’ structural model. Keywords: competencies, ability, competence assessment, cognition, modeling

In our Call for Papers for this topical issue of the Zeitschrift für Psychologie (ZfP) we blithely said that the ‘‘assessment of competence development during the course of higher education presents a substantive and methodological challenge. The challenge is to define, and model competence – as the latent cognitive and affective-motivational underpinning of domain-specific performance in varying situations – in a reliable and valid way’’ (Blömeke, Gustafsson, & Shavelson, 2013). We now say ‘‘blithely’’ because at the time we thought that defining ‘‘what was meant by the term ‘competencies’ seemed . . . to be an easier task [than measuring competencies].’’ We should have known better. For it is well known that definitions are important and contested. Moreover, once defined, constraints are placed on the measurement of competence – what constitutes a task eliciting competence and what doesn’t, what is an allowable measurement procedure and what isn’t, what is a reasonable approach to scaling and what isn’t, etc.? This reality has been brought home not only in the diversity of definitions and measurement approaches represented in this topical issue of ZfP but also in the debates and deliberations in the literature – and among the editors! The call actually set us up for controversy with a phrase so commonly seen in the measurement literature – ‘‘Competencies are conceptualized as complex ability constructs that are context-specific, . . . and closely related to Ó 2015 Hogrefe Publishing

real life’’ (Koeppen, Hartig, Klieme, & Leutner, 2008, p. 61) – that we editors left it unchallenged until we tried to unpack it and rephrased it as ‘‘the latent cognitive and affective-motivational underpinning of domain-specific performance in varying situations.’’ To be sure, cognition and affect-motivation are latent traits (i.e., human constructions of unobserved processes); they cannot be directly observed but have to be inferred from observable behavior. However, this definition only provides a starting point to address the conceptual and methodological challenges involved in assessing competencies acquired in higher education. Conceptually, in a first interpretation, the ‘‘complex ability’’ part of the definition is stressed and competence is analytically divided into several cognitive and affectivemotivational traits (or resources; Schoenfeld, 2010), each to be measured reliably and validly. The validity of an interpretation that such a measurement taps competence could then be established by, for example, testing whether the trait structure was as hypothesized, or whether the measurement predicted performance in a ‘‘criterion situation.’’ The correlation between competence and performance might vary across different situations but we would expect it to be positive and of substantial magnitude. Many of the papers included in this topical issue fit well within this analytic tradition. ZlatkinTroitschanskaia et al. (2015), for example, examine the content knowledge in micro- and macroeconomics acquired Zeitschrift für Psychologie 2015; Vol. 223(1):3–13 DOI: 10.1027/2151-2604/a000194


4

S. Blömeke et al.: Competence as Continuum

during higher education in Germany and Japan with a paperand-pencil test validated for cross-country comparisons. A second interpretation focuses on the ‘‘real-life’’ part of the definition and thus on observed behavior in context. Competence itself, then, is assumed to involve a multitude of cognitive abilities and affect-motivation states that are ever changing throughout the duration of the performance. In this case, the goal is to get measures as ‘‘closely related’’ to criterion performance as possible. Perhaps the closest a measurement can get to criterion performance is to sample real-world tasks and observe performance on them. What is to be measured, then, is behavior in real-life situations recognizing that no two people might use the exact same competence profile to carry out the behavior. Some of the papers included in this topical issue fit within this more holistic tradition. Methodologically, we note that the long-standing measurement traditions based on classical test theory (CTT) certainly provide useful tools to approach technical issues in assessment of competences. But factor analysis and other classical methods were developed to solve other measurement problems than those encountered when assessing domain-specific performance. Thus, with only few exceptions (e.g., generalizability theory [GT] as developed by Cronbach, Gleser, Nanda, & Rajaratnam, 1972; see also Shavelson & Webb, 1991; Brennan, 2001) much of CTT focuses on reliable assessment of individual differences on single characteristics in norm-referenced contexts. But assessment of competences often requires criterionreferenced decisions, such as whether particular levels of competence have been reached (Berry, Clark, & McClure, 2011). Furthermore, as pointed out above, in competence assessments a multitude of characteristics is to be taken into account at the same time and the profile, how these characteristics are related to each other within a person often is of strong interest. For such purposes latent trait and mixed models – out of which item-response-theory (IRT) models are the most prominent ones – seem to hold promise, in particular because they make it possible to investigate the nature of scales (Rijmen, Tuerlinckx, De Boeck, & Kuppens, 2003). If the latent variables are, in addition, categorical (mixture models; McLachlan & Peel, 2000), a personoriented approach to competence profiles can be explored. The intent is to capture unobserved heterogeneity of profiles in subpopulations. These approaches open therefore up a wide range of possibilities in the field of educational measurement. A specific methodological challenge in the context of competence assessments though is that reliability requirements typically imply a large number of items which leads to selected-response assessments that can be quickly administered and scored. However, assessment of domain-specific competence in higher education does not necessarily lend itself to such approaches because validity considerations call for tapping ‘‘real-life’’ performance at some point. Achieving sufficient reliability and generalizability in the assessments is challenging given the complexities of higher education competencies. Zeitschrift für Psychologie 2015; Vol. 223(1):3–13

Overview We have purposely characterized the definition and measurement of competence by strong and opposing positions. Pragmatically, reality in both respects lies somewhere in between. At either extreme, there is a chance of forgetting either observable behavior or cognitive abilities. That is, our notion of competence includes ‘‘criterion behavior’’ as well as the knowledge, cognitive skills, and affective-motivational dispositions that underlie that behavior. Statistically, we believe that both CTT, in particular GT and other approaches that are based on the decomposition of variance, and more recent latent trait, mixed, and mixture models in the IRT tradition have a role to play in examining the quality of competence measurements. This paper tries to tidy up ‘‘this messy construct.’’ We do not intend to find ‘‘the’’ one definition and assessmentof-competence measurement. Rather by systematically sketching conceptual controversies and assessment approaches we attempt to clarify the construct and its measurement. Our discussion of ‘‘messy’’ challenges confronting the definition and measurement of competence begins with definitional issues. We unpack competing definitions and identify commonplaces where there seems to be a modicum of agreement. We also highlight disagreements and suggest how some might be resolved. We then provide examples of how competence is defined in several professions. Next we discuss methodological issues, focusing on how we can move beyond dichotomies by balancing and making the best use of both CTT and IRT. Finally, we conclude by tying key points and issues together.

Conceptual Framework: Definitions of Competence The notion of competence was first discussed in the US during the 1970s (Grant, Elbow, & Ewens, 1979). The discussion focused on performance on ‘‘criterion tasks’’ sampled from real-life situations. McClelland (1973) contrasted the ‘‘criterion-sampling’’ approach with testing for aptitude and intelligence. In McClelland’s view, ‘‘intelligence and aptitude tests are used nearly everywhere by schools, colleges, and employers. . . The games people are required to play on aptitude tests are similar to the games teachers require in the classroom. . . So it is scarcely surprising that aptitude test scores are correlated highly with grades in school’’ (1971, p. 1). He argued that we instead should be testing for competence – successful behavior in real-life situations: ‘‘If someone wants to know who will make a good teacher, they will have to get videotapes of classrooms, as Kounin (1970) did, and find out how the behaviors of good and poor teachers differ. To pick future businessmen, research scientists, political leaders, prospects for a happy marriage, they will have to make careful behavioral analyses of these outcomes and then find ways of sampling the adaptive behavior in advance’’ (p. 8). Ó 2015 Hogrefe Publishing


S. Blömeke et al.: Competence as Continuum

A contrasting perspective stressed competence’s dispositional, and in particular its cognitive nature; either generic competence, which is often synonymous with intelligence or information processing abilities, or domain-specific competence, often referred to as expertise. Boyatzis (1982) carried out one of the first empirical studies in this perspective. Based on top managers definitions of their competence he defined it as an ‘‘underlying characteristic of a person which results in effective and/or superior performance in a job’’ (p. 97). Spencer and Spencer (1993, p. 9) were more precise: A competency is an underlying characteristic of an individual that is causally related to criterionreferenced effective and/or superior performance in a job or situation. Underlying characteristic means the competency is a fairly deep and enduring part of a person’s personality. [. . .] Causally related means that a competency causes or predicts behavior and performance. Criterion-referenced means that the competency actually predicts who does something well or poorly, as measured on a specific criterion or standard.’’ So, as we see, a variety of definitions has existed and still exists. The respective representatives mutually criticize each other fiercely for misconceiving the construct, reducing its complexity, ignoring important aspects, and so on (e.g., McMullan et al., 2003). The value added by each of the perspectives is rarely acknowledged. The dichotomy of a behavioral assessment in real-life situations versus an analytical assessment of dispositions underlying such behavior has much to do with the origins of these different models. The first approach stems from industrial/organizational psychology that has the selection of candidates best suited for a job as the main purpose in mind. Naturally, underlying dispositions are then not the focus because they are not as close as observed performance in context. Rather, predicting future jobperformance by sampling typical job tasks and assessing how well a candidate does represents a reliable and valid approach to identify job-person fit (Arthur, Day, McNelly, & Edens, 2003). Many large employers carry out such assessments as part of their recruitment process. It is not important how a candidate has come to his or her competence. What matters is that he or she shows it in situations relevant for the job (Sparrow & Bognanno, 1993). But also in the context of professional certification and licensure, performance criteria and their assessment according to the standards of a profession are foregrounded. Which opportunities to learn a candidate had during his or her training or which traits contribute to performance is not the focus. The license is only awarded if a teacher, nurse, or psychologist is able to do what is required. In contrast to this selection approach, the second approach stems from educational research and intends to find ways to foster the development of competence. Identifying a person’s characteristics (resources) underlying her Ó 2015 Hogrefe Publishing

5

or his behavior and how these best can be developed are essential in this approach. An implicit assumption is that these characteristics are amenable to external interventions (Koeppen et al., 2008; Sternberg & Grigorenko, 2003) such as opportunities to learn and systematic training so that the relationship between educational inputs and competence outcomes is foregrounded and a frequent research topic. In the long run, the purpose is not to identify job-person fit but to identify those opportunities to learn on the individual, classroom, and system level best suited to foster competence development. The German research program, ‘‘Modeling and measuring competencies in higher education,’’ is an example (Blömeke, ZlatkinTroitschanskaia, Kuhn, & Fege, 2013). The program responds to the increasing discussion about instructional quality in higher education and the new wave of competence-based curricula as a result of the Bologna process’ requirements.

Overcoming Disagreements Due to Oversimplified Dichotomies The industrial/organizational selection and the educational training approaches to the definition of competence and competence assessments are in some respects distinct. In the following, we unpack the disagreements and suggest how to overcome these. However, we also see substantial commonalities in the various notions of competence – a ‘‘framework’’ of sorts. We highlight these commonalities first.

Agreements in the Definition of Competence There is some agreement in the two contrasting perspective laid out above that ‘‘competence’’ (plural ‘‘competences’’) is the broader term whereas ‘‘competency’’ (plural ‘‘competencies’’) refers to the different constituents of competence. The first term describes a complex characteristic from a holistic viewpoint whereas the latter takes an analytic stance. The constituents (or resources) may be cognitive, conative, affective, or motivational. In contrast to common views of intelligence as a less malleable trait, competence and competency are regarded as learnable and can thus be improved through deliberate practice (Epstein & Hundert, 2002; Shavelson, 2010; Weinert, 2001). Furthermore, agreement exists in both perspectives that a competence framework recognizes the importance of realworld situations typical for performance demands in a field as ‘‘the’’ point of reference. The definition of competence therefore has to start from an analysis of authentic job or societal situations and enumerate the tasks as well as the cognition, conation, affect, and motivation involved. And no matter whether one follows the behavioral or the dispositional perspective – such real-world situations should be Zeitschrift für Psychologie 2015; Vol. 223(1):3–13


6

S. Blömeke et al.: Competence as Continuum

sampled in measures of competence or in measures of criteria. In both cases, the underlying competencies inferred from such a framework do not necessarily have to be in line with those inferred from a curriculum in school or university. Beyond Dichotomies: Competence as a Multidimensional Construct If we agree that competence ultimately refers to real-world performance, either as constituent of the construct or as a validity criterion, several disagreements are resolved. It is then no longer a question whether competence is a set of cognitive abilities only or is a combination of cognition, conation, affect, and motivation. To the degree that conation, affect, and motivation are involved in that performance besides cognition, so too should the definition of competence include them for that domain. Competence thus involves complex intellectual characteristics along with affect-motivation that underlies observable performance. Evidence exists that for long-term job success, such subjective indicators have to be taken into account (Brief & Weiss, 2001). Job satisfaction predicts productivity and performance (Judge, Thorensen, Bono, & Patton, 2001). Work engagement also predicts performance and, in addition, organizational commitment (Bakker, 2011) and health (Hakanen & Schaufeli, 2012). This argument leads back to Snow’s (1994) idea of two pathways that contribute to achievement, namely a cognitive and a commitment pathway. Thus, he included motivational-conative processes in his new concept of aptitude. Lau and Roeser (2002) confirmed this framework empirically with respect to science achievement. Whereas students’ cognitive abilities were the strongest predictors, it turned out that motivational characteristics increased the predictive validity and that these were also the strongest predictors for commitment. A priori, it is impossible to specify which specific facets enter into a definition of competence. For example, what does a competent physicist know, believe, and is able to do? Only from detailed observation and other information particular profiles of cognition, motivation, etc., can be specified. Not only is subject-matter knowledge required to solve force and motion problems, so too are problemsolving strategies, analytic reasoning, critical thinking, and the like. Moreover, if competent performance involves working successfully as a team member, this competency would be included in the definition of competence. Thus, any definition of competence should entertain the possibility that competence involves complex cognitive abilities along with affective and volitional dispositions to work in particular situations. Beyond Dichotomies: Competence as a Horizontal Continuum Currently, the dichotomy of disposition versus performance comes down to and gets stuck with the question of whether Zeitschrift für Psychologie 2015; Vol. 223(1):3–13

(a) competence is performance in real-world situations, more specifically, whether behavior is the focus of competence, or (b) behavior is the criterion against which cognition and affect-motivation are validated as measures of competence. As we will see, such a dichotomy overlooks an essential question and this is how knowledge, skills, and affect are put together to arrive at performance. The first position (a) takes a holistic view in which cognition, affect-motivation, and performance are complexly linked together, changing during the course of performance (Corno et al., 2002). A competence assessment, then, involves successfully carrying out concrete tasks in realworld criterion situations; a definition of competence, then, should be based on a thorough analysis of the demands of and variations in these situations. To be sure, knowledge, skill, and affective-motivation components underlie performance but they change during the in-situation performance as the situation moves along. Cognition, affect-motivation, and performance are linked as a system, cobbled together in response to task demands, somewhat differently for each person. This observation is what Oser (2013) had in mind when he pointed out that competence involves a process dimension which he calls a ‘‘competence profile’’ – a set of resources enacted in practice. One important research question in this context is how precisely the different resources are cobbled together, what this interplay depends on and how the resources can be built up (i.e., how should they look like, e.g., at the end of higher education). The second position (b) restricts the term ‘‘competence’’ to the sum of cognitive and motivational resources. This approach assumes that the whole is the sum of its (weighted) parts and divides competence into multiple constituents (latent abilities, skills) needed for competent performance. Competencies, then, are used to predict behavior in criterion situations (e.g., Spencer & Spencer, 1993). From this perspective, among others, measures of both declarative and procedural ‘‘knowing’’ tap underlying competencies such that they are applicable to multiple ‘‘real-world’’ situations in which doing is the end game. If this reasoning holds, we should seek a model of competence featuring cost-efficient selected-response measures of declarative and procedural knowledge in a domain. Note that this definition of competence would also lead to a measurement model that accounted for task/response sampling, in addition to scaling scores. Since real-world behavior is the core validity criterion in this case, again a careful analysis of the demands of and the variations of these situations would be crucial. One important research question is about the relation of competence and its constituents (Sadler, 2013): Is it possible to decompose competence exhaustively as it is often done in technology and science? The decomposition reduces complexity and aids understanding – but is it the same then? In both perspectives, the behavioral and the dispositional, the question arises as to whether and how persons who possess all of the resources belonging to a competence construct are able to integrate them, such that the underlying competence emerges in performance. This might be an empirical question but would require assessments for each competency. Ó 2015 Hogrefe Publishing


S. Blömeke et al.: Competence as Continuum

7

Figure 1. Modeling tence as a continuum.

Conceptually, this question leads us to point out an important gap in the current dichotomized discussion: Which processes connect cognition and volition-affectmotivation on the one hand and performance on the other hand? Different facets have to be integrated, perhaps to be transformed and/or restructured through practical experience. Processes such as the perception and interpretation of a specific job situation together with decision-making (Schoenfeld, 2010) may mediate between disposition and performance (see Figure 1). Thus, instead of insisting on an unproductive dichotomy view of competence, in particular knowledge or performance, competence should be regarded as a process, a continuum with many steps in between. Thus, we suggest that trait approaches recognize the necessity to measure behaviorally, and that behavioral approaches recognize the role of cognitive, affective, and conative resources. At this time, we encourage research on competence in higher education emanating from either perspective and paying attention particularly to the steps in between. Our model may help thinking about these.

compe-

Taking a longitudinal, developmental perspective on competence adds complexity. The model might be similar to Figure 1 in that, firstly, some dispositions have to be in place before situation-specific skills can be acquired. Many higher education programs (e.g., teaching or medicine) are built on such an implicit assumption by delivering basic knowledge first before students undergo practical training. But it might as well be that a developmental model would look completely different in that growth or loss continuously happens on all dimensions at the same time (Baltes, Reese, & Lipsitt, 1980; Cattell, 1971). An interesting research question is whether competence changes are then best characterized by linear increase (or decrease), by differentiation processes from more general and basic expressions to more specialized one, or by qualitative changes as it is assumed in the novice-expert paradigm. In the two latter cases, developmental trajectories would imply structural changes in the nature of competence.

Particular Fields of Research on Competence

Competence Is Also a Continuum in Other Respects Before we consider particular fields of research on competence, it is worth noting at least briefly that competence is also a vertical continuum in terms of performance levels and of developmental stages. More specifically, one interpretation is that competence is a continuous characteristic with higher and lower levels (more or less competent). Additionally, as competence is a multidimensional construct, a person’s profile might include stronger estimates in one dimension and weaker ones in another. So, the definition of competence includes the notion of how much is enough to be called ‘‘competent.’’ Furthermore, an important research question is whether the different dimensions of competence can compensate for each other (i.e., are additive by nature) or if strength on one cannot compensate for weakness on another dimension (i.e., multiplicative nature of competence dimensions; Koeppen et al., 2008). In the latter case, an interesting follow-up research question would be which minimum threshold has to be in place before someone can show a certain behavior. Ó 2015 Hogrefe Publishing

Many professions are concerned about the nature and assessment of competence. They have to train the next generations of professionals on the one hand and to award licenses or to select candidates on the other hand. Here we look specifically at medicine, teaching, and vocational education to see how each deals with the definition of competence and its measurement. In medicine, the debate about the meaning of competence has a long tradition and included from the beginning both perspectives: competence development through medical training but also selection at its end in terms of licensing. The debates resulted in Miller’s (1990) widely used pyramid of clinical competence. The pyramid provides a framework for how to think about the different transformation processes that link factual knowledge and job-related behavior by distinguishing between knowledge, competence, performance, and behavior. The level of each category and the relation between categories (e.g., knowledge is regarded an antecedent to competence) are assumed to be influenced by other characteristics such as beliefs, Zeitschrift für Psychologie 2015; Vol. 223(1):3–13


8

S. Blömeke et al.: Competence as Continuum

opportunities to learn, practical experiences, or situational affordances. The Accreditation Council for Graduate Medical Education (http://www.acgme.org) strives to include both, the educational and the selection perspectives, in their accreditation procedure for medical programs in the US by requesting assessments of different dimensions of clinical competence. Epstein and Hundert (2002) summarize these as a cognitive function – knowledge to solve real-life problems; an integrative function – using biomedical and psychosocial information in clinical reasoning; a relational function – communication with patients and colleagues; and an affective/moral function – the willingness, patience, and emotional awareness to use these skills judiciously and humanely. For each category, specific assessment formats have been developed (Wass, Van der Vluten, Shatzer, & Jones, 2001): traditional paper-and-pencil tests, standardized performance assessments using laboratory experiments or simulations, and unstandardized performance assessments at the workplace. Teaching is another field with extensive research on what it means to be competent. Outstanding teacher performance is regarded to involve different types of resources, in particular knowledge, skills, beliefs, values, motivation, and metacognition (Shulman, 1987; Schoenfeld, 2010). The corresponding research is mostly driven by the objective of long-run improvement of teacher education. A study by Blömeke et al. (2014), for example, found that mathematics teachers’ perception accuracy of classroom situations and speedy recognition of students’ errors are influenced by their knowledge acquired during teacher education (see also König et al., 2014). Gold, Förster, and Holodynski (2013) showed that it is possible to train perception abilities with respect to classroom management through guided video analysis. Correspondingly, Stürmer, Könings, and Seidel (2012) confirmed a positive effect of classes in teaching and learning on professional vision. However, selection also plays an important role in teacher education and is addressed differently. The German teacher education system, for example, requires two comprehensive examinations before a license is awarded: A first one after university with typical written and oral knowledge tests and a second one on-site in schools where student teachers have to demonstrate their teaching skills. If one regards these exams as indicators of what is meant to constitute teacher competencies, the German system combines a dispositional and a behavioral perspective. Finally, in the field of Vocational Education and Training (VET) competence is discussed intensely. Although many different definitions exist here as well (Biemans, Nieuwenhuis, Poell, Mulder, & Wesselink, 2004), some agreement exists with respect to core concepts (Mulder, Gulikers, Biemans, & Wesselink, 2009). Competence is regarded as an integrated set of knowledge, skills, and attitudes. It is regarded as a necessary condition for task performance and for being able to function effectively in a certain situation. Shavelson (2010) presented a definition of competence in VET from the holistic behavioral perspective; included in the assessment are also probes of knowledge and skills though. The German dual VET system again Zeitschrift für Psychologie 2015; Vol. 223(1):3–13

combines dispositional and behavioral approaches by partly taking place in school – delivering traditional knowledge and completed with theoretical examinations – and partly at the workplace – delivering practical experience in an occupation and completed with examinations in which students are supposed to master real-world challenges. Importantly, all three fields are heavily affected by the debate about the theory-practice gap between schools or universities and workplace. A major research focus is therefore on competence-based school and university education which has become increasingly popular in Western Europe. Instead of following a disciplinary curriculum, defining cognitive outcomes related to typical situations in an occupation and examining them in a performance-based way is regarded a promising way to raise the quality of the workforce (Handley, 2003). After some euphoria, implementation turned out to be more difficult and less related to higher quality, though, than expected (Eraut, 2003).

Methodological Framework: Assessing Competence Assessments developed to measure competence, by nature, have to differ from traditional knowledge tests (Bennett, 1993; Birenbaum, 2007). For example, frequent or central real-world situations typical for performance demands in a domain play a crucial role either for determining constituents of competence or validity criteria. Thus, the sampling of these situations is crucial and their representativeness for the universe of tasks has to be ensured (Shavelson, 2012). Moreover, whereas reliability and (construct) validity as classical criteria of test quality remain important, the range of quality criteria has been expanded to address specific characteristics of competence assessments such as authenticity, fairness, transparency, consequences for student achievement and motivation, and cost efficiency (Kane, 2013; Messick, 1995). These requirements impose challenges for competence assessments which currently often are given too limited attention.

Challenges and Issues The analytic view of competence assessment focuses on measuring different latent traits (cognitive, conative, affective, motivational) with different instruments. Assessing the resources one-by-one has the advantage that it identifies specific preconditions for performing well in real life. The approach also has the advantage of diagnostic accuracy because what is measured within reasonable time and cost constraints by a particular scale is a constituent of the broader competence, thereby pinpointing particular strengths and limitations. Because such measures include large numbers of observations, the approach often leads to high reliability. Nevertheless, serious validity concerns exist, most notably construct underrepresentation. Ó 2015 Hogrefe Publishing


S. Blömeke et al.: Competence as Continuum

From the holistic view of competence (performance in complex, messy real-life situations), assessments have been developed to estimate real-life performance without accounting for the contribution of specific dispositional resources. Assuming the whole is greater than the sum of its parts, it is argued that assessing them one-by-one might distort the actual underlying traits needed for successful performance. The Collegiate Learning Assessment provides a holistic example, sampling tasks from newspapers and other common sources and constructing an assessment around them to tap critical thinking, analytic reasoning, problem solving, and communication (Benjamin, 2013; Shavelson, 2010). However, there are several challenges in this approach, too (Kane, 1992). The first is that there is a tradeoff between testing time and the number of independent samples of behavior that can be collected. Performance tasks are complex and take considerably more time than selectedresponse or short-answer tasks. Hence, only a limited sample of behavior can be collected in a given amount of time which imposes limits on generalizability. A second issue is that assessment of the complex student responses which typically are produced in performance assessments introduces considerable amounts of measurement error because it is harder to define and assess quality of responses in complex situations than with respect to clearly-defined items. Yet another issue is that different components of extended performance tasks tend to depend on one another, thereby violating the assumption of local independence which is central in most measurement models. This raises questions about how to model the item responses appropriately. In some situations solutions may be found by creating testlets (Wainer, Bradlow, & Wang, 2007), but development of specialized models to deal with this issue may also be needed.

Overcoming Disagreements Due to Oversimplified Dichotomies Thus both the analytic and the holistic approaches to assessment are afflicted by issues of validity and reliability. These issues need attention in further work on modeling and measuring competence. The issues space is not primarily a matter of dichotomy and choice between the analytical or holistic approaches. Rather the space involves how the different approaches may be developed and combined in fruitful ways to improve the reliability and validity of competence assessments. This involves many conceptual and empirical questions, and data rather than opinion is needed to inform future measurement methods. Below we discuss future work in three areas which we see as promising for methodological development, namely assessment formats, conceptual frameworks and dimensionality, as well as modeling techniques. Beyond Dichotomies: Tapping Into a Broader Range of Assessment Formats One gets the impression that the unproductive dichotomy of dispositions (analytic) versus performance (holistic) in Ó 2015 Hogrefe Publishing

9

assessments translates into the use of a limited range of assessment formats, with either multiple-choice items or very complex tasks dominating. It is obvious that knowledge and personality tests as well as performance assessments have important functions to fulfill in a competence assessment. The limitation to either-or should be of concern because they each only tap into parts of the construct definition. Using combinations of approaches, we may also be able to cover the processes mediating the transformation of dispositions into performance. Wass et al. (2001) demonstrated the richness of available formats in building competence measurements in medicine that capture different levels of proximity to real-life situations: Besides multiple-choice and constructed-response items or performance assessments in real life or laboratories, they suggested videobased assessments using representative job situations so that the perception of real-life, that is unstructured situations, can be included. Also the speed of performance which provides information not available with accuracy (Stanovich, 2009) has increasingly been examined with the advent of computer-based testing. Blömeke et al. (2014) developed different assessment formats to capture teacher competence in terms of different knowledge facets as well as perceptual, interpretation, and decision-making skills as well as their speedy reaction to student errors. And the Comparative Judgment procedure, based on Thurstone’s early work, represents an interesting implementation in assessments of authentic Design and Technology tasks (Kimbell, 2006). This challenges us to make productive, integrative use of performance assessments, traditional discrete items, and other innovative formats in competence measurement. One potential consequence of combining formats, though, is that when selected-response and performance tasks are scaled together, unless specific weights are assigned to performance data, selected-response data may ‘‘swamp’’ the signal provided by the performance tasks. Additional challenges now arise, among others is the (multi) trait-(multi) method issue of distinguishing constructs from methods (Campbell & Fiske, 1959). Do differences between the results of analytic and holistic instruments reflect differences in the methods used or are we talking about different constructs which should consequently then be labeled differently? This problem can be thought of as a sampling problem, that is of defining the sampling frame for constructing performance assessments – does the frame involve sampling of assessment methods in addition to the sampling of items, raters, and test takers? Beyond Dichotomies: Essential Unidimensionality/ Multidimensionality One of the fundamental challenges in competence assessment is to reduce a large amount of observational complexity into scores which maintain meaningfulness and interpretability. To this end one of the classic principles upon which measurement is based is the principle of unidimensionality. This principle fundamentally states that the different components (e.g., items, tasks, ratings) of an Zeitschrift für Psychologie 2015; Vol. 223(1):3–13


10

S. Blömeke et al.: Competence as Continuum

assessment should reflect one and the same underlying dimension. It should be noted that the principle of unidimensionality does not imply any requirement that the different components should in themselves be simple; they can, for example, be complex authentic tasks as used in performance assessments (Gustafsson & Åberg-Bengtsson, 2010). However, a strict application of the principle of unidimensionality rarely is possible in competence assessments because conceptually it typically is not expected (e.g., the analytic approach including cognition and affect) and empirically it is violated by the presence of method variance and multiple-expected dimensions. Such challenges have typically been met by splitting the construct into more narrow sub-constructs, each of which satisfies the assumption of unidimensionality. While such approaches typically are successful in the sense that statistical criteria of unidimensionality are met, the approach in itself is self-defeating because the construct itself is splintered into pieces (Gustafsson, 2002). An alternative approach is to focus instead on ‘‘essential unidimensionality’’ which preserves the construct while allowing for additional minor dimensions and different sources of method variance. Models for essential unidimensionality can be implemented in different ways, for example with so-called bi-factor models or hierarchical measurement models (Gustafsson & Åberg-Bengtsson, 2010; Reise, 2012). Such models identify a general factor hypothesized to represent the construct but also allowing for minor dimensions. Given that competence dimensions may be assumed to be multidimensional while at the same time a common underlying dimension is expected, this approach may be particularly useful in developing and understanding competence assessments. An extended version of this approach is Multidimensional Item Response Theory (MIRT) which is able to model several latent traits simultaneously and thus provides a promising approach to competence assessments. However, it may be argued that the ideas of essential unidimensionality or multidimensionality still do not solve the fundamental dimensionality issue, because there are limits to how far these approaches may be stretched.

produces interval-scale measurements and, second, it links individual performance to levels of performance that can be exemplified by items an individual at a particular ability (theta) has some (e.g., .5) probability of performing – anchoring the interpretation of the score in the items and not in rank order. This link between performance on items and scale levels is one of the main approaches for investigating the meaning and characteristics of a scale. CTT, in particular GT, is in contrast useful for assessing the impact of inconsistencies due to tasks, raters, and their combinations with persons, on the basis of which an optimal assessment design can be set forth. This strength is particularly important in the field of competence assessments because rater effects and temporal instability tend to be large in more complex studies. GT can thus be helpful in estimating the extent of measurement error as a first step and then to estimate the effects of redesigning a study by using more or better trained raters or more tasks. For example, Shavelson (2012) suggests an assessment approach based on a criterion-sampling approach and shows the close link to GT – a mixed model sampling theory of measurement. The variance of a score is split up so that the error variance resulting from inconsistencies between raters, task difficulty and their interactions with each other and test takers can be partialed out and only the variance of interest remains. This approach can be extended by taking measurement methods into account because a particular competence test can be regarded as one instrument out of a broad range of possible instruments. However, we also believe that the psychometric theories can and should be used in combination, as is sometimes done. For example, GT provides an initial step in that once reliable scores are produced, they can be IRT scaled with a number of different approaches such as partial credit or rater models. Vice versa, generalized linear mixed models and generalized latent variable modeling (e.g., Muthén, 2002; Skrondal & Rabe-Hesketh, 2004) provide ways to analyze typical ‘‘GT questions’’ by explicitly stating hypotheses and testing statistical models, and by offering flexible frameworks in which to deal with measurements from virtually any assessment format, data structures, and a multitude of fixed or random effects (e.g., time, rater, classrooms).

Beyond Dichotomies: Psychometric Pluralism Way too long CTT and IRT have been regarded as another allegedly incompatible dichotomy. We see a continuum from linear CTT models to nonlinear IRT models and beyond. Each theory has something to contribute to our understanding of competence measurement with respect to item/task functioning, scalability, reliability, and validity of assessment scores. Of course, different models have been developed to solve different problems so models should be carefully selected to suit the particular problem at hand. For example, IRT is useful for forming scales, examining the dimensionality of competence (as pointed out above), estimating persons’ scores, and typifying levels of competence to provide criterion-referenced interpretation. IRT makes two important contributions, especially within the context of criterion-referenced testing: First, IRT Zeitschrift für Psychologie 2015; Vol. 223(1):3–13

Particular Applications of Interesting Assessment Approaches Combinations of GT and IRT have been successfully applied and their usefulness demonstrated. Raudenbush, Martinez, Bloom, Zhu, and Lin (2010) integrated GT and IRT in the assessment of group-level quality measures. Characteristics assumed to influence competence development such as classroom quality or opportunities to learn can be measured then in a reliable and valid way. Based on quantifying various sources of error, for example rater inconsistency, temporal instability and item inconsistencies, Raudenbush et al. (2010) developed a six-step paradigm that systematically integrates GT and IRT for the design Ó 2015 Hogrefe Publishing


S. Blömeke et al.: Competence as Continuum

of measurements of social settings that minimizes measurement error and thus maximizes statistical power. We also encourage use of specialized models to approach specific research questions, such as, for example, the stability-change issue of competence. Performance can be regarded as an interaction of competence (latent abilities and dispositions) and situation. A person has to integrate several cognitive and motivational resources in order to master situational demands. Latent State-Trait Theory (LST) has been developed to deal with this challenge. LST is methodologically similar to the (multi) trait-(multi) method. It emphasizes that besides the person’s characteristics also effects of the situation and the interaction of person and situation contribute to the variance of a variable (Steyer, Schmitt, & Eid, 1999). Situational aspects can be distinguished into systematic variation of the context such as teaching different classes and into similar contexts but differential situational reactions due to working memory or exhaustion (for more details see Eid & Diener, 1999; Jenßen et al., 2015).

Beyond Dichotomies This paper tried to tidy up the ‘‘messy construct,’’ competence, that has been plagued by misleading dichotomies (e.g., analytic vs. holistic, IRT vs. GT, trait vs. behavior). We did not expect to find ‘‘the’’ one definition and statistical model for competence assessment. Rather by systematically sketching conceptual and statistical controversies and assessment approaches we attempted to clarify the construct and its measurement. We unpacked competing competence definitions (analytic/traits vs. holistic/real-world performance) and identified commonplaces. This led to the construction of a framework for moving beyond dichotomies to show how the analytic versus holistic approaches complemented one another (Figure 1). The measurement of competence, then, may be viewed along a continuum from traits (cognitive, affective, motivational) that underlie the perception, interpretation, and decision-making that give rise to observed behavior in a particular real-world situation. Dichotomies arise because one position looks at only one part of the continuum (e.g., underlying traits) while another position looks at a different part (behavior in criterion situation). We hope that the proposed integrated perspective moves us beyond dichotomies. We unpacked competing statistical approaches to modeling competence-assessment scores, namely IRT (latent trait) versus GT (sampling error variance). Once again we viewed these models not as dichotomies but as arraying along a continuum of linear to nonlinear models. Rather than competing, the various statistical models serve different purposes. IRT models may be used for scaling item responses and modeling structural relations and GT models for pinpointing sources of measurement error variance and thereby enabling the design of reliable measurements.

Ó 2015 Hogrefe Publishing

11

Finally, we would like to point out that the proposed framework (Figure 1) is not only heuristic in suggesting multiple new research studies but also in viewing it as a ‘‘grand’’ structural model. The analytic (latent trait) side of the model (left-side of Figure 1) includes indicators for cognitive, affective, and motivational traits demanded in particular contexts/situations. Such competencies are structurally related to real-world performance (right-side) through a set of perceptual, interpretive, and decision-making processes (middle). Research on competence measurement, then, might work on various parts of the model and even attempt to test the entire model conceptually and statistically. Viewing competence as a continuum and applying a corresponding range of assessment formats required by the framework is conceptually and methodologically challenging. But we believe that solutions exist or can be developed to deal with these challenges and we tried to sketch out possible approaches to trustworthy competence assessments that overcome the risk of forgetting either observable behavior or cognitive abilities. If our reasoning holds, it opens up for a great range of research questions. With the proposed integrated approach and the improvement of measurement of competence, the field of higher education will be in a position to address important, substantive questions. For example, we should be able to examine the developmental trajectories of competence, identify groups of students with differential developmental patterns, and determine effective educational strategies for development. We should be able to go beyond immediate measurement of behavior in situ to longer-term measurements of life outcomes beyond earning and including health, family, and civic and social engagement. We should also be able to study the interaction of perception, interpretation, and decision-making in the education and training of students for particular life outcomes. Higher education is certainly a field with huge research gaps. By providing this overview and by editing this special ZfP issue, we hope to inspire and encourage many colleagues to look into this field and to take up the challenge of what it means to define and assess competence acquired in higher education.

References Arthur, W., Day, E. A., McNelly, T. L., & Edens, P. S. (2003). A meta-analysis of the criterion-related validity of assessment center dimensions. Personnel Psychology, 56, 125–154. Bakker, A. B. (2011). An evidence-based model of work engagement. Current Directions in Psychological Science, 20, 265–269. Baltes, P. B., Reese, H. W., & Lipsitt, L. P. (1980). Life-span developmental psychology. Annual Review of Psychology, 31, 65–110. Benjamin, R. (2013). The principles and logic of competency testing in higher education. In S. Blömeke, O. ZlatkinTroitschanskaia, C. Kuhn, & J. Fege (Eds.), Modeling and measuring competencies in higher education: Tasks and challenges (pp. 127–136). Boston, MA: Sense.

Zeitschrift für Psychologie 2015; Vol. 223(1):3–13


12

S. Blömeke et al.: Competence as Continuum

Bennett, Y. (1993). The validity and reliability of assessments and self-assessments of workbased learning. Assessment & Evaluation in Higher Education, 18, 83–94. Berry, C. M., Clark, M. A., & McClure, T. (2011). Black-white differences in the criterion-related validity of cognitive ability tests: A qualitative and quantitative review. Journal of Applied Psychology, 96, 881–906. Biemans, H., Nieuwenhuis, L., Poell, R., Mulder, M., & Wesselink, R. (2004). Competence-based VET in The Netherlands: Backgrounds and pitfalls. Journal of Vocational Education and Training, 56, 523–538. Birenbaum, M. (2007). Evaluating the assessment: Sources of evidence for quality assurance. Studies in Educational Evaluation, 33, 29–49. Blömeke, S., Busse, A., Suhl, U., Kaiser, G., Benthien, J., Döhrmann, M., & König, J. (2014). Entwicklung von Lehrpersonen in den ersten Berufsjahren: Längsschnittliche Vorhersage von Unterrichtswahrnehmung und Lehrerreaktionen durch Ausbildungsergebnisse [Teacher development during the initial years in the profession: Longitudinal prediction of lesson perception and teacher reactions by means of training results]. Zeitschrift für Erziehungswissenschaft, 17, 509–542. Blömeke, S., Gustafsson, J.-E., & Shavelson, R. (2013). Call for papers: Assessment of competencies in higher education – a topical issue of the Zeitschrift für Psychologie. Zeitschrift für Psychologie, 221, 202. Blömeke, S., Zlatkin-Troitschanskaia, O., Kuhn, Ch., & Fege, J. (Eds.). (2013). Modeling and measuring competencies in higher education: Tasks and challenges. Rotterdam, The Netherlands: Sense. Boyatzis, R. E. (1982). The competent manager. New York, NY: Wiley. Brennan, R. L. (2001). Generalizability theory. New York, NY: Springer. Brief, A. P., & Weiss, H. M. (2001). Organizational behavior: Affect in the workplace. Annual Review of Psychology, 53, 279–307. Campbell, D. T., & Fiske, D. W. (1959). Convergent and discriminant validation by the multitrait-multimethod matrix. Psychological Bulletin, 56, 81–105. Cattell, R. B. (1971). Abilities: Their structure, growth, and action. New York, NY: Houghton Mifflin. Corno, L., Cronbach, L. J., Kupermintz, H., Lohman, D. F., Mandinach, E. B., Porteus, A. W., . . . Talbert, J. E. (2002). Remaking the concept of aptitude: Extending the legacy of R. E. Snow. Mahwah, NJ: Erlbaum. Cronbach, L. J., Gleser, G. C., Nanda, H., & Rajaratnam, N. (1972). The dependability of behavioral measurements: Theory of generalizability for scores and profiles. New York, NY: Wiley. Eid, M., & Diener, E. (1999). Intraindividual variability in affect: Reliability, validity, and personality correlates. Journal of Personality and Social Psychology, 76, 662–676. Epstein, R. M., & Hundert, E. M. (2002). Defining and assessing professional competence. JAMA, 287, 226–235. Eraut, M. (2003). National vocational qualifications in England: Description and analysis of an alternative qualification system. In G. Straka (Ed.), Zertifizierung non-formell und informell erworbener beruflicher Kompetenzen. Münster, Germany: Waxmann. Förster, M., Zlatkin-Troitschanskaia, O., Brückner, S., Happ, R., Hambleton, R., Walstad, W. B., . . . Yamaoka, M. (2015). Validating test score interpretations by cross-national comparison: Comparing the results of students from Japan and Germany on an American test of economic knowledge in higher education. Zeitschrift für Psychologie, 223, 14–23. doi: 10.1027/2151-2604/a000195

Zeitschrift für Psychologie 2015; Vol. 223(1):3–13

Gold, B., Förster, St., & Holodynski, M. (2013). Evaluation eines videobasierten Trainingsseminars zur Förderung der professionallen Wahrnehmung von Klassenführung im Grundschulunterricht [Evaluation of a video-based training program to enhance professional perception of classroom leadership in primary school education]. Zeitschrift für Pädagogische Psychologie, 27, 141–155. Grant, G., Elbow, P., & Ewens, T. (1979). On competence: A critical analysis of competence-based reforms in higher education. San Francisco, CA: Jossey-Bass. Gustafsson, J.-E. (2002). Measurement from a hierarchical point of view. In H. I. Braun, D. N. Jackson, & D. E. Wiley (Eds.), The role of constructs in psychological and educational measurement (pp. 73–95). London, UK: Erlbaum. Gustafsson, J.-E., & Åberg-Bengtsson, L. (2010). Unidimensionality and interpretability of psychological instruments. In S. E. I Embretson (Ed.), Measuring psychological constructs: Advances in model-based approaches. Washington, DC: American Psychological Association. Hakanen, J. J., & Schaufeli, W. B. (2012). Do burnout and work engagement predict depressive symptoms and life satisfaction? A three-wave seven-year prospective study. Journal of Affective Disorders, 141, 415–424. Handley, D. (2003). Assessment of competencies in England’s National Vocational Qualification system. In G. Straka (Ed.), Zertifizierung non-formell und informell erworbener beruflicher Kompetenzen. Münster, Germany: Waxmann. Jenßen, L., Dunekacke, S., Eid, M., & Blömeke, S. (2015). The relationship of mathematical competence and mathematics anxiety: An application of latent state-trait theory. Zeitschrift für Psychologie, 223, 31–38. doi: 10.1027/2151-2604/a000197 Judge, T. A., Thorensen, C. J., Bono, J. E., & Patton, G. K. (2001). The job satisfaction-job performance relationship: A qualitative and quantitative review. Psychological Bulletin, 127, 376–407. Kane, M. (1992). An argument-based approach to validation. Psychological Bulletin, 112, 527–535. Kane, M. T. (2013). Validating the interpretations and uses of test scores. Journal of Educational Measurement, 50, 1–73. Kimbell, R. A. (2006). Innovative technological performance. In J. Dakers (Ed.), Defining technological literacy: Towards an epistemological framework (pp. 159–179). Basingstoke, UK: Palgrave. König, J., Blömeke, S., Klein, P., Suhl, U., Busse, A., & Kaiser, G. (2014). Is teachers’ general pedagogical knowledge a premise for noticing and interpreting classroom situations? A video-based assessment approach. Teaching and Teacher Education, 38, 76–88. Koeppen, K., Hartig, J., Klieme, E., & Leutner, D. (2008). Current issues in competence modeling and assessment. Zeitschrift für Psychologie, 216, 61–73. doi: 10.1027/00443409.216.2.61 Kounin, J. S. (1970). Discipline and group management in classrooms. New York, NY: Holt, Rinehart, & Winston. Lau, S., & Roeser, R. W. (2002). Cognitive abilities and motivational processes in high school students’ situational engagement and achievement in science. Educational Assessment, 8, 139–162. McClelland, D. C. (1973). Testing for competence rather than testing for ‘‘intelligence’’. American Psychologist, 28, 1–14. McLachlan, G., & Peel, D. A. (2000). Finite mixture models. New York, NY: Wiley. McMullan, M., Endacott, R., Gray, M. A., Jasper, M., Miller, C. M. L., Scholes, J., & Webb, C. (2003). Portfolios and assessment of competence: A review of the literature. Journal of Advanced Nursing, 41, 283–294. Messick, S. (1995). Validity of psychological assessment: Validation of inferences from persons’ responses and

Ó 2015 Hogrefe Publishing


S. Blömeke et al.: Competence as Continuum

performances as scientific inquiry into score meaning. American Psychologist, 50, 741–749. Miller, G. E. (1990). The assessment of clinical skills/ competence/performance. Academic Medicine: Journal of the Association of American Medical Colleges, 65, 63–67. Mulder, M., Gulikers, J., Biemans, H., & Wesselink, R. (2009). The new competence concept in higher education: Error or enrichment? Journal of European Industrial Training, 33, 755–770. Muthén, B. O. (2002). Beyond SEM: General latent variable modeling. Behaviormetrika, 29, 81–117. Oser, F. (2013). ‘‘I know how to do it, but I can’t do it’’: Modeling competence profiles for future teachers and trainers. In S. Blömeke, O. Zlatkin-Troitschanskaia, C. Kuhn, & J. Fege (Eds.), Modeling and measuring competencies in higher education: Tasks and challenges (pp. 45–60). Rotterdam, The Netherlands: Sense. Raudenbush, S. W., Martinez, A., Bloom, H., Zhu, P., & Lin, F. (2010). Studying the reliability of group-level measures with implications for statistical power: A six-step paradigm [Working Paper]. Chicago, IL: University of Chicago. Reise, S. P. (2012). The rediscovery of bifactor measurement models. Multivariate Behavioral Research, 47, 667–696. Rijmen, F., Tuerlinckx, F., De Boeck, P., & Kuppens, P. (2003). A nonlinear mixed model framework for item response theory. Psychological Methods, 8, 185–205. Sadler, R. (2013). Making competent judgments of competence. In S. Blömeke, O. Zlatkin-Troitschanskaia, C. Kuhn, & J. Fege (Eds.), Modeling and measuring competencies in higher education: Tasks and challenges (pp. 13–27). Rotterdam, The Netherlands: Sense. Schoenfeld, A. H. (2010). How we think: A theory of goaloriented decision making and its educational applications. New York, NY: Routledge. Shavelson, R. J. (2010). On the measurement of competency. Empirical Research in Vocational Education and Training, 1, 43–65. Shavelson, R. J. (2012). An approach to testing and modeling competencies. In S. Blömeke, O. Zlatkin-Troitschanskaia, C. Kuhn, & J. Fege (Eds.), Modeling and measuring competencies in higher education: Tasks and challenges. Rotterdam, The Netherlands: Sense. Shavelson, R. J., & Webb, N. M. (1991). Generalizability theory: A primer. Newbury Park, CA: Sage. Shulman, L. (1987). Knowledge and teaching: Foundations of the new reform. Harvard Educational Review, 57, 1–22. Skrondal, A., & Rabe-Hesketh, S. (2004). Generalized latent variable modeling: Multilevel longitudinal, and structural equation models. London, UK: Chapman and Hall/CRC.

Ó 2015 Hogrefe Publishing

13

Snow, R. E. (1994). Abilities in academic tasks. In R. J. Sternberg & R. K. Wagner (Eds.), Mind in context: Interactionist perspectives on human intelligence. New York, NY: Cambridge University Press. Sparrow, P. R., & Bognanno, M. (1993). Competency requirement forecasting: Issues for international selection and assessment. International Journal of Selection and Assessment, 1, 50–58. Spencer, L. M. Jr., & Spencer, S. M. (1993). Competence at work: Models for superior performance. New York, NY: Wiley. Stanovich, K. E. (2009). What intelligence tests miss: The psychology of rational thought. New Haven, CT: Yale University Press. Sternberg, R. J. & Grigorenko, E. L. (Eds.). (2003). The psychology of abilities, competencies, and expertise. Cambridge, MA: Cambridge University Press. Steyer, R., Schmitt, M., & Eid, M. (1999). Latent state-trait theory and research in personality and individual differences. European Journal of Personality, 13, 389–408. Stürmer, K., Könings, K. D., & Seidel, T. (2012). Declarative knowledge and professional vision in teacher education: Effect of courses in teaching and learning. British Journal of Educational Psychology, 83, 467–483. Wainer, H., Bradlow, E. T., & Wang, X. (2007). Testlet response theory and its applications. Cambridge, UK: Cambridge University Press. Wass, V., Van der Vluten, C., Shatzer, J., & Jones, R. (2001). Assessment of clinical competence. Lancet, 357, 945–949. Weinert, F. E. (2001). Concept of competence: A conceptual clarification. In D. S. Rychen & L. H. Salganik (Eds.), Defining and selecting key competencies (pp. 45–66). Göttingen, Germany: Hogrefe.

Sigrid Blömeke University of Oslo Faculty of Education Centre for Educational Measurement (CEMO) Niels Henrik Abels hus Moltke Moes vei 35 0318 Oslo Norway Tel. +47 464 18755 E-mail sigribl@cemo.uio.no

Zeitschrift für Psychologie 2015; Vol. 223(1):3–13


Original Article

Validating Test Score Interpretations by Cross-National Comparison Comparing the Results of Students From Japan and Germany on an American Test of Economic Knowledge in Higher Education Manuel Förster,1 Olga Zlatkin-Troitschanskaia,1 Sebastian Brückner,1 Roland Happ,1 Ronald K. Hambleton,2 William B. Walstad,3 Tadayoshi Asano,4 and Michio Yamaoka5 1

Gutenberg School of Management and Economics, Johannes Gutenberg University, Mainz, Germany, 2 University of Massachusetts, Amherst, MA, USA, 3University of Nebraska, Lincoln, NE, USA, 4 Yamamura Gakuen College, Hatoyama, Saitama, Japan, 5Waseda University, Shinjuku, Tokyo, Japan Abstract. Cross-national assessment of students’ competences in higher education is becoming increasingly important in many disciplines including economics but there are few available instruments that meet psychological standards for assessing students’ economic competence in higher education (HE). One of them is the internationally valid Test of Understanding in College Economics (TUCE), which has been adapted and employed successfully in HE systems in various countries, but the test results have seldom been used for international comparisons of students’ Economic Content Knowledge (ECK). Here, we compare the German and the Japanese test adaptations of the TUCE with reference to the American original in order to determine their suitability for comparative analyses of ECK in HE among these countries. Having critically examined the two test adaptations, we present a comparative analysis of students’ test scores in Germany and Japan and evaluate potential differences with regard to students’ acquisition of ECK while investigating country-specific influence factors. Keywords: economic competence, cross-national assessment, higher education, Germany, Japan

Relevance and Research Challenges Cross-national assessment of students’ competences in higher education (HE) is becoming increasingly important (OECD, 2012) in many disciplines including economics. However, there are few available instruments that meet psychological standards for assessing students’ economic competence in HE (Zlatkin-Troitschanskaia, Förster, Brückner, & Happ, 2014). Increasing internationalization of economic study programs and curricula has made valid assessment of Economic Content Knowledge (ECK) and international comparability of HE learning outcomes crucial. This was illustrated recently by the Assessment of Higher Education Learning Outcomes (AHELO) study by the OECD (2012). Internationally comparable competence assessments in economics also are highly relevant for international rankings, which have become very popular and widely used in this field, but which usually have relied on instruments that

Zeitschrift für Psychologie 2015; Vol. 223(1):14–23 DOI: 10.1027/2151-2604/a000195

are valid to a limited extent only (Dehon, McCathie, & Verardi, 2010). Therefore, there is an urgent need to provide an objective, reliable, and valid tool for the assessment and comparison of economic competence. International comparative analyses in economics in HE face huge challenges. There is enormous heterogeneity of economics study programs, and the structure of the field is very complex. Such comparative analyses have at least two prerequisites (OECD, 2012): First, the content to be compared must be equivalent among the respective countries and in terms of curricular validity; second, comparable test instruments must be used for assessments within the countries. Therefore, tests need to go through a complex adaptation process (Hambleton, 2001) to ensure, for example, that the construct to be assessed remains comparable despite any language and cultural differences. The International Test Commission has issued Test Adaptation Guidelines (TAG) to ensure high-quality test adaptations;

 2015 Hogrefe Publishing


M. Förster et al.: Validating Test Score Interpretations

however, the TAG provide just a rough orientation. The international adaptation and validation of test instruments is a complex, multifaceted task (AERA, APA, & NCME, 2004). The challenge of ensuring test validity in international comparisons has not yet been resolved. International comparative studies in HE lack methods for validating test score interpretations, particularly in the field of economics (OECD, 2012). It is essential to establish validity of score interpretations, especially when students’ test scores as well as item parameters are being used for extrapolating inferences about the quality of different national HE systems. One aim of the Johannes Gutenberg University’s WiwiKom project for modeling and measuring professional economic competences in students and graduates is to adapt the internationally valid Test of Understanding in College Economics (TUCE; Walstad, Watts, & Rebeck, 2007) for the assessment of students’ economic competence in Germany (Zlatkin-Troitschanskaia et al., 2014). The TUCE already has been adapted and employed successfully in HE systems in various countries including Japan (Yamaoka, Walstad, Watts, Asano, & Abe, 2010), but the test results have seldom been used for international comparisons of students’ ECK (Rebeck, Walstad, Yamaoka, & Asano, 2009). In this paper, we compare the German and the Japanese test adaptations of the TUCE with reference to the American original in order to determine their suitability for comparative analyses of ECK in HE among these countries. Having critically examined the two test adaptations, we present a comparative analysis of students’ test scores in Germany and Japan and evaluate potential differences with regard to students’ acquisition of ECK while investigating country-specific influence factors.

Theoretical Criteria for Comparing Test Scores An established approach for comparative studies is Bereday’s four-step approach (1964; Bray, Adamson, & Mason, 2007). The first step involves describing the phenomenon to be compared, which is ECK in this study, based on primary and secondary field-specific material obtained from country- or culture-specific sources. For the present study, this step involved literature reviews, document analyses, database analyses, as well as expert interviews (for analyses and results related to ECK, see ZlatkinTroitschanskaia et al., 2014; for further findings on ECK discussed also from a comparative perspective, see Yamaoka et al., 2010). The second step, interpretation, involves specifying potential dimensions of the phenomenon to be compared based on current research findings. In this study, we specified the construct of ECK with regard to contentand structure-related cognitive dimensions (ZlatkinTroitschanskaia et al., 2014). The third step involves juxtaposing the data, which means preliminary matching of data from different countries to prepare them for comparison.

 2015 Hogrefe Publishing

15

As part of the juxtaposition, information on the tertium comparationis should be gathered as well (Bray et al., 2007, p. 87) and comparative conditions for the analysis should be created, in particular, ensuring equivalence of the construct in the target countries. Construct equivalence is ensured through detailed validity analyses such as analyses of content or construct validity. In this study, content validity was analyzed through comparisons of economic study programs in HE and curricular analyses, while construct validity was analyzed using confirmatory factor analysis (CFA) to examine the test and its adaptations for the countries. The fourth step is the actual comparison, which is conducted according to theory-driven aims of examination. Steps 2–4 are described in more detail in the following subsections.

Construct Definition of ECK We based our definition of ECK on the widely accepted general definition of competence by Weinert (2001). Accordingly, competence in economics enables a person to solve problems in economic situations based on an interaction of cognitive, metacognitive, affective, and selfregulatory dispositions. Thus, we followed a general understanding of competence as being based on cognitive, conative, affective, and motivational resources (Blömeke, Gustafsson, & Shavelson, 2015). Accordingly, competence was considered a continuum in which various dispositions interact depending on the situational requirements and manifest themselves in behavior, for example, in responses to economic items. Previous research has focused predominantly on the assessment of cognitive dispositions (Koeppen, Hartig, Klieme, & Leutner, 2008). This includes the assessment of content knowledge and the associated cognitive processes; both are considered fundamental dimensions of the construct of competence. Modeling approaches in international research in economics (Walstad et al., 2007; Yamaoka et al., 2010) often follow Bloom’s cognitive taxonomy of teach-study objectives (Bloom, Englehart, Furst, Hill, & Krathwohl, 1956) and the further developed version by Anderson and Krathwohl (2001), which allows international comparison for these content areas. With the aim of facilitating international comparisons and ensuring international compatibility of concepts, the WiwiKom project examines the cognitive dispositions of competence with regard to microeconomic and macroeconomic content such as determinants of supply and demand, factor markets, theories of the firm, measures of aggregate economic performance, and fiscal policies in connection with different cognitive levels such as remembering and understanding, applying and analyzing, creating and evaluating (Zlatkin-Troitschanskaia et al., 2014). Thus, in this project, we modeled ECK theoretically as a key cognitive disposition of competence as well as the related cognitive processes involved in responding to economic items, and we assessed them empirically for both dimensions of microeconomics and macroeconomics.

Zeitschrift für Psychologie 2015; Vol. 223(1):14–23


16

M. Förster et al.: Validating Test Score Interpretations

Curricular Framework Conditions in the Countries and Comparability of Test Items The AHELO study has presented evidence of a common international core curriculum in economics (OECD, 2012). To extend the preliminary AHELO results, we analyzed whether the field of study of economics and the curricular framework conditions were comparable among the target countries. International comparisons must take into account the respective country-specific organizational and curricular requirements in HE (Owen, 2012). In the WiwiKom project additional organizational and curricular analyses of HE institutions are meant to ensure international compatibility and comparability of the test results and to help generate hypotheses on the differences among countries. A multistep adaptation and validation process was implemented in our study (Zlatkin-Troitschanskaia et al., 2014): The TUCE was validated based on the AERA Standards, which require evidence from five categories: (1) test content, (2) response processes, (3) internal structure, (4) relations to other variables, and (5) consequences of testing (AERA et al., 2004). First, we conducted document analyses to determine which systematic differences can influence university students’ ECK, for example, differences in students’ economic knowledge from preuniversity education. We tested, in particular, whether the German and Japanese samples were comparable on a systemic level and an institutional level. Analysis of the HE systems in Germany and Japan showed that the systemic and structural differences were quite small and the samples were comparable. Nevertheless, we investigated the prior economic education of students from both countries in this study (for more information on the comparative analysis of the HE systems and on the comparability of the samples, see Brückner, Förster, Zlatkin-Troitschanskaia, & Walstad, in press). Second, to ensure content validity, we analyzed comprehensively the curricula of economic studies in the US, Germany, and Japan to determine whether the content of the tests were in a similar way central for and representative of the core curricula in these countries (Brückner et al., in press). We analyzed the curricula of 96 courses of study from 64 faculties, including all faculties that participated in the subsequent quantitative surveys. Third, the WiwiKom project team conducted expert interviews with 32 lecturers and an online rating with 78 lecturers. In this online rating, lecturers evaluated items on a Likert scale from 1 (low) to 7 (high) regarding, for example, the curricular representativeness and difficulty of the items. All 60 TUCE items showed a good curricular representativeness (mean = 4.83, SD = 0.592; median = 5, mode = 5). Thus, results of the analysis confirmed that the score of the German adaptation of the TUCE enables valid conclusions to be drawn about the ECK of university students in Germany. It was equally confirmed that the Japanese TUCE adaptation assesses relevant and valid economic curricular content from HE programs in Japan

Zeitschrift für Psychologie 2015; Vol. 223(1):14–23

(Rebeck et al., 2009). Furthermore, comparison of the country-specific curricula indicated that there is a similar understanding of the construct of ECK in both countries. The criterion of evidence based on relationships to other variables was considered, for example, in analyses of the influence of prior economic education or study progress on ECK (see Brückner et al., in press). Because our focus in this paper is on AERA criterion 5, we only briefly discuss a few findings on the construct validity of the test in the section below (for more information on testing of AERA validation criteria 1–4, see Zlatkin-Troitschanskaia et al., 2014). Linguistic and cultural differences also were examined as potential influences on the equivalence of the test adaptations and the interpretation of results. We juxtaposed the German and Japanese adaptations with the English original and compared each item across the three versions. For reference, we used a back translation of the Japanese adaptation. Back translations are commonly used in survey research but provide only a rough approximation to the linguistic quality of a test adaptation (Behling & Law, 2000). The translation was done by a professional translator who was given only the Japanese adaptation and was asked to produce a source-oriented translation into German. We also interviewed the translator on all potential differences we detected. The translation comparison was used to identify items and content areas that might pose problems to subsequent comparative analyses. The above-described analyses showed that the American test instrument underwent similar adaptation processes in Germany and Japan, and the two resulting adaptations are indeed substantially equivalent (Rebeck et al., 2009; Zlatkin-Troitschanskaia et al., 2014). Hence, all 60 TUCE items were adapted successfully and are available for international comparisons. Still, there could be some factors that cannot be controlled, such as country-specific differences in teaching and learning, and could impact the psychometric properties of adaptations of the TUCE. Therefore, for this comparative study we investigate whether there are practical, relevant psychometric differences in the German and Japanese adaptations of the original American TUCE.

Test The original version of the TUCE was developed by the American Council for Economic Education (CEE; Walstad et al., 2007); meanwhile, the TUCE has been issued in its 4th revised edition. The TUCE assesses ECK in the two commonly distinguished dimensions of microeconomics and macroeconomics, and it is divided into two corresponding parts containing 30 items each. Each TUCE item is further sorted into one of several content areas within the two dimensions. Microeconomic content areas include basic problems, markets and prices, theories of the firm, factor markets, micro role of government, and international microeconomics; macroeconomic content areas include measuring aggregate performance, aggregate supply and

 2015 Hogrefe Publishing


M. Förster et al.: Validating Test Score Interpretations

demand, money and financial markets, monetary and fiscal policies, policy debates and applications, and international macroeconomics (Walstad et al., 2007). Since construct validity is central to empirical comparative analyses, our first step was to confirm empirically the theoretically assumed multidimensionality for the German and Japanese test versions using CFA and multidimensional Rasch models. The two-dimensional structure was confirmed based on the entire data for both the German and Japanese test versions. Therefore, in the following analyses, we examined the item scores for both dimensions separately (for more information on evidence of content and construct validity of the TUCE, see Brückner et al., in press). With regard to the psychometric properties of the two dimensions we examined reliability and validity. Sufficient reliability was established for both the microeconomics and the macroeconomics parts (micro: a = 0.70; macro: a = 0.77). Moreover, content validity of the TUCE was established through the work of six academic economists serving on the test development committee and then affirmed by a national review panel consisting of seven academic economists (Walstad et al., 2007).

17

Table 1. Distribution of students according to country and study progress Germany First year of study Second year of study Third year of study Total

243 354 238 835

(29%) (42%) (29%) (100%)

Japan 181 193 156 530

(34%) (36%) (29%) (100%)

study. In the sample in Germany, each student answered only 3/7ths of all items due to the booklet design. In Japan, each student answered half of the items. Thus, there were 3/7 · 835 + 1/2 · 530 = 622 responses per item. In the weighting procedure, we distributed 624 responses per item equally over the years of study and the countries. We did this so that the overall mean of responses per item was 104 in each dimension (micro and macro), year (first to third), and subsample (Germany and Japan), and the weighted sample included 104 responses per item for each year of study for Germany and Japan. Thus, the following analyses refer to a sample of 312 students from Germany and 312 students from Japan.

Sample The following analyses were based on a subsample from the WiwiKom project. The German TUCE data for the comparison was gathered in 2013 through a paperand-pencil survey employing a balanced incomplete block design (Frey, Hartig, & Rupp, 2009). The German subsample included 835 students from 22 HE institutions who were presented with 30 of the 60 TUCE items. The adjusted sample for Japan comprised 263 students who responded to 30 items on microeconomics and 267 students who responded to 30 items on macroeconomics. These 530 students came from seven universities in Japan. In both countries, the surveys included questions on sociodemographic details, such as gender, age, study progress, courses attended, and prior economic education. In both countries, there were only a few missing values in the underlying contextual and personal variables (0.2–2.7%). This share of less than 5% allowed a robust replacement of these random missing values and would have justified even case-wise deletion. However, since case-wise deletion often causes biased parameter estimates and would have reduced the sample size, we used single imputation (Graham, Cumsille, & Elek-Fis, 2003) based on the EM algorithm for imputation (Dempster, Laird, & Rubin, 1977). In view of the minor uncertainty in terms of missing values, multiple imputation would not have changed the parameters considerably. Table 1 shows that the students in Germany and Japan were distributed similarly with regard to study progress. To enable a balanced estimation of the measurement model and to avoid bias due to a particularly large percentage of students in a given country or year of study, we weighted the sample so that the responses per item were distributed identically between the countries and among the years of  2015 Hogrefe Publishing

Method For the comparison of countries in this study, our first step involved applying multigroup CFA to estimate separate measurement models for the German and Japanese samples (Steinmetz, 2013). We followed a step-up approach as suggested by Brown (2006), in which the CFA models for both countries were estimated, first, without restrictions, and then, with constraints added progressively. First, we analyzed configural invariance, imposing only an equal factor structure for both countries. Next, we tested for metric invariance, assuming identical factor loadings. Subsequently, we analyzed scalar invariance, testing for equal intercepts or thresholds. The respective models were compared using chi-squared (v2) tests and common criteria of fit. These analyses helped to identify items that were not measurement invariant and should be submitted to a more detailed qualitative examination. Furthermore, once a sufficient model fit was confirmed, we used partial scalar invariance to estimate differences in the means among both countries. In accordance with the construct definition and the test design, we first modeled the content areas of microeconomics and macroeconomics separately and analyzed them for measurement invariance. Our second step involved applying multiple indicators and multiple causes (MIMIC) models (Finch, 2005). The advantage of MIMIC models over multi-group CFA is that MIMIC models allow a more flexible estimation of models and easier integration of covariates. In the MIMIC models, the latent variable and the indicators are both regressed on newly integrated manifest covariates. We tested the effects of differential item functioning (DIF) of certain covariates (gender and study progress) Zeitschrift für Psychologie 2015; Vol. 223(1):14–23


18

M. FĂśrster et al.: Validating Test Score Interpretations

within the countries (Woods, 2009), and we analyzed their influence on the microeconomics and macroeconomics scores of the adapted versions of the TUCE in both countries.

Results Measurement Invariance for the Test Dimension of Microeconomics First, we analyzed the 30 items in the test dimension of microeconomics for configural invariance. Within the baseline model, a CFA model was estimated separately for Germany and for Japan. Model 1 was estimated without restrictions using Mplus software version 7 and a weighted least square estimator with standard errors and meanadjusted and variance-adjusted v2 test statistic (WLSMV) for dichotomous data (Beauducel & Herzberg, 2006). Since the variables were dichotomous, we analyzed thresholds instead of intercepts of metric data. Furthermore, we used scaling factors for both countries. The v2 value was significant at the 5% level with a p-value of .034 (Table 2). The root mean square error of approximation (RMSEA) of 0.018 was below the suggested criterion of 0.05 (Browne & Cudeck, 1998), which indicted a good fit. Our first constraint for the measurement models of both countries was that all items should load positively on the latent variable of microeconomic knowledge. In the German sample, Item 14 had a significant negative factor loading (k = 0.180), indicating that students with a high latent score performed worse on this item than students with a low latent score. In the Japanese model, the item had a positive loading (k = 0.148). Due to the negative loading in the German model, we excluded Item 14 from further analyses and calculated a new baseline model. In baseline Model 2 (Table 2), Item 14 was excluded, and the factor loadings and thresholds were estimated freely for both countries. The v2 test was barely significant at the 5% level with a p-value of .045; the ratio of v2 to df was lower as well. Model 2 did not contain any negative factor loadings, and the overall model fit was considered good. In Model 3, we tested for metric invariance by constraining factor loadings to be equal in both countries. The robust v2 difference test for the WLSMV estimator

showed a p-value of .172 (v2 diff = 36,064, df diff = 29), indicating that Model 3 did not have a significantly poorer fit than Model 2. This result confirmed metric invariance and suggested that the factor loadings for both countries indeed could be considered identical. In Model 4, we added the constraint of equal thresholds for both countries to test for scalar invariance. Model 4 clearly showed a poorer fit than Model 3. The p-value of the v2 tests was highly significant and the RMSEA was slightly worse than in Model 3. The robust v2 difference test confirmed a significantly worse fit than Model 3 ( p-value = .000; v2 diff = 301.345, df diff = 28). Thus, the assumption that the thresholds were equal in both countries was refuted. Inspecting the modification indices, we found that the deviation from measurement invariance was particularly large for the thresholds of Items 4, 22, and 23. Therefore, we tested for partial scalar invariance in Model 5, estimating the intercepts of indicators 4, 22, and 23 for both countries without constraints. Model 5 exhibited a clearly better fit than Model 4 but still a clearly poorer fit than Model 3 (Table 2 and v2 difference test compared to Model 3: p-value = .000; v2 diff = 164.249, df diff = 25).

Comparison of Means in Germany and Japan for the Test Dimension of Microeconomics Our second aim was to estimate differences of means in the German and Japanese models. Although we did not establish scalar invariance, the weighting of the samples ensured that the sample sizes in both countries were the same. Thus, if the estimation of means based on the model of scalar invariance included potentially problematic factor loadings and thresholds of items that normally would differ between the countries, a fair trade-off would be created in the shared German and Japanese measurement model. If the sample sizes were different in the two countries, the estimation probably would be biased toward the parameters of the country with the larger sample size. The underlying Model 5 ensured a fair measurement for both countries, in which the estimation of thresholds was not dominated by either one of the country-specific samples. Based on this measurement model of microeconomic knowledge, the estimation of means showed that the Japanese knowledge score

Table 2. Model fits for testing measurement invariance between countries in the microeconomic model Model 1 2 3 4 5

6

Description

v2

df

p-value

v2/df

Configural invariance Configural invariance without Item 14 Metric invariance without Item 14 Scalar invariance without Item 14 Partial scalar invariance with free thresholds of Items 4, 22, and 23 and without Item 14 Scalar invariance without Items 14, 4, 22, and 23

884.883 821.009 854.722 1,050.756 970.504

810 754 783 811 808

.034 .045 .038 .000 .000

1.092 1.088 1.092 1.296 1.201

0.018 0.014 0.014 0.025 0.020

790.897

649

.000

1.219

0.021 (.000)

Zeitschrift fßr Psychologie 2015; Vol. 223(1):14–23

RMSEA (p-value) (.000) (.000) (.000) (.000) (.000)

 2015 Hogrefe Publishing


M. Förster et al.: Validating Test Score Interpretations

was 0.665 standard deviations ( p = .000) below the German score. The German score was set to 0.00, and the variance of both scores was 1.00. In addition to this model of partial scalar invariance, we estimated Model 6, from which we excluded problematic Items 4, 22, and 23. Consequently, we assumed scalar invariance in both countries for this model instead of freely estimating the thresholds for the remaining 26 items. We generated Model 6 to test whether partial exclusion of the three items would have an effect on the difference of means between the countries. Earlier analyses and simulation studies have shown that freely estimated intercepts or thresholds in a partial measurement invariance model can have a strong influence on the latent variable score in multi-group comparisons (Steinmetz, 2013). In Model 6, the difference of means of microeconomic knowledge was similar. The mean score of the students in Japan was still 0.690 standard deviations below the mean score of the students in Germany. Thus, the freely estimated intercepts in the partial scalar invariance Model 5 did not have a large effect. These results indicated that the difference in means between the students in Germany and those in Japan amounted to approximately 0.68 standard deviations, which we considered substantial.

19

deviated substantially from the assumed metric invariance between the countries. Factor loadings of these two items were estimated freely in Model 9, which showed a clearly better fit. Almost all fit criteria were on the same level as in baseline Model 7, and the robust v2 difference test to Model 7 was not significant (v2 diff = 38.208, df diff = 28, p = .095), indicating an acceptable fit for the model of partial metric invariance. In Model 10, we tested for scalar measurement invariance. We estimated the factor loadings of Items 57 and 42 freely in both countries and imposed equal thresholds for all items in both countries. The fit of Model 10 was worse than that of Model 9 (v2 diff = 142.788, df diff = 29, p = .000). Accordingly, we could not assume all item thresholds to be equal in the two countries. We analyzed the modification indices again and found that the intercepts of Items 57 and 48 clearly differed in the countries, which is why we refuted the assumption of equality. For Item 57, we found invariance not only of the factor loading, but also of the threshold. In Model 11, we also allowed variation of the thresholds of these two indicators between the countries, which resulted in a better fit. Nevertheless, the v2 difference test to Model 9 was significant (v2 diff = 105.276, df diff = 26, p = .000). Overall, we did not establish scalar invariance or partial scalar invariance. However, the RMSEA was clearly below 0.05, indicating a good overall fit.

Invariance for the Test Dimension of Macroeconomics

Comparison of Means Between Germany and Japan for Test Dimension of Macroeconomics

We followed the same approach for macroeconomic knowledge. The fit values of the respective measurement models are indicated in Table 3. Model 7 was the baseline model. The ratio of v2 to df and the RMSEA indicated a good fit. In the macroeconomic measurement model, there were no negative factor loadings in either of the two countries; hence, we did not need to exclude any items from the model. In Model 8, we tested for metric invariance, imposing identical factor loadings for both countries. The robust v2 difference test indicated a significantly poorer fit to the data than Model 7 (v2 diff = 56.068, df diff = 30, p = .003). According to the analysis of modification indices and to the comparison of the students’ responses on the German and the Japanese questionnaires, Items 57 and 42

Similar to the estimations for microeconomics, we estimated the differences of means between both countries for the latent variable of macroeconomic knowledge. We estimated the means based on Model 11, which was used to test partial scalar measurement invariance. The latent score of macroeconomic knowledge in Germany was set to 0, and the variance was 1 in both countries. The mean in Japan amounted to 1.058 standard deviations. When we excluded problematic Items 42, 48, and 57 from Model 12, the difference of means changed only slightly to 1.052, indicating the score differences are quite robust.

Table 3. Model fits for testing measurement invariance in the macroeconomic model Description

v2

df

p-value

v2/df

RMSEA (p-value)

Configural invariance Metric invariance Partial metric invariance with free factor loadings on Items 42 and 57 Scalar invariance with free factor loadings on Items 42 and 57 Partial scalar invariance with free factor loading on Items 42, 48, 57 and free intercepts on Items 48 and 57 Scalar invariance without Items 42, 48, 57

966.651 1,031.461 997.071

810 840 838

.000 .000 .000

1.193 1.228 1.190

0.020 (.000) 0.022 (.000) 0.020 (.000)

1,105.135

867

.000

1.275

0.024 (.000)

1,079.142

864

.000

1.249

0.023 (.000)

909.324

701

.000

1.297

0.025 (.000)

Model 7 8 9 10 11

12

 2015 Hogrefe Publishing

Zeitschrift für Psychologie 2015; Vol. 223(1):14–23


20

M. Förster et al.: Validating Test Score Interpretations

MIMIC Modeling to Compare Gender and Study Progress Effects on ECK Overall, the comparison of means showed that students in Japan had a lower latent knowledge score in microeconomics and macroeconomics than students in Germany. The model was based on a sample including male and female bachelor undergraduate students in their first, second, or third year. Our analyses focused on the potential influence of these subgroups on the latent score in both countries. To this end, we used country-specific MIMIC models to analyze two aspects: Whether the variables of gender and study progress (years of university study) had a similar effect in both countries and in both economic dimensions, and differential item effects caused by gender or study progress within one country. The question was whether the item effects determined between the countries also could be observed for the subgroups. Since the effect of study progress does not have to be linear, two dummy variables were used to compare the 2nd-year and 3rd-year students to the 1st-year students. The model fit is presented in Table 4; results of the regression are summarized in Table 5. The RMSEA was below 0.05, and the ratio of v2 to df was below 2 for all the models, which indicated a good fit (Table 4). We found considerable differences between the countries and the test dimensions. Gender seemed to have no effect in Japan but a large effect on both knowledge dimensions in Germany. The increase in microeconomic and macroeconomic knowledge per year of study was comparable in the 2nd year of study in Germany and Japan but considerably higher in Japan during the 3rd year of study. In Germany, 3rd-year students did not achieve higher scores than 2nd-year students. In both countries,

the two variables explained macroeconomic knowledge to about 25% and, thus, clearly better than microeconomic knowledge, which they explained to 14.5% in Germany and to 17.8% in Japan. For the sample being analyzed, we found no DIF effects based on gender or study progress in either of the countries. This means that all the differences in the countries could be explained by the different latent score of gender or years of study. As there were no DIF effects, no subgroup seemed to be discriminated by single items. The MIMIC models showed students in Japan had a slightly larger increase in knowledge of macroeconomics than microeconomics. This also is evident from the distribution of the Japanese knowledge scores over the years of study. For this analysis, we used the factor scores from Models 5 and 11 for partial scalar measurement invariance so that the scores were comparable between the countries. Students in Japan had a macroeconomics score of 1.48 in their 1st year of study, 1.04 in their 2nd year, and 0.54 in their 3rd year. Students in Germany had a macroeconomics score of 0.26 in their 1st year of study, 0.14 in their 2nd year, and 0.09 in their 3rd year. The students in Japan started at a considerably lower level than the students in Germany but had considerably higher rates of increase during their 3rd year of study than the students in Germany (see also Models 14 and 16 in Table 4). In Germany, students’ macroeconomic knowledge remained rather static in their 3rd year of study as did their microeconomic knowledge. In Japan, microeconomic knowledge increased constantly over the course of study (1st year M = 0.98, 2nd year M = 0.62, 3rd year M = 0.35), while in Germany, there was no knowledge increase toward the end of bachelor studies (1st year M = 0.19, 2nd year M = 0.14, 3rd year M = 0.06).

Table 4. Model fits for the tested MIMIC models Model 13 14 15 16

MIMIC MIMIC MIMIC MIMIC

Model Model Model Model

Micro Germany Macro Germany Micro Japan Macro Japan

v2

df

p-value

v2/df

526.117 579.666 499.532 632.611

461 492 461 492

.019 .004 .104 .000

1.141 1.178 1.083 1.286

RMSEA (p-value) 0.014 0.016 0.018 0.033

(.000) (.000) (.000) (.000)

Table 5. Influence of gender and years of study on economic knowledge within one country (MIMIC models) Influence of gender and years of study

Variable Male student Second yeara Third yeara R2

Model 13 Micro Germany b

Model 14 Macro Germany B

Model 15 Micro Japan b

Model 16 Macro Japan B

0.216* 0.328* 0.274* .145

0.397* 0.333* 0.323* .264

0.024 0.295* 0.484* .178

0.026 0.299* 0.582* .254

a

Compared to 1st-year students; *p < .01.

Zeitschrift für Psychologie 2015; Vol. 223(1):14–23

 2015 Hogrefe Publishing


M. Förster et al.: Validating Test Score Interpretations

Discussion Problematic Items We established that the samples from Germany and Japan were comparable on a systemic level and a structure level, and we verified content and construct validity for both countries. The aim of the comparative analyses of the two countries was to test the different types of measurement invariance in order to draw conclusions about which items varied considerably in their psychometric properties between the countries and, subsequently, to develop hypotheses about the reasons for the different item functioning with regard to the two adapted test versions. Some test items exhibited different psychometric properties in the cross-national comparison. We examined whether these differences could be explained by the country-specific adaptations of the TUCE. The respective translation and adaptation processes included various linguistic, cultural, as well as content-related and curriculum-based modifications, which may have influenced the psychometric properties of the test items. Below we discuss and provide examples of the observed item differences and preliminary explanations. We found notable linguistic and cultural differences in Items 4, 22, and 23 between the German and the Japanese test versions. Linguistic differences were detected in Item 4: The German response options were short nominal phrases and followed English syntax closely; the Japanese response options were syntactically more explicit and expressed the hypothetical character of the tasks, which were only implicit in the German and English versions. The more explicit items in the Japanese version may have been easier to respond to than the less explicit items in the German and American test versions. This hypothesis also was supported by the fact that the rate of correct responses for the Japanese item was 55.7% and, thus, considerably higher than the rate of 29.8% for the German version of the item. Furthermore, we found lexical differences in the response options of Items 22 and 23 that may have rendered the Japanese items slightly more explicit for the students. In Items 57 and 48, we detected differences in the adaptations that were mainly numerical but may have influenced the underlying cognitive processes necessary for responding to the items. Currency measures had to be adapted in both items. This way, the levels of magnitude assigned to macroeconomic measures remained the same in accordance with the factual macroeconomic situational context within the countries. The larger values in the Japanese version might have been more difficult to process cognitively; accordingly, processing the Japanese item might have required a higher level of other latent abilities, such as numerical reasoning, than processing the German item. For a correct interpretation of Item 14, students had to draw on content that was clearly part of the curricula in both countries. However, this was the only TUCE item that required the students to interpret an economic graph. The negative loading in the German model suggests that analyzing graphs might require an ability different from that used to analyze other item formats (Lin, Wilson, & Cheng,  2015 Hogrefe Publishing

21

2013). This difference between the countries might be due to systematic differences in teaching and learning that could not be controlled. In contrast to other tests in international comparative studies such as AHELO, the TUCE adaptations were developed independently in separate projects and were not designed primarily for international comparisons of data, but for national assessments within the respective, country-specific HE systems. Nevertheless, the German and the Japanese TUCE versions enabled international comparisons under certain constraints. The factor loadings in the German and Japanese samples were quite similar, which supported metric measurement invariance in both subdimensions of ECK. Scalar measurement invariance between the countries was not established, as differences in the thresholds were much more pronounced. Overall, the measurement models showed a good fit. Future studies aiming to improve comparability of the two test versions could focus on the identified items and modify them to test the generated hypotheses regarding the cross-national differences in the psychometric item properties.

Differences of Means We found systematic mean differences in the level of ECK and growth rates between countries. In follow-up analyses, we focused on whether the higher level of ECK of students in Germany at the beginning of their studies could be explained by their previous education in economics (see Brückner et al., in press). Our findings on the knowledge increase rates were not based on a panel of students. Instead, growth rates were calculated from cross-sectional data from different groups. In future research, it would be desirable to test the above results in a truly longitudinal assessment of students in the field of economics in both countries. The gender effect needs to be discussed critically. We observed the striking gender effect only with the German sample, and this effect also has been reported in other relevant studies on Germany (Zlatkin-Troitschanskaia, Förster, & Kuhn, 2013). Future comparative research should investigate why a gender effect was detected for Germany but not Japan. The multiple-choice item format also should be evaluated critically. Female students are considered more risk-averse than male students: When they respond to multiple-choice items, they tend to avoid guessing more often than their male peers. Therefore, in the TUCE 4 we will analyze the missing values and the students’ responses with regard to systematic differences in the missing values or the response behavior of female and male participants. We also will examine whether there were systematic differences in the response behavior or response strategies between the two countries.

Validating Test Score Interpretations Our analyses of construct equivalence and measurement invariance between Germany and Japan indicated that the Zeitschrift für Psychologie 2015; Vol. 223(1):14–23


22

M. Förster et al.: Validating Test Score Interpretations

assumed test score interpretations of ECK were indeed equally valid between the countries and, thus, could support general interpretations (Hambleton, Merenda, & Spielberger, 2005). Construct equivalence was confirmed by verifying the comparability of framework conditions, interviewing experts from the field, and producing back translations. Once cross-national equivalence of constructs was established theoretically, the key challenge was operationalizing it in a suitable measurement model for testing measurement invariance (Hambleton, 2001). With regard to the requirements specified in the measurement model, we found that the analysis of measurement invariance can provide evidence of systematic differences between countries being compared and can indicate whether those differences are due to variations in the construct or are caused by just methodological artifacts, which may invalidate general interpretations of validity (Kane, 2013). It has been shown that measurement invariance is established most frequently between countries with similar framework conditions. Thus, for countries with comparable educational and economic conditions, such as Germany and Japan, the findings tend to be according to expectations.

Conclusion Our analyses showed that ECK can be assessed validly and comparably with the German and Japanese versions of the TUCE. Our analyses also revealed that the psychometric properties of items can be influenced by minor linguistic, cultural, or even purely numerical modifications. Cultural adaptation of items can entail problems of equivalence among test versions. Thus, the juxtaposition of different adapted and back-translated test versions provides important and valuable evidence for international comparative competence research. In view of the particular challenges and the research deficit in international comparative research in HE, adapted test instruments should be taken into account specifically for cross-national comparative studies, since such studies can provide a critical evaluation of the psychometric properties of an instrument and can point out potential limitations. Despite the general usefulness of the German and Japanese adaptations of the TUCE, an intriguing question arises as to the huge difference in scores between the genders in Germany. We are not yet willing to accept that these differences reflect differences in true competence. Instead, test-taking attitudes or affective reactions and behaviors could be more influential. Future research is needed to understand this issue. This paper illustrates how diverse challenges can arise in the assessment of single competence facets such as ECK. Even if the mixed-method analysis in this study confirmed sufficient content and construct validity of the ECK test, further analyses of the cognitive validity based on cognitive interviews with 36 students indicated that the students drew not only on their ECK when responding to the TUCE items, but also on other cognitive, affective, and motivational dispositions. For example, the concurrent and retrospective cognitive interviews with a randomized Zeitschrift für Psychologie 2015; Vol. 223(1):14–23

purposeful sample of male and female students at different points of their study progress showed ample use of elimination and guessing strategies, which corresponded, however, with a higher probability of selecting an incorrect response. It also was apparent that the emotional state and familiarity with the content area influenced the use of such constructirrelevant test-taking strategies. Nevertheless, these aspects were not as important as construct-relevant cognitive processes, such as abductive economic reasoning, which were mainly responsible for correct responses to items. Further research in the modeling of economic competence can address in greater detail the questions of how the different competence resources interact and result in performance in specific situations, as well as which assessment formats enable an appropriate and valid measurement to this end (see also Blömeke et al., 2015).

References AERA, APA, & NCME. (2004). Standards for educational and psychological testing (2nd ed.). Washington, DC: American Psychological Association. Anderson, L. W., & Krathwohl, D. R. (2001). A taxonomy for learning, teaching, and assessing: A revision of Bloom’s taxonomy of educational objectives. New York, NY: Longman. Beauducel, A., & Herzberg, P. Y. (2006). On the performance of maximum likelihood versus means and variance adjusted weighted least squares estimation in CFA. Structural Equation Modeling, 13, 186–203. Behling, O., & Law, K. S. (2000). Translating Questionnaires and other research instruments: Problems and solutions. Thousand Oaks, CA: Sage. Bereday, G. (1964). Comparative method in education. New York, NY: Holt, Rinehart, & Winston. Blömeke, S., Gustafsson, J.-E., & Shavelson, R. (2015). Beyond dichotomies: Competence viewed as a continuum. Zeitschrift für Psychologie, 223, doi: 10.1027/2151-2604/a000194 Bloom, B. S., Englehart, M. B., Furst, E. J., Hill, W. H., & Krathwohl, D. R. (1956). Taxonomy of Educational Objectives, the classification of educational goals – Handbook I: Cognitive domain. New York, NY: McKay. Bray, M., Adamson, B., & Mason, M. (2007). Comparative education research – approaches and methods. Hong Kong, China: Springer. Brown, T. (2006). Confirmatory factor analysis for applied research. New York, NY: Guilford. Browne, M. W., & Cudeck, R. (1998). Alternative ways of assessing model fit. In K. A. Bollen (Ed.), Testing structural equation models (Sage focus editions, vol. 154, pp. 136–162). Newbury Park, CA: Sage. Brückner, S., Förster, M., Zlatkin-Troitschanskaia, O., & Walstad, W. B. (in press). Effects of prior economic education, native language, and gender on economic knowledge of first-year students in higher education. A comparative study between Germany and the Unites States. In O. Zlatkin-Troitschanskaia & R. Shavelson (Eds.), Assessment of competence in higher education [Special issue]. Studies in Higher Education. Dehon, C., McCathie, A., & Verardi, V. (2010). Uncovering excellence in academic rankings: A closer look at the Shanghai ranking. Scientometrics, 83, 515–524. Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum-likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society. Series B (Methodological), 39, 1–38.  2015 Hogrefe Publishing


M. Förster et al.: Validating Test Score Interpretations

Finch, H. (2005). The MIMIC model as a method for detecting DIF: Comparison with Mantel-Haenszel, SIBTEST, and the IRT Likelihood Ratio. Applied Psychological Measurement, 29, 278–295. doi: 10.1177/0146621605275728 Frey, A., Hartig, J., & Rupp, A. A. (2009). Booklet designs in large-scale assessments of student achievement: Theory and practice. Educational Measurement: Issues and Practice, 28, 39–53. Graham, J. W., Cumsille, P. E., & Elek-Fisk, E. (2003). Methods for handling missing data. In J. A. Schinka & W. F. Velicer (Eds.), Handbook of psychology. Research methods in psychology (Vol. 2, pp. 87–114). Hoboken, NJ: Wiley. Hambleton, R. K. (2001). The next generation of the ITC test translation and adaption guidelines. European Journal of Psychological Assessment, 17, 164–172. doi: 10.1027/ 1015-5759.17.3.164 Hambleton, R. K., Merenda, P., & Spielberger, C. (2005). Adapting educational and psychological tests for crosscultural assessment (pp. 3–38). Mahwah, NJ: Erlbaum. Kane, M. T. (2013). Validating the interpretations and uses of test scores. Journal of Educational Measurement, 50, 1–73. Koeppen, K., Hartig, J., Klieme, E., & Leutner, D. (2008). Current issues in competence modeling and assessment. Zeitschrift für Psychologie, 216, 61–73. doi: 10.1027/ 0044-3409.216.2.61 Lin, Y.-H., Wilson, M., & Cheng, C.-L. (2013). An investigation of the nature of the influences of item stem and option representation on student responses to a mathematics test. European Journal of Psychological Education, 28, 1141–1161. OECD. (2012). Assessment of Higher Education Learning Outcomes. Feasibility Study Report. Volume 1 – Design and Implementation. Retrieved from http://www.oecd.org/ edu/skills-beyond-school/AHELOFSReportVolume1.pdf Owen, A. L. (2012). Student characteristics, behavior, and performance in economics classes. In G. M. Hoyt & K. McGoldrich (Eds.), International handbook on teaching and learning economics (pp. 341–350). Northampton, MA: Edward Elgar. Rebeck, K., Walstad, W. B., Yamaoka, M., & Asano, T. (2009). An international comparison of university students’ knowledge of economics: Japan and the United States. Bulletin of Yamamura Gakuen College, 20, 13–43. Steinmetz, H. (2013). Analyzing observed composite differences across groups: Is partial measurement invariance enough? Methodology, 9, 1–12. doi: 10.1027/1614-2241/a000049

 2015 Hogrefe Publishing

23

Walstad, W. B., Watts, M., & Rebeck, K. (2007). Test of understanding in college economics: Examiner’s manual (4th ed.). New York, NY: National Council on Economic Education. Weinert, F. E. (2001). Competencies and key competencies: Educational perspective. In N. J. Smelser & P. B. Baltes (Eds.), International Encyclopedia of the social and behavioral sciences (Vol. 4, pp. 2433–2436). Amsterdam, The Netherlands: Elsevier. Woods, C. M. (2009). Evaluation of MIMIC-model methods for DIF testing with comparison to two-group analysis. Multivariate Behavioral Research, 44, 1–27. Yamaoka, M., Walstad, W. B., Watts, M. W., Asano, T., & Abe, S. (Eds.). (2010). Comparative studies on economic education in Asia-Pacific region. Tokio, Japan: Shumpusha. Zlatkin-Troitschanskaia, O., Förster, M., Brückner, S., & Happ, R. (2014). Insights from a German assessment of business and economics competence. In H. Coates (Ed.), Higher education learning outcomes assessment – international perspectives (pp. 175–197). Frankfurt am Main, Germany: Lang. Zlatkin-Troitschanskaia, O., Förster, M., & Kuhn, C. (2013). Modeling and measurement of university students’ subjectspecific competencies in the domain of business & economics – The ILLEV project. In S. Blömeke, O. Zlatkin-Troitschanskaia, C. Kuhn, & J. Fege (Eds.), Modeling and measuring competencies in higher education (pp. 159–170). Rotterdam, The Netherlands: Sense.

Olga Zlatkin-Troitschanskaia Gutenberg School of Management and Economics Johannes Gutenberg University Jakob-Welder-Weg 9 55128 Mainz Germany Tel. +49 6131 39-22009 Fax +49 6131 39-22095 E-mail lsTroitschanskaia@uni-mainz.de

Zeitschrift für Psychologie 2015; Vol. 223(1):14–23


Original Article

Modeling the Competencies of Prospective Business and Economics Teachers Professional Knowledge in Accounting Kathleen Schnick-Vollmer,1 Stefanie Berger,2 Franziska Bouley,3 Sabine Fritsch,2 Bernhard Schmitz,1 Jürgen Seifried,2 and Eveline Wuttke3 1

Institute for Psychology, University of Darmstadt, Germany, 2Economic and Business Education II, University of Mannheim, Germany, 3Economic and Business Education, Goethe University Frankfurt, Germany Abstract. Despite the important role that teachers’ professional competencies play, domain-specific models of competence as well as established instruments to measure such competencies are lacking (e.g., Blömeke, Zlatkin-Troitschanskaia, Kuhn, & Fege, 2013). For this reason, a domain-specific model of competence and an instrument to measure prospective business and economics teachers’ professional competence in the domain of accounting was developed. This article focuses on the measurement of professional knowledge, which is a key facet of teachers’ professional competence. A corresponding test instrument is introduced and its measurement quality is reported. The test instrument used at 24 German universities (N = 1.158) comprises 49 items, distributed among different booklets following a multi-matrix design. All items have well functioning parameter values. In accordance with our hypothesis, a two-dimensional model fits the data best. The reliabilities of .64 (content knowledge) and .64 (pedagogical content knowledge) are satisfying. Thus, the developed instrument allows to gain a detailed understanding of prospective teachers’ professional knowledge in accounting. Keywords: higher education, competence models, Item Response Theory, teachers’ professional knowledge, accounting

Professional competencies are seen as a key issue in being successful in one’s education and working life. Buzz words such as ‘‘human capital’’ and ‘‘global currency of the 21st century’’ (OECD, 2012, p. 11, 3) underpin the meaning of the debate. Due to the relevance of professional competencies, it is essential to examine how they are acquired and how they can be supported. This can only be done on the basis of an adequate measurement. Against this background the measurement and modeling of professional competencies has developed into an active research field. In recent decades, research on competencies put the emphasis on students’ achievement. Large-scale assessments such as PISA1 and TIMSS2 have shown that students frequently lack necessary competencies. As a result, interest in teacher competencies has increased, since their competencies are seen as a key factor for the effectiveness of 1 2

classroom teaching and student achievement (cf. Bromme, 2001, Lipowsky, 2006; Shulman, 1986). Despite a growing research body in this area (e.g., Baumert et al., 2010; Blömeke, Kaiser, & Lehmann, 2010), there is still a considerable lack of empirical results (e.g., Desimone, 2009; Jude & Klieme, 2008). This applies especially to professional competencies of prospective teachers in business and economics education since only few empirical evidence exists for the effectiveness of university teacher training and professional development in this domain (Beck, 2005; Kuhn et al., 2014; Zlatkin-Troitschanskaia, Förster, Brückner, Hansen, & Happ, 2013). Due to the importance of business and economic education teachers’ competencies on the one hand, and the lack of research in this field on the other hand, the aim of the current study is to work on this deficit. Thus, an instrument

The OECD’s Programme for International Student Assessment (PISA) conducts triennial surveys of 15-year-old students’ skills in an effort to evaluate educational systems worldwide. The Trends in International Mathematics and Science Study (TIMSS) is conducted by the International Association for the Evaluation of Educational Achievement (IEA) every four years since 1995, aimed largely at students in grades 4 and 8.

Zeitschrift für Psychologie 2015; Vol. 223(1):24–30 DOI: 10.1027/2151-2604/a000196

Ó 2015 Hogrefe Publishing


K. Schnick-Vollmer et al.: Teachers’ Professional Knowledge in Accounting

to measure the professional competence of prospective teachers in business and economics education in the domain of accounting was developed.

Definitions of Competence Competence definitions are numerous and no clear consensus has been reached yet (Blömeke, Gustafsson, & Shavelson, 2015). In the approach of Koeppen, Hartig, Klieme, and Leutner, ‘‘competencies are conceptualized as complex ability constructs that are context-specific, trainable, and closely related to real life’’ (2008, p. 61). Weinert (2001) describes the term ‘‘complex ability constructs’’ in his widely known approach more detailed and classifies competencies as ‘‘those intellectual abilities, content-specific knowledge, cognitive skills, domainspecific strategies, routines and subroutines, motivational tendencies, volitional control systems, personal value orientations, and social behaviors (combined) into a complex system’’ (Weinert, 2001, p. 51). With regard to teachers’ competencies professional knowledge is seen as crucial and as the most powerful factor for successful teaching (Ball, Thames, & Phelps, 2008; Hill, Ball, & Schilling, 2008) and with an important impact on student performance (Hattie, 2009; Hill, Rowan, & Ball, 2005). Professional knowledge is typically understood in the sense of Shulman’s (1986) conceptualization, who divided it into three facets: content knowledge (CK), pedagogical content knowledge (PCK), and pedagogical knowledge (PK). Especially the differentiation between CK and PCK drew attention and was confirmed empirically (Krauss, Baumert, & Blum, 2008). In the current study teachers’ competence is defined as sequences of actions functionally related to classroom instruction in accounting that is composed of professional knowledge (CK and PCK), beliefs, motivational orientation, and self-regulatory abilities. With regard to actual behavior or performance, competencies are influenced by situational characteristics and teachers’ personal interpretations thereof. Therefore – as mentioned in Blömeke et al. (2015) – we also consider competence as being a horizontal continuum since different aspects of competence are linked with one another, act in specific situations with one another and thus, lead to observable behavior. Based on this definition an instrument to measure the CK and PCK of (prospective) teachers in business and economics education in the domain of account was developed.

Professional Knowledge in Accounting The domain of business accounting is addressed because accounting is considered to be an important subject in business and economics education and to be crucial for the development of economical competence (cf. Seifried, 2012). Also, previous studies in the field of accounting

Ó 2015 Hogrefe Publishing

25

(e.g., Seifried, Türling, & Wuttke, 2010; Türling, Seifried, Wuttke, Gewiese, & Kästner, 2011; Wuttke & Seifried, 2013) show that prospective teachers lack central aspects of CK as well as PCK. Content Knowledge Content knowledge can be viewed as a necessary prerequisite for structuring classroom instructions with a focus on student understanding (cf. Krauss et al., 2008; Neuweg, 2010; Schlump, 2010). In order to ensure the construct validity of CK in accounting and to identify central content areas the following steps were carried out. By analyzing the curriculum (framework) and textbooks as well as conducting expert interviews with experienced teachers key learning areas in the subject of accounting were identified. Using open content analysis (cf. Mindnich, Berger, & Fritsch, 2013), three main areas of learning content could be identified: (1) purpose, relevance, and legal basis of accounting, (2) double-entry bookkeeping, and (3) procurement and sales, including the system of value-added taxes. Content area (1) includes a basic understanding of bookkeeping tasks and their importance as well as knowledge of basic legal principles and technical terms. Content area (2) involves the bookkeeping system (e.g., posting rules and account types). Finally, content area (3) focuses on the key tasks of a company (buying and selling processes), including the topic of value-added tax. Pedagogical Content Knowledge The content knowledge described above represents a necessary but not sufficient prerequisite for successful instruction (e.g., Neuweg, 2010; Schlump, 2010). Thus, in addition to content knowledge, pedagogical content knowledge represents a crucial facet of prospective teachers’ professional competencies. We modeled domain-specific PCK knowledge facets relevant to the test instrument’s target population that also consider the unique structure of accounting instruction (Seifried, 2012). As a result, we focus on the two facets of pedagogical content knowledge identified by Shulman (1986) – that is knowledge of how to make content accessible to students as well as knowledge of student’s thinking (Baumert et al., 2010). In addition, we also include knowledge of tasks (e.g., evaluating the cognitive potential of tasks), which represents – just as in mathematics (Kunter et al., 2007) – a central determinant to stimulate cognitively activating learning processes in accounting. Thus, we divided pedagogical content knowledge for each content facet into the following three facets: (1) knowledge of students’ cognition and typical student errors, (2) knowledge of tasks as instructional tools, and (3) knowledge of multiple representations and explanations (cf. Baumert et al., 2010). A more detailed discussion of these facets is provided by Mindnich et al. (2013).

Zeitschrift für Psychologie 2015; Vol. 223(1):24–30


26

K. Schnick-Vollmer et al.: Teachers’ Professional Knowledge in Accounting

You are planning to introduce the topic “sale of goods” in your accounting class. Since students often have difficulties with the accurate accounting of the value added tax (VAT), you are planning to neglect the VAT at first. However, your mentor points out that (apart from VAT) there typically is one central comprehension difficulty when the topic sale of goods is introduced. Which comprehension difficulty could this be? Please name one difficulty (keywords are sufficient).

Figure 2. Item 7 requires pedagogical content knowledge (knowledge of students’ thinking and typical student errors) and has a high item difficulty.

Figure 1. Three facets of pedagogical content knowledge, three content areas, and two item difficulties form the three-dimensional model of professional knowledge.

the basis for a recategorization of the items regarding the difficulty levels. Including content, facets of PCK, and item difficulty, a three-dimensional model (Figure 1) was developed. This model forms the basis for the test items. Table 1 provides details on a difficult test item (see Figure 2) and an easy one (see Figure 3).

Research Questions Item Difficulty Numerous studies (Blömeke, Bremerich-Vos, et al., 2013; Blömeke et al., 2010; Winther, 2010) have focused on the level of cognitive ability necessary to correctly respond to an item as the central criterion of item difficulty (Anderson & Krathwohl, 2001). In this sense, three levels have been identified: (1) reproduction, (2) application, and (3) development and evaluation. Additional characteristics that determine item difficulty in accounting include quantitative (i.e., number of accounts and technical terms) and qualitative aspects of the task (e.g., the different types of accounts and required mathematical skills). Thus, two difficulty levels have been established. However, it has to be stated that the á priori determined item difficulties could only be confirmed marginally during the examinations of the pretests. Thus, the empirical results of the pretests form

In this article the measurement quality of the newly developed instrument is reported. To assess the psychometrical quality of the instrument our hypotheses are the following: Hypothesis (1): Since our conceptualization of the professional knowledge in accounting is based on Shulman’s (1986) framework, we expect a verification of the two-dimensional model of professional knowledge in accounting (PCK and CK). Hypothesis (2): Regarding investigations on item level, we expect MNSQ values to lie within the acceptable intervals. Hypothesis (3): The spectrum of the item difficulties matches the spectrum of the person abilities to a large extent.

Table 1. Description of example items Item 7 (see Figure 2) Item difficulty (dissolution rate) Aspect (facet)

Content area

High; d = 1.193 (24%) PCK (knowledge of students’ thinking and typical student errors). Procurement and sales.

Format Requirement

Open-ended. Prospective teachers have to recognize, that students might have problems to identify the revenues from sales of goods and that they are recognized in the income statement.

Example for a correct answer

Differentiation between goods and revenues.

Zeitschrift für Psychologie 2015; Vol. 223(1):24–30

Item 37 (see Figure 3) Low; d = CK

1.509 (80%)

Purpose, relevance, and legal basis of accounting. CMC To solve the item, the prospective teachers have to reproduce factual knowledge. Furthermore, the content presented in the sample item requires basic knowledge that is sequenced at the beginning of the curriculum. One point for three correct answers. Ó 2015 Hogrefe Publishing


K. Schnick-Vollmer et al.: Teachers’ Professional Knowledge in Accounting

Assign the following objects either to “fixed assets” or to “current assets” of the balance sheet. Please mark one box for each line. Fixed assets A. Patent for personal use on an invention in manufacturing, value B. Pending invoice of a major customer of

8,500.00.

50,000.00.

C. Stock of oak wood for manufacturing, total value of D. Finished goods in stock with a total value of

100,000.00.

15,000.00.

27

Current assets

[X]

[ ]

[ ]

[X]

[ ]

[X]

[ ]

[X]

Figure 3. Item 37 requires content knowledge (purpose, relevance, and legal basis of accounting) and has a low item difficulty. Hypothesis (4): There is no differential item functioning (DIF). Thus, the items are equally suitable for both genders.

Method Participant Population The entire sample consists of 1.158 prospective teachers (age range 19–46, M = 24.84, SD = 3.51) from 24 out of 28 German universities that offer a teacher training program in business and economics education. Three hundred ninety-five participants were male, 753 female, and 10 without statement. Five hundred ninety participants were students at the bachelor level, 555 students at the master level, and 13 without statement. Participation in the study was voluntary and mostly took place during students’ regular classes.

Instrument The three-dimensional model of CK, PCK, and item difficulty (Figure 1) forms the basis for the development of the test items A booklet design (Frey, Hartig, & Rupp, 2009) with seven clusters of seven items each was used. Each cluster was paired exactly twice with each other cluster. In order to ensure that all subsections (CK, PCK, item difficulty; Figure 1) contained the same number of items, we had to add one additional item. Thus, the total item pool consists of 49 items. Of these 49 items each participant had to answer 28 items. By choosing this design, the content validity of the professional knowledge could be improved (Bühner, 2004). The answer format was either open-ended, multiple choice, or complex multiple choice (CMC). In this first step the answers were coded dichotomously. Items Ó 2015 Hogrefe Publishing

which have not been answered – except the items, which are missing by design – were considered as not solved (Kleickmann et al., 2013). In addition to the construct and content validity mentioned above, the instrument was validated by the comparison between German and Austrian participants (Fritsch et al., 2015). This type of validation is referred to as internal validation (Rost, 2004) and in this case is based on the method of known groups (Hattie & Cooksey, 1984). Thus, as expected the Austrian sample performed significantly better than the German sample. This fact can be explained by the different trainings in the two countries.

Procedure The test was presented as a paper-pencil-test. Each data collection session lasted approximately 90. Within this, 40 min were needed to complete the CK and PCK items. Afterwards, participants received a compensation of 20 Euro for their participation.

Results The data analysis was carried out with R (Version 3.0.3, package TAM; Kiefer, Robitzsch, & Wu, 2014) and Conquest (Version 3.0.1; Adams, Wu, & Wilson, 2012) on the basis of Item Response Theory. First, the instrument was examined at the test level. Here, the structure of professional knowledge is of special interest. Second, different item parameters were examined.

Results at Test Level Initially the dimensionality of professional knowledge was examined. For this the LRT (Likelihood Ratio Test) Zeitschrift für Psychologie 2015; Vol. 223(1):24–30


28

K. Schnick-Vollmer et al.: Teachers’ Professional Knowledge in Accounting

statistics were used. In contrast to the computation of the AIC and BIC values the LRT allows not only a relative comparison, but also a significance examination by consideration of the difference of the deviances (Rost, 2004). It could be shown that in comparison with the one-dimensional model the two-dimensional model has a significantly smaller deviance, v2(2, N = 1158) = 28.27, p < .001. This is in line with our Hypothesis (1). However, the latent correlation between CK and PCK of .92 is very high. Based on the two-dimensional model different statistical background models were examined. The two-parameter (2PL) model has a smaller, but not significantly smaller deviance in comparison to the dichotomous Rasch model. Thus, in the present study the Rasch model is preferred. This can be mainly justified by the different person parameter estimation. One disadvantage of the 2PL models is the fact, that persons with the same total score and hence, the same trait, have different estimations of their abilities – depending on which item was answered correctly. Thus, no sufficient statistic, but the marginal estimation is computed (e.g., Strobl, 2010). The EAP reliabilities of the subscales of the two-dimensional Rasch model are .64 both, for PCK and CK. These are satisfying values for a performance test (Bühner, 2004).

Results at Item Level After the analysis at test level, the individual items have been examined more closely in a subsequent step. For this, the MNSQ values, item difficulties, and differential item functioning (Osterlind & Everson, 2009) were analyzed.

Differential Item Functioning (DIF) Regarding gender, no DIF was found in the field of PCK. However, the analyses show a significant DIF in the field of CK (men score significantly better than women). This means that a comparison of the two groups’ performances is only possible with caution. Thus, our Hypothesis (4) could only be confirmed partly.

Discussion Our hypotheses could be confirmed to a large extent. An important result is the connectivity to national and international results regarding the empirical validation of the division of professional knowledge in the two components of content knowledge and pedagogical content knowledge. Thus, previous findings (e.g., Krauss et al., 2008; Shulman, 1986) are valid for the domain of accounting knowledge as well. The range of item difficulties matches the range of the participants’ abilities. The MNSQ values show, that the model fits the data well. However, regarding the DIF analyses, only the PCK items are equally suitable for both, male and female participants. This may be explained by the high proportion of multiple choice questions within the CK items. In general, answering multiple choice questions is easier for male participants than for female (Ben-Shakhar & Sinai, 1991). The items for measuring PCK include considerably less multiple choice questions. With reliabilities about .64 (PCK) and .64 (CK) a sufficiently functioning instrument was developed.

MNSQ Values

Limitations

The weighted MNSQ values of all items were in a very good range between 0.93  MNSQ  1.08 (Bond & Fox, 2001; Wright & Linacre, 1994). This confirms our Hypothesis (2).

In the current study, we developed and optimized an instrument for measuring professional competence of prospective teachers in business and economic education in the domain of accounting. In the context of this article, the focus was placed on professional knowledge as one important component of teachers’ professional competence. Still, there are methodological limitations associated with our data collection method. The most essential limitation concerns the operationalization of the competence construct. Competence refers to an individual’s ability and motivation to utilize his or her skills; thus competencies are specific to a certain situation. In the current study, however, we only assessed different aspects of competence in separate and hypothetical rather than real situations and – especially in terms of pedagogical content knowledge – hypothetical action tendencies. However, this is a general problem in the measurement of competencies. One possible solution might be the use of Situational Judgment Tests (SJT; e.g., McDaniel, Hartmann, Whetzel, & Grubb, 2007). Still, the use of SJT also has (practical and methodical) limitations. Furthermore, as mentioned above, the items will be recoded more differentiated and thus, evaluated on the basis

Item Difficulties In the next step item difficulties were analyzed. Thus, the item difficulties were compared to the abilities of the participants. Especially the PCK items are characterized through a good fit between item difficulties and person abilities. Merely one item (d = 1.840) is very difficult. Regarding the CK items, the range of the item difficulties covers the range of the person abilities well, but in both directions not completely. Here both easier and more difficult items could have been formulated. Thus, our Hypothesis (3) can be confirmed to a large extent. However, it has to be stated that the variance of the ability of the persons is very small. Therefore, in a later step the items should be recoded and evaluated on the basis of a partial credit model. Zeitschrift für Psychologie 2015; Vol. 223(1):24–30

Ó 2015 Hogrefe Publishing


K. Schnick-Vollmer et al.: Teachers’ Professional Knowledge in Accounting

of a partial credit model. This serves the purpose to obtain more variance regarding the abilities of the persons.

Outlook and Conclusions As we were able to demonstrate methodological quality of the developed instrument, the next step would be to assess the professional competencies of (prospective) teachers in business and economics education/accounting in detail. First of all, the subject characteristics that have been collected with the help of the biographical questionnaire will be investigated (cf. Bouley et al., in press; Fritsch et al., 2015). One question will be, whether factors at the personal level (especially prior knowledge and educational background) have an influence on the professional knowledge. Additionally we will analyze how the different competence components play together. Interesting questions are how self-regulation and professional knowledge as well as beliefs and professional knowledge are related. And a comparison of the test results with self-assessments of students will be performed. Acknowledgments This research is part of the research initiative ‘‘Modeling and Measuring Competencies in Higher Education (Kompetenzmodellierung und Kompetenzerfassung im Hochschulsektor, KoKoHs) funded by the German Federal Ministry of Education and Research.

References Adams, R. J., Wu, M. L., & Wilson, M. R. (2012). ACER ConQuest 3.0. [computer program]. Melbourne, Australia: ACER. Anderson, L. W., & Krathwohl, D. R. (2001). A taxonomy for learning, teaching, and assessing: A revision of Bloom’s taxonomy of educational objectives. New York, NY: Longman. Ball, D. L., Thames, M. H., & Phelps, G. C. (2008). Content knowledge for teaching: What makes it special? Journal of Teacher Education, 59, 389–407. Baumert, J., Kunter, M., Blum, W., Brunner, M., Voss, T., Jordan, A., . . . Tsai, Y.-M. (2010). Teachers’ mathematical knowledge, cognitive activation in the classroom, and student progress. American Educational Research Journal, 47, 133–180. doi: 10.3102/0002831209345157 Beck, K. (2005). Ergebnisse und Desiderate zur LehrLern-Forschung in der kaufmännischen Berufsausbildung [Findings and desirables regarding teaching-learning research in commercial training]. Zeitschrift für Berufs- und Wirtschaftspädagogik, 101, 533–556. Ben-Shakhar, G., & Sinai, Y. (1991). Gender differences in multiple-choice tests: The role of differential guessing tendencies. Journal of Educational Measurement, 28, 23–35. Blömeke, S., Bremerich-Vos, A., Kaiser, G., Nold, G., Haudeck, H., Keßler, J.-U., & Schwippert, U. (2013). Professionelle Kompetenzen im Studienverlauf: Weitere Ergebnisse zur Deutsch-, Englisch- und Mathematiklehrerausbildung aus TEDS-LT [Professional compentencies over the course of studies. Further Ó 2015 Hogrefe Publishing

29

findings on the training of German, English and Maths teachers from TEDS-LT]. Münster, Germany: Waxmann. Blömeke, S., Gustafsson, J.-E., & Shavelson, R. (2015). Beyond dichotomies: Competence viewed as a continuum. Zeitschrift für Psychologie, 223. doi: 10.1027/2151-2604/a000194 Blömeke, S., Kaiser, G., & Lehmann, R. (Eds.). (2010). TEDS-M 2008: Professionelle Kompetenz und Lerngelegenheiten angehender Primarstufenlehrkräfte im internationalen Vergleich [TEDS-M 2008: Professional competence and learning opportunities of prospective elementary school teachers – an international comparison]. Münster, Germany: Waxmann. Blömeke, S., Zlatkin-Troitschanskaia, O., Kuhn, C., & Fege, J. (2013). Modeling and measuring competencies in higher education: Tasks and challenges. Rotterdam, The Netherlands: Sense. Bond, T. G., & Fox, C. M. (2001). Applying the Rasch model: Fundamental measurement in the human sciences. Mahwah, NJ: Erlbaum. Bouley, F., Berger, S., Fritsch, S., Wuttke, E., Seifried, J., Schnick-Vollmer, K., & Schmitz, B. (in press). Zum Einfluss von universitären und außeruniversitären Lerngelegenheiten auf das Fachwissen und fachdidaktische Wissen von Studierenden der Wirtschaftspädagogik [The influence of learning opportunities within and outside of university on the subject knowledge and pedagogical knowledge of studemts of business and economics education]. Zeitschrift für Pädagogik. Bromme, R. (2001). Teacher expertise. In N. J. Smelser, P. B. Baltes, & F. E. Weinert (Eds.), International encyclopedia of the behavioral sciences: Education (pp. 15459–15465). London, UK: Pergamon. Bühner, M. (2004). Einführung in die Test- und Fragebogenkonstruktion [An introduction to test and questionnaire construction]. Munich, Germany: Pearson. Desimone, L. M. (2009). Improving impact studies of teachers’ professional development: Toward better conceptualizations and measures. Educational Researcher, 38, 181–199. doi: 10.3102/0013189X08331140 Frey, A., Hartig, J., & Rupp, A. (2009). Booklet designs in largescale assessments of student achievement: Theory and practice. Educational Measurement: Issues and Practice, 28, 39–53. Fritsch, S., Berger, S., Seifried, J., Bouley, F., Wuttke, E., Schnick-Vollmer, K., & Schmitz, B. (2015). Measurement of content knowledge and pedagogical content knowledge in business and economics education – a cross-country comparison between Germany and Austria. In O. ZlatkinTroitschanskaia & R. Shavelson (Eds.), Assessment of domain-specific professional competencies [Special issue]. Empirical Research in Vocational Education and Training. Manuscript submitted for publication. Hattie, J. (2009). Visible learning. A synthesis of over 800 metaanalyses relating to achievement. London, UK: Routledge. Hattie, J., & Cooksey, R. W. (1984). Procedures for assessing the validities of tests using the ‘‘known-groups’’ method. Applied Psychological Measurement, 8, 295–305. Hill, H. C., Ball, D. L., & Schilling, S. G. (2008). Unpacking pedagogical content knowledge: Conceptualizing and measuring teachers’ topic specific knowledge of students. Journal for Research in Mathematics Education, 39, 372–400. Hill, H. C., Rowan, B., & Ball, D. L. (2005). Effects of teacher’s mathematical knowledge for teaching on student achievement. American Educational Research Journal, 42, 371–406. Jude, N., & Klieme, E. (2008). Einleitung [Introduction]. In N. Jude, J. Hartig, & E. Klieme (Eds.), Kompetenzerfassung in pädagogischen Handlungsfeldern (pp. 11–15). Berlin, Germany: BMBF. Zeitschrift für Psychologie 2015; Vol. 223(1):24–30


30

K. Schnick-Vollmer et al.: Teachers’ Professional Knowledge in Accounting

Kiefer, T., Robitzsch, A., & Wu, M. (2014). TAM: Test-Analysis Modules. Retrieved from http://cran.r-project.org/web/ packages/TAM/index.html Kleickmann, T., Richter, D., Kunter, M., Elsner, J., Besser, M., Krauss, S., & Baumert, J. (2013). Teachers’ content knowledge and pedagogical content knowledge: The role of structural differences in teacher education. Journal of Teacher Education, 64, 90–106. Koeppen, K., Hartig, J., Klieme, E., & Leutner, D. (2008). Current issues in competence modeling and assessment. Zeitschrift für Psychologie, 216, 6–73. doi: 10.1027/ 0044-3409.216.2.61 Krauss, S., Baumert, J., & Blum, W. (2008). Secondary mathematics teachers’ pedagogical content knowledge and content knowledge: Validation of the COACTIV constructs. International Journal on Mathematics Education, 40, 873–892. Kuhn, C., Happ, R., Zlatkin-Troitschanskaia, O., Beck, K., Förster, M., & Preuße, D. (2014). Kompetenzentwicklung angehender Lehrkräfte im kaufmännisch-verwaltenden Bereich – Erfassung und Zusammenhänge von Fachwissen und fachdidaktischem Wissen [Competency development in prospective teachers of commercial-administrative subjects – assessment of and relationships between professional knowledge and subject-specific pedagogical knowledge]. Zeitschrift für Erziehungswissenschaft, 17, 149–167. Kunter, M., Klusmann, U., Dubberke, T., Baumert, J., Blum, W., Brunner, M., . . . Tsai, Y.-M. (2007). Linking aspects of teacher competence to their instruction. Results from the COACTIV Project. In M. Prenzel (Ed.), Studies on the educational quality of schools. The final report on the DFG priority programme (pp. 32–52). Münster, Germany: Waxmann. Lipowsky, F. (2006). Auf den Lehrer kommt es an: Empirische Evidenzen für Zusammenhänge zwischen Lehrerkompetenzen, Lehrerhandeln und dem Lernen der Schüler [It depends on the teacher: Empirical evidence for correlations between teachers’ competencies, teachers’ behavior, and students’ learning]. In C. Allemann-Ghionda & E. Terhart (Eds.), Kompetenzen und Kompetenzentwicklung von Lehrerinnen und Lehrern (pp. 47–70) [Zeitschrift für Pädagogik, Supplement 51]. Weinheim, Germany: Beltz. McDaniel, M. A., Hartmann, N. S., Whetzel, D. L., & Grubb, W. L. III (2007). Situational judgment tests, response instructions, and validity: A meta-analysis. Personnel Psychology, 60, 63–91. Mindnich, A., Berger, S., & Fritsch, S. (2013). Modellierung des fachlichen und fachdidaktischen Wissens von Lehrkräften im Rechnungswesen - Überlegungen zur Konstruktion eines Testinstruments [Modelling the professional and subjectspecific pedagogical knowledge of accountancy teachers – deliberations on the construction of an assessment tool]. In U. Faßhauer, B. Fürstenau, & E. Wuttke (Eds.), Jahrbuch Berufs- und Wirtschaftspädagogischer Forschung 2013 (pp. 61–72). Opladen, Germany: Budrich. Neuweg, H. G. (2010). Grundlagen und Dimensionen der Lehrerkompetenz [Foundations and dimensions of teacher competence]. In R. Nickolaus, G. Pätzold, H. Reinisch & T. Tramm (Eds.), Handbuch der Berufs- und Wirtschaftspädagogik (pp. 26–31). Bad Heilbrunn, Germany: Klinkhardt. OECD. (2012). Better skills, better jobs, better lives. A strategic approach to skills policies. Paris, France: OECD Publishing. doi: 10.1787/9789264177338-en Osterlind, S. J., & Everson, H. T. (2009). Differential item functioning. Thousand Oaks, CA: Sage. Rost, J. (2004). Lehrbuch Testtheorie – Testkonstruktion [Textbook test theory – test construction] (2nd ed.). Bern, Switzerland: Huber.

Zeitschrift für Psychologie 2015; Vol. 223(1):24–30

Schlump, S. (2010). Kompetenzen von Lehrpersonen zur Konstruktion von Lernaufgaben [Teachers’ task construction competencies]. In H. Kiper, W. Meints, S. Peters, & S. Schlump (Eds.), Lernaufgaben und Lernmaterialen im kompetenzorientierten Unterricht (pp. 224–236). Stuttgart, Germany: Kohlhammer. Seifried, J. (2012). Teachers’ beliefs at vocational schools – an empirical study in Germany. Accounting Education: An International Journal, 21, 489–514. Seifried, J., Türling, J. M., & Wuttke, E. (2010). Professionelles Lehrerhandeln – Schülerfehler erkennen und für Lernprozesse nutzen [Professional teacher behavior – recognizing student errors and using them to aid learning processes]. In J. Warwas & D. Sembill (Eds.), Schulleitung zwischen Effizienzkriterien und Sinnfragen (pp. 137–156). Baltmannsweiler, Germany: Schneider. Shulman, L. (1986). Those who understand: Knowledge growth in teaching. Educational Researcher, 15, 4–14. Strobl, C. (2010). Das Rasch-Model l: Eine verständliche Einführung für Studium und Praxis [The Rasch model: A coherent introduction for students and practitioners]. Munich, Germany: Hampp. Türling, J. M., Seifried, J., Wuttke, E., Gewiese, A., & Kästner, R. (2011). ‘‘Typische’’ Schülerfehler im Rechnungswesenunterricht: Empirische Befunde einer Interviewstudie [‘‘Typical’’ student mistakes in accountancy class: Empirical findings from an interview study]. Zeitschrift für Berufs- und Wirtschaftspädagogik, 107, 390–407. Weinert, F. E. (2001). Concept of competence: A conceptual clarification. In D. S. Rychen & L. H. Saganik (Eds.), Defining and selecting key competencies (pp. 45–65). Göttingen, Germany: Hogrefe. Winther, E. (2010). Kompetenzmessung in der beruflichen Bildung [Competency assessment in vocational training]. Bielefeld, Germany: Bertelsmann. Wright, B. D., & Linacre, J. M. (1994). Reasonable meansquare fit values. Rasch Measurement Transaction, 8, 370. Wuttke, E., & Seifried, J. (2013). Diagnostic competence of (prospective) teachers in vocational education: An analysis of error identification in accounting lessons. In K. Beck & O. Zlatkin-Troitschanskaia (Eds.), From diagnostics to learning success. Proceedings in vocational education and training (pp. 225–240). Rotterdam, The Netherlands: Sense. Zlatkin-Troitschanskaia, O., Förster, M., Brückner, S., Hansen, M., & Happ, R. (2013). Modellierung und Erfassung der wirtschaftswissenschaftlichen Fachkompetenz bei Studierenden im deutschen Hochschulbereich [Modelling and assessing professional economics competence in German students]. In O. Zlatkin-Troitschanskaia, R. Nickolaus, & K. Beck (Eds.), Kompetenzmodellierung und Kompetenzmessung bei Studierenden der Wirtschaftswissenschaften und der Ingenieurwissenschaften (pp. 108–133). Landau, Germany: Verlag Empirische Pädagogik.

Kathleen Schnick-Vollmer Institute for Psychology University of Darmstadt Alexanderstraße 10 64283 Darmstadt Germany Tel. +49 6151 16-70976 Fax +49 6151 16-4196 E-mail schnick@psychologie.tu-darmstadt.de

Ó 2015 Hogrefe Publishing


Original Article

The Relationship of Mathematical Competence and Mathematics Anxiety An Application of Latent State-Trait Theory Lars Jenßen,1 Simone Dunekacke,1,2 Michael Eid,3 and Sigrid Blömeke1,4 1

Instructional Research, HU Berlin, Germany, 2Carl von Ossietzky University, Oldenburg, Germany, 3Methods and Evaluation, FU Berlin, Germany, 4Centre for Educational Measurement (CEMO), University of Oslo, Norway

Abstract. In educational contexts, it is assumed that mathematical competence can be viewed as a trait. However, studies have yet to examine whether mathematical competence is actually a stable personality characteristic or rather depends on situational factors. Thus, construct validity has not yet been confirmed in this respect. The present study closes this research gap with regard to prospective pre-school teachers when measured across measurement occasions with similar situational characteristics. This study also examines the idea that math anxiety is a relevant negative predictor of mathematical competence. Both research objectives were examined using latent state-trait theory (LST) modeling, which allows for the investigation of occasion-independent and occasion-specific variability over time. The competence and anxiety of n = 354 prospective pre-school teachers were assessed twice across a period of three weeks. Results indicated no occasion-specific effects and moderate negative relations between math anxiety and all mathematical domains. The utility of LST modeling for construct validation and the investigation of complex relationships are discussed. Keywords: mathematical competence, math anxiety, latent state-trait theory, pre-school teachers

Theoretical Background Early Education in the Field of Mathematics and the Mathematical Competence of Prospective Pre-School Teachers In recent years, several studies have shown that pre-school children are able to develop notable mathematical competence and that this competence predicts their later achievement in mathematics at school (e.g., Krajewski & Schneider, 2009). However, this development strongly depends on the quality of the support provided by preschool teachers (Reynolds, 1995). Therefore, pre-school teachers should be competent at fostering children’s mathematical development (Burchinal et al., 2008; Klibanoff, Levine, Huttenlocher, Vasilyeva, & Hedges, 2006). According to Shulman’s theoretical work (1986), teachers’ competence can be divided into several contentand pedagogy-related facets. Studies that have empirically tested this model have supported its validity with respect 1

to prospective primary school teachers (Blömeke, Kaiser, & Lehmann, 2010). According to these studies, mathematical competence as one content-related model facet consists of several domains (number and operations; quantity and relation; geometry; data, combinatorics, and chance) and processes (problem solving; modeling; communicating; representing; reasoning; patterns and structuring). Given the frequent use of this model in standards, it can be seen as internationally valid (e.g., Common Core State Standards Initiative, 2014) and was also applied by the German Standing Conference of the Ministers of Education (Kultusministerkonferenz [KMK], 2004). It was also validated in analyses of pre-school teacher education curricula and standards for early education in all federal states in Germany (Jenßen et al., 2013) as part of our KomMa study.1 In Germany, up to 95% of pre-school teachers are trained at early-education vocational schools, which accept students with a middle-school or high-school degree depending on the state. A small proportion of these teachers are trained at universities for applied sciences (Metzinger, 2006).

KomMa is a joint research project of the Humboldt University of Berlin and the Alice Salomon University of Applied Sciences Berlin. It is funded by the Federal Ministry of Education and Research (FKZ: 01PK11002A) and part of the funding initiative ‘‘Modeling and Measuring Competencies in Higher Education (KoKoHs).’’

Ó 2015 Hogrefe Publishing

Zeitschrift für Psychologie 2015; Vol. 223(1):31–38 DOI: 10.1027/2151-2604/a000197


32

L. Jenßen et al.: Mathematical Competence and Mathematics Anxiety

The regular duration of prospective pre-school teachers’ training in Germany averages 3 years. A systematic analysis of all pre-school teacher education curricula within KomMa showed that mathematics could not be considered a regular subject during the training (Jenßen et al., 2013).

Math Anxiety In the field of mathematical competence, math anxiety has been discussed as a moderately negative predictor of varying relevance for achievement in math (Ma, 1999). Math anxiety is defined as ‘‘feelings of tension and anxiety that interfere with the manipulation of mathematical problems in a wide variety of ordinary life and academic situations’’ (Richardson & Suinn, 1972, p. 551), and it consists of cognitive and affective components (Ashcraft, 2002). Studies have revealed that math anxiety is more common among females (Miller & Bichsel, 2004) and among pre-school teachers (Gresham, 2007). In addition, studies have shown that there are three main consequences: Teachers who show higher levels of math anxiety are less competent in different mathematical domains (e.g., Rayner, Pitsolantis, & Osana, 2009), avoid mathematical situations (e.g., Chinn, 2012), and transfer their own math anxiety to their students (Beckdemir, 2010). Hence, different kinds of interventions, such as systematic desensitization and cognitive restructuring, were assumed to reduce math anxiety in teacher training and were investigated as such (e.g., Hembree, 1990). However, nothing is known about whether the relationship between mathematical competence and math anxiety occurs on a generalized level or on a situation-specific level.

Latent State-Trait Theory Studies that have examined the relationship between mathematical competence and math anxiety have indicated that both constructs can be seen as traits, meaning that these constructs are stable and consistent over time and can be generalized across various situations (Liebert & Liebert, 1998). For example, Weinert (2001) defined competence as the ability to successfully master problems in variable situations. In addition, other evidence has suggested that mathematical competence is a stable personality characteristic (Aunola, Leskinen, Lerkkanen, & Nurmi, 2004). According to Klieme, Hartig, and Rauch (2008, p. 5), competence as a trait provides the chance to examine competence characteristics in larger groups of persons because interindividual differences in achievement are assumed to be caused only by the trait (dispositionism, Epstein, 1984). However, Mischel (1968) assumed that situationspecific influences may also affect the measurement of constructs (situationism). This implies that these constructs have unstable and specific portions (occasion-specifics) that are sensitive to situations. Steyer, Schmitt, and Eid (1999) pointed out that ‘‘measurement does not take place in a situational vacuum’’ Zeitschrift für Psychologie 2015; Vol. 223(1):31–38

(p. 389). There may be different situational effects that influence measurement (cf. Anastasi, 1983) with the result that measures may differ due to the situational specificity of the measurement occasion. ‘‘The term ‘situation’ refers to the unobservable psychological conditions that might be relevant for the measurement of the construct considered’’ (Steyer et al., 1999, p. 394). With regard to the present study, such unobservable conditions might be fatigue, attitudes toward mathematics, or psychophysiological parameters. Observable situational factors (e.g., the composition of the class or material that may prime math anxiety) are purposefully not varied in such studies but are rather held constant. Steyer, Partchev, Seiß, Menz, and Hübner (2000) suggested that achievement scores also contain significant amounts of occasion-specific variance. From a theoretical point of view, it is assumed that anxiety consists of both parts (Spielberger, 1972). Furthermore, math anxiety is seen as a reaction to situational aspects (Ashcraft, 2002). Math anxiety measured as a trait was correlated with general trait anxiety and state anxiety. By systematically manipulating the situation, occasion-specific effects occurred (Goetz, Bieg, Lüdtke, Pekrun, & Hall, 2013; Hembree, 1990). The Latent State-Trait Theory (LST) allows (1) for the investigation of dispositional differences among persons (‘‘traits’’) and (2) for them to be separated from occasionspecific effects as well as from effects of interactions of traits and occasions on obtained scores (Geiser & Lockhart, 2012; Steyer et al., 1999). LST examines these traits and occasionspecifics with regard to variability, meaning that these two concepts can be attributed to interindividual differences. Traits develop across the life span (i.e., characteristics of a person can increase or decrease). Thus, developmental processes do not represent occasion-specifics. The LST has already been applied in numerous fields in psychology (Geiser & Lockhart, 2012), but it has rarely been applied in educational research (see Eid & Hoffmann, 1998, for its only previous application in educational research concerning students’ interest in the topic of radioactivity). To apply the LST, at least two occasions and at least two test halves (Steyer et al., 1999) are required to investigate trait and occasion-specific effects (the latter including purely occasion-specific effects and the interaction of trait and occasion). The basic assumption according to Eid and Diener (2004) is that the achievement of an individual (state variable S) can be decomposed into an occasion-unspecific variable (trait variable T) and an occasion-specific deviation variable (OS). Considering the measurement error E of an observed variable Y, it is assumed that Y can be decomposed into S and E. With respect to different occasions and different traits measured at each occasion, the following decomposition follows: Y ki ¼ T i þ OSk þ Eik where k refers to the kth occasion of measurement and i refers to the ith trait measured at occasion k. Eid and Diener (1999) referred to this model as the MultistateMultitrait (MSMT) model. This model is one of the Ó 2015 Hogrefe Publishing


L. Jenßen et al.: Mathematical Competence and Mathematics Anxiety

most popular models in LST research, and it allows a researcher to investigate the different variance components of traits, occasion-specifics, and measurement error and are thus based on classical test theory (see Figure 1). MSMT models do not need additional method factors because of their indicator-specific trait factors (Geiser & Lockhart, 2012). LST modeling takes place in the framework of structural equation modeling. Modeling in terms of LST requires strong factorial invariance (Meredith, 1993), which means that the factor structure, factor loadings, and intercepts are not allowed to differ significantly across measurement occasions (Geiser et al., 2014). The basic assumption of an MSMT model is: The occasion-specific variable OSk is common to all traits i measured at occasion k. This means that OSk accounts for the effects of situational aspects and the interaction of trait and occasion k at the same time. The next assumption is that all occasion-specific variables are uncorrelated. Consequently, it is assumed that the latent trait variables Ti explain stability across occasions and is therefore not indexed with k. Whether an observed variable (indicator) contains significant amounts of occasionspecific or trait variance is in the end an empirical question of model fit.2

Research Questions The present study examined the relationship between mathematical competence and math anxiety in prospective pre-school teachers. The state of research (e.g., Ma, 1999) suggests a moderate negative correlation between these two constructs. However, almost nothing is known about whether this negative relationship occurs on a stable, generalized level or on an occasion-specific level. The present study also applied LST to enhance our understanding of the nature of mathematical competence in prospective pre-school teachers. From a psychometric point of view, the study represents a construct validation of the competence test because theoretical assumptions and practical procedures in competence research follow from the idea that mathematical competence can be viewed as a trait. Thus, the purpose of this study was to close both research gaps with respect to prospective pre-school teachers by applying the well-established LST method. The main research questions were therefore, (1) Does this mathematical competence test actually measure mathematical competence as a trait as intended? A confirmation would mean that this measure should be influenced only by personal characteristics and not by natural variations; and (2) What is the relationship between mathematical competence and math anxiety with respect to LST? More precisely, this research question asks whether correlational relationships exist on only the trait level, on only the occasion-specific level, or on both levels.

2

33

Figure 1. Multistate-multitrait model.

Method Participants Three hundred fifty-four prospective pre-school teachers were assessed twice within a time frame of two to three weeks. They came from 16 classes belonging to five vocational schools in the greater areas of Berlin and Bremen/Lower Saxony. The participants’ mean age was M = 22.9 years (SD = 2.1 years). About 83% of the participants were female. The participants differed in their years of training: 41.5% were tested during their first year of training, 33% during their second year of training, and 25.5% during their third year of training. 17.5% of the participants had at least one missing value on one variable.

Instruments Mathematical competence was measured with a test developed in the KomMa project. This paper-pencil test consists of 24 items combining the mathematical domains and processes described above. Most of the items are presented in a multiple-choice format and some in an open-response format. The interrater reliability (Kappa coefficient) for open-response items was between .95 and .99. All items were coded dichotomously (right/wrong). Results of the pilot study confirmed factorial validity in that the test was able to distinguish between the four domains as hypothesized with six items for each domain. The reliability of each dimension/domain as measured by Cronbach’s alpha ranged from a = 0.80–0.86. The content validity of the test was also confirmed through a systematic expert review process (Jenßen et al., 2013). The Mathematics Anxiety Scale-Revised (MAS-R; Bai, Wang, Pan, & Frey, 2009) was used to examine math anxiety. The questionnaire contains 14 items of which six are positive statements, for example ‘‘I find math interesting,’’ and eight are negative statements, for example

More details about LST and its methodological implications can be found in Eid and Diener (2004).

Ó 2015 Hogrefe Publishing

Zeitschrift für Psychologie 2015; Vol. 223(1):31–38


34

L. Jenßen et al.: Mathematical Competence and Mathematics Anxiety

Table 1. Raw scores of the Mathematical Competence and Anxiety (Sub-)Scales Occasion 1 Measurement Mathematical Competence (MC) Number and Operations (NO) Geometry (GE) Quantity and Relation (QR) Data, Combinatorics, and Chance (DC) Mathematics Anxiety Scale – Revised (MAS-R) Positively phrased subscale (MAp) Negatively phrased subscale (MAn)

Occasion 2

M

SD

M

SD

11.39 3.29 2.74 2.65 2.77 43.77 19.86 23.91

4.27 1.42 1.50 1.19 1.33 10.71 5.32 6.71

10.93 3.06 2.69 2.53 2.60 43.19 19.46 23.73

4.40 1.44 1.53 1.17 1.34 10.18 5.26 6.25

‘‘Mathematics makes me feel nervous.’’ Participants provide their answers on a five-point Likert scale ranging from ‘‘totally agree’’ to ‘‘totally disagree.’’ The positive statements have to be reverse scored so that a high score indicates high anxiety. The questionnaire has satisfactory reliability (Cronbach’s a = 0.87) and is considered valid (Bai, 2011).

had a potential range of 0–24, and the scores from each mathematical dimension could range from 0 to 6. The Mathematics Anxiety Scale – Revised score could range from 14 to 70. The positively phrased math anxiety score could range from 6 to 30 and the negatively phrased math anxiety score could range from 8 to 40.

Procedure

Mathematical Competence

The assessment took place during regular instruction times and at the same time for each class. First, participants completed the MAS-R. Second, to avoid priming effects, participants completed other instruments (e.g., about their mathematical pedagogical content knowledge or their selfefficacy) before they began working on the mathematics test. This procedure was similar at occasion 1 and occasion 2.

Four trait variables were specified according to the four mathematical domains representing mathematical competence. Each trait variable was specified by two indicators (one indicator per occasion), and each indicator was an item parcel that was formed by summing six items. We first examined measurement invariance across occasions. Results indicated that all factor loadings could be restricted to 1. However, the intercept of the indicator ‘‘number and operations’’ at occasion 2 was 0.14 and significantly different from 0 ( p < .05); all other intercepts were fixed to zero. A first estimation of the basic MSMT model revealed an acceptable model fit (v2(19) = 30.46, p = .046, RMSEA = 0.042 [0.01; 0.07], SRMR = 0.02, CFI = 0.99) but also negative variances of the occasionspecific variables. Therefore, these variances were fixed to zero, and the model was re-estimated. The modified MSMT model for mathematical competence fit the data well, v2(21) = 30.93, p = .07, v2/df = 1.5, RMSEA = 0.037 [0.00; 0.06], SRMR = 0.03, CFI = 0.99. The estimated parameters of this model are presented in Table 2. The correlations between the four subdimensions ranged from .6 (quantity and relation with data, combinatorics, and chance) to .8 (number and operations with geometry). All correlations were significant ( p < .001). The amount of variance explained by the indicators ranged from .66 (number and operations at occasion 2) to .74 (data, combinatorics, and chance at occasion 2).

Data Analysis A series of structural equation models was applied to examine the research questions. First, an MSMT of mathematical competence was estimated. Four indicators were used to represent the four subdimensions of the test. Second, an MSMT of math anxiety was estimated. Two indicators were built: one for the negative items and one for the positive items. For each model, strong factorial invariance was examined, and occasion-specific factors were tested. In a third step, we examined the complex relationship of math anxiety and mathematical competence in an integrated LST model. All statistical analyses were computed with the Mplus 5.2 software package and took the clustered data structure into account (MLR estimator; TYPE = complex) with the 16 classes representing the second level (Muthén & Muthén, 2007). Missing data were handled by using the FIML procedure.

Results

Math Anxiety

Raw Scores

Two trait variables were specified: one for the negatively phrased math anxiety items and one for the positively phrased math anxiety items. Just as for the mathematical competence model, each trait variable was specified by

The raw scores for each (sub-)trait and each occasion are reported in Table 1. The mathematical competence score Zeitschrift für Psychologie 2015; Vol. 223(1):31–38

Ó 2015 Hogrefe Publishing


L. Jenßen et al.: Mathematical Competence and Mathematics Anxiety

35

Table 2. Estimated model parameters

Var (T ) Mean (T ) Var (OS1) Var (OS2) Var (e1) Var (e2)

Positively phrased math anxiety

Negatively phrased math anxiety

Number and operations

24.60* (2.35) 19.89* (0.31) 0 (fixed) 0 (fixed) 3.47* (0.68) 3.55* (0.62)

36.29* (2.38) 23.87* (0.27) 0 (fixed) 0 (fixed) 7.58* (1.31) 5.78* (1.63)

1.32* (0.16) 3.24* (0.12) 0 (fixed) 0 (fixed) 0.73* (0.08) 0.68* (0.09)

Geometry

Quantity and relation

Data, combinatorics, and chance

1.64* (0.16) 2.69* (0.13) 0 (fixed) 0 (fixed) 0.61* (0.06) 0.65* (0.07)

0.98* (0.13) 2.56* (0.08) 0 (fixed) 0 (fixed) 0.42* (0.06) 0.42* (0.04)

1.29* (0.17) 2.62* (0.09) 0 (fixed) 0 (fixed) 0.52* (0.06) 0.45* (0.05)

Notes. *p < .001, T = indicator-specific trait variance; OSk = occasion-specific variance at occasion k; ei = residual variable of indicator i, standard errors in parentheses.

two indicators that represented parcels that were formed by summing the corresponding items. Again, measurement invariance was tested first. All factor loadings could be restricted to 1. The intercept of the negative indicator was fixed to zero, and the positive indicator at occasion 2 had a significant intercept of 0.07 ( p < .05). A basic MSMT model was tested. The model fit was good (v2(2) = 2.24, p = .3258, RMSEA = 0.02 [0.00; 0.11], SRMR = 0.01, CFI = 1.00). Since the data indicated negative or nonsignificant occasion-specific variances, we estimated a modified MSMT model with occasion-specific variances fixed to 0; this model fit the data well (v2(4) = 3.65, p = .894, v2/df = 0.9, RMSEA = 0.00 [0.00; 0.08], SRMR = 0.01, CFI = 1.00). The estimated model parameters are presented in Table 2. The correlation between the positive math anxiety statements in their reversed version and the negative ones was 0.67 ( p < .001). Thus, the two indicators accounted for different facets of math anxiety. The amount of variance explained by the indicators ranged from 0.83 (negative indicator at occasion 1) to 0.88 (positive indicator at occasion 1).

Final Model Representing the Complex Relationship of Mathematical Competence and Math Anxiety To model the complex relationship between the different domains of mathematical competence and the different subdimensions of math anxiety, the two models were integrated into an overall MSMT model. This MSMT model was fit to the data, v2(49) = 74.46, p = .011, v2/df = 1.5, RMSEA = 0.04 [0.02; 0.06], SRMR = 0.03, CFI = 0.99. Considering the complexity of the model, the fit could be regarded as acceptable, although the deviance in the chi-square value was still significant (Schermelleh-Engel, Moosbrugger, & Müller, 2003). The correlations between mathematical competence and math anxiety are presented in Table 3. All correlations between the latent trait variables were significant, negative, and moderate in size. The highest correlation was found between geometry and the positively phrased trait of math Ó 2015 Hogrefe Publishing

anxiety. The lowest correlation was found between the negatively phrased trait of math anxiety and data, combinatorics, and chance. The positive math anxiety statements appeared to be more strongly related to mathematical domains than the negative ones. Also, number and operations as well as geometry appeared to be more strongly associated with math anxiety than the other domains. Some of the differences in the correlations were significant ( p < .05).

Summary and Discussion The aim of the present study was to shed light on the relationship between mathematical competence and math anxiety in prospective pre-school teachers. In addition, the present study was intended to demonstrate the usefulness of LST. It can be used for the construct validation of competence tests or to examine complex relationships. It is possible to identify indicators representing a trait rather than occasion-specific constructs and vice versa with the help of LST. Therefore, LST can also be used in test construction. In the present study, the data were in line with the theoretical assumptions about mathematical competence as a trait and thereby supported the construct validity of the KomMa test. Math anxiety as assessed here also seems to be a trait. The results basically indicate that prospective pre-school teachers differ interindividually in their trait characteristics. These differences in mathematical competence and math anxiety cannot be attributed to situational factors but rather to stable person characteristics (Klieme et al., 2008, p. 5). The results of the present study imply that, as long as the contextual conditions are quite similar across situations and not too much time goes by, no occasion-specific variability can be found in everyday settings in math anxiety or in mathematical competence. Future research is needed to determine whether this result can be generalized across other settings, in particular to those with varying conditions. From the point of view of LST, the results of the present study on mathematics competence resemble the results of Danner, Hagemann, Schankin, Hager, and Funke’s (2011) Zeitschrift für Psychologie 2015; Vol. 223(1):31–38


36

L. Jenßen et al.: Mathematical Competence and Mathematics Anxiety

Table 3. Variances (diagonal), covariances (lower triangular matrix), and correlations (upper triangular matrix) for the latent occasion-specific and latent trait variables NO GE QR DC OSMC1 OSMC2 MAp MAn OSMA1 OSMA2

NO

GE

QR

DC

OSMC1

OSMC2

MAp

MAn

OSMA1

OSMA2

1.32 1.18 0.85 0.87 0.00* 0.00* 2.05 2.10 0.00* 0.00*

0.80 1.65 0.90 0.89 0.00* 0.00* 2.41 2.61 0.00* 0.00*

0.75 0.71 0.98 0.67 0.00* 0.00* 1.50 1.60 0.00* 0.00*

0.67 0.60 0.60 1.29 0.00* 0.00* 1.81 1.61 0.00* 0.00*

0.00* 0.00* 0.00* 0.00* 0.00* 0.00* 0.00* 0.00* 0.00* 0.00*

0.00* 0.00* 0.00* 0.00* 0.00* 0.00* 0.00* 0.00* 0.00* 0.00*

0.36 0.38 0.30 0.32 0.00* 0.00* 24.58 19.97 0.00* 0.00*

0.30 0.34 0.27 0.24 0.00* 0.00* 0.67 36.34 0.00* 0.00*

0.00* 0.00* 0.00* 0.00* 0.00* 0.00* 0.00* 0.00* 0.00* 0.00*

0.00* 0.00* 0.00* 0.00* 0.00* 0.00* 0.00* 0.00* 0.00* 0.00*

Notes. *Fixed; all correlations were significant (p < .001); NO = Number and Operations; GE = Geometry; QR = Quantity and Relation; DC = Data, Combinatorics, and Chance; MAp = positively phrased math anxiety; MAn = negatively phrased math anxiety; OSMCk = Occasion-Specific variable for mathematical indicators at occasion k; OSMAk = Occasion-Specific variable for math anxiety indicators at occasion k.

study in which LST modeling was applied to different intelligence measures. In their study, significant occasionspecific variability was not found either. One might therefore cautiously hypothesize that in general, cognitive constructs are not comprised of situational occasionspecific variance (Steyer et al., 2000). Since math anxiety also seems to be a trait if assessed with the MAS-R, it might be suggested that math anxiety has characteristics that are similar to the stable anxiety schema in anxiety disorders (Ashcraft, 2002). However, it might again be the case that occasion-specific variance in math anxiety occurs when situations are systematically varied (Goetz et al., 2013). Similar to Ma’s (1999) meta-analysis, our data revealed a moderate negative relation between math anxiety and mathematical competence. These relations existed in all mathematical domains. In addition, our study was the first to present evidence that math anxiety is a substantial phenomenon in prospective pre-school teachers in Germany. Our examination of measurement invariance showed a slight but significant decline in competence in ‘‘number and operations’’ at occasion 2. This result might be due to a lack of motivation to work on the tests at occasion 2. Studies have indicated that motivational aspects play an important role in the complexity of mathematical competence and math anxiety (Zakaria & Nordin, 2008). Simultaneously, math anxiety declined significantly at occasion 2 for the positively phrased indicator. A possible explanation might be that participants were less anxious about mathematics because of their experiences at occasion 1 (Beckdemir, 2010). A limitation of the present study is its representativeness. The sample did not reflect the full heterogeneity of prospective pre-school teachers’ training because it did not capture the full range of years of training and did not represent prospective pre-school teachers from federal states other than Berlin and Bremen/Lower Saxony. Thus, the generalizability of our findings is limited to the groups represented in our sample. Future research should examine whether the same findings apply to other groups of prospective pre-school teachers. Zeitschrift für Psychologie 2015; Vol. 223(1):31–38

Conclusions If the results of the present study can be replicated and a stable relationship between math anxiety and mathematical competence can in fact be supported, interventions that are designed to reduce math anxiety in order to minimize its negative effects on achievement should focus on the stable facets of math anxiety. From research on anxiety disorders and the general theoretical assumptions about anxiety, it is well known that cognitions play a role in shaping the stable part of anxiety (e.g., Morris, Davis, & Hutchings, 1981). Therefore, attempting to modify the cognitive parts of the math anxiety scheme (e.g., cognitive restructuring) might be a worthwhile approach. Hembree (1990) showed the effectiveness of such an intervention in comparison with other interventions that focused on situational aspects such as relaxation training. Further research is needed on the effects of the mathematical competence and math anxiety of pre-school teachers on their interactions with children and the development of children’s mathematics achievement during pre-school. Possible research questions are whether math anxiety is transmitted from teachers to children or whether pre-school teachers avoid mathematical situations when they are highly anxious. Furthermore, pedagogical content knowledge and pedagogical knowledge are important facets of pre-school teachers’ professional competence too. However, little is known about how they are related to math anxiety, and therefore, further research is needed.

References Anastasi, A. (1983). Traits, states, and situations: A comprehensive view. In H. Wainer & S. Messick (Eds.), Principles of modern psychological measurement (pp. 345–356). Hillsdale, NJ: Erlbaum. Ashcraft, M. H. (2002). Math anxiety: Personal, educational, and cognitive consequences. Current Directions in Psychological Science, 11, 181–185. Aunola, K., Leskinen, E., Lerkkanen, M.-K., & Nurmi, J.-E. (2004). Developmental dynamics of math performance from Ó 2015 Hogrefe Publishing


L. Jenßen et al.: Mathematical Competence and Mathematics Anxiety

preschool to grade 2. Journal of Educational Psychology, 96, 699–713. Bai, H. (2011). Cross-validating a Bidimensional Mathematics Anxiety Scale. Assessment, 18, 115–122. Bai, H., Wang, L. S., Pan, W., & Frey, M. (2009). Measuring mathematics anxiety: Psychometric analysis of a Bidimensional Affective Scale. Journal of Instructional Psychology, 36, 185–193. Beckdemir, M. (2010). The pre-service teachers’ mathematics anxiety related to depth of negative experiences in mathematics classroom while they were students. Educational Studies in Mathematics, 75, 311–328. Blömeke, S., Kaiser, G., & Lehmann, R. (Eds.). (2010). TEDS-M 2008. Professionelle Kompetenz und Lerngelegenheiten angehender Primarstufenlehrkräfte im internationalen Vergleich [TEDS-M 2008: Professional competence and learning opportunities of prospective elementary school teachers – an international comparison]. Münster, Germany: Waxmann. Burchinal, M., Howes, C., Pianta, R., Bryant, D., Early, D., Clifford, R., & Barbarin, O. (2008). Predicting child outcomes at the end of kindergarten from the quality of pre-kindergarten teacher-child interactions and instructions. Applied Developmental Science, 12, 140–153. Chinn, S. (2012). Beliefs, anxiety, and avoiding failure in mathematics. Child Development Research, 1–8. doi: 10.1155/2012/396071 Common Core State Standards Initiative. (2014). Common Core State Standards for Mathematics. Retrieved from http://www. corestandards.org/wp-content/uploads/Math_ Standards.pdf Danner, D., Hagemann, D., Schankin, A., Hager, M., & Funke, J. (2011). Beyond IQ: A latent state-trait analysis of general intelligence, dynamic decision making, and implicit learning. Intelligence, 39, 323–334. Eid, M., & Diener, E. (1999). Intraindividual variability in affect: Reliability, validity, and personality correlates. Journal of Personality and Social Psychology, 76, 662–676. Eid, M., & Diener, E. (2004). Global judgments of subjective well-being: Situational variability and long-term stability. Social Indicators Research, 65, 245–277. Eid, M., & Hoffmann, L. (1998). Measuring variability and change with an item response model for polytomous variables. Journal of Educational and Behavioral Statistics, 23, 193–215. Epstein, S. (1984). The stability of behavior across time and situations. In R. Zucker, J. Aronoff, & A. I. Rabin (Eds.), Personality and the prediction of behavior (pp. 209–268). San Diego, CA: Academic Press. Geiser, C., Keller, B. T., Lockhart, G., Eid, M., Cole, D. A., & Koch, T. (2014). Distinguishing state variability from trait change in longitudinal data: The role of measurement (non)invariance in latent state-trait analyses. Behavior Research Methods. Advance online publication. doi: 10.3758/s13428-014-0457-z Geiser, C., & Lockhart, G. (2012). A comparison of four approaches to account for method effects in Latent StateTrait Analyses. Psychological Methods, 17, 255–283. Goetz, T., Bieg, M., Lüdtke, O., Pekrun, R., & Hall, N. C. (2013). Do girls really experience more anxiety in mathematics? Psychological Science, 24, 2079–2087. doi: 10.1177/0956797613486989 Gresham, G. (2007). A study of mathematics anxiety in preservice teachers. Early Childhood Education Journal, 35, 181–188. Hembree, R. (1990). The nature, effects, and relief of mathematics anxiety. Journal for Research in Mathematics Education, 21, 33–46. Jenßen, L., Dunekacke, S., Baack, W., Tengler, M., Wedekind, H., Grassmann, M., & Blömeke, S. (2013, August). Validating an Ó 2015 Hogrefe Publishing

37

assessment of pre-school teachers’ mathematical knowledge. Paper presented at the 37th Conference of the Group for the Psychology of Mathematics Education in Kiel, Germany. Klibanoff, R. S., Levine, S. C., Huttenlocher, J., Vasilyeva, M., & Hedges, L. V. (2006). Preschool children’s mathematical knowledge: The effect of teacher ‘‘Math Talk’’. Developmental Psychology, 42, 56–69. Klieme, E., Hartig, J., & Rauch, D. (2008). The concept of competence in educational contexts. In J. Hartig, E. Klieme, & D. Leutner (Eds.), Assessment of competencies in educational contexts: State of the art and future prospects (pp. 3–22). Göttingen, Germany: Hogrefe. Krajewski, K., & Schneider, W. (2009). Early development of quantity to number-word linkage as a precursor of mathematical school achievement and mathematical difficulties: Findings from a four-year longitudinal study. Learning and Instruction, 19, 513–526. Kultusministerkonferenz (KMK). (2004). Bildungsstandards im Fach Mathematik für den Primarbereich: Beschluss vom 15.10.2004 [Educational standards in mathematics in primary school education: Resolution passed October 15, 2004]. Munich, Germany: Wolters Kluwer. Liebert, R. M., & Liebert, L. L. (1998). Personality strategies and issues. Pacific Grove, CA: Brooks. Ma, X. (1999). A meta-analysis of the relationship between anxiety toward mathematics and achievement in mathematics. Journal for Research in Mathematics Education, 30, 520–540. Meredith, W. (1993). Measurement invariance, factor analysis and factorial invariance. Psychometrika, 58, 525–543. Metzinger, A. (2006). Geschichte der Erzieherinnenausbildung als Frauenberuf [History of kindergarten teacher training as a female profession]. In L. Fried & S. Roux (Eds.), Pädagogik der frühen Kindheit. Handbuch und Nachschlagewerk (pp. 348–358). Weinheim, Germany: Beltz. Miller, H., & Bichsel, J. (2004). Anxiety, working memory, gender, and math performance. Personality and Individual Differences, 37, 591–606. Mischel, W. (1968). Personality and assessment. New York, NY: Wiley. Morris, L. W., Davis, M. A., & Hutchings, C. H. (1981). Cognitive and emotional components of anxiety: Literature review and a revised worry–emotionality scale. Journal of Educational Psychology, 73, 541–555. Muthén, L. K., & Muthén, B. O. (2007). Mplus User’s Guide (5th ed.). Los Angeles, CA: Muthén & Muthén. Rayner, V., Pitsolantis, N., & Osana, H. (2009). Mathematics anxiety in preservice teachers: Its relationship to their conceptual and procedural knowledge of fractions. Mathematics Education Research Journal, 21, 60–85. Reynolds, A. (1995). One year of preschool intervention or two: Does it matter? Early Childhood Research Quarterly, 10, 1–31. Richardson, F., & Suinn, R. (1972). The mathematics anxiety rating scale; psychometric data. Journal of Counseling Psychology, 19, 551–554. Schermelleh-Engel, K., Moosbrugger, H., & Müller, H. (2003). Evaluating the Fit of Structural Equation Models: Tests of significance and descriptive goodness-of-fit measures. Methods of Psychological Research Online, 8, 23–74. Shulman, L. S. (1986). Those who understand: Knowledge growth in teaching. Educational Researcher, 15, 4–14. Spielberger, C. (1972). Anxiety: Current trends in research. London, UK: Academic Press. Steyer, R., Partchev, I., Seiß, K., Menz, S., & Hübner, T. (2000). Zur Anwendung von State-Trait-Modellen in der wehrpsychologischen Eignungsdiagnostik unter besonderer Berücksichtigung computerunterstützter Tests [On the Zeitschrift für Psychologie 2015; Vol. 223(1):31–38


38

L. Jenßen et al.: Mathematical Competence and Mathematics Anxiety

application of state-trait models in military psychological aptitude testing with particular regard to computer-aided tests]. Retrieved from http://www.metheval.uni-jena.de/materialien/ publikationen/bwpbericht000317.pdf Steyer, R., Schmitt, M., & Eid, M. (1999). Latent state-trait theory and research in personality and individual differences. European Journal of Personality, 13, 389–408. Weinert, F. E. (2001). Concept of competence: A conceptual classification. In D. S. Rychen & L. H. Salganik (Eds.), Defining and selecting key competencies. Göttingen, Germany: Hogrefe. Zakaria, E., & Nordin, N. M. (2008). The effects of mathematics anxiety on matriculation students as related to motivation and achievement. Eurasia Journal of Mathematics, Science & Technology Education, 4, 27–30.

Zeitschrift für Psychologie 2015; Vol. 223(1):31–38

Lars Jenßen Department of Instructional Research HU Berlin Geschwister-Scholl-Straße 7 10117 Berlin Germany Tel. +49 30 2093-1936 Fax +49 30 2093-1828 E-mail lars.jenssen@hu-berlin.de

Ó 2015 Hogrefe Publishing


Original Article

Assessment of Mathematical Competencies and Epistemic Cognition of Preservice Teachers Benjamin Rott,1 Timo Leuders,2 and Elmar Stahl3 1

Faculty of Mathematics, University of Duisburg-Essen, Germany, 2Institute for Mathematics Education, University of Education, Freiburg, Germany, 3Institute for Media in Education, University of Education, Freiburg, Germany

Abstract. Assessment in higher education requires multifaceted instruments to capture competence structures and development. The construction of a competence model of preservice mathematics teachers’ mathematical abilities and epistemic beliefs allows for comparing different groups at different stages of their studies. We investigated 1st and 4th semester students with respect to their epistemic beliefs on the certainty of mathematical knowledge (assessed by both denotative and connotative judgments) and their mathematical abilities (defined as critical thinking with respect to mathematical problem situations). We show that students’ beliefs change during the first four semesters, and that the level of critical thinking does not depend on belief orientation but goes along with the level of sophistication of the epistemic judgments uttered by the students. Keywords: epistemic beliefs, mathematical competencies, sophisticated beliefs, critical thinking

Higher education is a key to the development of modern society, especially in disciplines like teacher education, because these disciplines deal with the transfer of knowledge and are supposed to have a leverage effect. While the measurement of competencies in primary and secondary education has evolved considerably during the past decades (Hartig, Klieme, & Leutner, 2008), there is still a need for competence models for assessment in higher education (Blömeke, Zlatkin-Troitschanskaia, Kuhn, & Fege, 2013). Although instruments for assessing competencies in teacher education on a large scale are available (Blömeke, Suhl, & Döhrmann, 2013; Hill, Rowan, & Loewenberg Ball, 2005; OECD, 2010), many questions on the underlying competence structures remain open. Competencies are considered as complex abilities closely related to performance in real-life situations (Blömeke, Gustafsson, & Shavelson, 2015; Hartig et al., 2008; Shavelson, 2010) and they comprise and integrate constructs such as skills, abilities, knowledge, beliefs, and motivation (Weinert, 2001). Competence models are expected to specify the structure of competencies in predefined areas and constitute a framework for validly measuring competencies. Within the specific area of mathematics teacher education it has been studied how competencies develop and how abilities and beliefs interconnect (e.g., Schoenfeld, 2003; Staub & Stern, 2002); however, there is still a lack of approach to analyze these competence structures by means of psychometric competence models. Ó 2015 Hogrefe Publishing

Therefore, in our study (which is embedded in a larger initiative ‘‘Modeling and Measuring Competencies in Higher Education,’’ cf. Blömeke, Zlatkin-Troitschanskaia, et al., 2013) we construct and use a competence model that incorporates epistemic beliefs about mathematics and critical thinking in mathematics as two central cognitive dimensions of preservice mathematics teachers. Figure 1 presents an overview on the relevant constructs and variables of the study explained in the following sections.

Structure of the Belief Dimension Epistemic beliefs can be defined as learners’ beliefs about the nature of knowledge and knowing (Hofer & Pintrich, 1997). A growing amount of empirical evidence shows that epistemic beliefs are related to several aspects of learning processes and to learning outcomes (e.g., Buehl & Alexander, 2006; Hofer & Pintrich, 1997). Furthermore, students’ epistemic beliefs are affected by their teachers’ epistemic beliefs and their teaching style (e.g., Brownlee & Berthelsen, 2008). Adequate epistemic beliefs are considered as a prerequisite to successfully complete higher education (e.g., Bromme, 2005) and for an elaborated understanding of scientific findings. Beliefs of the nature of knowledge are regarded as a prerequisite for an active civic participation in modern science- and technologybased societies (Bromme, 2005). However, they are rarely Zeitschrift für Psychologie 2015; Vol. 223(1):39–46 DOI: 10.1027/2151-2604/a000198


40

B. Rott et al.: Assessment of Mathematical Competencies and Epistemic Cognition of Preservice Teachers

high level low level

Mathematical thinking critical thinking algorithmic thinking autonomous mind

Epistemic beliefs on mathematical knowledge and knowing sophisticated beliefs denotative judgments inflexible beliefs connotative beliefs

Figure 1. Dimensions and development of mathematical competencies. included in competence models, as, for example, in the area of mastering languages (cf. Klieme et al., 2004, p. 135). In our study we wish to introduce a model which incorporates beliefs on the nature of knowledge in the area of mathematics. To empirically assess beliefs, one can draw on various theories on their structure: a) Epistemic beliefs are usually seen as multidimensional. A widely accepted structure was proposed by Hofer and Pintrich (1997), who differentiated between two general areas of epistemic beliefs (nature of knowledge and nature or process of knowing) with two dimensions each (certainty and simplicity as well as source and justification of knowledge). For the aim of this study we focus on beliefs about the certainty of knowledge, which is central to mathematics (Kline, 1980). In the public opinion mathematics is regarded as a domain of certain knowledge and many misconceptions on mathematics are related to this view (Muis, 2004; Weber, Inglis, & Meija-Ramos, 2014). b) Most theoretical approaches include different levels of specificity of epistemic beliefs. Hofer (2006) distinguishes between general epistemic beliefs, disciplinary perspectives on beliefs, and discipline-specific beliefs. There is strong evidence for the assumption that epistemic beliefs are context-related (e.g., Stahl, 2011). Hence, in our study we examine interactions between mathematical knowledge and beliefs about mathematics on a discipline-specific level. To further account for the context specificity of epistemic beliefs we distinguish between beliefs about mathematics expressed in either a connotative or a denotative way. Connotative judgments are defined as judgments about the nature of mathematics that are activated spontaneously when no further context is given; for example, when a student is asked whether he or she generally thinks that mathematical knowledge is rather certain or uncertain. It is assumed that these judgments are directly related to the discipline-specific epistemic beliefs. Denotative judgments are generated with respect to a specific (mathematical) situation and are expected to be more reflected and more contextspecific (Bromme, Kienhues, & Stahl, 2008); for example, when a student is asked to choose between two given positions and give arguments for his choice. 1

c) Nearly all existing approaches address the development of beliefs during education toward more sophisticated epistemic beliefs. Often it is assumed that a strong absolutistic view on knowledge, which stresses its certainty and stability and considers knowledge as accumulation of facts that can be transferred by authority, is not considered appropriate in many contexts in which individuals have to judge knowledge claims. On the other hand, a strong relativistic view on knowledge that stresses its uncertainty and instability and that might result in an acceptance of different viewpoints without deeper reflections is also not appropriate for reflected epistemic judgments (Krausz, 2010). Therefore we agree with Bromme et al. (2008) who describe sophisticated epistemic beliefs ‘‘as those beliefs which allow for contextsensitive judgments about knowledge claims.’’1 This is in line with an evaluativistic perspective (e.g., King & Kitchener, 2002) which accepts a certain degree of uncertainty and changeability of truth. This also goes well with the fact that academic disciplines have commonly accepted methods and standards in the development of scientific knowledge. Therefore, aspects like the ontology of a discipline (Bromme et al., 2008), the specific context (e.g., Elby & Hammer, 2001), and the sociocultural context (e.g., Buehl & Alexander, 2006) should be taken into account, when knowledge claims are judged. This view of sophistication as an epistemologically reflected way to deal with knowledge claims supports the idea of Stahl (2011) that it is necessary to distinguish between the epistemic orientation (e.g., more absolutistic, more relativistic, more evaluativistic) and the sophistication of the judgment (level of reflection).

Structure of the Knowledge Dimension When assessing knowledge (as the second cognitive dimension of our competence model) we do not measure mathematical achievement in terms of students’ knowledge of the content of university courses. A theoretically more coherent picture of the students’ competence can be achieved by capturing the quality of the use of mathematical knowledge by drawing on the concept of critical thinking.

Please note that we do not intend to use ‘‘sophistication’’ to convey a value judgment (cf. Muis, 2004, p. 332), but rather to indicate a reflected use of arguments.

Zeitschrift für Psychologie 2015; Vol. 223(1):39–46

Ó 2015 Hogrefe Publishing


B. Rott et al.: Assessment of Mathematical Competencies and Epistemic Cognition of Preservice Teachers

Reflective Mind

Algorithmic Mind

41

Hypothesis 2: We expect that we can distinguish between students’ epistemic orientations (certain vs. uncertain) on the one hand and the sophistication of their argumentations on the other hand reflecting the reported discussion on context specificity of epistemic judgments. This structure should be seen more clearly in 4th semester students than in 1st semester students, due to the influence of reflective elements of courses in mathematics and mathematics education.

Autonomous Mind

Figure 2. The tripartite model of thinking (see Stanovich & Stanovich, 2010); the broken horizontal line represents the key distinction in dual process theory. Facione (1990, p. 3) ‘‘understand[s] critical thinking to be purposeful, self-regulatory judgment which results in interpretation, analysis, evaluation, and inference, as well as explanation of the evidential, conceptual, methodological, criteriological, or contextual considerations upon which that judgment is based. [. . .].’’ Though many different conceptualizations of critical thinking exist (e.g., in philosophy, psychology, and education) the following abilities are commonly agreed upon (cf. Lai, 2011, p. 9): analyzing arguments, claims, or evidence; making inferences using inductive or deductive reasoning; judging or evaluation and making decisions; or solving problems. To locate critical thinking within cognition, Stanovich and Stanovich (2010) propose a tripartite model of thinking, adapting, and extending dual process theory (e.g., Kahneman, 2003). They distinguish the subconscious thinking of an ‘‘autonomous mind’’ from the conscious thinking of an ‘‘algorithmic mind’’ and a ‘‘reflective mind’’ (see Figure 2). Critical thinking is identified with the functioning of the reflective mind, which may override algorithmic processes. When solving mathematical problems critical thinking can be attributed to those processes that consciously regulate the algorithmic use of mathematical procedures. A context-specific operationalization of critical thinking can be considered as a competence component (Weinert, 2001). Consequently, tasks to measure critical thinking that reflect this definition should (i) reflect discipline-specific solution processes but should not require higher level mathematics, (ii) require a reflective component of reasoning and judgment when solving a task or evaluating the solution, (iii) reflect an appropriate variation of difficulty within the population. Resting upon these conceptualizations of the belief and knowledge dimensions of mathematical competence, we intend to assess the levels of sophistication within the competencies of future mathematic teachers and their development at different stages of their university education: Hypothesis 1: We expect students at the beginning of their mathematical education to show lower levels of mathematical thinking and more inflexible beliefs. This finding would be coherent with the picture of mathematics conveyed in school (Schoenfeld, 1989). Ó 2015 Hogrefe Publishing

Hypothesis 3: Finally, we expect that this deeper reflection of the discipline induces connotative beliefs to be more in line with denotative beliefs in higher semesters. By these analyses we intend to find indications that our instruments are sensitive to identify competence profiles and changes in competencies, also when used within long-term longitudinal settings in subsequent studies.

The Study Measuring the impact of university education on beliefs and critical thinking may pose severe problems. Arum and Roksa (2011) report that most students do not improve their critical thinking skill during their first 2 years of university studies. The CBMS (2012, p. 55) summarizes the state of mathematics teacher education in the United States (which is comparable to Germany): ‘‘A primary goal of a mathematics major program is the development of mathematical reasoning skills. This may seem like a truism to higher education mathematics faculty, to whom reasoning is second nature. But precisely because it is second nature, it is often not made explicit in undergraduate mathematics courses.’’ Therefore it may prove difficult to assess the change of students’ beliefs or higher order skills with any instrument in regular teacher education. However, we situated our research within the teacher education program of the ‘‘University of Education Freiburg’’ which specializes on teacher education and only appoints specialists in education, in educational as well as in all subject matter courses. During the first four semesters, future mathematics teachers attend courses on education, mathematics education, and mathematics. All courses comprise reflective elements on the methods and the epistemology of mathematics as a discipline, for example, the role of proof on different levels of rigor, experimentation in mathematics, etc. (Barzel et al., in press). Therefore we expect to find relevant growth in competence areas that address the level of reflection on mathematical processes and meta-mathematical issues. Participants of the study were students at the University of Education Freiburg with the aim of teaching mathematics in primary and secondary schools. In 2013/2014 two groups were observed longitudinally at two times of measurement each: Group 1 at the beginning and end of the 1st semester (T1, n = 105; T2, n = 87; full survey on all beginning students) and Group 2 at the beginning and Zeitschrift für Psychologie 2015; Vol. 223(1):39–46


42

B. Rott et al.: Assessment of Mathematical Competencies and Epistemic Cognition of Preservice Teachers

end of the 4th semester (T3, n = 42; T4, n = 59; participants of a seminar on epistemological processes in mathematics). At T1 and T3 we collected data regarding (i) disciplinespecific denotative epistemic beliefs focusing on certainty of mathematical knowledge, (ii) discipline-related connotative epistemic beliefs, and (iii) critical thinking in mathematics. The denotative beliefs have additionally been gathered at T2 and T4 to capture changes of beliefs.

Methods Measuring Epistemic Beliefs (Denotative vs. Connotative) To gain access to denotative epistemic beliefs and to the sophistication of the students’ judgments, we conducted a preliminary interview study (Rott, Leuders, & Stahl, 2014). The interviewees’ answers revealed a broad spectrum of possible responses that helped us to construct a web-based questionnaire. In this questionnaire we used open-ended questions and prompts explaining controversial points of view toward mathematics as a scientific discipline to acquire a vivid account of our participants’ epistemic judgments and according arguments. We developed a coding manual for the belief orientation (mathematics as certain vs. uncertain) and the level of argumentation (as inflexible vs. sophisticated ). Overall, 293 responses (from the participants of all four measurement times) were independently coded by two trained raters (average Cohen’s j = 0.88). To measure the connotative epistemic beliefs we used the CAEB (Connotative Aspects of Epistemological Beliefs) questionnaire by Stahl and Bromme (2007) which can be used in relation to different disciplines or contexts. It consists of 24 pairs of contrastive adjectives like simple vs. complex. The respondents are supposed to judge the character of a discipline on a 7-point Likert scale for each contrastive pair. Stahl and Bromme (2007) validated the instrument in two studies with more than 1,000 participants each and identified two factors via factor analysis: Texture (beliefs about the structure and accuracy of knowledge) and Variability (beliefs about the stability and dynamics of knowledge). In the present study we instructed our participants to complete the CAEB with ‘‘mathematics as a scientific discipline’’ in mind. Due to our focus on epistemic judgments about the certainty of mathematics we constructed one factor focusing on certainty – based on expert ratings and on an exploratory factor analysis. The factor consists of 10 items (Cronbach’s a = 0.71, see Table 1 for the items and factor loadings).

Measuring Critical Thinking Within the tripartite model of thinking (Figure 2), critical thinking can be operationalized by situations that demand a critical override of algorithmic mathematical solutions by reflective and evaluative processes, such as in the paraZeitschrift für Psychologie 2015; Vol. 223(1):39–46

Table 1. Adjective pairs and according factor loadings for the CAEB dimension ‘‘Certainty’’ Item Mathematics as a discipline is. . .

Factor loading

precise – imprecise exact – vague absolute – relative sorted – unsorted certain – uncertain temporary – everlasting definite – ambiguous stable – unstable confirmable – unconfirmable accepted – disputed

.77 .77 .74 .71 .70 .60 .53 .50 .45 .38

digmatic bat-and-ball task by Kahneman and Frederick (2002): ‘‘A bat and a ball cost $1.10 in total. The bat costs $1 more than the ball. How much does the ball cost?’’ The spontaneous, algorithmically produced answer that most people come up with is $0.10. A critical thinker would question this answer and realize that the ball should cost $0.05, whereas people who do not use critical thinking do not evaluate their first thought and adapt their solution. By adapting and developing more than 20 items of similar character within the domain of mathematics, we constructed a test to measure the students’ critical thinking ability (e.g., see Appendix). All items were related to mathematical situations and were located on a curricular level that demanded only knowledge from lower secondary education. After validating the items in a preliminary study, the final test consisted of 11 items that were rated dichotomously. In the present study, we used a Rasch model to transform our students’ test scores into values on a onedimensional competence scale (software RUMM 2030 by Andrich, Sheridan, & Luo, 2009). After eliminating two items because of underdiscrimination (fit residual > 2.5) in connection with floor and ceiling effects, respectively, for each item the model showed good fit residuals (all values between 2.5 and 2.5) and no significant differences between the observed overall performance of each trait group and its expected performance (overall-v2 = 36.2; df = 27; p = .11).

Results Denotative Beliefs and Sophistication The development of the students’ denotative epistemic beliefs and their degree of sophistication is summarized in Table 2. At the beginning of their university studies (T1), more than two thirds of the students regard mathematical knowledge as uncertain (cf. Hypothesis 1 above) from above). This is slightly surprising and does not conform to the finding that mathematics as taught in school shapes a view of mathematics as a collection of static and reliable procedures (Muis, 2004; Schoenfeld, 1989). Nevertheless, Ó 2015 Hogrefe Publishing


B. Rott et al.: Assessment of Mathematical Competencies and Epistemic Cognition of Preservice Teachers

43

Table 2. Distribution of the students’ denotative epistemic beliefs and their degree of sophistication Measurement time Group 1 (1st semester) Group 2 (4th semester)

(T1) (T2) (T3) (T4)

Beginning End Beginning End

Certain and inflexible 29 48 8 7

Certain and sophisticated

(27.6%) (55.2%) (19.0%) (11.9%)

at the end of their 1st semester (T2), the majority of our students tend to see mathematical knowledge as certain. Because of the students’ responses in the questionnaire, we attribute this change of judgment (which is significant; McNemar test: v2 = 7.2, p = .012) mostly to the lectures the students attended in their 1st semester, who paid considerable attention to mathematical reasoning and proof (e.g., geometrical proofs or mathematical induction). Of the 4th semester students (T3 and T4) the majority regard mathematical knowledge as uncertain. In their responses most of them argue with deficiencies of the review procedure of mathematical publications or the many unsolved mysteries regarding prime numbers (e.g., the existence of an infinite number of prime twins, Goldbach’s conjecture, etc.). With respect to the degree of the students’sophistication (inflexible/sophisticated) one can detect an increase of the proportion of students that use sophisticated arguments (T1: 9.5%; T2: 11.5%; T3: 21.4%; T4: 28.8%). As expected, our data shows that the two test dimensions – certainty and sophistication – are unrelated (testing all groups at T1 and T3 on independence yields: v2 = 0.06, p = .811, cf. Hypothesis 2). Connotative Beliefs The certainty dimension of the connotative beliefs measured by the CAEB instrument provides metrical data that ranges from 1 to 7 (Table 3). In what follows the data regarding the connotative beliefs and critical thinking scores will each be sorted by both judgment and sophistication. Does the CAEB data fit to the denotative belief orientations? Low CAEB scores indicate connotative beliefs of certainty and indeed students that articulated certain beliefs in the denotative questionnaire have lower CAEB scores (3rd and 4th row in Table 3). This effect is especially visible

3 2 3 4

(2.9%) (2.3%) (7.1%) (6.8%)

Uncertain and inflexible 66 29 25 35

(62.9%) (33.3%) (59.5%) (59.3%)

Uncertain and sophisticated 7 8 6 13

(6.7%) (9.2%) (14.3%) (22.0%)

Sum 105 (100%) 87 (100%) 42 (100%) 59 (100%)

for 4th semester students. However, a two-way ANOVA does not show significant effects for the group (1st/4th semester; F = 0.20; p = .652) or the students’ denotative judgments (certain/uncertain; F = 3.54; p = .062) and no interaction effect (F = 2.88; p = .092) indicating the disparity of connotative and denotative judgments. Do the connotative beliefs correlate with the quality of the argumentation? As expected, there are no differences in the connotative beliefs between students that argue inflexibly compared to those that argue sophisticatedly (t = 0.93; df = 145; p = .354) showing that both dimensions should be distinguished. Critical Thinking The Rasch model of the students’ critical thinking ability provides metrical latent variables ranging from 2.83 to 2.81 with low values indicating a low ability (see Table 4). Does critical thinking depend on the denotative judgment? A two-way ANOVA has been used to investigate possible differences between the students’ critical thinking scores sorted by their judgment: The 4th semester students show higher ability scores than 1st semester students (F = 9.54; p = .002); there is no significant effect between students regarding mathematical knowledge as ‘‘certain’’ compared to students regarding it as ‘‘uncertain’’ (F = 1.04; p = .310), and no interaction effect (F = 0.02; p = .886) which is in accordance with our expectations. Does critical thinking correlate with the degree of sophistication? Another two-way ANOVA has been used to analyze the critical thinking scores sorted by the quality of the argumentation. The group effect is confirmed (F = 9.54; p = .002). As expected, students arguing sophisticatedly show higher ability scores than students arguing in an inflexible way (F = 4.76; p = .031). Thus, there is a marked connection between sophistication of beliefs and the ability of thinking critically when solving tasks. There

Table 3. Means (and standard deviations) of the groups regarding connotative epistemic beliefs (caeb dimension ‘‘Certainty’’) First semester (T1) Fourth semester (T3) All students combined

Ó 2015 Hogrefe Publishing

Total

Certain

Uncertain

Inflexible

3.23 (1.15) n = 105 3.54 (1.31) n = 42 3.32 (1.20) n = 147

3.20 (1.29) n = 32 2.90 (1.02) n = 11 3.13 (1.23) n = 43

3.25 (1.09) n = 73 3.77 (1.34) n = 31 3.40 (1.19) n = 104

3.32 (1.10) n = 95 3.47 (1.31) n = 33 3.35 (1.15) n = 128

Sophisticated 2.44 (1.34) n = 10 3.80 (1.35) n=9 3.08 (1.48) n = 19

Zeitschrift für Psychologie 2015; Vol. 223(1):39–46


44

B. Rott et al.: Assessment of Mathematical Competencies and Epistemic Cognition of Preservice Teachers

Table 4. Means (and standard deviations) of mathematical critical thinking Total First semester (T1) Fourth semester (T3) All students combined

0.439 (0.853) n = 105 0.110 (0.997) n = 42 0.282 (0.927) n = 147

Certain 0.327 (0.885) n = 32 0.268 (1.295) n = 11 0.175 (1.023) n = 43

is no interaction effect (F = 1.60; p = .208), even though the mean difference is much more prominent in the 4th semester.

Discussion Our aim was to better understand the complex structure of competencies in higher education. We approached this question by focusing on preservice mathematics teachers and on two cognitive dimensions – epistemic beliefs and mathematical knowledge. We especially looked at denotative beliefs about the certainty of mathematical knowledge, the sophistication of the students’ argumentation and their relations to connotative beliefs, and critical thinking ability in 1st and 4th semester students. We were able to construct and adapt instruments for measuring these cognitive dimensions: (1) To capture the denotative component of beliefs we did not construct a scale but used specific contextual information (texts on certainty) to elicit a reflected elaborate judgment.2 These judgments could be evaluated reliably by a rating procedure that distinguished not only between the direction of the judgment but also its level of sophistication. (2) Furthermore, we constructed a reliable scale for measuring the connotative dimension of beliefs focusing on certainty by relying on the more general framework of the CAEB instrument. (3) The dimension of mathematical ability could be operationalized within the general framework of critical thinking. The availability of such an instrument (instead of using more general tests of critical thinking, such as Ennis & Weir, 1985) is pivotal for keeping the theoretical focus of our investigation on the context of mathematics. Our inspections of the connection between the beliefs captured by connotative and denotative judgments revealed that these perspectives should be regarded as disparate. Whether this result generalizes to other areas than the certainty of mathematical knowledge (such as the epistemological status of mathematical knowledge: constructed vs. discovered) remains an open question. 2

Uncertain 0.488 (0.840) n = 73 0.054 (0.887) n = 31 0.327 (0.886) n = 104

Inflexible

Sophisticated

0.459 (0.836) n = 95 0.054 (0.987) n = 33 0.354 (0.891) n = 128

0.255 (1.033) n = 10 0.711 (0.826) n=9 0.203 (1.041) n = 19

The comparison between students of the 1st and 4th semester showed that the instruments are sensitive to the increase in sophistication in cognitive dimensions: The level of critical thinking when solving mathematical problems grows alongside the sophistication of (one facet of) the belief system; this finding goes along with research on problem solving (e.g., Schoenfeld, 1985). However, the proportion of students showing sophisticated beliefs is rather low, which calls for optimizing the instrument to reveal more differentiated levels. We could also find evidence for the independence of denotative beliefs and the sophistication of according arguments. Most models of epistemic development predict a change from inflexible beliefs of certainty to sophisticated beliefs of uncertainty of knowledge during education in schools and universities. Our data shows that there are sophisticated representatives of ‘‘certain knowledge’’ as well as unreflected representatives of ‘‘uncertain knowledge.’’ We propose to deliberately distinguish between beliefs and the sophistication of their representation; cf. Greene and Yu (2014) for a similar critique on models of epistemic cognition. Limitations and Further Studies There are several limitations to this study which can be ameliorated methodologically in further investigations. For example, reliably rating the judgment and sophistication of denotative belief is dependent on the students’ willingness to fill out our questionnaire and write an argumentation. Also, although the items of the critical thinking test show face validity with respect to the theoretical model, a test of differential validity of this scale with respect to mathematical knowledge has yet to be undertaken. There are also limitations regarding the selection of our sample as it only consists of students from a university specializing in teacher education. We do not assume that the observed development of cognitive competence dimensions holds true for teacher students in general. We plan to take a considerably larger sample from different universities in Germany in a subsequent study. Even within the University of Education Freiburg the sample is partially biased, as we assessed all students within

In these texts we used reflections on the role of proof for certainty. Please note that the German word ‘‘Beweis’’ used in the study has the sole meaning of ‘‘proof’’ as ‘‘rigorous deduction.’’ There is no indication that our students confused this with other concepts of ‘‘proof’’ like ‘‘evidence by a series of examples’’ (for this the German word ‘‘prüfen’’ would be used).

Zeitschrift für Psychologie 2015; Vol. 223(1):39–46

Ó 2015 Hogrefe Publishing


B. Rott et al.: Assessment of Mathematical Competencies and Epistemic Cognition of Preservice Teachers

the 1st semester but only participants of one particular seminar from the 4th semester. Although the latter group should also cover a full sample as all students have to attend this special seminar in the course of their studies. However, due to flexible study regulations, not all students attend this seminar in their 4th semester which explains the significantly lower number of that group in this study. Further studies will take this into account by an improved longitudinal design. Analyzing students of different semesters on the basis of the theoretical competence model revealed some interesting facts on the development of cognitive competence dimensions in higher education. Since the theoretical framework used represents a rather domain-specific approach (focusing on mathematics teacher students and on the certainty of mathematics), we cannot draw conclusions as to the generalizability of our findings to other domains or groups. Nevertheless we see an opportunity to connect our constructs to broader competence models in further research, aiming to answer questions like these: Can we distinguish the competence structure or development in mathematics of different groups of students with different professional goals (such as mathematics, teaching, or engineering)? This suggests that the instruments should be refined and further validated, for example in experimental designs, allowing for the identification of factors that strongly influence competence development.

References Andrich, D., Sheridan, B. E., & Luo, G. (2009). RUMM2030: Rasch unidimensional models for measurement. Perth, Australia: RUMM Laboratory. Arum, R., & Roksa, J. (2011). Academically adrift. Limited learning on college campuses. Chicago, IL: University of Chicago Press. Barzel, B., Eichler, A., Holzäpfel, L., Leuders, T., Maaß, K., & Wittmann, G. (in press). Vernetzte Kompetenzen statt träges Wissen – Ein Studienmodell zur konsequenten Vernetzung von Fachwissenschaft, Fachdidaktik und Schulpraxis [A network of competencies instead of dull knowledge – a study model on the consistent interlinking of science, didactics, and teaching practice in schools]. In R. Biehler, R. Hochmuth, A. Hoppenbrock, & H. E. Rück (Eds.), Lehren und Lernen von Mathematik in der Studieneingangsphase – Herausforderungen und Lösungsansätze. Berlin, Germany: Springer. Blömeke, S., Suhl, U., & Döhrmann, M. (2013). Assessing strengths and weaknesses of teacher knowledge in Asia, Eastern Europe and Western countries: Differential item functioning in TEDS-M. International Journal of Science and Mathematics Education, 11, 795–817. Blömeke, S., Zlatkin-Troitschanskaia, O., Kuhn, C., & Fege, J. (Eds.). (2013). Modeling and measuring competencies in higher education. Rotterdam, The Netherlands: Sense. Blömeke, S., Gustafsson, J.-E., & Shavelson, R. (2015). Beyond dichotomies: Competence viewed as a continuum. Zeitschrift für Psychologie, 223. doi: 10.1027/2151-2604/a000194 Bromme, R. (2005). Thinking and knowing about knowledge – a plea for critical remarks on psychological research programs on epistemological beliefs. In M. Hoffmann, J. Lenhard, &

Ó 2015 Hogrefe Publishing

45

F. Seeger (Eds.), Activity and sign – grounding mathematics education (pp. 191–201). New York, NY: Springer. Bromme, R., Kienhues, D., & Stahl, E. (2008). Knowledge and epistemological beliefs: An intimate but complicate relationship. In M. S. Khine (Ed.), Knowing, knowledge and beliefs. Epistemological studies across diverse cultures (pp. 423–441). New York, NY: Springer. Brownlee, J., & Berthelsen, D. (2008). Developing relational epistemology through relational pedagogy. In M. S. Khine (Ed.), Knowing, knowledge and beliefs. Epistemological studies across diverse cultures (pp. 405–422). New York, NY: Springer. Buehl, M. M., & Alexander, P. A. (2006). Examining the dual nature of epistemological beliefs. International Journal of Educational Research, 45, 28–42. CBMS – Conference Board of the Mathematical Sciences. (2012). The mathematical education of teachers II. Providence, RI: American Mathematical Society. Elby, A., & Hammer, D. (2001). On the substance of sophisticated epistemology. Science Education, 85, 554–567. Ennis, R. H., & Weir, E. (1985). The Ennis-Weir critical thinking essay test. Pacific Grove, CA: Midwest. Facione, P. A. (1990). Critical thinking: A statement of expert consensus for purposes of educational assessment and instruction [Executive Summary ‘‘The Delphi Report’’]. Millbrae, CA: California Academic Press. Greene, J. A., & Yu, S. B. (2014). Modeling and measuring epistemic cognition: A qualitative re-investigation. Contemporary Educational Psychology, 39, 12–28. Hartig, J., Klieme, E., & Leutner, D. (Eds.). (2008). Assessment of competencies in educational contexts: State of the art and future prospects. Göttingen, Germany: Hogrefe. Hill, H. C., Rowan, B., & Loewenberg Ball, D. (2005). Effects of teachers’ mathematical knowledge for teaching on student achievement. American Educational Research Journal, 42, 371–406. Hofer, B. (2006). Beliefs about knowledge and knowing: Domain specificity and generality. Educational Psychology Review, 18, 67–76. Hofer, B. K., & Pintrich, P. R. (1997). The development of epistemological theories: Beliefs about knowledge and knowing and their relation to learning. Review of Educational Research, 67, 88–140. Kahneman, D. (2003). A perspective on judgment and choice. American Psychologist, 58, 697–720. Kahneman, D., & Frederick, S. (2002). Representativeness revisited: Attribute substitution in intuitive judgment. In T. Gilovich, D. Griffin, & D. Kahneman (Eds.), Heuristics and biases: The psychology of intuitive judgment (pp. 49–81). New York, NY: Cambridge University Press. King, P. M., & Kitchener, K. S. (2002). The reflective judgment model: Twenty years of research on epistemic cognition. In B. K. Hofer & P. R. Pintrich (Eds.), Personal epistemology: The psychology of beliefs about knowledge and knowing. Mahwah, NJ: Erlbaum. Klieme, E., Avenarius, H., Blum, W., Döbrich, P., Gruber, H., Prenzel, M., . . ., Vollmer, H. J. (Eds.). (2004). The development of national educational standards – an expertise. Berlin, Germany: Bundesministerium für Bildung und Forschung (BMBF). Kline, M. (1980). Mathematics: The loss of certainty. Oxford, UK: Oxford University Press. Krausz M. (Ed.). (2010). Relativism. A contemporary anthology. New York, NY: Columbia University Press. Lai, E. R. (2011). Critical thinking: A literature review. Upper Saddle River, NJ: Pearson Assessment. Retrieved from www.pearsonassessments.com/hai/images/tmrs/critical thinkingreviewfinal.pdf

Zeitschrift für Psychologie 2015; Vol. 223(1):39–46


46

B. Rott et al.: Assessment of Mathematical Competencies and Epistemic Cognition of Preservice Teachers

Muis, K. R. (2004). Personal epistemology and mathematics: A critical review and synthesis of research. Review of Educational Research, 74, 317–377. OECD. (2010). PISA 2009 assessment framework. Retrieved from www.oecd.org/pisa/pisaproducts/44455820.pdf. Rott, B., Leuders, T., & Stahl, E. (2014). ‘‘Is mathematical knowledge certain? – Are you sure?’’ An interview study to investigate epistemic beliefs. mathematica didactica, 37, 118–132. Schoenfeld, A. H. (1985). Mathematical problem solving. Orlando, FL: Academic Press. Schoenfeld, A. H. (1989). Explorations of students’ mathematical beliefs and behavior. Journal for Research in Mathematics Education, 20, 338–355. Schoenfeld, A. H. (2003). How can we examine the connections between teachers’ world views and their educational practices? Issues in Education, 8, 217–227. Shavelson, R. J. (2010). On the measurement of competency. Empirical Research in Vocational Education and Training, 2, 41–63. Stahl, E. (2011). The Generative nature of epistemological judgments: Focusing on interactions instead of elements to understand the relationship between epistemological beliefs and cognitive flexibility. In J. Elen, E. Stahl, R. Bromme, & G. Clarebout (Eds.), Links between beliefs and cognitive flexibility – lessons learned (pp. 37–60). Dordrecht, The Netherlands: Springer. Stahl, E., & Bromme, R. (2007). The CAEB: An instrument for measuring connotative aspects of epistemological beliefs. Learning and Instruction, 17, 773–785.

Staub, F. C., & Stern, E. (2002). The nature of teachers’ pedagogical content beliefs matters for students’ achievement gains: Quasi-experimental evidence from elementary mathematics. Journal of Educational Psychology, 94, 344–355. Stanovich, K. E., & Stanovich, P. J. (2010). A framework for critical thinking, rational thinking, and intelligence. In D. Preiss & R. J. Sternberg (Eds.), Innovations in educational psychology: Perspectives on learning, teaching and human development (pp. 195–237). New York, NY: Springer. Weber, K., Inglis, M., & Meija-Ramos, J. P. (2014). How mathematicians obtain conviction: Implications for mathematics instruction and research on epistemic cognition. Educational Psychologist, 49, 36–58. Weinert, F. E. (2001). Concept of competence: A conceptual classification. In D. S. Rychen & L. H. Salganik (Eds.), Defining and selecting key competencies. Göttingen, Germany: Hogrefe. Benjamin Rott Faculty of Mathematics University of Duisburg-Essen Thea-Leymann-Str. 9 45127 Essen Germany Tel. +49 201 183-4297 Fax +49 201 183-2426 E-mail benjamin.rott@uni-due.de

Appendix Sample Items of the Mathematical Critical Thinking Test A low Rasch location indicates an easy item; a low Rasch fit residual indicates a high discriminatory power. Item in the critical thinking test In a gamble, a regular six-sided die with four green faces and two red faces is rolled 20 times. You win €25 if a certain sequence of results is shown. Which sequence would you bet on? h RGRRR h GRGRRR h GRRRRR If the sum of the digits of an integer is divisible by three, then it cannot be a prime number. This statement is h correct h incorrect A table-tennis bat and a ball cost €10.20 in total. The bat costs €10 more than the ball. How much does the ball cost? A sequence of 6 squares made of matches consists of 19 matches (see the figure). How many matches does a sequence of 30 squares consist of ?

Zeitschrift für Psychologie 2015; Vol. 223(1):39–46

Empirical frequency of solutions

Rasch location (fit residual)

0.21

1.24 ( 1.064)

0.31

0.537 (0.855)

0.43

0.032 ( 1.665)

0.68

1.152 ( 0.227)

Ó 2015 Hogrefe Publishing


Original Article

Scientific Reasoning in Higher Education Constructing and Evaluating the Criterion-Related Validity of an Assessment of Preservice Science Teachers’ Competencies Stefan Hartmann,1 Annette Upmeier zu Belzen,1 Dirk Krüger,2 and Hans Anand Pant3 1

Biology Education, HU Berlin, Germany, 2Biology Education, FU Berlin, Germany, 3Institute for Quality Development in Education, HU Berlin, Germany

Abstract. The aim of this study was to develop a standardized test addressed to measure preservice science teachers’scientific reasoning skills, and to initially evaluate its psychometric properties. We constructed 123 multiple-choice items, using 259 students’ conceptions to generate highly attractive multiple-choice response options. In an item response theory-based validation study (N = 2,247), we applied multiple regression analyses to test hypotheses based on groups with known attributes. As predicted, graduate students performed better than undergraduate students, and students who studied two natural science disciplines performed better than students who studied only one natural science discipline. In contrast to our initial hypothesis, preservice science teachers performed less well than a control group of natural sciences students. Remarkably, an interaction effect of the degree program (bachelor vs. master) and the qualification (natural sciences student vs. preservice teacher) was found, suggesting that preservice science teachers’ learning opportunities to explicitly discuss and reflect on the inquiry process have a positive effect on the development of their scientific reasoning skills. We conclude that the evidence provides support for the criterion-based validity of our interpretation of the test scores as measures of scientific reasoning competencies. Keywords: scientific reasoning skills, preservice teachers, assessment

Methods of scientific inquiry, such as experimenting and modeling, play a key role in scientific practice as well as in science classrooms (American Association for the Advancement of Science, 1993; Bybee, 2002; National Research Council, 2012; Popper, 2003). To become competent scientists, science students need to develop a deeper understanding about ‘‘the characteristics of the scientific enterprise and processes through which scientific knowledge is acquired’’ (Schwartz, Lederman, & Crawford, 2004, p. 611). How do theories work? What is the purpose of a scientific model? What is a good research question? How are hypotheses tested? What inferences can be drawn from empirical data? – but not only scientists should be able to answer these questions: To teach subjects like biology, chemistry, and physics effectively in school classrooms, science teachers too need to develop a conceptual understanding of the scientific method, and to acquire competencies in this area (Liu, 2010; Schwartz et al., 2004). In many teacher education programs, preservice science teachers therefore have to attend several academic lectures and practical classes that are held for natural sciences students. As an  2015 Hogrefe Publishing

example, at German universities preservice science teachers’ curricula partly overlap with the curricula of natural sciences students during the undergraduate phase of academic training (e.g., Der Präsident der Humboldt-Universität zu Berlin, 2007a, 2007b, 2007c). In addition to science lectures, future biology, chemistry, and physics teachers attend seminars on teaching and instruction. In these seminars, scientific inquiry methods are taught as instruction techniques that can be used in science classrooms (Liu, 2010). The importance of inquiry methods and scientific reasoning skills in academic programs for science students and preservice science teachers leads to the questions: How competent are science students and preservice science teachers in the field of scientific reasoning? How do their competencies develop during the phase of academic training? What differences can be found between students of different academic education programs? To answer these questions, we analyze the development of science students’ and preservice science teachers’ competencies in the area of scientific reasoning. This article is about the first two stages of the project, the construction Zeitschrift für Psychologie 2015; Vol. 223(1):47–53 DOI: 10.1027/2151-2604/a000199


48

S. Hartmann et al.: Scientific Reasoning in Higher Education

of a new test of scientific reasoning competencies and the evaluation of its criterion-related validity. We give a brief overview of the theoretical concept of scientific reasoning, describe the construction process of a standardized test, and present and discuss the findings of a study in which we investigated three aspects of criterion-related validity of the test. In a third stage of the project, the test will be used to assess the development of students’ competencies longitudinally from the beginning to the end of academic training.

Scientific Reasoning Competencies Domain-specific cognitive skills or ‘‘competencies’’ (see Blömeke, Gustafsson, & Shavelson, 2015) are often characterized as ‘‘dispositions that are acquired and needed to successfully cope with certain situations or tasks’’ (Koeppen, Hartig, Klieme, & Leutner, 2008, p. 62). Within the domain of the natural sciences, the competencies that are needed to acquire scientific knowledge have been referred to as scientific thinking (Kuhn, Amsel, & O’Loughlin, 1988), scientific reasoning (Giere, Bickle, & Mauldin, 2006; Klahr, 2000; Koslowski, 1996), and scientific inquiry skills (Burns, Okey, & Wise, 1985; Liu, 2010). Following the hypothetico-deductive approach (Brody, de la Peña, & Hodgson, 1993; Godfrey-Smith, 2003; Popper, 2003), common inquiry techniques of natural sciences are observing, comparing, experimenting, and modeling (Crawford & Cullin, 2005; Gott & Duggan, 1998; Grosslight, Unger, Jay, & Smith 1991; Justi & Gilbert, 2003; Klahr, 2000; Zimmermann, 2007). ‘‘There is a general pattern to all scientific reasoning’’ (Giere et al., 2006, p. 6), which can be defined as a context-specific form of problem-solving. Mayer (2007) proposes a structural model of scientific reasoning that describes this pattern as a set of four subskills: Formulating research questions, generating hypotheses, planning investigations, analyzing and interpreting data (see also Liu, 2010). This model refers to the method of experimenting, but can also be applied to observing, comparing, and modeling (Wellnitz, Hartmann, & Mayer, 2010). However, certain aspects of scientific modeling are not covered by this model. In the study presented in this paper, Mayer’s structural model is therefore extended by three subskills of a scientific reasoning framework that refers to the inquiry method of scientific modeling (Krell, Upmeier zu Belzen, & Krüger, 2014). These subskills are defining the purpose of models, testing models, and changing models. Combining the subskills of both frameworks, seven aspects of scientific reasoning have been identified, each of which marks one step of a general inquiry process. Being competent in the field of scientific reasoning is defined as a cognitive disposition that enables students to apply each of these seven steps to real-life scientific problems.

Prior Research ‘‘The obvious way to learn about scientific reasoning is by learning to be a scientist [but] this is not . . . the best way’’ Zeitschrift für Psychologie 2015; Vol. 223(1):47–53

(Giere et al., 2006, p. 6). In fact, both practical work (laboratories) and theoretical lessons (lectures) proved to be effective ways to support the development of scientific reasoning competencies in higher education (Lawson et al., 2000). Duschl & Grandy (2013) demonstrated that the most effective method to improve students’ scientific reasoning skills are lectures that ‘‘explicitly . . . involve building and refining questions, measurements, representations, models and explanations’’ (p. 2126). According to Hodson (2014), ‘‘there are major advantages in addressing scientific inquiry [and] making this understanding explicit to students’’ (p. 9). Empirical studies in higher education showed that academic programs with longer and broader phases of specialized scientific training result in higher scientific reasoning skills (Kunz, 2012). Students adjust their reasoning strategies to the specific conditions of different content areas, and their overall skills increase as they proceed from domain to domain (Glaser, Schauble, Raghavan, & Zeitz, 1992). These skills are highly generalizable across different scientific disciplines (Godfrey-Smith, 2003). According to these findings, three aspects have a positive effect on the development of scientific reasoning competencies: The number of learning opportunities, the variety of content domains in which scientific reasoning is trained, and teaching methods that explicitly address the inquiry process and the Nature of Science – in Hodson’s (2014) terms: ‘‘learning about science’’ rather than ‘‘learning science.’’ As the number of learning opportunities increases by the time students spent at universities, we conclude that science students’ and preservice science teachers’ competencies in scientific reasoning undergo significant progress during the phase of academic training. As learners’scientific reasoning skills increase as they proceed through different content domains, learning opportunities in different natural sciences should also lead to increased scientific reasoning competencies. Finally, preservice science teachers, who learn about science by attending seminars in which they explicitly discuss and reflect on the inquiry process and the nature of science, should perform better in a scientific reasoning test than natural sciences students, who do not receive this kind of training.

Hypotheses The long-term goal of our project is to evaluate the development of the scientific reasoning competencies of preservice science teachers. After a standardized test was constructed (see Method section of this article), we conducted a cross-sectional study to initially evaluate the test’s psychometric properties. In this study, criterion-based validity was tested using the known-groups method (Cronbach & Meehl, 1955; Hattie & Cooksey, 1984; Rupp & Pant, 2006). Following this approach, we formulated three hypotheses that predict differences of the test outcome between groups with known attributes: Hypothesis (1): Graduate students perform better than undergraduate students, reflecting the positive effect of learning opportunities in academic training.  2015 Hogrefe Publishing


S. Hartmann et al.: Scientific Reasoning in Higher Education

Hypothesis (2): Students who study two natural sciences perform better than students whose learning opportunities focus on one natural science, reflecting the positive effect of learning opportunities across different content domains. Hypothesis (3): preservice science teachers perform better than a control group of natural sciences students, reflecting the positive effect of learning opportunities to explicitly discuss and reflect on the inquiry process.

Method To assess students’ scientific reasoning competencies, we developed a standardized paper-and-pencil test. Three item developers were provided with conceptual guidelines in the form of a test construction manual. The manual contained a detailed competence description for each of the seven subskills, alongside sample items. Following our framework of scientific reasoning, we constructed test items for each subskill (formulating questions, generating hypotheses, planning investigations, interpreting data, judging the purpose of models, testing models, and changing models). About one third of the items refers to real-life scientific problems from the field of biology, one third to chemistry, and one third to physics. To increase face validity as well as content validity, seven subject-matter experts and a psychometric consultant supervised the item construction process. Their supervision included critical discussion of the relevance and the representativeness of each item to the framework described in the test manual, as well as suggestions for improvement. All items underwent several revisions, and only items that eventually received no further comments, suggestions, or objections from the experts were approved to be used in the test.

49

distractors. The transfer of students’ conceptions into correct and incorrect answering options is expected to improve content validity, and in a test constructed this way, not only the correct responses but also the ‘‘distractors match common student ideas’’ (Sadler, 1998, p. 265). For 166 items, one correct and three incorrect options could be generated from the students’ answers. In case of the remaining items, the answers given by the students were not sufficient to formulate the required number of multiplechoice options. These items were not used for further testing. To ensure that the correct multiple-choice options adequately represent the competencies described in the manual, all items underwent a final revision by the experts, thus enhancing content validity as well as face validity. The psychometric properties of the 166 items were initially tested in a pilot study (N = 578). Based on item difficulty, discrimination parameters, and item characteristic curves, 123 items were selected for the final test.

Booklet Design and Sample The 123 items were distributed across 41 item blocks. An unbalanced incomplete matrix design (Gonzalez & Rutkowski, 2010) was used to assign the blocks to 20 test booklets. Each booklet contained 6 blocks (18 items), two blocks for each content domain (biology, chemistry, physics). All booklets contained items for each of the seven subskills. The booklets were used in a cross-sectional study with 2,247 participants (55% female) at universities in Germany and Austria.1 The sample consisted of 1,096 preservice science teachers in academic training and 1,151 natural sciences students. Two hundred thirteen students studied two natural sciences (e.g., biology and physics). Students were on average 22.45 years old (SD = 4.28). Three hundred fifty-five students were graduates. Differential Item Functioning (DIF) analyses (Adams, Wilson, & Wu, 1997) were conducted to ensure that the items measure the same underlying trait for each group of participants. For the vast majority of items, only negligible DIF was found.

Test Construction In a first stage, we constructed 183 open-ended test items. A sample of 259 preservice science teachers and naturalsciences students processed these items to generate written answers that reflect their conceptions of scientific reasoning. The test developers and the seven subject-matter experts categorized the written responses into scientifically adequate and non-adequate conceptions. The students’ answers were then used to formulate multiple-choice options. Answers that reflected the cognitive processes described in the construction manual were used as correct options, whereas alternative conceptions were used as 1

Results Item Parameters and Test Reliability ACER ConQuest 3.0 (Adams, Wu, & Wilson, 2012) was used for data analysis. Plausible values (Wu, 2005) were drawn to estimate the students’ abilities. On average, the students responded to 17.42 test items (SD = 1.88). The range of person abilities was covered well by items with matching difficulties. Item parameter estimates range from 2.81 to 1.64. We found acceptable infit MNSQs,

Students of the following universities participated in the study: FU Berlin, HU Berlin, RWTH Aachen, TU Berlin, University of Bremen, University of Duisburg-Essen, University of Cologne, University of Potsdam (all Germany), University of Innsbruck, University of Salzburg, and University of Vienna (all Austria).

 2015 Hogrefe Publishing

Zeitschrift für Psychologie 2015; Vol. 223(1):47–53


50

S. Hartmann et al.: Scientific Reasoning in Higher Education

ranging from 0.93 to 1.09. The EAP/PV reliability estimate of the test was 0.544.

Differences Between Known Groups To test our hypotheses, a latent regression model was applied (Adams et al., 1997). Latent regression allows to examine group differences in the means of the ability parameters directly in the IRT model. Three independent variables were included as predictors of the latent variable. Degree program (0 = undergraduate program; 1 = graduate program) reflects the number of learning opportunities: Graduate students have had more learning opportunities than undergraduate students. Discipline (0 = one natural science discipline; 1 = two natural science disciplines) reflects the comprehensive scope of the learning contents: Students who study two natural science disciplines have to adopt their reasoning skills to a broader variety of content domains. Qualification (0 = science student; 1 = preservice science teacher) was added to test for the effect of seminars and lectures in which the inquiry process and the nature of science are extensively discussed. These seminars are mandatory for preservice science teachers in academic training at German universities. Natural sciences students have been added as a control group, because their academic training is similar to the training of science teachers, but they do not visit such seminars. Analyses have indicated no multicollinearity of the data. Results of the regression analysis are shown in Table 1. Our first hypothesis predicted that graduate students perform better than undergraduate students. The regression effect of the variable degree program reflects the achievement of graduate students in comparison to the achievement of undergraduate students. We found a significant effect for the variable, indicating higher achievement scores for the group of graduate students (Table 1). Our second hypothesis stated that students who study two natural science disciplines perform better than students who study one natural science. We found a significant effect of the variable discipline, indicating higher achievement for students who study two natural sciences (Table 1). Our third hypothesis predicted that preservice science teachers perform better than science students. The regression effect of the variable

Table 1. Latent regression of qualification, discipline, and degree program on the achievement scale (predictor variables, unstandardized regression coefficients, standard errors) Predictor variable Qualification (1 = preservice science teacher) Discipline (1 = two natural sciences disciplines) Degree program (1 = graduate)

B

SE (B)

0.107**

0.021

0.116*

0.035

0.283**

0.028

Note. *p < .01. **p < .001. Zeitschrift für Psychologie 2015; Vol. 223(1):47–53

Figure 1. Mean achievement scores of natural sciences students and preservice science teachers, broken down by the phase of academic training (undergraduate vs. graduate). Error bars indicate ± 2 SE.

qualification reflects the difference between preservice science teachers’ and science students’ abilities. We found a very small but significant negative effect of the variable, meaning that preservice science teachers performed fairly lower than science students (Table 1). Figure 1 shows the effect broken down by the phase of academic training. In the undergraduate phase, the mean achievement scores of preservice science teachers were lower than the mean achievement scores of science students, whereas the opposite effect was found in the graduate phase. If the interaction between degree program and qualification is included in the regression model, it shows a significant effect (Table 2).

Table 2. Latent regression of qualification, discipline, degree program, and the interaction of qualification and degree program on the achievement scale (predictor variables, unstandardized regression coefficients, standard errors) Predictor variable Qualification (1 = preservice science teacher) Discipline (1 = two natural science disciplines) Degree program (1 = graduate) Qualification · Degree program (1 = graduate preservice teacher)

B

SE (B)

0.141**

0.023

0.113*

0.036

0.151* 0.218**

0.045 0.057

Note. *p < .01. **p < .001.  2015 Hogrefe Publishing


S. Hartmann et al.: Scientific Reasoning in Higher Education

Discussion Overall, the findings were in accordance with the hypothesized group differences, thus providing evidence from which initial validity statements can be drawn. In our first hypothesis, we predicted that graduate students perform better than undergraduate students. We found a significant regression effect of the variable degree program, supporting the hypothesis. This finding is in accordance with the increase of scientific reasoning skills during academic training that has been described in other studies (Glaser et al., 1992; Kunz, 2012; Lawson et al., 2000). As the group means differ in the expected directions, we conclude that this evidence provides support for the validity of our interpretation of the test scores as measures of scientific reasoning. However, one can think of alternative explanations for this result. Scientific reasoning is not the only skill that develops during the phase of academic training. As the test items refer to problems and phenomena of the natural sciences, the group differences could indicate an increase of a more general skill from this area, such as content knowledge. On the other hand, we believe that this risk was minimized by the extensive involvement of subject-matter experts in the item construction process. Another alternative explanation lies within the tested sample: Graduates can be seen as a selection of students who have been high-achieving during the phase of undergraduate education, therefore biasing the results of a cross-sectional comparison of both groups. To test for competence development, longitudinal designs are needed. In the next stage of our project, we will conduct a longitudinal study to test preservice science teachers’ competencies four times during the phase of academic training. The second hypothesis stated that students who study two natural sciences perform better than students who only study one natural science. Results of the regression analysis showed a positive regression coefficient for the variable discipline, supporting the hypothesis. This finding is in accordance with prior studies which found that a broader variety of learning opportunities lead to an increase in scientific reasoning skills (Glaser et al., 1992; Kunz, 2012; Lawson et al., 2000). Again, the group differences were in the expected direction, therefore supporting our interpretation of the test scores as measures of scientific reasoning. The third hypothesis accounted for the learning opportunities in academic programs for preservice science teachers. In seminars on teaching and instruction, preservice teachers learn about science (Hodson, 2014) by discussing and reflecting on the inquiry process and the nature of science. That way, competencies in the field of scientific reasoning are explicitly trained. If our test measures such competencies, students who have had these specific learning opportunities should perform better in the test than students who haven’t had such opportunities (Duschl & Grandy, 2013; Giere et al., 2006). The results of the initial regression analysis did not support this hypothesis: Overall, a negative effect was found, indicating that preservice science teachers perform slightly lower than natural sciences students. Broken down by degree (undergraduate vs. graduate),  2015 Hogrefe Publishing

51

the preservice teachers performed lower than the science students only in the undergraduate phase, but higher in the graduate phase of academic training. When the interaction of the variables qualification and degree program is added to the regression model, it shows a significant effect. A possible explanation lies within the graduate curricula of teacher education programs: Most of the seminars in which scientific reasoning and the nature of science are taught explicitly are held during the graduate phase of academic training (Der Präsident der Humboldt-Universität zu Berlin, 2007a). Therefore, the impact of these learning opportunities only shows for graduate students, but not for undergraduates. This finding is in accordance with our hypothesis. Again, this interpretation is hypothetical and needs longitudinal assessment to be tested. The next stage of this study will be used to further investigate this result. The results also provide evidence to rate the psychometric properties of the final test instrument. The range of person abilities was covered well by items with matching parameter estimates, indicating that the test was neither too easy nor too difficult for the tested sample. All item infit MNSQs were close to the expected value of 1.00 (Bond & Fox, 2007). The item with the highest overfit had an infit MNSQ of 0.93, the infit of the item with the highest underfit was 1.09. Overall, these results indicate a good model-data fit, with only minor redundancies in the responses. In terms of item fit, we conclude that all 123 items are suitable for measurement. The EAP/PV reliability of the test was 0.544. Even though this value is lower than for most psychological tests, it matches well with the reliabilities of standardized tests for scientific reasoning competencies (Mannel, 2011: 0.23– 0.66; Neumann, 2011: 0.55; Terzer, 2013: 0.46; Wellnitz, 2012: 0.59). Yet, the EAP/PV reliability ‘‘seems to be of limited use as an index of measurement instrument quality’’ (Adams, 2005, p. 170), and low reliabilities ‘‘can be compensated for by larger samples’’ (Adams, 2006, p. 25). In studies that focus on the estimation of population parameters instead of the abilities of individual students, it is therefore ‘‘not uncommon . . . to implement tests that have low reliability’’ (Adams, 2005, p. 170).

Conclusion The aim of this study was to construct a standardized test addressed to examine preservice science teachers’ competencies in the field of scientific reasoning, and to initially evaluate its psychometric properties. We conducted a cross-sectional validation study (N = 2,247) in which we tested three hypotheses that predict differences between known groups. Three lines of evidence supported the validity of our interpretation of the test results as measures of scientific reasoning competencies: First, graduate students performed better than undergraduate students, reflecting the positive effect of the increasing number of learning opportunities during the time students spent at the university. Second, students who study two natural sciences performed better than students who study one natural Zeitschrift für Psychologie 2015; Vol. 223(1):47–53


52

S. Hartmann et al.: Scientific Reasoning in Higher Education

science, reflecting the positive effect of learning opportunities that students receive as they proceed across different content domains within the natural sciences. The third hypothesis predicted that preservice science teachers perform better than a control group of natural sciences students. The hypothesized effect was only found for graduate students, but not for undergraduates. It reflects the positive impact of lectures and seminars in which they explicitly discuss and reflect on the inquiry process. As the vast majority of these seminars are held during the graduate phase of academic training, the result supports our hypothesis. In addition to the empirical findings, the extensive involvement of subject-matter experts in the item construction process, and the use of student’s conceptions to construct multiple-choice answering options further improve the validity by increasing both the relevance and the representativeness of the items. The analyses presented in this paper examined validity rather than reliability. Even though evidence for validity provides indirect evidence for reliability as well, the low EAP/PV reliability found in this study is a subject to ongoing discussion, and improvements to the booklet design are currently under way. In the upcoming longitudinal phase of our project, we will also implement additional tools to evaluate different aspects of validity, for example, a thinkingaloud validation study and a multi-trait-multi-method (MTMM) study, including general cognitive ability, knowledge about the Nature of Science, and an alternative scientific reasoning test as covariates. Overall, the findings led us to the conclusion that the test can be used for the upcoming stage of our project, where scientific reasoning competencies will be tested in a longitudinal design. This longitudinal study will provide insights into the competence development from the beginning to the end of academic training, and help us evaluate and improve the contents and structure of science teacher education.

References Adams, R. J. (2005). Reliability as a measurement design effect. Studies in Educational Evaluation, 31, 162–172. Adams, R. J. (2006, April). Reliability and item response modelling. Myths, observations and applications. Paper presented at the 13th International Objective Measurement Workshop, Berkeley, CA, USA. Adams, R. J., Wilson, M. R., & Wu, M. L. (1997). Multilevel item response models: An approach to errors in variables regression. Journal of Educational and Behavioral Statistics, 22, 46–75. Adams, R. J., Wu, M. L., & Wilson, M. R. (2012). Conquest 3.0 [computer software]. Camberwell, Australia: Australian Council for Educational Research. American Association for the Advancement of Science. (1993). Benchmarks for science literacy: Project 2061. New York, NY: Oxford University Press. Blömeke, S., Gustafsson, J.-E., & Shavelson, R. (2015). Beyond dichotomies: Competence viewed as a continuum. Zeitschrift für Psychologie, 223. doi: 10.1027/2151-2604/a000194

Zeitschrift für Psychologie 2015; Vol. 223(1):47–53

Bond, T. G., & Fox, C. M. (2007). Applying the Rasch model: Fundamental measurement in the human sciences. London, UK: Erlbaum. Brody, T. A., de la Peña, L., & Hodgson, P. E. (1993). The philosophy behind physics. Berlin, Germany: Springer. Burns, J. C., Okey, J. R., & Wise, K. C. (1985). Development of an integrated process skill test: TIPS II. Journal of Research in Science Teaching, 22, 169–177. Bybee, R. W. (2002). Teaching science as inquiry. In J. Minstrell & E. H. van Zee (Eds.), Inquiring into inquiry learning and teaching in science (pp. 20–46). Washington, DC: American Association for the Advancement of Science. Crawford, B., & Cullin, M. (2005). Dynamic assessments of preservice teachers’ knowledge of models and modelling. In K. Boersma, M. Goedhart, O. de Jong, & H. Eijkelhof (Eds.), Research and the quality of science education (pp. 309–323). Dordrecht, The Netherlands: Springer. Cronbach, L. J., & Meehl, P. E. (1955). Construct validity in psychological tests. Psychological Bulletin, 52, 281–302. Der Präsident der Humboldt-Universität zu Berlin. (Ed.). (2007a). Fachübergreifende Studienordnung für das Masterstudium für das Lehramt [Interdisciplinary study regulations for the Master of Education program]. Berlin, Germany: Humboldt University. Retrieved from http://www.amb. hu-berlin.de/2007/99/9920070 Der Präsident der Humboldt-Universität zu Berlin (Ed.). (2007b). Studien- und Prüfungsordnung für das Bachelorstudium Biologie: Kernfach Biologie und Beifach Chemie im Monostudiengang. Beifach Biologie im Monostudiengang [Study regulations for the Bachelor program in biology: Major in biology and minor in chemistry. Minor in biology]. Berlin, Germany: Humboldt University. Retrieved from http://www.amb.hu-berlin.de/2007/62/6220070 Der Präsident der Humboldt-Universität zu Berlin (Ed.). (2007c). Studien- und Prüfungsordnung für das Bachelorstudium Biologie: Kernfach und Zweitfach im Kombinationsstudiengang mit Lehramtsoption [Study regulations for the Bachelor program in biology: Major and minor in a dual program for preservice teachers]. Berlin, Germany: Humboldt University. Retrieved from http://www.amb. hu-berlin.de/2007/68/6820070 Duschl, R. A., & Grandy, R. (2013). Two views about explicitly teaching the nature of science. Science and Education, 22, 2109–2139. Giere, R. N., Bickle, J., & Mauldin, R. F. (2006). Understanding scientific reasoning. Independence, KY: Wadsworth/ Cengage Learning. Glaser, R., Schauble, L., Raghavan, K., & Zeitz, C. (1992). Scientific reasoning across different domains. In E. Corte, M. Linn, H. Mandl, & L. Verschaffel (Eds.), Computer-based learning environments and problem solving (pp. 345–371). Heidelberg, Germany: Springer. Godfrey-Smith, P. (2003). Theory and reality: An introduction to the philosophy of science. Chicago, IL: University of Chicago Press. Gonzalez, E., & Rutkowski, L. (2010). Practical approaches for choosing multiple-matrix sample designs. IEA-ETS Research Institute Monograph, 3, 125–156. Gott, R., & Duggan, S. (1998). Investigative work in the science curriculum. Buckingham, UK: Open University Press. Grosslight, L., Unger, C., Jay, E., & Smith, C. (1991). Understanding models and their use in science: Conceptions of middle and high school students and experts. Journal of Research in Science Teaching, 28, 799–822. Hattie, J. A., & Cooksey, R. W. (1984). Procedures for assessing the validity of tests using the ‘‘known groups’’ method. Applied Psychological Measurement, 8, 295–305.

 2015 Hogrefe Publishing


S. Hartmann et al.: Scientific Reasoning in Higher Education

Hodson, D. (2014). Learning science, learning about science, doing science: Different goals demand different learning methods. International Journal of Science Education, 36, 2534–2553. doi: 10.1080/09500693.2014.899722 Justi, R. S., & Gilbert, J. K. (2003). Teachers’ views on the nature of models. International Journal of Science Education, 25, 1369–1386. Klahr, D. (2000). Exploring science. The cognition and development of discovery processes. Cambridge, MA: MIT Press. Koeppen, K., Hartig, J., Klieme, E., & Leutner, D. (2008). Current issues in competence modeling and assessment. Zeitschrift für Psychologie, 216, 61–73. Koslowski, B. (1996). Theory and evidence. The development of scientific reasoning. Cambridge, MA: MIT Press. Krell, M., Upmeier zu Belzen, A., & Krüger, D. (2014). Students’ levels of understanding models and modelling in biology: Global or aspect-dependent? Research in Science Education, 44, 109–132. Kuhn, D., Amsel, E., & O’Loughlin, M. (1988). The development of scientific thinking skills. San Diego, CA: Academic Press. Kunz, H. (2012). Professionswissen von Lehrkräften der Naturwissenschaften im Kompetenzbereich Erkenntnisgewinnung [Science teachers’ professional knowledge in scientific inquiry]. (Doctoral dissertation, University of Kassel, Germany). Retrieved from https://kobra.bibliothek.unikassel.de/bitstream/urn:nbn:de:hebis:34-2012012040403/9/ DissertationHagenKunz.pdf Lawson, A. E., Clark, B., Cramer-Meldrum, E., Falconer, K. A., Sequist, J. M., & Kwon, Y.-J. (2000). Development of scientific reasoning in college biology: Do two levels of general hypothesis-testing skills exist? Journal of Research in Science Teaching, 37, 81–101. Liu, X. (2010). Using and developing measurement instruments in science education. A Rasch modeling approach. Charlotte, NC: Information Age Publishing. Mannel, S. (2011). Assessing scientific inquiry. Development and evaluation of a test for the low-performing stage. Berlin, Germany: Logos. Mayer, J. (2007). Erkenntnisgewinnung als wissenschaftliches Problemlösen [Inquiry as scientific problem solving]. In D. Krüger & H. Vogt (Eds.), Theorien in der biologiedidaktischen Forschung. Ein Handbuch für Lehramtsstudenten und Doktoranden (pp. 177–186). Berlin, Germany: Springer. National Research Council. (2012). Education for life and work: Developing transferable knowledge and skills in the 21st century. Washington, DC: National Academy Press. Neumann, I. (2011). Beyond physics content knowledge. Modeling competence regarding nature of science inquiry and nature of scientific knowledge. Berlin, Germany: Logos. Popper, K. R. (2003). The logic of scientific discovery. London, UK: Routledge.

 2015 Hogrefe Publishing

53

Rupp, A. A., & Pant, H. A. (2006). Validity theory. In N. J. Salkind (Ed.), Encyclopedia of measurement and statistics (pp. 1032–1035). Thousand Oaks, CA: Sage. Sadler, P. M. (1998). Psychometric models of student conceptions in science: Reconciling qualitative studies and distractor-driven assessment instruments. Journal of Research in Science Teaching, 35, 265–296. Schwartz, R. S., Lederman, N. G., & Crawford, B. A. (2004). Developing views of nature of science in an authentic context: An explicit approach to bridging the gap between nature of science and scientific inquiry. Science Education, 88, 610–645. Terzer, E. (2013). Modellkompetenz im Kontext Biologieunterricht – Empirische Beschreibung von Modellkompetenz mithilfe von Multiple-Choice Items [Model competence in biology education – empirical description of model competence using multiple-choice items] (Doctoral dissertation, HU Berlin, Germany). Retrieved from http://edoc.huberlin.de/dissertationen/terzer-eva-2012-12-19/PDF/terzer.pdf Wellnitz, N. (2012). Kompetenzstruktur und -niveaus von Methoden der naturwissenschaftlichen Erkenntnisgewinnung [Competence structure and levels of methods of scientific inquiry]. Berlin, Germany: Logos. Wellnitz, N., Hartmann, S., & Mayer, J. (2010). Developing a paper-and-pencil-test to assess students’ skills in scientific inquiry. In G. Çakmaki & F. Tassßar (Eds.), Contemporary science education research: Learning and assessment (pp. 289–294). Ankara, Turkey: ESERA. Wu, M. (2005). The role of plausible values in large-scale surveys. Studies in Educational Evaluation, 31, 114–128. Zimmermann, C. (2007). The development of scientific thinking skills in elementary and middle school. Developmental Review, 27, 172–223.

Stefan Hartmann HU Berlin Biology Education Invalidenstrasse 42 10115 Berlin Germany Tel. +49 30 2093-98306 Fax +49 30 2093-8311 E-mail stefan.hartmann@hu-berlin.de

Zeitschrift für Psychologie 2015; Vol. 223(1):47–53


Original Article

Assessing Professional Vision in Teacher Candidates Approaches to Validating the Observer Extended Research Tool Kathleen Stürmer and Tina Seidel School of Education, TUM, Munich, Germany Abstract. In this study, we present an approach to validating the video-based Observer Extended Research Tool, which empirically captures prospective teachers’ professional vision in a standardized yet contextualized way. We extended the original Observer tool with the aim of providing a reliable, efficient measure for subpopulations in different consecutive phases of teacher education (university and induction phase). Therefore, we expand the measure to include a broader spectrum of knowledge about effective teaching by drawing on a cognitive processoriented teaching and learning model while at the same time the number of test items is shortened to ensure a economically manageable assessment tool. In the validation study, we tested the extent to which the extension meets the criteria of context validity, reliability, and sensitiveness for different subpopulations. The participants were 317 preservice teachers and teacher candidates who worked with the Observer Extended Research Tool. Measurement quality was investigated using methods of item response theory. Our results confirm that the Observer Extended Research Tool provides a reliable measure of description, explanation, and prediction as aspects of professional vision within and across different subpopulations in teacher education. Keywords: teacher education, professional vision, competence assessment, video-based

Teacher education faces the challenge of assessing effectiveness of their programs and choosing indicators and instruments that provide valid and reliable measures of educational outcomes (Darling-Hammond, 2010; Seidel, 2012). With the aim of supporting students’ learning processes proximal to effective classroom teaching (Brouwer, 2010; Darling-Hammond & Bransford, 2005; Grossman et al., 2009), the measure of knowledge acquisition representing the integration of theory and practice is seen as crucial part of teacher education (Borko, 2004; Cochran-Smith & Zeichner, 2005; Putnam & Borko, 2000; Seidel, Blomberg, & Renkel, 2013). Recent research has made progress in modeling teacher knowledge related to effective teaching practice by drawing on Shulman’s (1987) conceptualization. It has also made progress in providing empirical evidence of the structure in content knowledge, pedagogical content knowledge, and generic pedagogical knowledge by using standardized knowledge tests (Baumert et al., 2010; Blömeke et al., 2009; Döhrmann, Kaiser, & Blömeke, 2012; Hill, Rowan, & Ball, 2005; Voss, Kunter, & Baumert, 2011). However, research is lacking when it comes to measuring aspects of teacher knowledge that refer to the contextualized and situated nature of real-world demands of the job (Blömeke, Gustafsson, & Shavelson, 2015; Borko, 2004). Zeitschrift für Psychologie 2015; Vol. 223(1):54–63 DOI: 10.1027/2151-2604/a000200

In this regard, the concept of professional vision (Goodwin, 1994) as a situation-specific skill combining knowledge and practice (Blömeke et al., 2015) offers a promising approach. In the last few years, the concept has become an increasingly important element in describing the initial processes of integrated knowledge acquisition within university-based teacher education (Santagata & Guarino, 2010; Star & Strickland, 2008; Stürmer, Könings, & Seidel, 2013; Wiens, Hessberg, LoCasale-Crouch, & DeCoster, 2013). Three aspects of professional vision have been described in qualitative research: the description, explanation, and prediction of classroom situations. The video-based Observer tool (Seidel, Blomberg, & Stürmer, 2010a) is the first measurement tool to empirically capture these aspects with regard to knowledge about goal clarity, teacher support, and learning climate (important components of effective teaching) in a standardized yet contextualized way. This measure is an indicator of the current state of preservice teachers’ acquired knowledge in this domain. It can be used promptly for feedback on teaching and formative assessment in the context of university-based teacher education. However, aiming to track prospective teachers’ professional development, economical measures are needed to capture the learning trajectories of subpopulations in different consecutive phases of  2015 Hogrefe Publishing


K. Stürmer & T. Seidel: Professional Vision Assessment in Teacher Candidates

education. In Germany, for example, initial teacher education is split into two phases: (1) acquiring conceptual knowledge about content, pedagogical content, and pedagogy in 3–5 years education at university (university phase), and (2) a 2-year induction phase at selected schools accompanied by seminars organized in the state where the school is located which are devoted to teacher candidates’ practical learning of (induction phase). Focusing on a comprehensive use in the two consecutive phases of theory-based and practice-based teacher education, we extended the original Observer tool by expanding the facets of knowledge the measurement of description, explanation, and prediction is based on by drawing on teaching and learning components (TL) of a cognitive process-oriented teaching and learning model. In this paper, we present findings of a validation study of the measurement with regard to the context representativeness, the reliability, and the sensitiveness for different subpopulations of teacher education programs.

Professional Vision as Indicator of Integrated Knowledge Acquisition Teachers’ professional vision is considered a prerequisite of effective teaching practice (Grossman et al., 2009; Sherin, 2001). It refers to the ability to notice and interpret features of classroom events relevant for student learning (van Es & Sherin, 2002; Sherin, 2007). It requires conceptual knowledge of effective teaching and learning (Borko, 2004) as well as the ability to apply this knowledge to the situation observed (Berliner, 1991; Sherin & van Es, 2009). Two interconnected processes can be distinguished: (1) noticing and (2) knowledge-based reasoning (van Es & Sherin, 2008).

Noticing Noticing involves the identification of classroom situations that, from a professional perspective, is crucial to effective instructional practice (Seidel & Stürmer, 2014). When defining relevant situations, different knowledge can be applied (van Es & Sherin, 2008). In our research, we focus on knowledge about principles of teaching and learning (Grossman & McDonald, 2008) as an aspect of generic pedagogical knowledge, (Shulman, 1987), which represents a basic component of teacher education (Hammerness, Darling-Hammond, & Shulman, 2002; Voss et al., 2011). Research into teaching effectiveness is based on knowledge about teaching and learning as an element of generic pedagogical knowledge. In the last decade, a substantial number of empirical studies have investigated the effects of teaching on student learning. In understanding teaching as a process of creating and fostering learning environments in which students are supported in activities that have a good chance of improving learning, Seidel and Shavelson (2007) in their meta-analysis make the common results of those studies explicit by integrating the variety of effective teaching variables into the five components of a cognitive  2015 Hogrefe Publishing

55

process-oriented teaching and learning (TL) model (Bolhuis, 2003). These components are: goal setting, orientation, execution of learning activities, evaluation of learning processes, and teacher guidance and support (regulation). All TL components show positive and differential effects on the cognitive and motivational-affective aspects of students’ learning (Fraser, Walberg, Welch, & Hattie, 1987; Hattie, 2009; Seidel & Shavelson, 2007). Goal setting – referring to teacher’s clarification of shortand long-term goals of the lesson – for example, has been shown to be an important condition for students’ experience of their competence, autonomy, and social relatedness (i.e., Kunter, Baumert, & Köller, 2007). The component orientation focuses on the transition from goals to the execution of learning activities. This includes transparency as to how the goals will be achieved (e.g., mentioning the learning activities that will take place) and how the lesson will be structured. Execution of learning activities includes the social, cognitive, and motivational stimulation of the learners. It is characterized by teacher’s support of social interactions between learners and teacher’s provision of opportunities for processing information. Regulation refers to the monitoring of students’ learning processes. It includes teachers’ feedback on learning outcomes, and their support in choosing the appropriate learning strategies and prompting self-regulated learning situations. Finally, evaluation includes a retrospective look at students’ progress toward the learning goals, as well as the learning processes that took place within the lesson. Knowledge-Based Reasoning Seidel and Stürmer (2014) reviewed the literature and identified three major aspects involved in professional reasoning: description, explanation, and prediction. Description is the ability to identify and differentiate relevant events without making any further judgments. Explanation refers to the ability to use what one knows to reason about a situation. This means linking classroom events to professional knowledge and classifying situations according to the components of teaching involved. Prediction refers to the ability to predict the consequences of observed events in terms of student learning. It draws on broader knowledge about teaching and student learning as well as its application to classroom practice. The ability to take a reasoned approach to events noticed provides insights into the quality of teachers’ mental representations of knowledge, and the application of those representations in the classroom (Borko, 2004). Research into expertise shows certain differences between novice and experienced in-service teachers (Hammerness et al., 2002; Seidel & Prenzel, 2007). It provides evidence that professional vision is an ability that can be learned (Berliner et al., 1988). Novices are capable of describing classroom situations, but their ability to accurately explain and predict the consequences and outcomes of those situations lags behind that of experienced in-service teachers. It is assumed that the ability to explain and predict observed events requires more integrated and flexible knowledge Zeitschrift für Psychologie 2015; Vol. 223(1):54–63


56

K. Stürmer & T. Seidel: Professional Vision Assessment in Teacher Candidates

Figure 1. Assessment of professional vision in the Observer and the Observer Extended Research Tool. structures (Berliner, 2001). However, for modeling prospective teachers’ learning processes and designing corresponding learning environments, instruments that empirically capture the structure of professional vision are required.

Assessing Professional Vision: The Observer Tool Taking into account the contextual nature of professional vision, measurements have to be devised that go beyond traditional knowledge tests (Blömeke et al., 2015; Seidel & Stürmer, 2014). In this respect, qualitative analyses, which use videos to prompt professional knowledge, are prominent (van Es & Sherin, 2008; Kersting, 2008). Noticing and reasoning abilities are assessed by open questions that are analyzed qualitatively. Changes in professional vision are then described according to categories applied in qualitative analyses. Findings generally reveal positive developments in professional development programs or university courses on teaching and learning regarding more precise descriptions and systematic use of professional knowledge about teaching and learning (Santagata & Angelici, 2010; Star & Strickland, 2008). However, qualitative approaches are limited when investigating larger samples of teachers. For example, when evaluating the progress of prospective teachers over time, standardized measures that are suitable for formative assessment are helpful. Although these measures might be less sensitive to the fine-tuned processes of noticing and reasoning, they would provide a valid and reliable indicator of the major achievement of objectives in teacher education programs: applicable and integrated knowledge about teaching and learning. The Observer tool (Seidel et al., 2010a; Seidel, Blomberg, & Stürmer, 2010b) is the first video-based tool that assesses preservice teachers’ professional vision in a standardized yet contextualized way. The assessment Zeitschrift für Psychologie 2015; Vol. 223(1):54–63

focuses on description, explanation, and prediction abilities with regard to knowledge about goal clarity, teacher support, and learning climate (see Figure 1). Since this was a first attempt to empirically capture the structure of professional vision (Seidel & Stürmer, 2014), these components were selected because they represented a balanced knowledge base, integrating the TL components of the cognitive process-oriented teaching and learning model. Goal clarity served as an indicator of the successful preparation for learning, which includes the aspects of goal setting and orientation. Teacher support served as a guiding process involved in the execution and regulation of learning activities, and learning climate served as an indicator of the motivational-affective classroom context. In the Observer tool, test items combined video clips recorded from real classroom situations with standardized ratings (Seidel & Stürmer, 2014). Participants were shown six 2–4 minute clips from a pool of 12 selected clips that functioned as item prompts (Seidel & Stürmer, 2014). The clips showed instruction in different subjects (mathematics, physics, history, English as a foreign language) at the secondary level (8th and 9th grade). Each clip represented two components (i.e., goal clarity, teacher support, or learning climate). Results of different studies have confirmed that all clips were perceived as authentic and cognitively activating, and were equally regarded by participants as examples of the focused component (Seidel & Stürmer, 2014; Seidel et al., 2010b). The clips were embedded in ratings referring to goal clarity, teacher support, or learning climate. The participants’ ability to describe, explain, and predict relevant classroom interactions and outcomes was measured by six items per ability (a total of 18 items per component and 36 items per clip). Responses were chosen from a 4-point Likert scale (1 = ‘‘disagree’’ to 4 = ‘‘agree’’). Video clips and ratings were integrated into an online platform, in which participants first were shown a clip. After watching, they were asked to indicate which component the clip was representative of. In a further step, participants had the opportunity to watch the clip again;  2015 Hogrefe Publishing


K. Stürmer & T. Seidel: Professional Vision Assessment in Teacher Candidates

prior to answering the ratings targeting their professional vision. In total participants answered 112 test items (after scaling) with a total duration of processing averaged 90 min. Participants’ responses were compared with a quality expert norm. The norm was established based on independent ratings of the Observer test items by three researchers in the field of teaching and learning. The experts had 100–400 hrs of experience in observing classroom situations. The direct expert agreement (Cohen’s j = .79) was used as a blueprint for students’ responses. In cases in which experts disagreed, a consensus validation was established. Student responses were compared to the expert norm and recoded into 0 = miss expert rating and 1 = hit expert rating (Seidel & Stürmer, 2014). Analyses of the psychometric properties of the instrument based on Item Response Theory (IRT) have confirmed that the Observer provides a valid and reliable assessment of professional vision (Jahn, Stürmer, Seidel, & Prenzel, 2014; Seidel et al., 2010b). In two scaling studies with more than 1,000 preservice teachers from different universities who were educated for various school tracks (i.e., primary, secondary), three models that might describe the structure of professional vision (one-, two-, and three-dimensional) were applied and scaling results were compared. The results showed that the three-dimensional model, which separated professional vision into the distinct abilities of description, explanation, and prediction, showed the best-fitting indices. The Observer is an economically manageable tool (Jahn, Prenzel, Stürmer, & Seidel, 2011) that is sensitive enough to capture the development of professional vision within university-based teacher education (Stürmer et al., 2013). However, the suitability of the Observer as a tool for assessing professional vision within the theory-based (university) and practice-based (induction at schools) phases of teacher education has yet to be proven. Furthermore, the assessment is based on knowledge about goal clarity, teacher support, and learning climate, which restricts the interpretations drawn from the measure to this context (Kane, 1994). Against the background requirement that competence measures should provide as representative a sampling of tasks from the real-world situation as possible (Shavelson, 2012), the question arises as to what extent the Observer captures professional vision in a reliable way when the knowledge foci are more differentiated and includes components that summarize the state of art of knowledge about effective teaching.

Research Questions With the Observer Extended Research Tool we aimed to provide a reliable and efficient outcome measure within different subpopulations of teacher education. We therefore expand the variety of test items with regard to knowledge about the five TL components of the process-oriented teaching and learning model. In our study, three research questions are addressed to ensuring the validity of the measure: (1) Does the extension ensure that the video clips used are discernible examples of the TL components of the  2015 Hogrefe Publishing

57

cognitive process-oriented teaching and learning model? (2) Does the extension lead to a reliable measure of professional vision with its three aspects: description, explanation, and prediction? (3) Does the extension provide a comparable measure for preservice teachers and teacher candidates with regard to reliability and item difficulty?

Methods The Extension of the Observer Tool The extension included two main steps: Firstly, the cognitive process-oriented TL model was used as a theoretical framework to specify and expand the number of TL components. Secondly, to ensure efficient deployment within teacher education, the number of test items per knowledge facet was reduced (a duration of 90 min should not be exceeded). By following the requirements of IRT (OECD, 2005) in our test construction, the reliability and the person ability are not influenced by the number of test items or by the specific selection of test items within a target knowledge facet (Bond & Fox, 2001). However, this presupposes that the test items are representative of the facet of target knowledge (TL component) and that test quality (difficulty, discrimination, and item fit) is ensured. Selection of Video Clips Video clips were classified according to the TL components of goal setting, orientation, execution of learning activities, evaluation of learning processes, and regulation. The existing clips from the original Observer version were assigned to those components (see Figure 1). Clips on goal orientation in the original version were specified under either goal setting or orientation. Clips originally representing teacher support were differentiated under either regulation (e.g., events of teacher feedback) or execution of learning activities (e.g., teacher elicitation and/or prompts for student work). For the evaluation component, new clips were identified from the original video clip pool, since this component was not addressed in the original Observer version. In the extension, two clips represented each TL component. As with the previous version, one clip illustrates a good practice example of the component, and the other addresses a typical, and often more critical, example. The two examples were rotated to control for placement effects. With the extension on five TL components, the tool includes 10 clips. Standardized Ratings Ratings connected to video clips of the original Observer tool were evaluated and, if necessary, adapted for the new categorization. As each clip in the extended version Zeitschrift für Psychologie 2015; Vol. 223(1):54–63


58

K. Stürmer & T. Seidel: Professional Vision Assessment in Teacher Candidates

Table 1. Example ratings tapping into knowledge-based reasoning TL Component

Description ‘‘In the excerpt that you saw . . .’’

Goal setting

The teacher clarifies what the students are supposed to learn.

Orientation

The teacher clarifies how the lesson will run.

Execution of learning activities

The teacher poses a question that only could be answered with yes or no. The teacher gives supportive feedback.

Regulation

Evaluation

The teacher summarizes the content of the lesson.

Explanation ‘‘In the excerpt that you saw . . .’’

Prediction ‘‘Based on what you saw. . .’’

The students have the opportunity to activate their prior knowledge of the topic. The students have the opportunity to recognize what is expected from them. The students have the opportunity to use space for own thinking.

The students will be able to align their learning process to the learning objective. The students will be able to concentrate on their learning.

The students have the opportunity to feel supported in their learning.

The students will be able to develop their interest in the learning content. Learners have the opportunity to summarize the content of the lesson on their own.

Learners can experience being competent learners.

represents only one TL component, the test item number could be halved (from 36 to 18 test items per clip). For the evaluation component, additional ratings were developed based on the same criteria applied in the original version by focusing on the aspects of describing, explaining, and predicting events (Seidel & Stürmer, 2014). For those ratings, another expert norm was established based on the independent responses of two researchers from the original expert rating team (Cohen’s j = .69). Examples of the ratings are presented in Table 1.

Sample German teacher education provides an interesting context with which to study whether the Observer Extended Research Tool is a suitable for capturing professional vision in different subpopulations of teacher candidates and preservice teachers in teacher education phases. In Germany, teacher education has two consecutive phases: the university phase and an induction phase. In this study, we accessed subpopulations in both phases. Our sample consists of 317 participants, divided into 141 preservice teachers (66.0% female; age M = 21.84, SD = 4.48) and 176 teacher candidates (77.3% female; age M = 27.35, SD = 2.56).

Research Design Preservice teachers enrolled at eight different German universities at which our project partners or other collaborating researchers were teaching were invited to participate in the study. An online link hosting the extended tool was randomly sent out to lecturers. They were asked to promote participation on a voluntary basis. The tool had to be completed within one week. Teacher candidates were invited to participate in the context of the joint research project ‘‘The role of broad Zeitschrift für Psychologie 2015; Vol. 223(1):54–63

The students will be able to engage into learning.

educational knowledge and the acquisition of professional competence of teacher candidates for career entry’’ (BilWiss) in which the study is integrated. In the project, the generic pedagogical knowledge of all teacher candidates in one German state, North Rhine-Westphalia (NRW), was assessed in the seminars that guided their internship at regional schools. In these seminars, teacher candidates were also asked to participate in the Observer Extended Research Tool on a voluntary basis. As with preservice teachers, participants received an online link and had one week after the announcement to complete the tool. In total, 26.3% of all teacher candidates in NRW participated in the study. Comparing the teacher candidates to the full cohort investigated in the BilWiss project, we found no differences in gender and high school grade point average (GPA). However, participants received a slightly better evaluation from their mentors for their practical achievement in the internship compared to the full cohort, t(609) = 2.66, p = .01, d = 0.21.

Data Analysis Research Question (1) focuses on whether the extension ensures that the selected clips represent discernible examples of the TL components of the cognitive process-oriented teaching and learning model. The research team checked each clip to see which TL component was represented in the clip. After watching a clip, participants were also asked to check which TL component was represented (yes/no answer for each component). They had no knowledge of the assessment of the clips by the research team. We calculated the mean agreement (in percentage) between the participants and the research team. The mean agreement was calculated for the full sample of preservice teachers and teacher candidates. Research Question (2) focuses on whether the extension of the Observer Research Tool leads to a reliable measure of professional vision with the three aspects of description,  2015 Hogrefe Publishing


K. Stürmer & T. Seidel: Professional Vision Assessment in Teacher Candidates

59

Table 2. Video clips as discernible examples of TL components Observer N = 152 (preservice teachers) Represented TL component

Observer Extended N = 317 (preservice teachers/teacher candidates) Discernible example

Represented TL component

Discernible example

Goal clarity

62.0

Teacher support

76.0

Goal setting Orientation Execution of learning activities Regulation

67.5 61.5 71.0 65.5

Learning climate

80.0 Evaluation

63.5

Note. Discernible example: % participant agreement with experts.

explanation, and prediction, as shown in the first version. According to the scaling results of the original version (Jahn et al., 2014; Seidel & Stürmer, 2014), we used a three-dimensional Rasch model to prove the psychometric properties. By using the software ConQuest (Wu, Adams, & Wilson, 1997), the quality of test items – difficulty, discrimination, and the mean square fit index (Bond & Fox, 2001) – was analyzed to select a consistent item pool. To obtain an exact estimation of scale indices (Rauch & Hartig, 2010), unidimensional model estimations (EAP/ PV reliability and variance of the person ability parameters indicating item discrimination) for the description, explanation, and prediction scales were run first, followed by an overall scale (professional vision). The scale indices were compared to the scale indices of the original Observer measure, which was based on a sample that included 152 preservice teachers (Seidel & Stürmer, 2014). Research Question (3) focuses on whether the extension of the tool provides a comparable measure for preservice teachers and teacher candidates. We conducted two separate analyses. We investigated whether professional vision and the aspects of description, explanation, and prediction were reliably measured within each subpopulation. We calculated unidimensional model estimations (EAP/PV reliability and variance of person ability parameters) for the description, explanation, and prediction scales, as well as an overall scale (professional vision) separately for preservice teachers and teacher candidates. We analyzed whether the processing of the Observer Research Tool leads to differential item functioning (DIF) in both groups by applying a multifaceted model, in which we integrated the additional interaction term ‘‘Item · Facet.’’ A test item shows DIF when the probability of answering the item correctly cannot be exclusively explained by person ability and item difficulty (Adams & Carstensen, 2002) or in other words by competence. DIF analyses show whether items are more difficult or easy to answer for a certain group after controlling for group differences in abilities. For example, DIF items would disadvantage or advantage certain persons, when the question not only requires the target knowledge under investigation but also context knowledge that is only available for one group of persons. According to the Educational Testing Service, the extent of DIF is categorized into insignificant (< .43), moderate (< .64), and high (> .64) (Penfield & Algina, 2006).  2015 Hogrefe Publishing

Results Discernible Video Examples of the Underlying Cognitive Process-Oriented TL Model In the original version of the Observer Research Tool, preservice teachers judged the selected clips as examples of the three underlying TL components. Table 2 shows the previous findings for goal clarity, teacher support, and learning climate (with an average of four clips per component) in a sample of 152 preservice teachers (Seidel & Stürmer, 2014). After the clips were specified to the five TL components of the cognitive process-oriented teaching and learning model, we examined the extent to which participants agreed with the classification. Table 2 shows that the participants overall agreed that the clips represented discernible examples of the TL components. For most of the clips, they showed the highest agreement with the TL components intended by the research team. Only Clip 1 showing orientation, and Clip 1 showing evaluation, also have middle agreements with other TL components (goal setting: Clip 1 = 0.71%, SD = 0.45; Clip 2 = 0.64%, SD = 0.48, orientation: Clip 1 = 0.43%, SD = 0.41; Clip 2 = 0.80%, SD = 0.40, execution of learning activities: Clip 1 = 0.64%, SD = 0.48; Clip 2 = 0.78%, SD = 0.41, regulation: Clip 1 = 0.65%, SD = 0.47; Clip 2 = 0.71%, SD = 0.45, and evaluation of learning processes: Clip 1 = 0.42%, SD = 0.49; Clip 2 = 0.85%, SD = 0.35).

Reliable Measure of Description, Explanation, and Prediction as Aspects of Professional Vision In our second research question, we were interested in whether the extension of the tool including a reduced number of test items, leads to a similar reliable measure of professional vision in terms of the three aspects of description, explanation, and prediction. Therefore, the psychometric properties of the test items were analyzed in terms of difficulty, discrimination, and the mean square fit index (MNSQ  .75  1.30; see Bond & Fox, 2001). Zeitschrift für Psychologie 2015; Vol. 223(1):54–63


60

K. Stürmer & T. Seidel: Professional Vision Assessment in Teacher Candidates

Table 3. Scale indices for professional vision Comparison between Observer Research tool versions Observer N = 112 items Reliability Professional vision Description Explanation Prediction

0.96 0.90 0.91 0.97

Observer Extended N = 41 items

Variance

Reliability

Variance

1.24 0.80 1.33 2.14

0.87 0.60 0.65 0.67

0.99 0.64 1.30 1.47

Comparison between subpopulations (Observer Extended) Preservice teachers N = 141 Reliability Professional vision Description Explanation Prediction

0.98 0.70 0.68 0.70

Teacher candidates N = 176

Variance

Reliability

Variance

0.67 0.53 0.81 0.95

0.89 0.66 0.71 0.72

1.32 0.78 1.85 2.00

Note. EAP/PV reliability and variance values based on estimations of unidimensional models for the different professional vision scales.

The application of these criteria resulted in an item pool of 41 items, which fitted the three-dimensional model. To obtain an exact estimation of scale indices (Rauch & Hartig, 2010), unidimensional model estimations for the description, explanation, and prediction scales were run first, followed by an overall scale (professional vision) in the second step of analysis. The scale indices were compared to the scale indices of the first Observer measure (see Table 3). The indices of the Observer Extended Research Tool with reduced test items and more clips are lower than the indices of the original version. However, the scales still show acceptable reliability and good variance of person ability parameters as indicators of item discrimination, with up to r2 = 1.47 for the prediction scale.

multifaceted model indicate differential item difficulties (v2[38] = 272.35, p = .00). However, the DIF analyses show that only two items have high DIF values (Item 1 [goal setting/prediction] = .73; Item 2 [orientation/ explanation] = .81) to the disadvantage of preservice teachers. With a 4.78% DIF for all 41 items, the critical border of 25% (indicating the substantial DIF of the tool) is not exceeded (Penfield & Algina, 2006). Therefore, the use of the Observer Extended Research Tool to compare the professional vision of preservice teachers and teacher candidates is supported.

Discussion Comparable Measure for Preservice Teachers and Teacher Candidates Our third research question focuses on whether the extension provides a comparable measure for preservice teachers and teacher candidates. Firstly, we were interested in whether professional vision was reliably measured within each subpopulation. Secondly, we investigated whether the Observer Extended Research Tool is suitable for a comparative use between both teacher education groups by providing similar item difficulties caused only by person ability and item difficulty. With regard to reliability, we calculated professional vision scales within the two subpopulations. Table 3 shows satisfying scale indices; the reliabilities are good, and similar in both groups. The items seem to discriminate more strongly between teacher candidates, with variance of up to r2 = 2.00 for the prediction scale. To ensure comparative use between both groups, we tested for differential item functioning. The results of a Zeitschrift für Psychologie 2015; Vol. 223(1):54–63

The aim of this study was to enhance the first version of the Observer Research Tool, which has been shown to capture professional vision of preservice teachers in a reliable and valid way. In response to the need for economical measures for capturing learning outcomes and development in teacher education that are suitable for formative assessment in the long term (Darling-Hammond, 2006; Seidel, 2012), the extension aimed at comprehensive use in the two consecutive phases of theory-based and practice-based teacher education. We expanded the facets of knowledge on which the measurement of description, explanation, and prediction as aspects of professional vision is based on, to a broader variety. We draw on the teaching and learning components (TL) of a cognitive process-oriented teaching and learning model. As the original version was a first attempt to empirically capture the structure of professional vision, the component goal clarity, teacher support, and learning climate were selected because they represented a balanced knowledge base by integrating different components of effective teaching. However, given that competence measures should  2015 Hogrefe Publishing


K. Stürmer & T. Seidel: Professional Vision Assessment in Teacher Candidates

provide as representative a sampling of tasks of the realworld situation as possible (Shavelson, 2012), the question arises as to what extent the Observer Extended Research Tool captures professional vision in a reliable way when the measure is more differentiated, and thus shows an extended picture of knowledge application about effective teaching. In the last few years, a substantial number of empirical studies have investigated the effects of teaching on student learning. In the cognitive process-oriented teaching and learning model, the common results of those studies are made explicit (Seidel & Shavelson, 2007) by integrating the gamut of effective teaching variables into five TL components, namely, goal setting, orientation, execution of learning activities, evaluation of learning processes, and teacher guidance and support (regulation). In this respect, we substantially extended the measure of professional vision with regard to differentiated knowledge application by providing test items covering TL components that summarize the state of art of knowledge about effective teaching. Specifying the number of test items to the five TL components also meant that we had to increase the number of video clips as item prompts while at the same time, we decided to reduce the number of test items per TL component to ensure efficient handling of the instrument. In the extended version, knowledge application is measured to a broader variety of test items (five TL components) but with a smaller number of items (one prompt representing a positive example and one prompt representing a negative/ ambiguous example instead of two examples per occurrence). By following the requirements of IRT in test construction (OECD, 2005), the reliability of the test and person ability is not influenced by the number of test items or the specific selection of test items within a target knowledge facet (Bond & Fox, 2001). However, this presupposes that the test items are representative for the facets of target knowledge, or, in other words, TL components (Research Question 1). Our results show that the participants strongly agreed that the selected clips for the extended instrument were discernible examples of the full TL model. The two clips with a lower agreement rate by 40% (and similar agreements on other components) also underlie the fact that it is hardly possible to identify clips only representing one component and exemplify once more the complexity of classroom situations (van Es & Sherin, 2008). A second prerequisite with regard to a reduced item battery is the evidence that the quality of items is ensured and that the selected items are suitable for a reliable measure (Research Question 2). Accordingly, we selected an item pool of 41 test items meeting the requirements of difficulty, discrimination, and fit. The results of the scaling analyses indicate that scale indices of professional vision as overall ability and its aspects of description, explanation, and prediction are lower than in the first version. However, reliability and variance as indicators of item discrimination are still satisfactory for the Observer Extended Research Tool. Focusing on the use of the instrument to compare learning developments over the course of teacher education, including both phases with a focus on theory (university,  2015 Hogrefe Publishing

61

preservice) and practice (internship, candidates), we also applied scaling analyses to the two subpopulations from both phases. Both groups had similar results, and their findings were replicated for the whole sample (Research Question 3). Within the groups, description, explanation, and prediction as aspects of professional vision were measured reliably. Furthermore, DIF analyses support the use of the Observer Extended Research Tool for comparing both groups. In teacher education, there is a need to assess effectiveness and to choose indicators and instruments that provide valid and reliable measures of teacher education outcomes (Darling-Hammond, 2006; Seidel, 2012). Teacher education faces the challenge of extending the established self-ratings of competencies and paper-pencil knowledge tests through tools that assess aspects of knowledge acquisition representing the integration of theory and practice (Blömeke et al., 2015; Cochran-Smith & Zeichner, 2005). At the same time, it must take into account the contextualized and situated nature of teacher knowledge (Borko, 2004). The videobased Observer Research Tool (Seidel et al., 2010a) is the first measurement tool that empirically captures professional vision as an indicator of integrated knowledge in a standardized yet contextualized way within universitybased teacher education. Our study indicates that the extension leads to an economically manageable tool that, in the long term, might be suitable for different phases of teacher education. This paper presented findings regarding the use of the tool in two phases: one with a focus on theory and another with a focus on practice. Further studies will have to replicate the promising results by taking more heterogeneous samples into account. Acknowledgments This study is part of the research project, BilWiss ‘‘The role of broad educational knowledge and the acquisition of professional competence of teacher candidates for career entry’’ (01PK11007C). The project is funded by the German Federal Ministry of Education and Research as part of the priority research program ‘‘Modeling and measuring competencies in higher education.’’ We would like to thank the preservice teachers and teacher candidates who participated in this study, as well as our project partners Mareike Kunter, Detlev Leutner, and Ewald Terhart, and their research teams.

References Adams, R., & Carstensen, C. H. (2002). Scaling outcomes. In R. Adams & M. Wu (Eds.), Pisa 2000. Technical Report (pp. 149–162). Paris, France: OECD. Baumert, J., Kunter, M., Blum, W., Brunner, M., Voss, T., Jordan, A., . . . Yi-Miau, T. (2010). Teachers’ mathematical knowledge, cognitive activation in the classroom, and student progress. American Educational Research Journal, 47, 133–180. Berliner, D. C. (1991). Perceptions of student behavior as a function of expertise. Journal of Classroom Interaction, 26, 1–8. Zeitschrift für Psychologie 2015; Vol. 223(1):54–63


62

K. Stürmer & T. Seidel: Professional Vision Assessment in Teacher Candidates

Berliner, D. C. (2001). Learning about learning from expert teachers. International Journal of Educational Research, 35, 463–482. doi: 10.1016/S0883-0355(02)00004-6 Berliner, D. C., Stein, P., Sabers, D. S., Clarridge, P. B., Cushing, K. S., & Pinnegar, S. (1988). Implications of research on pedagogical expertise and experience in mathematics teaching. In D. A. Grouws & T. J. Cooney (Eds.), Perspectives on research on effective mathematics teaching (pp. 67–95). Reston, VA: National Council of Teachers of Mathematics. Blömeke, S., Gustafsson, J.-E., & Shavelson, R. (2015). Beyond dichotomies: Competence viewed as a continuum. Zeitschrift für Psychologie, 223. doi: 10.1027/2151-2604/a000194 Blömeke, S., Kaiser, G., Lehmann, R., König, J., Döhrmann, M., Buchholtz, C., & Hacke, S. (2009). TEDS-M: Messung von Lehrerkompetenzen im internationalen Vergleich [TEDS-M: Assessment of teacher competencies in international comparison]. In O. Zlatkin-Troitschanskaia, K. Beck, D. Sembill, R. Nickolaus, & R. Mulder (Eds.), Lehrprofessionalität. Bedingungen, Genese, Wirkungen und ihre Messung (pp. 181–210). Weinheim, Germany: Beltz. Bolhuis, S. (2003). Towards process-oriented teaching for selfdirected lifelong learning: A multidimensional perspective. Learning and Instruction, 13, 327–347. doi: 10.1016/s09594752(02)00008-7 Bond, T. G., & Fox, C. M. (2001). Applying the Rasch Model. Mahwah, NJ: Erlbaum. Borko, H. (2004). Professional development and teacher learning: Mapping the terrain. Educational Researcher, 33, 3–15. doi: 10.3102/0013189x033008003 Brouwer, N. (2010). Determining long term effects of teacher education. In P. Peterson, E. Baker, & B. McGaw (Eds.), International encyclopedia of education (Vol. 7, pp. 503– 510). Oxford, UK: Elsevier. Cochran-Smith, M. & Zeichner, K. M. (Eds.). (2005). Studying teacher education: The report of the AERA Panel on Research and Teacher Education. Mahwah, NJ: Erlbaum. Darling-Hammond, L. (2006). Assessing teacher education: The usefulness of multiple measures for assessing program outcomes. Journal of Teacher Education, 57, 120–138. doi: 10.1177/0022487105283796 Darling-Hammond, L. (2010). Teacher education and the American future. Journal of Teacher Education, 61, 35–47. doi: 10.1177/0022487109348024 Darling-Hammond, L. & Bransford, J. D. (Eds.). (2005). Preparing teachers for a changing world: What teachers should learn and be able to do. San Francisco, CA: Jossey-Bass. Döhrmann, M., Kaiser, G., & Blömeke, S. (2012). The conceptualisation of mathematics competencies in the international teacher education study TEDS-M. ZDM – International Journal on Mathematics Education, 44, 325–340. Fraser, B. J., Walberg, H. J., Welch, W. W., & Hattie, J. A. (1987). Syntheses of educational productivity research. International Journal of Educational Research, 11, 145–252. Goodwin, C. (1994). Professional vision. American Anthropologist, 96, 606–633. doi: 10.1525/aa.1994.96.3.02a00100 Grossman, P., Compton, C., Igra, D., Ronfeldt, M., Shahan, E., & Williamson, P. W. (2009). Teaching practice: A crossprofessional perspective. Teachers College Record, 111, 2055–2100. Grossman, P., & McDonald, M. (2008). Back to the future: Directions for research in teaching and teacher education. American Educational Research Journal, 45, 184–205. doi: 10.3102/0002831207312906 Hammerness, K., Darling-Hammond, L., & Shulman, L. S. (2002). Toward expert thinking: How curriculum case writing prompts the development of theory-based

Zeitschrift für Psychologie 2015; Vol. 223(1):54–63

professional knowledge in student teachers. Teaching Education, 13, 219–243. Hattie, J. (2009). Visible learning: A synthesis of over 800 metaanalysis relating to achievement. New York, NY: Routledge. Hill, H. C., Rowan, B., & Ball, D. L. (2005). Effects of teachers’ mathematical knowledge for teaching on student achievement. American Educational Research Journal, 42, 371–406. doi: 10.3102/00028312042002371 Jahn, G., Prenzel, M., Stürmer, K., & Seidel, T. (2011). Varianten einer computergestützten Erhebung von Lehrerkompetenzen [Variants of computer-based teacher competency assessment]. Unterrichtswissenschaft, 39, 136–153. Jahn, G., Stürmer, K., Seidel, T., & Prenzel, M. (2014). Professionelle Unterrichtswahrnehmung von Lehramtsstudierenden: Eine Scaling-up Studie des Observe-Projekts [Professional teaching perception of student teachers: A scaling-up study from the Observe project]. Zeitschrift für Entwicklungspsychologie und pädagogische Psychologie, 46, 171–180. Kane, M. (1994). Validating the performance standards associated with passing scores. Review of Educational Research, 64, 425–461. doi: 10.3102/00346543064003425 Kersting, N. (2008). Using video clips of mathematics classroom instruction as item prompts to measure teachers’ knowledge of teaching mathematics. Educational and Psychological Measurement, 68, 845–861. Kunter, M., Baumert, J., & Köller, O. (2007). Effective classroom management and the development of subjectrelated interest. Learning and Instruction, 17, 494–509. OECD. (2005). PISA 2003 technical report. Paris, France: OECD. Penfield, R. D., & Algina, J. (2006). A generalized DIF effect variance estimator for measuring unsigned differential test function in mixed format tests. Journal of Educational Measurement, 43, 295–312. Putnam, R. T., & Borko, H. (2000). What do new views of knowledge and thinking have to say about research on teacher learning? Educational Researcher, 29, 4–15. Rauch, P., & Hartig, J. (2010). Multiple-choice versus openended responses formats of reading test items: A twodimensional IRT analysis. Psychological Test and Assessment Modeling, 52, 354–379. Santagata, R., & Angelici, G. (2010). Studying the impact of the lesson analysis framework on preservice teachers’ abilities to reflect on videos of classroom teaching. Journal of Teacher Education, 61, 339–349. doi: 10.1177/ 0022487110369555 Santagata, R., & Guarino, J. (2010). Using video to teach future teachers to learn from teaching. ZDM – International Journal on Mathematics Education, 43, 133–145. doi: 10.1007/s11858-010-0292-3 Seidel, T. (2012). Implementing competence assessment in university education. Empirical Research in Vocational Education and Training, 4, 91–94. Seidel, T., Blomberg, G., & Renkel, A. (2013). Instructional strategies for using video in teacher education. Teaching and Teacher Education, 34, 56–65. doi: 10.1016/j.tate.2013.03.004 Seidel, T., Blomberg, G., & Stürmer, K. (2010a). Observer: A video-based tool to diagnose teachers’ professional vision. Unpublished instrument. Retrieved from http://ww3. unipark.de/uc/observer_engl/demo/kv/ Seidel, T., Blomberg, G., & Stürmer, K. (2010b). ‘‘Observer’’ – Validation of a video-based instrument for measuring the perception of professional education. In E. Klieme, D. Leutner, & M. Kenk (Eds.), Kompetenzmodellierung – Zwischenbilanz des DFG-Schwerpunktprogramms und Perspektiven des Forschungsansatzes [Competence modeling – Summary and perspectives of the research approach].

 2015 Hogrefe Publishing


K. Stürmer & T. Seidel: Professional Vision Assessment in Teacher Candidates

[Special issue]. Zeitschrift für Pädagogik, (Suppl. 56), 296–306. Seidel, T., & Prenzel, M. (2007). Wie Lehrpersonen Unterricht wahrnehmen und einschätzen – Erfassung pädagogischpsychologischer Kompetenzen bei Lehrpersonen mit Hilfe von Videosequenzen [Teachers’ perceptions and evaluations in the classroom – assessment of teachers’ pedagogical and psychological competencies with the aid of video sequences] In M. Brenzle, I. Gogolin, & H.-H. Krüger, Kompetenzdiagnostik. [Special issue]. Zeitschrift für Erziehungswissenschaft, (Suppl. 8), 201–218. Seidel, T., & Shavelson, R. J. (2007). Teaching effectiveness research in the past decade: The role of theory and research design in disentangling meta-analysis results. Review of Educational Research, 77, 454–499. doi: 10.3102/ 0034654307310317 Seidel, T., & Stürmer, K. (2014). Modeling and measuring the structure of professional vision in pre-service teachers. American Educational Research Journal, 51, 739–771. doi: 10.3102/0002831214531321 Shavelson, R. J. (2012). Assessing business-planning competence using the Collegiate Learning Assessment as a prototype. Empirical Research in Vocational Education and Training, 4, 77–90. Sherin, M. G. (2001). Developing a professional vision of classroom events. In T. Wood, B. S. Nelson, & J. Warfield (Eds.), Bexond classical pedagogy: Teaching elementary school mathematics. Mahwah, NJ: Erlbaum. Sherin, M. G. (2007). The development of teachers’ professional vision in video clubs. In R. Goldman, R. Pea, B. Barron, & S. J. Derry (Eds.), Video research in the learning sciences (pp. 383–395). Mahwah, NJ: Erlbaum. Sherin, M. G., & van Es, E. (2009). Effects of video club participation on teachers’ professional vision. Journal of Teacher Education, 60, 20–37. doi: 10.1177/ 0022487108328155 Shulman, L. S. (1987). Knowledge and teaching: Foundations of the new reform. Harvard Educational Review, 57, 1–22. Star, J. R., & Strickland, S. K. (2008). Learning to observe: Using video to improve preservice mathematics teachers’ ability to notice. Journal of Mathematics Teacher Education, 11, 107–125. doi: 10.1007/s10857-007-9063-7

 2015 Hogrefe Publishing

63

Stürmer, K., Könings, K. D., & Seidel, T. (2013). Declarative knowledge and professional vision in teacher education: Effect of courses in teaching and learning. British Journal of Educational Psychology, 83, 467–483. doi: 10.1111/j.20448279.2012.02075.x van Es, E., & Sherin, M. G. (2002). Learning to notice: Scaffolding new teachers’ interpretations of classroom interactions. Journal of Technology and Teacher Education, 10, 571–596. van Es, E., & Sherin, M. G. (2008). Mathematics teachers’ ‘‘learning to notice’’ in the context of a video club. Teaching and Teacher Education, 24, 244–276. doi: 10.1016/ j.tate.2006.11.005 Voss, T., Kunter, M., & Baumert, J. (2011). Assessing teacher candidates’ general pedagogical/psychological knowledge: Test construction and validation. Journal of Educational Psychology, 103, 952–969. doi: 10.1037/a0025125 Wiens, P. D., Hessberg, K., LoCasale-Crouch, J., & DeCoster, J. (2013). Using a standardized video-based assessment in a university teacher education program to examine preservice teachers knowledge related to effective teaching. Teaching and Teacher Education, 33, 24–33. doi: 10.1016/ j.tate.2013.01.010 Wu, M. L., Adams, R. J., & Wilson, M. R. (1997). ConQuest: Multi-Aspect Test Software. Camberwell, Australia: Australian Council for Education Research.

Kathleen Stürmer School of Education TUM Marsstr. 20–22 80335 Munich Germany Tel. +49 89 289-25120 Fax +49 89 289-25199 E-mail kathleen.stuermer@tum.de

Zeitschrift für Psychologie 2015; Vol. 223(1):54–63


Opinion

Gaining Substantial New Insights Into University Students’ SelfRegulated Learning Competencies How Can We Succeed? Barbara Schober,1 Julia Klug,1 Gregor Jöstl,1 Christiane Spiel,1 Markus Dresel,2 Gabriele Steuer,2 Bernhard Schmitz,3 and Albert Ziegler4 1

Department of Applied Psychology: Work, Education, and Economy, University of Vienna, Austria, 2 Department of Psychology, University of Augsburg, Germany, 3Department of Human Sciences, Technical University of Darmstadt, Germany, 4Department of Psychology, University of Erlangen-Nuremberg, Germany

Why Is It Important to Discuss New Directions? Self-regulated learning (SRL) is a major issue in current educational research. A comprehensive body of evidence points to the relevance of SRL for creating lasting learning success in many learning contexts (Zimmerman & Schunk, 2011). SRL competences are of particular importance for success in higher education because students have to deal with rather unstructured contexts and diverse learning challenges (Peverly, Brobst, Graham, & Shaw, 2003). Despite SRL’s undeniable relevance and the large body of research attesting to this (Winne, 2005), some core issues – especially regarding learning at universities – have not been solved yet. We still do not know which components of SRL in which combination are crucial for success at university. Which aspects of SRL are relevant in which learning phases in which contexts? How do situational and personal factors interact? How do these competences actually develop under different institutional conditions? Why do we still find substantial knowledge deficits in this intensively researched field? A closer look makes it obvious that research often concerns very specific details of the complex SRL construct, such as the interrelations among specific SRL components, teachers’ effects on specific students’ SRL strategies, or the effects of very specific contexts (e.g., Eccles & Wigfield, 2002). Furthermore, a variety of research approaches are used, based on different models, measures, and study designs. Consequently, results are often inconsistent. Approaches and results therefore remain rather unconnected, and no comprehensive picture is able to emerge. However, if we want to create instructional designs that promote SRL at universities, we need a deeper comprehensive understanding of SRL competences and their development. Zeitschrift für Psychologie 2015; Vol. 223(1):64–65 DOI: 10.1027/2151-2604/a000201

To reach more coherence and advance in research on SRL competencies in complex learning settings like universities, we suggest an integrative approach in terms of theory and measurement in this opinion paper.

Theoretical Models of SRL – Could They Be Integrated? At present, there are several coexisting, rather separated models of SRL (Puustinen & Pulkkinen, 2001), classified as component-oriented models (e.g., Boekaerts, Pintrich, & Zeidner, 2000), defining three core strategy dimensions of SRL (cognitive, metacognitive, resource management strategies) and process-oriented models (e.g., Zimmerman & Schunk, 2011), defining SRL as a cyclical process made up of consecutive phases. In addition, SRL competences can be understood as knowledge about SRL that can be differentiated into three types of knowledge: declarative, procedural, and conditional knowledge, which also can be considered as consecutive stages of competence development (Dresel & Haugwitz, 2005). However, these dimensions of knowledge are not taken into account in the aforementioned models. Wirth and Leutner (2008) suggested defining SRL ‘‘as a learner’s competence to autonomously plan, execute, and evaluate learning processes, which involves continuous decisions on cognitive, motivational, and behavioral aspects of the cyclic process of learning’’ (p. 103). This is a first step toward integration, but a systematic theoretical integration of the components, process, and knowledge types of SRL is still lacking. We (Dresel et al., in press) recently suggested a framework model that takes this challenge into account. In the proposed ‘‘3D cube model,’’ the strategy dimensions of SRL are related to their specific meaning at different phases of Ó 2015 Hogrefe Publishing


Opinion

the SRL process and to the kinds of knowledge necessary at each phase, respectively.

Measures of SRL Competences – Could They Be Used in a More Coherent Way? Existing measures can be divided into three waves (Panadero & Järvelä, 2014): (1) self-reports from a trait-like perspective (e.g., questionnaires, interviews), (2) ‘‘online’’ measures (e.g., thinking aloud protocols, traces) following a process perspective, and (3) measures that also function as interventions (e.g., learning diaries). In most cases, a lack of validity remains a basic critique. Often, declarative strategy knowledge is measured without taking real-life situations into account. Especially since SRL is conceptualized as a competence (Wirth & Leutner, 2008), new measures that take domain and situation specificity into account are recommended. Thus, as is the case with theoretical approaches, measures are criticized for being too specific as well as too artificial and irrelevant for complex real-life learning situations. Consequently, we again argue for integration, here in the sense of systematic multi-method – multi-informant approaches (Azevedo, 2009). Furthermore, the measures ought to be directly connected to an integrated theoretical model as described above. The challenge is not simply to use many different instruments combining data on components, knowledge, and processes (see the model integration above), but to explicitly derive directly connected and meaningfully-combined sets of instruments from a coherent model. Based on such a model that takes into account the evidence that not all aspects of SRL are relevant in every phase of learning and in every context (Schober, 2014), a very specific combination of measures could be used. Subsequently, such an integrated approach could be extended to a longitudinal design investigating contextual effects and in a further step conducting interventions. In sum, integration and thinking in more holistic dimensions could be identified as central desiderata for gaining deeper insight into underlying mechanisms of successful SRL. One might argue that this concern is a huge challenge, not really new and presumably the case in many fields. We would agree, but it seems as if there is still need for transferring this recognition into research practice, especially for SRL, which is highly complex but deeply relevant for sustainable learning success. Acknowledgment This research was supported by the German Federal Ministry for Education and Research.

65

References Azevedo, R. (2009). Theoretical, conceptual, methodological, and instructional issues in research on metacognition and self-regulated learning: A discussion. Metacognition and Learning, 4, 87–95. doi: 10.1007/s11409-009-9035-7 Boekaerts, M., Pintrich, P., & Zeidner, M. (Eds.). (2000). Handbook of self-regulation. Orlando, FL: Academic Press. Dresel, M., & Haugwitz, M. (2005). The relationship between cognitive abilities and self-regulated learning: Evidence for interactions with academic self-concept and gender. High Ability Studies, 16, 201–218. doi: 10.1080/13598130600618066 Dresel, M., Schmitz, B., Schober, B., Spiel, C., Ziegler, A., Engelschalk, T., . . .., & Steuer, G. (in press). Competencies for successful self-regulated learning in higher education: Structural model and empirical evidence from expert interviews. Studies in Higher Education. Eccles, J. S., & Wigfield, A. (2002). Motivational beliefs, values, and goals. Annual Review of Psychology, 53, 109–132. doi: 10.1146/annurev.psych.53.100901.135153 Panadero, E., & Järvelä, S. (2014, August). Third wave on selfregulation measurement: When measuring is also an intervention. Paper presented at the EARLI SIG1 conference, Madrid, Spain. Peverly, S. T., Brobst, K. E., Graham, M., & Shaw, R. (2003). College adults are not good at self-regulation: A study on the relationship of self-regulation, note taking, and test taking. Journal of Educational Psychology, 95, 335–346. doi: 10.1037/0022-0663.95.2.335 Puustinen, M., & Pulkkinen, L. (2001). Models of self-regulated learning: A review. Scandinavian Journal of Educational Psychology, 45, 269–286. doi: 10.1080/00313830120074206 Schober, B. (2014, March). Kompetenzen zum Selbstregulierten Lernen an Hochschulen – erste Befunde aus dem Projekt PRO-SRL [Competencies regarding self-regulated learning at universities – first findings from the project PRO-SRL]. Invited paper presented at the Symposium Bildungsforschung 2020, Berlin, Germany. Winne, P. H. (2005). A perspective on state-of-the-art research on self-regulated learning. Instructional Science, 33, 559–565. doi: 10.1007/s11251-005-1280-9 Wirth, J., & Leutner, D. (2008). Self-regulated learning as a competence: Implications of theoretical models for assessment methods. Zeitschrift für Psychologie, 216, 102–110. doi: 10.1027/0044-3409.216.2.102 Zimmerman, B. J., & Schunk, D. H. (Eds.). (2011). Handbook of self-regulation of learning and performance. New York, NY: Taylor & Francis.

Barbara Schober Department of Applied Psychology: Work, Education, and Economy University of Vienna Universitätsstraße 7 1010 Vienna Austria Tel. +43 1 4277-47322 Fax +43 1 4277-847322 E-mail barbara.schober@univie.ac.at

The Opinion section of this journal aims to encourage further inquiry and debate. The opinions expressed in the contributions to this section are those of the authors and not necessarily those of the journal, the editors, or the publisher.

Ó 2015 Hogrefe Publishing

Zeitschrift für Psychologie 2015; Vol. 223(1):64–65


Call for Papers ‘‘Neural Plasticity in Rehabilitation and Psychotherapy – New Perspectives and Findings’’ A Topical Issue of the Zeitschrift für Psychologie Guest Editors: Wolfgang H. R. Miltner (Institute of Psychology, University of Jena, Germany) and Otto W. Witte (Hans Berger Department of Neurology at Jena University Hospital, Jena, Germany) Neural plasticity has become a major focus of neuroscience. Recent theories and experimental observations of this research field have radically altered our views about the brain and its capacity and shown that the brain continuously adapts its structures and functions to changing environments across the whole lifespan. A relatively new subarea of this field is focused on structural and functional plasticity of the brain as a result of behavioral and cognitive training and training of emotion regulation in several areas of therapy and rehabilitation (neurology, neuropsychology, behavioral and cognitive-behavioral interventions, psychotherapy, physiotherapy and sports therapy, etc.). Most research in this field is guided by the conviction that any positive outcomes of therapy and rehabilitative measures will only occur when the interventions significantly change the underlying physiological and pathological structures and/ or functions of the brain. We are looking for original empirical articles as well as review-type articles or meta-analyses that focus on traininginduced or therapy-induced brain plasticity in the fields of sensory, motor, or sensorimotor disorders, language disorders, cognitive or affective disorders, or psychopathological conditions. We are especially interested in contributions that advance our current knowledge by addressing new perspectives and new methods of intervention and testing the training-induced or therapy-induced brain plasticity. How to submit: Interested authors should submit a letter of intent including: (1) a working title for the manuscript, (2) names, affiliations, and contact information for all authors, and (3) an abstract detailing the content of the proposed manuscript to either of the guest editors, Wolfgang H. R. Miltner (wolfgang.miltner@uni-jena.de) or Otto W. Witte (otto.witte@med.uni-jena.de). Zeitschrift für Psychologie 2015; Vol. 223(1):66 DOI: 10.1027/2151-2604/a000202

There is a two-stage submissions process. Initially, authors are requested to submit only abstracts of their proposed papers. Authors invited to submit a full paper should then do so. All papers will undergo full peer review. Deadline for submission of abstracts is April 15, 2015. Deadline for submission of full papers is August 15, 2015. The journal seeks to maintain a short turnaround time, with the final version of the accepted papers being due by November 15, 2015. The topical issue will be published as issue 2 (2016). For additional information, please contact: Wolfgang H. R. Miltner (wolfgang.miltner@uni-jena.de) or Otto W. Witte (otto.witte@med.uni-jena.de).

About the Journal The Zeitschrift für Psychologie, founded in 1890, is the oldest psychology journal in Europe and the second oldest in the world. One of the founding editors was Hermann Ebbinghaus. Since 2007 it is published in English and devoted to publishing topical issues that provide stateof-the-art reviews of current research in psychology. For detailed author guidelines, please see the journal’s website at www.hogrefe.com/journals/zfp/ Ó 2015 Hogrefe Publishing


See sample pages at www.hogrefe.com !

Also available as E-book! Robert Edenborough & Marion Edenborough

The Psychology of Talent

Exploring and Exploding the Myths

Managing and nurturing talent in the workplace – this book provides practical and well-founded guidance for psychologists and HR professionals, as well as exploding numerous myths surrounding talent The core concepts in this book are the idea of talent, how it can be assessed, and how it can be nurtured and put to effective use in the workplace. Line managers, HR professionals, business or industrial/organizational psychologists, and consultants will find their understanding challenged and extended – and discover many helpful suggestions to improve their professional practices. The authors explore various psychological tools and approaches that can be pressed into service in connection with talent. Uniquely, they also set the psychological assessment of talent in the context of attitudes to talent and various myths and misunderstandings about it. This easy-to-read volume will be of interest to anyone concerned with understanding how talent can be pressed into service to improve performance in the workplace.

2012, viii + 152 pp., hardcover ISBN 978-0-88937-396-9 US $34.80 / £ 19.90 / € 24.95

Table of Contents: Preface, Acknowledgements Chapter 1: The Meaning and the Psychology of Talent Chapter 2: The Winning Team Chapter 3: Black Boxes and Dark Arts Chapter 4: Methodologies for Understanding Talent Chapter 5: Received Wisdom or Received Ignorance? Chapter 6: Getting Below the Surface: Standards for Talent Chapter 7: Positive Psychology and the Strengths Movement Chapter 8: Assessing Potential and Developing Talent Chapter 9: Moving Forward with Talent Hogrefe Publishing 30 Amberwood Parkway · Ashland, OH 44805 · USA Tel: (800) 228-3749 · Fax: (419) 281-6883 E-Mail: customerservice@hogrefe.com

Hogrefe Publishing Merkelstr. 3· 37085 Göttingen · Germany Tel: +49 551 999 500 · Fax: +49 551 999 50 111 E-Mail: customerservice@hogrefe.de

Hogrefe Publishing c/o Marston Book Services Ltd 160 Eastern Ave., Milton Park · Abingdon, OX14 4SB · UK Tel: +44 1235 465577 · Fax +44 1235 465556

E-mail: direct.orders@marston.co.uk

Order online at www.hogrefe.com or call toll-free (800) 228-3749 (US only)


­

RIAS

Reynolds Intellectual Assessment Scales and Screening Deutschsprachige Adaptation der Reynolds Intellectual Assessment Scales (RIAS) & des Reynolds Intellectual Screening Test (RIST) von Cecil R. Reynolds und Randy W. Kamphaus TM

TM

von Priska Hagmann-von Arx & Alexander Grob Die­ RIAS­ sind­ ein­ zeitökonomisches,­ leicht­ zu­ handhabendes­ Testverfahren­ zur­ Intelligenzeinschätzung­ über­ praktisch­die­gesamte­Lebensspanne­(3­bis­99­Jahre),­das­ in­ den­ USA­ entwickelt­ wurde.­ Erstmals­ liegt­ nun­ die­ deutschsprachige­Adaptation­des­Verfahrens­vor.

wird­ über­ zwei­ zusätzliche­ Gedächtnisuntertests­ gebildet.­ Die­ Intelligenzindizes­ entsprechen­ gängigen­ IQ-Werten.­ Der­ integrierte­ RIST­ ermöglicht­ als­ Screening-Version­ eine­ noch­ ökonomischere,­ reliable­ und­valide­Intelligenzeinschätzung.

Die­RIAS­umfassen­einen­Verbalen Intelligenz Index­und­ einen­Nonverbalen Intelligenz Index,­die­sich­jeweils­aus­ zwei­Untertests­zusammensetzen.­Die­T-Werte­der­vier­ Untertests­lassen­sich­aufsummiert­in­den­Gesamtintelligenz Index­umwandeln,­der­eine­Schätzung­der­globalen­ Intelligenz­ darstellt.­ Ein­ Gesamtgedächtnis Index­

Normen­ N­=­2145;­3;0­bis­99;11­Jahre Bearbeitungsdauer­ Die­ Durchführungsdauer­ beträgt­ bei­ einem­ geübten­ und­ erfahrenen­Testleiter­ ungefähr­ 20­bis­25­Minuten.­Das­Screening­(RIST)­kann­in­etwa­der­ Hälfte­der­Zeit­durchgeführt­werden.­Die­Durchführung­ der­ beiden­ zusätzlichen­ Gedächtnisuntertests­ dauert­ weitere­10­bis­15­Minuten. ­ Test komplett,­bestehend­aus:­ Manual,­20­Protokollbogen­RIAS,­20­Protokollbogen­RIST,­ Stimulusbücher­1, 2,­und­3,­Sichtschutz­und­Koffer Bestellnummer­03 172 01,­€­650.00/CHF­873.00

NonverbalMemory-VM.qxp:NonverbalMemory-VM.qxp

3/4/10

10:36 AM

Page 69

NonverbalMemory-VM.qxp:NonverbalMemory-VM.qxp

3/4/10

Protokollb

ogen RIA S

Name

Muster Vorname

02401

Nr.

Ausbildu

ng

Testleiter/-in Überweiser/-in Testungs

grund

RIAS Werte

Geschlec

Barbara

w

ht

Berufsl ehre

Nationalität

der Unter tests und

Raten Sie

Unpassendes

Ausschließen

Verbales Nonverbales

Gedächtnis

61

(VG)

93

(NG)

115

n (optional)

d)

Zu­beziehen­bei­Ihrer­Testzentrale: Herbert-Quandt-Str.­4­·­D-37081­Göttingen­ Tel.:­0049-(0)551 99950-999­·­Fax:­-998 E-Mail:­testzentrale@hogrefe.de­·­www.testzentrale.de Länggass-Strasse­76­·­CH-3000­Bern­9­ Tel.:­0041-(0)31 30045-45­·­Fax:­-90­ E-Mail:­testzentrale@hogrefe.ch­·­www.testzentrale.ch

Bemerku

ische Problem

e

ngen

wierigke

iten

ische Problem

e

Hirntu mor-Re

Bestellnum mer 03 172 03 Copyright © 2014 by Verlag Hans

Huber, Hogrefe

sektion vor 1

AG, Bern.

– 112 62.9

108

103

116

74.5

GGX

108

______ B-4

96

117 en Einstiegsder altersentsprechend 70.3 Anschließend wird mit der Beispielaufgabe. –

Nonverb aler beginnen mit GesamtIntellige Alle Probanden Start: nz ersten Versuch, Gesamtintelligenz en Aufgaben nicht im Index aufgabe fortgefahren. e Aufgaben richtig der ersten zwei altersentsprechend oder beide Index Proband einegedächtn bis zwei aufeinanderfolgend is Umkehrregel: Löst ein Reihenfolge präsentiert, Index Aufgaben in umgekehrter

werden die leichteren werden. n in Folge). im ersten Versuch gelöst drei 0-Punkte-Antworte falsch gelöst wird (d.h. ng der Eltern in Folge eine Aufgabe Antwort Abbruch: Wenn dreimal Punkt für eine richtige im ersten Versuch, 1 für eine richtige Antwort d) Bewertung: Pro Aufgabe gilt: 2 Punkte Antwort in beiden Versuchen. Punkte für eine falsche 0 Punkte für jede Aufgabe im zweiten Versuch, 0 Einstiegsaufgabe und altersentsprechenden jede Aufgabe vor der Geben Sie 2 Punkte für Aufgabe. nach der letzten bearbeiteten

– Verkau fsanges

zutreffen

vorbehalt

220 GIX

110

______ B-3

97

RIAS-Profil

tellte

Zeitgrenze: 1. Versuch:

20 Sekunden, 2. Versuch:

Art der Vervielfäl

Aufgabe Nr.tigung verboten. Lösung

Start

10 Sekunden

Antwort: 1. Versuch

75 —

3–4 Jahre

1. Kuh Ohr 2. Biene Flügel 3. Auto Vorderrad

5 Jahre

6–8 Jahre

9–10 Jahre

4. Zaun Teil des Zauns

Fahne

Rad vorne

falsch gezeigt (fg)

fg

Schweif

6. Tanksäule Schlauch

fg

7. Wandtafel Kreide

Kreide

8. Gabel Zinke

Teller

9. Käfer Bein

rg

11. Vogel Schnabel 12. Kopf Ohren 11 Jahre

richtig gezeigt (rg)

X

2

X

X X 2

1

0

1

0

1

2

0

1

2

0

1

2

35

2

– –

Abbruchregel 30

12

– –

– –

≤ 10 — T T-Wert

Beispielseiten­aus­dem­ RIAS-Protokollbogen

– –

RS

58

– —

— –

– –

– — –

– –

SE

UA

WF

VG

57

NG

52

53

59

49

20

– —

– –

15

– – — ≤ 10 T-Wert

Für das Untertestwertepro fil nutzen Sie die altersadjustierten T-Werte von Seite 1.

150

130

120

X

X

100

100

90

80

80

70

70

60

50

≤ 40 Index

50

VIX

NIX

GIX

GGX

113

105

110

108

Für das Indexprofil nutzen Sie die Indexwerte von Seite 1.

2

110

X

60

– – —

25

– – —

30

– —

– – —

– –

35

Index ≥ 160

140

90

40

– —

– –

– –

– – – –

– —

– – –

– —

– – – – – – —

– – – – —

– – – – —

– – – –

– – – – – —

– – – – —

– – – – —

– – – – —

– – – – —

– –

15 —

– – – – —

– – –

weiter 

– –

45

– – – – — – – –

– –

50

– – – – —

– –

– – –

GGX

— — – – – – – – – – – – – – – – – – – – — — – – – – – – – – – – – – – – – – – – — — – – – – – – – – – – – – – – – – – – — — – – – – – – – – – – – – – – – – – – — — – – – – – – – – – – – – – – – – – – — — – – – – – – – – – – – – – – – – – – — — – – – – – – – – – – – – – – – – – – — — – – – – – – – – – – – – – – – – – – — — – – – – – – – – – – – – – – – – – – — — – – – – – – – – – – – – – – – – – – — — – – – – – – – – – – – – – – – – – – — — – – – – – – – – – – – – – – – – – – — —

X

– —

– – —

– – —

– –

GIX

130

110

55

– —

X

– –

– – – –

25 —

20 — –

14. Kamel Höcker

– –

13. Dusche Duschkopf

– –

– —

– – – – – – —

– – – –

NIX

120

60

– – – – — – –

– – – – —

– – – –

65

– – – – —

– –

– –

– – – – – —

– – – – —

– – – –

– – – —

– – – – —

– – – –

– —

– – – – —

– – – – —

– – – —

– – – – —

– – —

– –

VIX

— — — – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – — — – — – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – — – — — – — – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – — – — — – — – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – — – — — – — – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – — – — — – — – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – — – — — – — – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – — – — — – — – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – — – — — – — – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – — – — — – — – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – — – — — – — – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – — – — — – — – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – — — — —

140

75

70

– – – —

– –

X

– – – –

– —

– – – –

40 —

2

1

– – –

45 —

2

1

0

X X X

– —

– – –

2

1

X

0

– – – —

– – —

– – —

X

– – – —

RIAS Indizes

Index ≥ 160 —

150 —

– —

X

– – —

– –

85

80

– —

– – –

– –

– – – – – –

– – —

– –

– –

– – – – —

– –

X

– – —

– – – –

– – – – —

– – –

50 —

2

1

X

0

– –

– – – – —

X —

– – – –

– –

– – –

– –

– – – –

– – – –

— – – – – —

– –

– – – –

65 —

60 — –

2

1

55 —

0

keine Antwort (kA) wn fg wn

– –

– –

– – —

– – –

– – – – —

– –

1

– – – – —

– – –

Basisstufe

2

1

0

0

weiß nicht (wn)

fg

1

0

0

rg

X X 2

0

X

rg

2

1

0

Ohr

5. Pferd Schweif

10. Zebra Streifen

– – – –

70 —

Bein

T-Wert ≥ 90

– – – –

– – —

– – –

– – —

– – –

NG

– – —

– – – –

– – – – —

– – —

– – – –

– – – – —

– – —

– – – –

VG

– – – – —

– –

80 —

WF

– – – – —

GGX

UA

– – – – —

Beispielaufgabe A. Tisch (Tisch-)Bein

alle

SE

– –

85 —

Punkte

Antwort: 2. Versuch

NIX

RS

– –

Material: Stimulusbuch

en. Jegliche

RIAS T-Werte der Untertests

VIX T T-Wert ≥ 90 —

mit Sekundenzeiger 2, Stoppuhr oder Uhr Bild?« hin. Was fehlt in diesem »Sehen Sie/sieh genau es nochmals. Was fehlt Instruktion: Sagen Sie: »Nein, versuchen Sie/versuch nicht antwortet, antwortet, sagen Sie: Versuch nach 20 Sekunden ersten Versuch falsch der Proband im ersten Wenn der Proband im der zweite Versuch. Wenn in diesem Bild?« Es erfolgt der zweite Versuch. in diesem Bild?« Es erfolgt sagen Sie: »Was fehlt

Monat

Alle Rechte

=

4 Was Fehlt? (WF)

Ausbildu

Beruf (falls

Beispielseiten­aus­den­Stimulusbüchern

49

105 NIX

105

______ B-2

119

80.7

und neurolog

105

Höchste

h-/motor

+

VIX ______ B-1

Verbaler Intelligenz Index

Informatione

Deutsc h

zutreffen

A-42

_______ ) Gedächtnis

59

all

rache

A, Tabelle

al

52

113

B)

Prozentrang

Schulsch

45 15 24 21

53

T-Werte

(vgl. Anhang

Muttersp

e (vgl. Anhang

Nonverb

58 57

RIAS Indizes

Zusätzliche

Tag

14 03 09

80

Summe der

Schule (falls

Monat

05

rte T-Wert

Verbal

54 83 41

(WF)

95%-Konfiden zinterv

Seh-/Hör-/Sprac

23

Altersadjustie

Rohwerte

Was Fehlt?

Gedächtnis

11 12 88

m

atum

Indizes

en (SE)

Lern- und

Testdatu Geburtsd Testalter

(RS)

(UA)

Sätze Ergänz

Medizinische

Jahr

CH

Dr. Wagne r Dr. Schnei der postope rative Untersuchun g

≤ 40 Index

10:36 AM

Page 70


Instructions to Authors – Zeitschrift fu¨r Psychologie The Zeitschrift fu¨r Psychologie, originally founded in 1890, is the oldest psychology journal in Europe and the second oldest in the world. Now, reflecting the change in the lingua franca of science from German to English, it is being published completely in English. The Zeitschrift fu¨r Psychologie publishes high-quality research from all branches of empirical psychology that is clearly of international interest and relevance, and does so in four topical issues per year. Each topical issue is carefully compiled by guest editors and generally features one broad Review Article accompanied by Original Articles from leading researchers as well as additional shorter contributions such as Research Spotlights (presenting details of individual studies or summaries of particularly interesting work in progress), Horizons (summarizing important recent or future meetings or outlining future directions of work), and Opinion pieces that provide a platform for both established and alternative views on aspects of the issue’s topic. The guest editors and the editorial team are assisted by an experienced international editorial board and external reviewers to ensure that the journal’s strict peer-review process is in keeping with its long and honorable tradition of publishing only the best of psychological science. The subjects being covered are determined by the editorial team after consultation within the scientific community, thus ensuring topicality. The Zeitschrift fu¨r Psychologie thus brings convenient, cutting-edge compilations of the best of modern psychological science, each covering an area of current interest. Please read the following information carefully before submitting a document to Zeitschrift fu¨r Psychologie: A call for papers is issued for each topical issue. Current calls are available on the journal’s website at www.hogrefe.com/journals/zfp. Manuscripts should be submitted as Word or RTF documents via e-mail attachment to either the responsible guest editor(s) or the Editor-in-Chief (Prof. Dr. Bernd Leplow, Institute of Psychology, University of Halle-Wittenberg, Brandbergweg 23, D-06120 Halle, Germany, Tel. +49 345 552-4358(9), Fax +49 345 552-7218, E-mail bernd.leplow@psych.uni-halle.de). Names of authors are usually made known to reviewers, although blind reviewing is available on request. Authors who prefer blind reviewing should state this when first submitting their manuscript and should remove all potentially identifying information from the manuscript, replacing names and any indication of the university where a study was conducted by neutral place-holders. The Title Page of each paper or article should include, in the following order: Title of the article; author name(s) (preceded by first names, but with no academic titles given); name of the institute or clinic (if there is more than one author or institution, affiliations should be indicated using superscript Arabic numerals); and an address for correspondence (including the name of the corresponding author with fax and phone numbers). An Abstract (maximum length 150 words) should be provided on a separate page for original and review articles. A maximum of 5 keywords should be given after the abstract. Reference Citations in the text and in the reference list proper should follow conventions listed in the Publication Manual of the American Psychological Association 6th ed., referred to hereinafter as the APA Manual. For example: Bezchlibnyk-Butler, K.Z., Jr., & Jeffries, J.J. (2007). Clinical handbook of psychotropic drugs (17th ed.). Cambridge, MA: Hogrefe. ´ . (1994). Event-related potentials and Czigler, I., Csibra, G., & Ambro´, A aging: Identification of deviant visual stimuli. Journal of Psychophysiology, 8, 193–210. O’Malley, S. (in press). Psychosocial treatments for drug abuse. In C. Stefanis, H. Hippius, & D. Naber (Eds.), Psychiatry in progress: Vol. 2. Research in addiction: An update (pp. 129–136). Cambridge, MA: Hogrefe. Tables should be numbered using Arabic numerals. Tables must be cited in the text (e.g., ‘‘As shown in Table 1, ...’’). Each table should be printed on a separate sheet. Below the table number, a brief descriptive title should be given; this should then be followed by the body of the table. It is recommended that each table should also include a brief explanatory legend. Figures should be numbered using Arabic numerals. Each figure must be cited in the text (e.g., ‘‘As illustrated in Figure 1, ...’’) and should be accompanied by a legend on a separate sheet. As online submission requires papers to be submitted as one file, figures and tables etc. should be embedded or appended to the paper and not be sent as separate files. However, upon acceptance of an article, it may be necessary for figures to be supplied separately in a form suitable for better reproduction: preferably high-resolution (300 dpi) or vector graphics files. Where this is necessary, the corresponding author will be notified by the publishers. Figures will normally be reproduced

in black and white only. While it is possible to reproduce color illustrations, authors are reminded that they will be invoiced for the extra costs involved. Length of Articles: Manuscripts submitted should not exceed the following lengths (including references, tables, and figures): Review articles – 60,000 characters and spaces (approx. 8,500 words); original articles – 50,000 characters and spaces (approx. 7,000 words); Research Spotlights – 20,000 characters and spaces (approx. 2,800 words); Opinion – 9,000 characters and spaces (approx. 1,200 words); Horizons – 9,000 characters and spaces (approx. 1,200 words). Scientific Nomenclature and Style: Authors should follow the guidelines of the APA Manual regarding style and nomenclature. Authors should avoid using masculine generic forms in their manuscripts. General statements about groups of people should be written in gender-neutral form; when presenting examples, authors may alternate between female and male forms throughout their text. Language: It is recommended that authors who are not native speakers of English have their papers checked and corrected by a native-speaker colleague before submission. Standard US American spelling and punctuation as given in Webster’s New Collegiate Dictionary should be followed. Proofs: PDF proofs will be sent to the corresponding author. Changes of content or stylistic changes may only be made in exceptional cases in the proofs. Corrections that exceed 5% of the typesetting costs may be invoiced to the authors. Offprints: Hogrefe will send the corresponding author of each accepted paper free of charge an e-offprint (PDF) of the published version of the paper when it is first released online. This e-offprint is provided for the author’s personal use, including for sharing with coauthors (see also ‘‘Online Rights for Journal Articles’’ in the Advice for Authors on the journal’s web page at www.hogrefe.com). Copyright Agreement: By submitting an article, the author confirms and guarantees on behalf of him-/herself and any co-authors that the manuscript has not been submitted or published elsewhere, and that he or she holds all copyright in and titles to the submitted contribution, including any figures, photographs, line drawings, plans, maps, sketches, and tables, and that the article and its contents do not infringe in any way on the rights of third parties. The author indemnifies and holds harmless the publisher from any third-party claims. The author agrees, upon acceptance of the article for publication, to transfer to the publisher the exclusive right to reproduce and distribute the article and its contents, both physically and in nonphysical, electronic, or other form, in the journal to which it has been submitted and in other independent publications, with no limitations on the number of copies or on the form or the extent of distribution. These rights are transferred for the duration of copyright as defined by international law. Furthermore, the author transfers to the publisher the following exclusive rights to the article and its contents: 1. The rights to produce advance copies, reprints, or offprints of the article, in full or in part, to undertake or allow translations into other languages, to distribute other forms or modified versions of the article, and to produce and distribute summaries or abstracts. 2. The rights to microfilm and microfiche editions or similar, to the use of the article and its contents in videotext, teletext, and similar systems, to recordings or reproduction using other media, digital or analog, including electronic, magnetic, and optical media, and in multimedia form, as well as for public broadcasting in radio, television, or other forms of broadcast. 3. The rights to store the article and its content in machine-readable or electronic form on all media (such as computer disks, compact disks, magnetic tape), to store the article and its contents in online databases belonging to the publisher or third parties for viewing or downloading by third parties, and to present or reproduce the article or its contents on visual display screens, monitors, and similar devices, either directly or via data transmission. 4. The rights to reproduce and distribute the article and its contents by all other means, including photomechanical and similar processes (such as photocopying or facsimile), and as part of so-called document delivery services. 5. The right to transfer any or all rights mentioned in this agreement, as well as rights retained by the relevant copyright clearing centers, including royalty rights to third parties. Online Rights for Journal Articles: Guidelines on authors’ rights to archive electronic versions of their manuscripts online are given in the Advice for Authors on the journal’s web page at www.hogrefe.com.

June 1, 2012. Ó 2012 Hogrefe Publishing


Assessment of Competencies in Higher Education Editors Sigrid Blömeke Jan-Eric Gustafsson Richard J. Shavelson

Volume 223 / Number 1 / 2015

Assessment of Competencies in Higher Education

Assessment of Competencies in Higher Education

In our globalized, knowledge-based societies with increased demands for competencies in the workforce, higher education institutions must ensure that their graduates have the competencies they need to succeed. Research on assessment of such competencies is only just beginning. The papers presented here are high-quality studies that integrate theory and methods to provide readers with an overview of the current state of research.

Sigrid Blömeke, Jan-Eric Gustafsson, and Richard J. Shavelson (Editors)

Contents include: Beyond Dichotomies: Competence Viewed as a Continuum Sigrid Blömeke, Jan-Eric Gustafsson, and Richard J. Shavelson

The Relationship of Mathematical Competence and Mathematics Anxiety: An Application of Latent State-Trait Theory Lars Jenßen, Simone Dunekacke, Michael Eid, and Sigrid Blömeke

Zeitschrift für Psychologie

Modeling the Competencies of Prospective Business and Economics Teachers: Professional Knowledge in Accounting Kathleen Schnick-Vollmer, Stefanie Berger, Franziska Bouley, Sabine Fritsch, Bernhard Schmitz, Jürgen Seifried, and Eveline Wuttke

Validating Test Score Interpretations by Cross-National Comparison: Comparing the Results of Students From Japan and Germany on an American Test of Economic Knowledge in Higher Education Manuel Förster, Olga Zlatkin-Troitschanskaia, Sebastian Brückner, Roland Happ, Ronald K. Hambleton, William B. Walstad, Tadayoshi Asano, and Michio Yamaoka

www.hogrefe.com/journals/zfp

Zeitschrift für Psychologie

Assessment of Mathematical Competencies and Epistemic Cognition of Preservice Teachers Benjamin Rott, Timo Leuders, and Elmar Stahl

Founded by Hermann Ebbinghaus and Arthur König in 1890 Volume 223 / Number 1 / 2015 ISSN-L 2151-2604 • ISSN-Print 2190-8370 • ISSN-Online 2151-2604

Scientific Reasoning in Higher Education: Constructing and Evaluating the Criterion-Related Validity of an Assessment of Preservice Science Teachers’ Competencies Stefan Hartmann, Annette Upmeier zu Belzen, Dirk Krüger, and Hans Anand Pant

Editor-in-Chief Bernd Leplow

Assessing Professional Vision in Teacher Candidates: Approaches to Validating the Observer Extended Research Tool Kathleen Stürmer and Tina Seidel Opinion: Gaining Substantial New Insights Into University Students’ Self-Regulated Learning Competencies: How Can We Succeed?

ISBN 978-0-88937-473-7

90000 9 780889 374737

Associate Editors Edgar Erdfelder · Herta Flor · Dieter Frey Friedrich W. Hesse · Heinz Holling · Christiane Spiel

Zfp 2015 223 issue 1