European Journal of Psychological Assessment

Page 1

Volume 35 / Number 1 / 2019

Volume 35 / Number 1 / 2019

European Journal of

Psychological Assessment

European Journal of Psychological Assessment

Editor-in-Chief Samuel Greiff Associate Editors Mark Allen Juan Ramón Barrada Nicolas Becker Gary N. Burns Laurence Claes Marjolein Fokkema Penelope Hasking Dragos Iliescu Stefan Krumm Lena Lämmle Anastasiya Lipnevich Marcus Mund René T. Proyer John F. Rauthmann Ronny Scherer Eunike Wetzel Matthias Ziegler

Official Organ of the European Association of Psychological Assessment


Focal psychodynamic psychotherapy – an evidence-based method “This book provides scientific evidence for an approach that has a long-term impact on one of the most vexing psychiatric problems – anorexia nervosa.” Jacques P. Barber, PhD, ABPP, Professor and Dean of Gordon F. Derner School of Psychology at Adelphi University, NY, USA

Hans-Christoph Friederich / Beate Wild / Stephan Zipfel /  Henning Schauenburg / Wolfgang Herzog

Anorexia Nervosa Focal Psychodynamic Psychotherapy 2019, xvi + 124 pp. US $39.80 / € 31.95 ISBN 978-0-88937-554-3 Also available as eBook This manual presents an evidencebased focal psychodynamic approach for the outpatient treatment of adults with anorexia nervosa, which has been shown to produce lasting changes for patients. The reader first gains a thorough understanding of the general models and theories of anorexia nervosa. The book then describes in detail a three-phase treatment using focal psychodynamic psychotherapy. It provides extensive hands-on tips, including precise assessment of psychodynamic themes and structures using the Operationalized Psychodynamic Diagnosis (OPD) system, real-life case studies, and

www.hogrefe.com

clinical pearls. Clinicians also learn how to identify and treat typical ego structural deficits in the areas of affect experience and differentiation, impulse control, self-worth regulation, and body perception. Detailed case vignettes provide deepened insight into the therapeutic process. A final chapter explores the extensive empirical studies on which this manual is based, in particular the renowned multicenter ANTOP study. Printable tools in the appendices can be used in daily practice. This book is of interest to clinical psychologists, psychotherapists, psychiatrists, counselors, and students.


European Journal of

Psychological Assessment Volume 35 / Number 1 / 2019 OfďŹ cial Organ of the European Association of Psychological Assessment


Editor-in-Chief

Samuel Greiff, Cognitive Science and Assessment, ECCS unit, 11, Porte des Sciences, 4366 Esch-sur-Alzette, Luxembourg (Tel. +352 46 6644-9245, E-mail samuel.greiff@uni.lu)

Editors-in-Chief (past)

Karl Schweizer, Germany (2009–2012), E-mail k.schweizer@psych.uni-frankfurt.de Matthias Ziegler, Germany (2013–2016), E-mail zieglema@hu-berlin.de

Editorial Assistant

Lindie van der Westhuizen, Cognitive Science and Assessment, ECCS unit, 11, Porte des Sciences, 4366 Esch-sur-Alzette, Luxembourg, (Tel. +352 46 6644-5578, E-mail ejpaeditor@gmail.com)

Associate Editors

Mark Allen, Australia; Juan Ramón Barrada, Spain; Nicolas Becker, Germany; Gary N. Burns, USA/Sweden; Laurence Claes, Belgium; Marjolein Fokkema, The Netherlands; Penelope Hasking, Australia; Dragos Iliescu, Romania; Stefan Krumm, Germany; Lena Lämmle, Germany; Anastasiya Lipnevich, USA; Marcus Mund, Germany; René Proyer, Germany; John F. Rauthmann, USA; Ronny Scherer, Norway; Eunike Wetzel, Germany; Matthias Ziegler, Germany

Editorial Board

Rebecca Pei-Hui Ang, Singapore Roger Azevedo, USA R. Michael Bagby, Canada Yossef S. Ben-Porath, USA Nicholas F. Benson, USA Francesca Borgonovi, France Janine Buchholz, Germany Vesna Busko, Croatia Eduardo Cascallar, Belgium Mary Louise Cashel, USA Carlo Chiorri, Italy Lee Anna Clark, USA Paul De Boeck, USA Scott L. Decker, USA Andreas Demetriou, Cyprus Annamaria Di Fabio, Italy Christine DiStefano, USA Stefan Dombrowski, USA Fritz Drasgow, USA Peter Edelsbrunner, Switzerland Kadriye Ercikan, USA Rocı́o Fernández-Ballesteros, Spain Marina Fiori, France Brian F. French, USA Arthur C. Graesser, USA Patrick Griffin, Australia Jan-Eric Gustafsson, Sweden

Founders

Rocı́o Fernández-Ballesteros and Fernando Silva

Supporting Organizations

The journal is the official organ of the European Association of Psychological Assessment (EAPA). The EAPA was founded to promote the practice and study of psychological assessment in Europe as well as to foster the exchange of information on this discipline around the world. Members of the EAPA receive the journal in the scope of their membership fees. Further, the Division for Psychological Assessment and Evaluation, Division 2, of the International Association of Applied Psychology (IAAP) is sponsoring the journal: Members of this association receive the journal at a special rate (see below).

Publisher

Hogrefe Publishing, Merkelstr. 3, D-37085 Göttingen, Germany, Tel. +49 551 999-500, Fax +49 551 999-50111, E-mail publishing@hogrefe.com, Web http://www.hogrefe.com North America: Hogrefe Publishing, 7 Bulfinch Place, 2nd floor, Boston, MA 02114, USA, Tel. +1 866 823-4726, Fax +1 617 354-6875, E-mail customerservice@hogrefe-publishing.com, Web http://www.hogrefe.com

Production

Regina Pinks-Freybott, Hogrefe Publishing, Merkelstr. 3, D-37085 Göttingen, Germany, Tel. +49 551 999-500, Fax +49 551 999-50111, E-mail production@hogrefe.com

Subscriptions

Hogrefe Publishing, Herbert-Quandt-Strasse 4, D-37081 Göttingen, Germany, Tel. +49 551 50688-900, Fax +49 551 50688-998, E-mail zeitschriftenvertrieb@hogrefe.de

Advertising/Inserts

Melanie Beck, Hogrefe Publishing, Merkelstr. 3, D-37085 Göttingen, Germany, Tel. +49 551 999-500, Fax +49 551 999-50111, E-mail marketing@hogrefe.com

ISSN

ISSN-L 1015-5759, ISSN-Print 1015-5759, ISSN-Online 2151-2426

Copyright Information

Ó 2019 Hogrefe Publishing. This journal as well as the individual contributions and illustrations contained within it are protected under international copyright law. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, microfilming, recording or otherwise, without prior written permission from the publisher. All rights, including translation rights, reserved.

Publication

Published in 6 issues per annual volume (new in 2017; 4 issues from 2004 to 2016)

Subscription Prices

Calendar year subscriptions only. Rates for 2019: Institutions – from US $483.00/€370.00 (print only; pricing for online access can be found in the journals catalog at hgf.io/journals2019); Individuals – US $254.00/€199.00 (print & online). Postage and handling – US $16.00/€12.00. Single copies: US $85.00/€66.50 + postage and handling. Special rates: IAAP/Colegio Oficial de Psicólogos members: €129.00, US $164.00 (+ €18.00, US $24.00 postage and handling); EAPA members: Included in membership

Payment

Payment may be made by check, international money order, or credit card, to Hogrefe Publishing, Merkelstr. 3, D-37085 Göttingen, Germany, or, for North American customers, to Hogrefe Publishing, 7 Bulfinch Place, 2nd floor, Boston, MA 02114, USA.

Electronic Full Text

The full text of the European Journal of Psychological Assessment is available online at http://econtent.hogrefe.com and in PsycARTICLES.

Abstracting/Indexing Services

The journal is abstracted/indexed in Current Contents / Social & Behavioral Sciences (CC/S&BS), Social Sciences Citation Index (SSCI), Social SciSearch, PsycINFO, Psychological Abstracts, PSYNDEX, ERIH, and Scopus. 2017 Impact Factor 1.985, 5-year Impact Factor 2.357, Journal Citation Reports (Clarivate Analytics, 2018)

European Journal of Psychological Assessment (2019), 35(1)

Ronald K. Hambleton, USA William Hanson, Canada Sonja Heintz, Switzerland Sven Hilbert, Germany Joeri Hofmans, Belgium Therese N. Hopfenbeck, UK Jason Immekus, USA Jan Henk Kamphuis, The Netherlands David Kaplan, USA James C. Kaufman, USA Eun Sook Kim, USA Muneo Kitajima, Japan Radhika Krishnamurthy, USA Klaus Kubinger, Austria Patrick Kyllonen, USA Kerry Lee, Hong Kong Chung-Ying Lin, Hong Kong Jin Liu, USA Patricia A. Lowe, USA Romain Martin, Luxembourg R. Steve McCallum, USA Helfried Moosbrugger, Germany Kevin R. Murphy, Ireland Janos Nagy, Hungary Tuulia M. Ortner, Austria Marco Perugini, Italy K. V. Petrides, UK

Aaron Pincus, USA Kenneth K. L. Poon, Singapore Ricardo Primi, Brazil Richard D. Roberts, USA Willibald Ruch, Switzerland Leslie Rutkowski, Norway Jesus F. Salgado, Spain Douglas B. Samuel, USA Manfred Schmitt, Germany Heinz Schuler, Germany Martin Sellbom, New Zealand Valerie J. Shute, USA Stephen Stark, USA Jonathan Templin, USA Katherine Thomas, USA Stéphane Vautier, France Fons J.R. van de Vijver, The Netherlands Michele Vecchione, Italy David Watson, USA Nathan C. Weed, USA Alina von Davier, USA Cilia Witteman, The Netherlands Moshe Zeidner, Israel Johannes Zimmermann, Germany Ada Zohar, Israel Bruno Zumbo, Canada

Ó 2019 Hogrefe Publishing


Contents Editorial

And Yet Another New Year’s Resolution: Pushing EJPA’s Advance Articles Samuel Greiff

1

Original Articles

The Relationship Between Faking and Response Latencies: A Meta-Analysis ^rbescu Laurentßiu P. Maricutßoiu and Paul Sa

3

Is Reliability Compromised Towards the End of Long Personality Inventories? Martin Bäckström and Fredrik Björklund

14

Working Alliance Inventory for Children and Adolescents (WAI-CA): Development and Psychometric Properties Bárbara Figueiredo, Pedro Dias, Vânia Sousa Lima, and Diogo Lamela

22

Measurement Invariance of English and French Language Versions of the 20-Item Toronto Alexithymia Scale Carolyn A. Watters, Graeme J. Taylor, Lindsay E. Ayearst, and R. Michael Bagby

29

Psychometric Properties of the Basic Psychological Need Satisfaction and Frustration Scale – Intellectual Disability (BPNSFS-ID) Noud Frielink, Carlo Schuengel, and Petri J. C. M. Embregts

37

Psychometric Properties of the Online Arabic Versions of BDI-II, HSCL-25, 46 and PDS Pirko Selmo, Tobias Koch, Janine Brand, Birgit Wagner, and Christine Knaevelsrud

Ó 2019 Hogrefe Publishing

The Effect of Alternative Scoring Procedures on the Measurement Properties of a Self-Administered Depression Scale: An IRT Investigation on the CES-D Scale Noboru Iwata, Akizumi Tsutsumi, Takafumi Wakita, Ryuichi Kumagai, Hiroyuki Noguchi, and Naotaka Watanabe

55

Is SMS APPropriate? Comparative Properties of SMS and Apps for Repeated Measures Data Collection Erin I. Walsh and Jay K. Brinker

63

Psychometric Properties of the Borderline Personality Features Scale for Children-11 (BPFSC-11) in a Sample of Community Dwelling Italian Adolescents Andrea Fossati, Carla Sharp, Serena Borroni, and Antonella Somma

70

Applying the Latent State-Trait Analysis to Decompose State, Trait, and Error Components of the Self-Esteem Implicit Association Test Francesco Dentale, Michele Vecchione, Valerio Ghezzi, and Claudio Barbaranelli

78

European Journal of Psychological Assessment (2019), 35(1)


Multistudy Reports

Validation of the Adult Substance Abuse Subtle Screening Inventory-4 (SASSI-4) Linda E. Lazowski and Brent B. Geary

86

Perceived Mutual Understanding (PMU): Development and Initial Testing of a German Short Scale for Perceptual Team Cognition Michael J. Burtscher and Jeannette Oostlander

98

Assessing Positive Orientation With the Implicit Association Test Giulio Costantini, Marco Perugini, Francesco Dentale, Claudio Barbaranelli, Guido Alessandri, Michele Vecchione, and Gian Vittorio Caprara

109

Assessing Personality With Multi-Descriptor Items: More Harm Than Good? Johannes Schult, Rebecca Schneider, and Jörn R. Sparfeldt

117

Detecting Random Responses in a Personality Scale Using IRT-Based Person-Fit Indices Tour Liu, Tian Lan, and Tao Xin

126

A Need for Cognition Scale for Children and Adolescents: Structural Analysis and Measurement Invariance Ulrich Keller, Anja Strobel, Rachel Wollschläger, Samuel Greiff, Romain Martin, Mari-Pauliina Vainikainen, and Franzis Preckel

137

European Journal of Psychological Assessment (2019), 35(1)

Ó 2019 Hogrefe Publishing


Editorial And Yet Another New Year’s Resolution Pushing EJPA’s Advance Articles Samuel Greiff Institute of Cognitive Science and Assessment (COSA), University of Luxembourg, Luxembourg

A new year, 2019, has just started and everywhere you turn, you will find flashbacks of 2018, plans and thoughts on how 2019 will be different from 2018, and, of course, New Year’s resolutions! And – spoiler alert – this editorial won’t hold back on New Year’s resolutions either. What virtually all New Year’s resolutions have in common is that we want to do something better than in the past: Do more sports, eat healthier, spend more time with the family, write more manuscripts, procrastinate less, and so forth. Despite the good nature of New Year’s resolutions, attempts towards increasing their chance of success seem to work moderately at best (Alan Marlatt & Kaplan, 1972; Oscarsson, Rozental, Andersson, & Carlbring, 2017). However, their popularity remains unbroken and even journals, or at least their editors, tend to have New Year’s resolutions (e.g., Woodruff, 2019). EJPA is no different from this, but our New Year’s resolutions are specific enough to have a high likelihood of becoming reality – and they have implications for readers and authors. The core New Year’s resolution for EJPA is to reduce the number of published advance articles, often called “onlinefirst articles�, that we have accumulated over the years, and to assign them more quickly to an issue. Not a new idea, by the way. In fact, the increase in the number of issues per year from 4 to 6 in 2017 was targeted at reducing the number of articles that had not been assigned to an issue yet, but due to the high number of submissions and their high quality, this attempt utterly failed. Thus, as we move into 2019 we are still faced with a substantial number of advance articles. This means that after acceptance of an article, it will take over 24 months until the paper is assigned to an issue and receives page numbers even though it is finalized, available online (including a doi), and citable usually no later than 6 months after acceptance. In a way, this is a bit like having a new novel (personally, I have been waiting for the sequels of some novels for plenty

Ă“ 2019 Hogrefe Publishing

of years. . .) that is available as e-book, but then it takes several years until the regular paper version is out. And sometimes it is just nicer to feel the paper in your hands. . . but there are other reasons why it is important to authors that their manuscript is assigned to an issue. What does this mean specifically? First of all, obviously we will continue the six issues per year cycle, but in addition to this, as a temporary measure, Hogrefe, the publishing house of EJPA, has agreed to publish issues of double their usual length in 2019, so this year – thanks to the publisher – you will find thicker issues in your mailbox. This will help EJPA to reduce the time between acceptance of an article and its assignment to a specific issue to about 1 year, which is in line with what you see in many other major journals as well. For authors, this means that their papers are assigned much quicker to an issue and, for readers, it means that you might want to allocate more time to assessment-related topics and to the papers you will find in EJPA in 2019. Related to this, my personal New Year’s resolution is – among many other things such as leading a healthy life and being a better person – to not write lengthy editorials and at least for this one, this worked out. So, we have two resolutions in this editorial, both specific and actionable: (1) decreasing backlog and (2) not writing extensive editorials. Interestingly, one is mastery-oriented, the other avoidance-oriented, in line with a study that shows that around 2/3rds of all New Year’s resolutions are masteryoriented and 1/3 avoidance-oriented (Woodruff, 2019). But wait, what about your New Year’s resolution? In case you haven’t made any and you need some suggestions, how about these: Continue to submit your best work to EJPA. We are looking forward to receiving your manuscripts; Attend the 15th ECPA in Brussels, Belgium July 7–10, 2019 and meet great colleagues;

European Journal of Psychological Assessment (2019), 35(1), 1–2 https://doi.org/10.1027/1015-5759/a000521


2

Editorial

Get involved, be it in EAPA as the association housing the journal or in EJPA as its flagship journal, be it as author, reviewer, board member, reader. . .; Give us feedback: What would you like to see improved in the journal? Where are we doing a good job? Let us know at ejpaeditor@gmail.com.

Oscarsson, M., Rozental, A., Andersson, G., & Carlbring, P. (2017). New Year’s resolutions. A large-scale randomized controlled trial. Paper presented at the 9th Swedish Congress on Internet Interventions, LinkÜping, Sweden, November 3, 2017. Woodruff, T. K. (2019). New Year’s resolutions; New year reviewers; New year of review. Endocrinology, 60, 36–37. https://doi. org/10.1210/en.2018-01003

Whichever of the above you choose, the editorial team of EJPA sends best wishes for a prosperous year 2019.

Samuel Greiff Institute of Cognitive Science and Assessment (COSA) University of Luxembourg 11, Porte des Sciences 4366 Esch sur Alzette Luxembourg samuel.greiff@uni.lu

References Alan Marlatt, G., & Kaplan, B. E. (1972). Self-initiated attempts to change behavior. A study of New Year’s resolutions. Psychological Reports, 30, 123–131. https://doi.org/10.2466/pr0.1972. 30.1.123

European Journal of Psychological Assessment (2019), 35(1), 1–2

Ă“ 2019 Hogrefe Publishing


Original Article

The Relationship Between Faking and Response Latencies A Meta-Analysis Laurențiu P. Maricuțoiu and Paul Sârbescu Department of Psychology,West University of Timișoara, Romania Abstract: The purpose of this meta-analysis was to analyze the relationship between faking and response latencies (RL). Research studies included in online databases, as well as papers identified in previous reviews, were considered for selection. Inclusion criteria for the studies were (a) to have an experimental faking condition, (b) to measure RL using a computer, and (c) to provide data for calculating the d Cohen effect sizes. Overall effects were significant in the case of honest versus fake good condition (d = 0.20, Z = 3.05, p < .05), and in the case of honest versus fake bad condition (d = 0.39, Z = 2.21, p < .05). Subgroup analyses indicated moderator effects of item type, with larger effects computed on RL of positively keyed items, as compared with RL of negatively keyed items. Keywords: faking, social desirability, response latencies, meta-analysis

Personality assessment is largely based on questionnaires that require respondents to evaluate their agreement with various behaviors, or to indicate the degree to which a particular behavior is characteristic to them. In any assessment context, the person who is responding to a personality inventory has the option to answer truthfully or untruthfully, in accordance with own objectives. Faking is defined by Ziegler, MacCann, and Roberts (2011) as a deliberate and intentional behavior that helps a person to achieve personal goals. Fake-good behaviors involve presenting the self in a more positive manner, as compared with an honest self-evaluation, while fake bad behaviors involve presenting the self in a more negative manner, as compared with an honest self-evaluation. Systematic reviews showed that all personality variables included in the Big Five are fakable (Viswesvaran & Ones, 1999), therefore the possibility of analyzing untruthful responses is generalized to most (if not all) personality variables. Because distorted answering can be a matter of respondents’ intentions, psychologists searched for methods to identify the occurrence of this phenomenon (Fluckinger, McDaniel, & Whetzel, 2008). The technological development and the spread of computerized administration of questionnaires allowed psychologists to record not only the responses to a personality questionnaire, but also the time needed by respondents to formulate that response. Response latencies represent the time spent between the apparition of the item (usually on a computer screen) and the moment a Ó 2016 Hogrefe Publishing

participant responds to that item. Even from the beginnings of computerized administration of personality inventories, researchers like Dunn, Lushene, and O’Neil (1972) suggested that response latencies can be used for detecting faking tendencies. Research studies provided divergent results regarding the capability of response latencies to detect faking tendencies, therefore the generalized efficacy of using response latencies to identify faking is still unknown (Fluckinger et al., 2008). The present meta-analysis reviews the efficacy of response latencies in differentiating between honest and dishonest responding. To the best of our knowledge, this is the first quantitative analysis of research studies that investigated the latency differences between honest and dishonest responding. Therefore, the present article summarizes what is known about the relationship between faking and response latencies. Furthermore, findings from the present study provide conclusions with higher levels of generalizability (as compared with previous research studies) and suggest directions for future investigations.

Response Latencies and Faking A widespread pop psychology belief states that “lying takes time”. Contrary to this statement, early research studies found that shorter response latencies are associated with higher scores on social desirability scales (Dunn et al., 1972), and other studies showed that participants in faking European Journal of Psychological Assessment (2019), 35(1), 3–13 DOI: 10.1027/1015-5759/a000361


4

L. P. Maricuțoiu & P. Sârbescu, Faking and Response Latencies

conditions have shorter response latencies as compared with participants in the standard condition (Hsu, Santelli, & Hsu, 1989). Hsu et al. (1989) interpreted these results as evidence of semantic evaluation of the items. According to the semantic evaluation perspective, socially desirable responses require an evaluation of the meaning of the item, while honest respondents are accessing self-referenced information from memory in order to provide an answer. Therefore, the differences between the latencies exist because it is easier to evaluate the meaning of an item as compared with remembering how often respondents have behaved in accordance with the content of that item (Hsu et al., 1989). However, other research studies found negative relationships between response latencies and distorted answering (Holden & Kroner, 1992), and reported that participants in faking conditions answer slower as compared with participants in control (or standard) experimental conditions (Holden, Kroner, Fekken, & Popham, 1992). The larger response latencies were attributed to inconsistencies between the respondent answer and self-schema, or to the higher levels of emotional arousal generated by the fear of being detected (Vasilopoulos, Reilly, & Leaman, 2000). According to the self-schema model (Holden et al., 1992), honest respondents answer consistently with their selfschemas, while dishonest respondents decide not to provide self-schematic information, after an evaluation of schematic information. As a consequence, honest respondents have shorter latencies because they use fewer cognitive processes, as compared with dishonest respondents. Other authors (Vasilopoulos et al., 2000) suggested that honest respondents present their self-schema information, while dishonest respondents access a schema of an ideal respondent. Therefore, dishonest responding will take longer as compared with honest responding, because the schema of an ideal respondent is less accessible than their self-schema. Support for this perspective can also be found in the results of the meta-analysis conducted by DePaulo et al. (2003) on the cues of lying, who reported that response latencies were larger when the social actors did not plan their answers. The debate between semantic and self-referenced models is still not settled, and current research is still providing evidence in favor of the semantic (Van Hooft & Born, 2012) or in favor of the self-referenced (Shoss & Strube, 2011) interpretation of the relationships between response latencies and faking. Consequently, we formulated the first hypothesis: Hypothesis 1: Response latencies of participants in faking conditions will be different from the response latencies of respondents in control (or honest) conditions. European Journal of Psychological Assessment (2019), 35(1), 3–13

Moderators of the Relationship Between Faking and Response Latencies Previous research studies discussed several moderator variables of the relationship between faking tendencies and response latencies. These moderator variables are well documented in the literature, but not all researchers have controlled their influence (see Appendix for more information). Because response latencies represent the time spent between the apparition of the item and the participant response, their value can be influenced by personal characteristics (e.g., reading speed, experience with computerized testing), and by item characteristics (e.g. item length, item difficulty). Previous research studies showed that item characteristics and reading speed account for more than half of the latency variance (Dunn et al., 1972; Tetrick, 1989). To limit the effect of these variables, some researchers (Holden et al., 1992; Holden & Kroner, 1992; Holden, 1995) used the double standardization procedure. In a double standardization procedure, latencies are standardized across the items within each participant, and then latencies are standardized within each item using the means and standard deviations of the control group. The double standardization method was criticized because it does not allow for investigation of between-respondent differences (Brunetti, Schlottmann, Scott, Mihura, & Hollrah, 1998; Vasilopoulos et al., 2000). Because of this shortcoming, researchers controlled for reading speed by randomizing the participants (Brunetti et al., 1998), by regressing the response latencies on a measure of item complexity (Robie et al., 2000), or by using covariate variables (Vasilopoulos et al., 2000). Because of the large amount of latency variance accounted by personal and item characteristics, we anticipated to find that stronger effects were reported by research studies that controlled for these characteristics, as compared with research studies that did not control the effects of such variables. Therefore, we formulated the second hypothesis: Hypothesis 2: Research studies that controlled for the personal and item characteristics will report larger effect sizes, as compared with research studies that did not control for personal and item characteristics. Other moderator variables documented in the literature are the type of answer provided by the respondent (endorsing/ rejecting the item), and the type of item (positively/negatively keyed). Findings reported by previous research studies have showed that “true” responses have shorter latencies as compared with “false” responses (Dunn et al., 1972; Robie et al., 2000; Tetrick, 1989). Tetrick (1989) interpreted this result using findings from cognitive literature, which suggested that comparisons between Ó 2016 Hogrefe Publishing


L. P. Maricuțoiu & P. Sârbescu, Faking and Response Latencies

similar stimuli are faster than comparisons between different stimuli. Moreover, studies reported that response latencies are influenced by an interaction effect between the answer type (endorsement/rejection) and the item type (positively/negatively keyed). One robust finding of previous studies (Dunn et al., 1972; Holden, 1995; Robie et al., 2000; Tetrick, 1989) is that the smallest latencies were found for the true responses to items that are positively keyed. Holden (1995) suggested that “true” responses to positively keyed items are congruent with a schema of desirable responding; therefore, these responses are quicker than other types of responses, which are less schemacongruent. In conclusion, we addressed the following two hypotheses: Hypothesis 3: The effects computed on response latencies of endorsing answers will be larger, as compared with the effects computed on response latencies of rejecting answers.

Hypothesis 4: The effects computed on response latencies of answers to positively keyed items will be larger, as compared with the effects computed on response latencies of answers to negatively keyed items.

The Present Meta-Analysis Literature Search and Study Selection A systematic search of papers indexed by Web of Science, PsycINFO, EconLit, ERIC, DOAJ, Academic Search Premier, ProQuest Dissertations and Theses, and Scopus was accomplished, using the term “response latency” in combination with the following words: “social desirability”, “honesty”, “faking”, “fake good”, “fake bad”, “lie scales”. In addition to the results of this search, we also looked for papers in two reviews (DePaulo et al., 2003; Viswesvaran & Ones, 1999), and subsequently identified another two papers (the entire process of study selection and study analysis is presented in Figure 1). The search yielded 86 titles (68 published research papers and 16 unpublished PhD theses) and their abstracts, which were analyzed by both authors individually. We selected only the studies that, in their abstracts, reported measuring response latencies of questionnaires. We excluded theoretical papers, studies that did not report measuring response latencies, studies that measured response latencies for other purposes (e.g., Stroop tasks), or studies that correlated the response latencies of Ó 2016 Hogrefe Publishing

5

personality scales with social desirability scores. A study was excluded only if both evaluators considered it not eligible for the present meta-analysis. Based on this analysis, 47 studies were considered for further analysis. Out of the 47 papers, 45 were available in full-text on the online databases we had access to.

Study Inclusion Criteria In order to consider an article eligible for inclusion in the present meta-analysis, we used the following cumulative criteria: (a) the research had to include an experimental faking condition, (b) response latencies (as dependent variable) had to be measured using a computer, (c) the authors had to report statistical indices that allowed the computation of effect size. We did not use any restrictions regarding the professional or ethnic background of the participants or their geographical location. Ultimately, 16 studies were considered eligible for inclusion in the present meta-analysis. These studies contained 22 independent samples and 155 effect sizes.

Study Coding Study coding was accomplished in two stages: (a) extraction of relevant information for computing effect sizes and (b) analysis and coding of study characteristics. In the first stage, we independently analyzed each eligible article and extracted the information needed for computing effect sizes: experimental and control group sample sizes, means and standard deviations or values of statistical tests (t test), if the previous ones were not available. In the second stage, we independently analyzed the Method section of each article and examined the characteristics of participants, research design, and procedure. Regarding the sample characteristics, we looked for the occupation, the gender distribution and the nationality of participants, and whether they received any incentive for their participation in the study. Regarding the research design, we identified the design type (between or counterbalanced within), the faking condition type (normal, with coaching or incentive), and whether randomization was applied. Regarding procedure, we looked for information regarding the reason for faking (e.g., to get a job), the motivation for faking (e.g., money), whether participants were warned about faking detection methods, and whether participants knew that their response latencies are being recorded. The information concerning the treatment of raw response latencies (no treatment, simple standardization, double standardization, or log transformation) was usually presented in the Method section of the research papers. If the authors did not present any information concerning the transformation of European Journal of Psychological Assessment (2019), 35(1), 3–13


6

L. P. Maricuțoiu & P. Sârbescu, Faking and Response Latencies

Figure 1. The PRISMA flow diagram of the paper analysis.

Records identified through database searching (n = 84)

Records screened (n = 84)

Additional records identified in previous reviews and references lists

Records considered for further analysis (n = 45)

Records excluded (n = 39)

Records not available in fulltext (n = 2)

(n = 3)

Full-text articles assessed for eligibility (n = 46)

Papers that did not measure response latencies (n = 10) Papers that did not analyze the relationship between response latencies and faking (n = 6)

Studies included in metaanalysis (n = 16)

raw response latencies, we included the research study in the no treatment category. We extracted information concerning analyses that accounted for item type or answer type from the descriptive statistics tables, and we computed different effect sizes for each of these categories.

Statistical Analysis Comprehensive Meta-Analysis Version 2.0 software (Borenstein, Hedges, Higgins, & Rothstein, 2005) was used for all calculations. The effect size used for the present meta-analysis is the standardized mean difference, also known as Cohen’s d. According to Cohen (1988), a value around .20 indicates a weak effect, a value around .50 indicates a medium effect, while a value around .80 indicates a strong effect. A positive value of the effect size indicates that participants in faking conditions have longer European Journal of Psychological Assessment (2019), 35(1), 3–13

Papers that did not report proper statistical results (n = 11) Non-experimental research papers (n = 3)

response times than those in the control condition, while a negative value of this effect size statistic indicates that the participants in faking conditions responded more quickly, as compared to participants in the control condition. Depending on the data available from the articles, the standardized mean difference was calculated in two different ways: out of means and standard deviations, and out of t test values. Because of the variety of research designs, procedure approaches, and participant’s nationality, we assumed a random variation of the “true” effect size from one study to another. Therefore, a random-effects metaanalysis was conducted. For studies that reported multiple outcomes (studies that allowed for computation of more than one effect size), we averaged the effect sizes into a combined effect. These calculations were accomplished with the Comprehensive Meta-Analysis Version 2.0 software (Borenstein et al., 2005). The observed (or post-hoc) Ó 2016 Hogrefe Publishing


L. P. Maricuțoiu & P. Sârbescu, Faking and Response Latencies

statistical power of the meta-analytical results was computed using the formulas provided by Hedges and Pigott (2001), which were detailed by Borenstein et al. (2009). For each average effect size, we computed the observed statistical power for a .05 significance criterion, using the following information: the number of independent samples (k), the average effect size, the average experimental group size, the average control group size, and the betweenstudies variance (τ2). Observed statistical power above .80 is a generally accepted threshold (Cohen, 1988), which indicates an 80% probability of finding a difference that does exist.

Results The present meta-analysis is based on 16 research studies with 22 independent samples. The main characteristics of these studies are synthesized in Table 1, and a list of the characteristics for each research paper is presented in Appendix.

Overall effects The first hypothesis of this meta-analysis anticipated that dishonest participants will have different response latencies, as compared with honest participants. The overall results (Table 2) are statistically significant in the case of honest versus fake good condition (d = 0.23, Z = 2.84, p < .05), and statistically insignificant in the case of honest versus fake bad condition (d = 0.38, Z = 1.22, p > .05). Following the computation of the overall results, we examined the data for extreme effect sizes that might have disproportionate influence on the analyses. In the case of honest versus fake good condition, we identified three research studies that reported effect sizes different with more than two standard errors below (Hsu et al., 1989, d = .87) or above (Esser & Schneider, 1998, d = 1.19; Holden et al., 1992, Study 2, d = 1.02) the average effect size. An analysis that omitted these three research studies yielded similar results (d = 0.20, Z = 3.05, p < .01), with small changes in terms of observed statistical power or between-study heterogeneity. In the case of honest versus fake bad condition, we identified two research studies that reported effect sizes different with more than two standard errors below (Hsu et al., 1989, d = .90) or above (Holden et al., 1992, Study 3, d = 1.5) the average effect size. A restricted analysis of the honest versus fake bad condition yielded statistically significant results (d = 0.39, Z = 2.21, p < .05). Although the average effect size resulted from the restricted analysis (d = 0.39) is not very different from

Ó 2016 Hogrefe Publishing

7

the initial result (d = 0.38), the results of the restricted analysis had smaller heterogeneity and superior statistical power, as compared with the initial analysis. These results supported the first hypothesis of the present meta-analysis in the case of both types of faking tendencies. The heterogeneity of effects was assessed using the traditional chi-square (or Q test) and the I2 index. The Q test is used for verifying whether the differences between the studies and their averaged effect are either marginally or statistically significant (Borenstein et al., 2009). Although the Q test is not adequate for random-effect meta-analyses (such as the present meta-analysis), we reported it because (a) its value is important for computation of other dispersion indices such as I2, and (b) because a significant Q test is an argument in favor of using a random-effects meta-analysis (for a more in-depth discussion, see Borenstein et al., 2009, p. 107–125). The I2 index estimates the percentage of effect variance that can be attributed to systematic between-study variations, and values above 50% can be considered moderate to high (Borenstein et al., 2009). The values of I2 indicated high between-study variance that can be attributed to systematic differences between these studies. Taken as a whole, the results concerning Q and I2 support the presence of moderator variables affecting the variation of results from one study to another.

Moderator Analysis The other three hypotheses of this meta-analysis addressed the potential moderating effect of the answer type, the item type, and the double standardization procedure on the relationship between faking and response latencies. In these analyses we have used only the honest versus fake good effect, as there were too few studies concerning the honest versus fake bad effect in order to create study subgroups. The results concerning the moderators of the honest versus fake good effect (Table 3) suggested that the answer type and the item type had a moderating effect. When endorsing an item, participants in the fake good condition had significantly longer response latencies (d = 0.33, Z = 2.91, p < .05), as compared with participants who were instructed to answer honestly. These differences are not statistically significant in the case of answers that rejected the item (d = 0.04, Z = 0.42, p = n.s), or in the case of effects that did not control for the type of answer (d = 0.16, Z = 1.31, p = n.s). The analysis on the item type as a moderator variable yielded significant results in the case of positively keyed items (d = 0.45, Z = 2.13, p < .05), with significant between-studies variance, Q(7) = 40.19, p < .01, I2 = 85.07, (all Q and I2 values are presented in Table 2). This means that, for the positively keyed

European Journal of Psychological Assessment (2019), 35(1), 3–13


8

L. P. Maricuțoiu & P. Sârbescu, Faking and Response Latencies

Table 1. Characteristics of the samples included in the review Sample characteristics Occupation of participants

Geographical area of the participants

Gender distribution (% of women)

Percentage of congruent evaluations (%) Students: 14 samples Employees: samples Unemployed adults: 2 samples Inmates: 1 sample Participants with alcohol/drug related problems: 1 sampleJob applicants: 1 sample USA: 10 samples Canada: 8 samples The Netherlands: 1 sample Germany: 2 samples Croatia: 1 sample Average: 56.3%

100

100

92.34 100

Type of reward received

No reward: 14 samples Money: 5 samples Credits: 2 samples Mixed: 1 sample

Randomization of participants

Yes: 20 samples No: 2 samples

94.74

Type of sample

Between: 19 samples Within: 2 samples Mixed: 1 sample

94.74

Reason for faking

Imaginary job: 9 samples Just “because”: 11 samples Transfer to another prison: 1 sample Imaginary college: 1 sample

96.36

Motivation for faking

No motivation: 20 samples Prize: 2 sample

Participants were notified about faking detection methods

Yes: 8 samples No: 13 samples Yes for the Fake Good with Coaching condition: 1 sample

96.36

Participants knew their response latencies are being measured

No: 21 samples Yes for the Fake Good with Coaching condition: 1 sample

100

Data treatment procedure

Double standardization: 13 samples No treatment: 5 samples Log transformation: 2 samples Simple standardization: 2 sample

items, participants in the faking good condition had significantly longer response latencies than participants in the control condition. Other results indicated significant effects for studies in which the double standardization procedure was used (d = 0.32, Z = 3.27, p < .01), nonsignificant effects for studies that used other methods for controlling participant and item characteristics (d = 0.11, Z = 0.60, p = n.s), or for studies that did not control such characteristics (d = 0.09, Z = 0.45, p = n.s).

Publication Bias Publication bias can affect any systematic review (Borenstein et al., 2009), and can artificially increase the average effect size reported by any meta-analysis. Because European Journal of Psychological Assessment (2019), 35(1), 3–13

100

88.23

the present meta-analysis found a small average effect size, we estimate that publication bias has little influence on this result. We investigated publication bias using graphical methods (the funnel plot), and found that the studies included in the present review are distributed symmetrically around the mean effect size, indicating that publication bias had little influence on the conclusions of the present meta-analysis.

Discussion The present review assessed whether response latencies to a questionnaire can be used to discriminate between honest and dishonest respondents. We were interested in providing a general answer to this question because numerous Ó 2016 Hogrefe Publishing


L. P. Maricuțoiu & P. Sârbescu, Faking and Response Latencies

9

Table 2. Overall effects of faking on response latencies k

I2

Obs. power

22 0.23 [0.07; 0.39]

2.84**

.84

Q(21) = 57.19** 63.28

2,061

1,236

Honest versus fake good (without outliers) 19 0.20 [0.07; 0.33]

3.05**

.88

Q(18) = 31.09** 42.09

1,984

1,159

Q(6) = 34.54** 82.63

251

252

202

203

Honest versus fake bad (all studies)

7 0.38 [ 0.23; 0.99] 1.22

.22

Honest versus fake bad (without outliers)

5 0.39 [0.04; 0.74]

.60

2.21*

Q

Total N control Total N faking

Z

Honest versus fake good (all studies)

d [95% CI]

Q(4) = 4.94

19.06

Notes. k = the number of samples included in the analysis; d = the average effect size; Z = the statistical test used for computing the significance of the average effect size; Obs. power = the observed statistical power of the result; Q = the statistical test used for the estimation of heterogeneity; I2=the proportion of effect size variance that can be attributed to moderator variables. *p < .05, **p < .01.

Table 3. Moderator analysis of the honest versus fake good effect Honest versus fake good k

d [90% CI]

Z

Obs. power

Q

I2

59.63

Answer type Endorsed

10

0.33 [0.11; 0.55]

2.91**

0.86

Q(9) = 22.29**

Rejected

5

0.04 [ 0.13; 0.20]

0.42

0.07

Q(4) = 1.26

Mixed

12

0.16 [ 0.08; 0.41]

1.31

0.30

Q(11) = 37.27**

70.49

Positively keyed

7

0.45 [0.04; 0.87]

2.13*

0.59

Q(6) = 40.19**

85.07

Negatively keyed

7

0.11 [ 0.04; 0.26]

1.49

0.33

Q(6) = 3.96

Mixed

15

0.21 [ 0.01; 0.43]

1.86

0.61

Q(14) = 47.01**

70.22

Double standardization

13

0.32 [0.13; 0.52]

3.27**

0.91

Q(12) = 23.26**

48.42

Other method for control

4

0.11 [ 0.24; 0.45]

0.60

0.09

Q(3) = 5.16

41.86

No control

5

0.09 [ 0.29; 0.46]

0.45

0.07

Q(4) = 26.81**

85.08

0

Item type 0

Control of personal and item characteristics

Notes. k = the number of samples included in the analysis; d = the average effect size; Z = the statistical test used for computing the significance of the average effect size; Q = the statistical test used for the estimation of heterogeneity; I2 = the proportion of effect size variance that can be attributed to moderator variables. *p < .05, **p < .01.

decisions are based on self-reported data, therefore are influenced by the quality of such data. The overall effects were statistically significant, suggesting that honest respondents are quicker, as compared with dishonest respondents. According to our results, this difference is significant both for fake good and for fake bad tendencies. However, the overall result for fake bad tendencies should be interpreted with caution because of the modest statistical power. Heterogeneity indices suggested that moderator variables have a large impact on these results. Moderator analyses of the fake good effects indicated a significant effect only for responses to positively keyed items, and not for negatively keyed items. Because of the small number of studies, we could not investigate the influence of moderator variables for the fake bad effects. Current models on impression-managed responding lead to divergent conclusions regarding the relationship between response latencies and faking. These theoretical perspectives suggested that faking involves quicker responses as compared with honest responding (the case of the semantic exercise model – Hsu, Santelli, & Hsu, 1989), or that faking involves slower responses as compared with honest Ó 2016 Hogrefe Publishing

responding (the case of self-referenced model or the adopted-schema model). The overall results of the present meta-analysis provide support for the schema models of honest responding (Holden et al., 1992). Our results also suggested that the type of answer provided by the respondent had a moderation effect. Honest and dishonest respondents have statistically different response latencies when endorsing an item, and similar response latencies when rejecting an item. Furthermore, analyses have indicated that, for positively keyed items, honest respondents have significantly shorter response latencies, as compared with respondents who are trying to present themselves in a positive manner. This effect seems to be absent in the case of negatively keyed items, or in the case of research studies that did not take into account the type of item. In both cases we observed small differences between honest and dishonest responding, but these results do not have sufficient statistical power to reach statistical significance. Regarding the interaction between the two moderators, the present meta-analysis did not have enough research studies to analyze such relationships, but we expect to find significant differences between latencies of European Journal of Psychological Assessment (2019), 35(1), 3–13


10

honest and latencies of dishonest respondents for endorsement of positively keyed items, and not in the case of rejection of positively keyed items. The overall results of the present meta-analysis indicated a significant effect and large between-study variation of effect sizes. The heterogeneity of results from previous studies was diminished when we controlled for the type of item, but significant proportions of between-study variance remained unexplained. To address the problem of heterogeneity, we investigated the moderating effect of study characteristics. Because researchers agreed that item characteristics and reading speed are important variables that account for large proportions of latency variance (Dunn et al., 1972; Tetrick, 1989), we considered it is important to control for their possible influence on the results of any research. Our analyses found that research studies that used the double standardization procedure to control for personal and item characteristics obtained significant positive effects (in support of the self-referenced responding model), while the research studies that used the raw response latencies obtained smaller effects. As a consequence, we encourage future research studies to control for these characteristics. Controlling for personal and item characteristics when analyzing response latencies data has the benefit of removing large proportions of latency variance (more than 50% of all variance, according to Dunn et al., 1972). In our opinion, this could explain why research studies that used the double standardization procedure provided more similar results (had small heterogeneity indices), as compared with research studies that did not control for these characteristics, and obtained very different results from one study to the other (had large heterogeneity indices). Finally, although the present meta-analysis identified some interesting relationships between response latencies and faking, our analysis of the literature provided theoretical explanations only for the main effects, and not for the moderation effects. Although previous researchers (Holden, 1995; Robie et al., 2000; Tetrick, 1989) interpreted similar findings using results from cognitive sciences, it is our opinion that further research is needed to fully understand the cognitive mechanisms responsible for the interaction effect identified in this review. For example, Ziegler (2011) concluded that fakers evaluate the meaning of the item and the importance of an item in terms of the situation (or scenario) that needs to be faked, prior to the selfschema analysis. This finding is not fully integrated in the schema models of faking, and can add alternative explanations to the differences in response latencies identified in this meta-analysis. Therefore, the most important issue regarding the use of response latencies as indicator of faking is the development of an integrative theoretical

European Journal of Psychological Assessment (2019), 35(1), 3–13

L. P. Maricuțoiu & P. Sârbescu, Faking and Response Latencies

background that will explain the main effects, the moderation effects, and other possible moderator variables.

Limitations The findings of the present meta-analysis have limitations. The first limitation is that we could not access the full-text version of all research studies that we have found in our initial search. However, we estimate that the inclusion of the two research studies could have determined a reduction of the between-study heterogeneity, but it is unlikely that it could have major influences on the averaged effect sizes computed in the present meta-analysis. The second limitation is related to unexplained betweenstudy heterogeneity. Although study heterogeneity is a part of the random-effects model and we identified one important moderator variable, indices such as the I2 suggested the presence of moderators. Because these moderators remain unknown, the large proportion of unexplained heterogeneity represents a limitation. The third limitation is that research studies included in this meta-analysis analyzed items that assessed normal (e.g., the traits included in the Big Five Model) or pathological (e.g., the MMPI scales) personality traits. Therefore, it is hazardous to generalize our results to other types of self-reported data (e.g., evaluation of peer/subordinate performance) that might be collected in organizational settings. The fourth limitation is that we had a modest sample of studies in the analysis of the main effects (overall fake-good and fake-bad effects), and in the moderator analyses. As a consequence, some of these analyses had a statistical power below the .80 threshold, which represents a limitation of the present meta-analysis. However, the main conclusions of this paper have optimal statistical power. Finally, another limitation is that we had to average the multiple effects reported by an individual study. Although this approach is suggested by Borenstein et al. (2009) and was implemented automatically by the software used in our analyses, it diminishes the differences that might have occurred from one questionnaire scale to another. Therefore, it can have an impact on the moderation analyses presented in this meta-analysis.

Implications for Future Research and Practice The results of the present review suggested that dishonest responding generates longer response latencies as compared with honest responding, regardless of whether it aims

Ó 2016 Hogrefe Publishing


L. P. Maricuțoiu & P. Sârbescu, Faking and Response Latencies

at generating a positive or a negative self-presentation. We believe the effects are large enough to encourage further research studies on the relationship between faking and response latencies. Although these effects are statistically significant, the heterogeneity indices suggested that there are some unknown moderator variables that require further research studies. For this reason, we believe it is too early to derive any practical implications, based on the results of the present meta-analysis. Future research studies should focus on the intra-individual differences in response latencies (differences between items from the same respondent), because the results of the present paper are based on between-subjects effects (comparisons between participants in honest vs. faking conditions). This data aggregation implies that we still have little information about how people respond to individual items; therefore, it is difficult to use our results with the purpose of detecting individual faking tendencies. Because faking generates increased response latencies only for positively keyed items, researchers could compute an intra-individual index using the difference in response latencies between positively-keyed and negatively-keyed items from the same participant (as an intra-individual index). In addition to the research studies aimed at investigating the predictive properties of such an intra-individual index, we believe that more efforts are needed for theory development. At this moment, the mechanisms responsible for the effects identified in this meta-analysis are generally unknown. In our opinion, a better understanding of these mechanisms should lead to smaller between-studies variability and to better integration of response latencies in the psychological practice. Acknowledgments The authors would like to thank the two anonymous reviewers for their valuable comments. This work was supported by a grant of the Romanian Ministry of Education, CNCS – UEFISCDI, Project Number PN-IIRU-PD-2012-3-0161, and Project Number PN-II-ID-PCE2012-4-0621. This organization had no role in the design and implementation of the study.

References References marked with an asterisk indicate studies included in the study. Borenstein, M., Hedges, L., Higgins, J., & Rothstein, H. (2005) Comprehensive meta-analysis (version 2) [Computer software]. Englewood, NJ: Biostat. Borenstein, M., Hedges, L., Higgins, J., & Rothstein, H. (2009). Introduction to meta-analysis. Chichester, UK: Wiley.

Ó 2016 Hogrefe Publishing

11

*Brunetti, D. G., Schlottmann, R. S., Scott, A. B., Mihura, J. L., & Hollrah, J. L. (1998). Instructed faking and MMPI-2 response latencies: The potential for assessing response validity. Journal of Clinical Psychology, 54, 143–153. doi: 10.1002/(SICI)10974679(199802)54:2<143:AID-JCLP3>3.0.CO;2-T Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Hillsdale, NJ: Erlbaum. DePaulo, B. M., Lindsay, J. J., Malone, B. E., Muhlenbruck, L., Charlton, K., & Cooper, H. (2003). Cues to deception. Psychological Bulletin, 129, 74–118. doi: 10.1037/0033-2909. 129.1.74 Dunn, T. G., Lushene, R. E., & O’Neil, H. F. (1972). Complete automation of the MMPI and a study of its response latencies. Journal of Consulting and Clinical Psychology, 39, 381–387. doi: 10.1037/h0033855 *Eakin, D. E. (2004). Detection of feigned posttraumatic stress disorder: A multimodal assessment strategy (Doctoral dissertation). Retrieved from ProQuest Dissertations and Theses (UMI No 3154807). *Esser, C., & Schneider, J. F. (1998). Differentielle Reaktionslatenzzeiten beim Bearbeiten von Persönlichkeitsfragebogen als möglicher Indikator für Verfälschungstendenzen [Differential response latencies as a possible indicator for detecting faking on personality test items]. Zeitschrift für Differentielle und Diagnostische Psychologie, 19, 246–257. doi: http://www. psyjournals.com/content/120245 Fluckinger, C. D., McDaniel, M. A., & Whetzel, D. L. (2008). Review of faking in personnel selection. In M. Mandal (Ed.), In search of the right personnel. New Delhi, India: McMillian. *Gore, B. A. (2000). Reducing and detecting faking on a computeradministered biodata questionnaire (Doctoral dissertation). Retrieved from ProQuest Dissertations and Theses (UMI No. 9954139). Hedges, L. V., & Pigott, T. D. (2001). The power of statistical tests in meta-analysis. Psychological Methods, 6, 203–217. doi: 10.1037/1082-989X.6.3.203 *Holden, R. R. (1995). Response latency detection of fakers on personnel tests. Canadian Journal of Behavioural Science/ Revue canadienne des sciences du comportement, 27, 343–355. doi: 10.1037/0008-400X.27.3.343 *Holden, R. R. (1998). Detecting fakers on a personnel test: Response latencies versus a standard validity scale. Journal of Social Behavior and Personality, 13, 387–398. http://psycnet. apa.org/psycinfo/1998-10358-014 *Holden, R. R., & Hibbs, N. (1995). Incremental validity of response latencies for detecting fakers on a personality test. Journal of Research in Personality, 29, 362–372. doi: 10.1006/ jrpe.1995.1021 Holden, R. R., & Kroner, D. G. (1992). Relative efficacy of differential response latencies for detecting faking on a self-report measure of psychopathology. Psychological Assessment, 4, 170–173. doi: 10.1037/1040-3590.4.2.170 *Holden, R. R., & Lambert, C. E. (2015). Response latencies are alive and well for identifying fakers on a self-report personality inventory: A reconsideration of van Hooft and Born (2012). Behavior Research Methods, 47, 1436–1442. doi: 10.3758/ s13428-014-0524-5 *Holden, R. R., Kroner, D. G., Fekken, G. C., & Popham, S. M. (1992). A model of personality test item response dissimulation. Journal of Personality and Social Psychology, 63, 272–279. doi: 10.1037/0022-3514.63.2.272 *Holtgraves, T. (2004). Social desirability and self-reports: Testing models of socially desirable responding. Personality and Social Psychology Bulletin, 30, 161–172. doi: 10.1177/ 0146167203259930

European Journal of Psychological Assessment (2019), 35(1), 3–13


12

*Hsu, L. M., Santelli, J., & Hsu, L. R. (1989). Faking detection validity and incremental validity of response latencies to MMPI subtle and obvious items. Journal of Personality Assessment, 53, 278–295. doi: 10.1207/ s15327752jpa5302_6 *Konradt, U., Syperek, S., & Hertel, G. (2011). Testing on the Internet: Faking a web-based self-administered personality measure. Journal of Business and Media Psychology, 2, 1–10. www.journal-bmp.de *Parmac, M., Galic, Z., & Jerneic, Z. (2009). Vrijeme latencije kao indikator iskrivljavanja odgovora na upitnicima ličnosti [Response latency as an indicator of personality test item response dissimulation]. Suvremena Psihologija, 12, 43–61. doi: http://hrcak.srce.hr/file/122956 *Robie, C., Curtin, P. J., Foster, T. C., Phillips, H. L., Zbylut, M., & Tetrick, L. E. (2000). The effects of coaching on the utility of response latencies in detecting fakers on a personality measure. Canadian Journal of Behavioural Science/Revue canadienne des Sciences du comportement, 32, 226–233. doi: 10.1037/h0087119 Shoss, M. K., & Strube, M. J. (2011). How do you fake a personality test? An investigation of cognitive models of impressionmanaged responding. Organizational Behavior and Human Decision Processes, 116, 163–171. doi: 10.1016/j.obhdp. 2011.05.003 Tetrick, L. E. (1989). An exploratory investigation of response latency in computerized administrations of the MarloweCrowne social desirability scale. Personality and Individual Differences, 10, 1281–1287. doi: 10.1016/0191-8869(89) 90240-7 *Van Hooft, E. A. J., & Born, M. P. (2012). Intentional response distortion on personality tests: Using eye-tracking to understand response processes when faking. Journal of Applied Psychology, 97, 301–316. doi: 10.1037/a0025711

European Journal of Psychological Assessment (2019), 35(1), 3–13

L. P. Maricuțoiu & P. Sârbescu, Faking and Response Latencies

*Vasilopoulos, N. L., Reilly, R. R., & Leaman, J. A. (2000). The influence of job familiarity and impression management on self-report measure scale scores and response latencies. Journal of Applied Psychology, 85, 50–64. doi: 10.1037/00219010.85.1.50 Viswesvaran, C., & Ones, D. S. (1999). Meta-analyses of fakability estimates: Implications for personality measurement. Educational and Psychological Measurement, 59, 197–210. doi: 10.1177/00131649921969802 Ziegler, M. (2011). Applicant faking: A look into the black box. The Industrial and Organizational Psychologist, 49, 29–36. http:// www.siop.org/tip/july11/06ziegler.aspx Ziegler, M., MacCann, C., & Roberts, R. D. (2011). Faking: Knowns, unknowns, and points of contention. In M. Ziegler, C. MacCann, & R. D. Roberts (Eds.), New perspectives on faking in personality assessment (pp. 3–16). New York, NY: Oxford University Press.

Received September 1, 2014 Revision received November 11, 2015 Accepted December 27, 2015 Published online October 7, 2016 Laurentiu P. Maricutoiu Department of Psychology West University of Timișoara 4 Vasile Pârvan Blvd., room 504 300223 Timișoara Romania Tel. /Fax +40 256 592252 E-mail lmaricutoiu@gmail.com

Ó 2016 Hogrefe Publishing


Ó 2016 Hogrefe Publishing

Canada

Holden et al. (1992) Study 1

Canada

Canada

US

US

US

Germany

Croatia

US

The Netherlands

US

US

Holden et al. (1992) Study 3

Holden and Lambert (2014)

Holtgraves (2004) Study 2

Holtgraves (2004) Study 3

Hsu et al. (1989)

Konradt et al. (2011)

Parmac, Galic, and Jerneic (2009)

Robie et al. (2000)

Van Hooft and Born (2012)

Vasilopoulos et al. (2000) Study 1

Vasilopoulos et al. (2000) Study 2

Canada

Canada

Holden (1998)

Holden et al. (1992) Study 2

Canada

Canada

Canada

US

Gore (2000) Sample 2

Holden (1995) Study 1

Holden and Hibbs (1995)

Employees

US

Gore (2000) Sample 1

Holden (1995) Study 2

Employees

Germany

Esser and Schneider (1998)

Job applicants

Undergraduate students

Undergraduate students

Undergraduate students

Undergraduate students

Employees

Undergraduate students

Undergraduate students

Undergraduate students

Undergraduate students

Convicts

Undergraduate students

Undergraduate students

Unemployed

Undergraduate students

Unemployed

Undergraduate students

Undergraduate students

Undergraduate students with PTSD

US

Eakin (2004)

Occupation Undergraduate students

US

Nationality

Participant

Brunetti et al. (1998)

Study

Characteristics of the studies included in the meta-analysis

Appendix

No

No

No

Yes

No

No

No

No

No

No

Yes

Yes

Yes

Yes

No

Yes

Yes

No

No

No

No

No

Item type

No

No

No

Yes

No

No

No

No

No

No

Yes

Yes

Yes

Yes

No

Yes

Yes

No

No

Yes

No

No

Answer type

Conducted separate analyses on

Other

Other

No treatment

Double standardization

Double standardization

No treatment

No treatment

No treatment

No treatment

Double standardization

Double standardization

Double standardization

Double standardization

Double standardization

Double standardization

Double standardization

Double standardization

Double standardization

Double standardization

Double standardization

Other

Other

Data treatment

0.23

0.6

0.22

0.03

0.11

0.43

0.87

0.33

0.45

0.16

0.01

1.02

0.3

0.38

0.74

0.26

0.11

0.14

0.13

1.19

0.03

0.04

0.9

0.24

1.54

0.63

0.93

0.2

<0.01

Control versus FB

Effect sizes (d values) Control versus FG

L. P. Maricuțoiu & P. Sârbescu, Faking and Response Latencies 13

European Journal of Psychological Assessment (2019), 35(1), 3–13


Original Article

Is Reliability Compromised Towards the End of Long Personality Inventories? Martin Bäckström and Fredrik Björklund Department of Psychology, Lund University, Sweden Abstract: During very long self-rating sessions there is a risk that respondents will be tired and/or lose interest. Is this a concern for users of long personality inventories, such that the reliability becomes threatened in the latter half when respondents have made hundreds of personality self-ratings? Two thousand three hundred fifty-two volunteers completed long ( 500 items) personality inventories on the Internet, where items were presented in a unique random order for each participant. Perhaps counterintuitively, there was no evidence that reliability is threatened as respondents approach the end of a long personality inventory. If anything, the ratings in the second half of the inventories had higher reliability than ratings in the first half. Ratings were quicker towards the end of the inventories, but equally reliable. The criterion validity, estimated using Paunonen’s Behavior Report Form, was maintained too. The current results provide little reason to mistrust responses to items that appear towards the end of long personality inventories. Keywords: personality assessment, self-ratings, reliability, personality inventory

If you ask laypeople about the reliability in very long inventories their gut-feeling is likely that respondents will be fatigued and start rating more randomly after a while. The same kind of argument has occasionally been forwarded by researchers in the field of personality measurement (e.g., Burisch, 1984; Robins, Hendin, & Trzesniewski, 2001), coupled with recommendations to use shorter versions of inventories, which are less repetitive and should be less prone to produce invalid protocols. The present study concerns whether respondents are able to maintain a consistent way of responding even when there is a substantive amount of personality items to rate. What empirical basis exists for the belief that the reliability is compromised in the latter half of long personality inventories? This question is important, since it concerns whether researchers and others with an interest in personality measurement would be well advised to think carefully about the added value of long inventories, but the current research literature does not appear to give an answer to it. Although general assumptions that the final items of long personality inventories bring about less reliable ratings are sometimes voiced, and respondents tend to report making more random responses towards the end of a long inventory (Berry et al., 1992) there appears to be no systematic research on the issue. The present study is a first attempt to fill this gap. European Journal of Psychological Assessment (2019), 35(1), 14–21 DOI: 10.1027/1015-5759/a000363

Kurtz and Parrish (2001) have defined the term “protocol validity” which refers to whether an individual protocol is interpretable via the standard algorithms for scoring and assigning meaning. A number of studies have investigated if personality inventories have protocol validity in relation to linguistic incompetence, careless inattentiveness and deliberate misrepresentation. The present study concerns protocol validity in very long personality inventories. Respondents make their ratings on a computer, over the Internet, which has been found to result in a somewhat higher number of duplicate protocols, more long strings of the same response category, and a higher number of unacceptable missing responses (Johnson, 2005). Consistency, that is, that raters respond to questions from the same scales in a similar way, appears to be less of a problem (Johnson, 2005). If it is the case that that having rated a large number of items leads to less reliable ratings and less valid subscales, protocol validity should decrease. This suggests the possibility that the measurement problems associated with Web-administrated personality inventories may appear more towards the end of long inventories. If after having rated many of the items the respondents become bored, they may start to rate in a more random fashion, making the second half of the inventory less reliable. We investigate this issue in two studies, using two different personality Ó 2016 Hogrefe Publishing


M. Bäckström & F. Björklund, Is Reliability Compromised?

inventories. Both of the inventories are very long, in comparison with most personality inventories. We are not investigating differences between long and short inventories, but specifically whether the scales that are based on ratings that appear at the end of long inventories yield lower reliability indices than the ones that appear at the beginning.

Study 1 The purpose of the first study is to examine the basis for the belief that reliability decreases towards the end of very long personality inventories. The analyses concern comparisons of measures of consistency (reliability indices). They include standard indicators of reliability (alpha), Structural Equation Models (SEM) including invariance analyses, and in addition, a comparison of the first half of items and last half of items that each respondent rated. Analyses always concerned the reliability of the factors (the Big Five factors) based on the sub-scales included in the inventory. We also examine a central process-related variable, response time. If response time is constant across the inventory, this may indicate that the respondents are not tired, bored or have fallen into a more haphazard way of responding. If, however, response time gradually decreases (or increases) as respondents approach the end of inventory, some motivational factor may have influenced the ratings. If response time is related to a tendency to boredom and/or more haphazard responding, it should be correlated with reliability. We investigate the relationship between response time and reliability, but do not expect that faster response times necessarily come with the price of reduced rater consistency.

Method The sample consisted of 1991 respondents, 71.4 % females, mean age 31.9 years (SD = 11.5), mostly spontaneous Internet visitors to http://www.pimahb.com. The materials consisted of the Swedish version of the IPIP-AB5C personality inventory (Bäckström, Björklund, & Larsson, 2009), which measures the Big Five factors by means of nine subscales each. In total, the inventory has 486 items and in addition to these items there were 32 extra items not used in the analyses. Items were presented in a new random order for each participant. The Web application forced respondents to rate every item (on a 5-point Likert scale). To estimate the degree to which reliability was influenced by the amount of items that had been responded to we created two versions of all items from the inventory, one based on the 259 items rated first and the other based on the 259 items rated last (i.e., the first vs. second half). Ó 2016 Hogrefe Publishing

15

Since items were presented randomly, across all participants, all items were included in both the first and the last half of the scale. This way we could control for differences in item quality and content (e.g., avoiding that the last half was more reliable because the final items happened to be reliable). Again, the scales of the first and last half of the inventory, across all respondents, were based on the exact same set of items. This setup, where one participant’s items of the first half do not match other participants’ items of the first half (i.e., the same item can be included, for different participants, in both the first and the second half of the inventory, which will make it possible to investigate reliability on the subscale [facet] level, since almost all participants have responded to one or more items from each subscale of the inventory). The homogeneity of the factors as measured by their subscales was high for all five scales, alpha was .90, .90, .90, .91, and .86, for Extraversion, Agreeableness, Conscientiousness, Emotional Stability, and Openness, respectively. The lowest corrected subscale correlation with the main scales was 0.31, but 40 out of 45 subscales had a corrected subscale correlation above 0.50, and the mean was 0.66. Since the factor scales were homogeneous and all subscales worked as indicators of its factor scale, we hoped to find similar homogeneity in both the first and the second half of the ratings. To estimate reliability we used the wellknown Cronbach’s alpha, and the Omega measures based on Confirmatory Factor Analysis (CFA; McDonald, 1999, Revelle & Zinbarg, 2009). To estimate the difference in reliability we conducted invariance analysis on the two parts of the ratings. We estimated the reliability of the five factors in the FFM (Five Factor model); Extraversion, Agreeableness, Conscientiousness, Emotional stability, and Openness, all measured by the subscales included in the inventory. Each factor consisted of nine facets (measured by at most 12 items each) so the total number of reliability coefficients was five from the first half paired with five from the last half of the inventory. In the CFA models we controlled for the fact that the facets in the AB5C model have secondary loadings to the other factors of the Big Five (Hofstee, de Raad, & Goldberg, 1992). For example, regarding the Extraversion factor, there was only one facet that loaded uniquely to the main factor. Two of its facets had secondary loadings to agreeableness (one positive and one negative), two facets had secondary loadings to conscientiousness, and so on. Again, since items were randomly distributed, uniquely for each participant, the number of items in the subscales varied. For some participants all items of a subscale were presented in the first half, and then items from that subscale were missing in the last part. For these rare cases we used data imputation by means of the AMOS regression procedure. European Journal of Psychological Assessment (2019), 35(1), 14–21


16

To further probe potential loss of reliability, we performed additional analyses related to consistency and investigated if there was difference in reliability at the very start and very end of the ratings (based on single items). Lastly we added information about response times to investigate if they differed between the first and second half of the ratings. The Results section will summarize the results, but detailed statistics for this first study are available in the Electronic Supplemental Materials. We used an alpha level of p < .01 for significance testing.

Results The AB5C measures the FFM, but instead of having a simple structure most of the facets are defined to have both a primary loading and a secondary loading to the FFM factors, in what the founders refer to as a circumplex model (Hofstee et al., 1992). However, the main interest here is not the model, but reliability. We estimated the Cronbach alpha based on all nine facets that each of the five factors of the AB5C have, and found the Cronbach alphas for the first half of the inventory to be .85, .84, .85, .86, and .79, for Extraversion (E), Agreeableness (A), Conscientiousness (C), Emotional stability (ES), and Openness (O), respectively. The Cronbach alphas for the second half of the inventory were .84, .86, .85, .85, and .82, respectively. In other words, they were almost the same. The reliabilities for the five factors based on CFA are displayed in Table 1. It was found that these measures closely followed the alpha values. The reliability was .87, .88, .88, .89, and .83 for the first half and .89, .89, .89, .89, and .86 for the second half. In no case the reliability was higher for the first than the second half, but remember, we controlled for the other four factors that was part of the structure. The fit indices (see Table 1) from the invariance analysis showed that measurement invariance was only significant for Emotional stability, suggesting that the loadings were very similar in the first and second half. This corroborates the picture that reliability is not at risk, since the omega reliability coefficients are based on the loadings. We also tested for invariance of measurement intercept, but this invariance was significant for all factors. However, there was no trend in the distribution of intercepts, the mean difference in intercepts for the five factors was, .009, .017, .000, .003, and .034, for E, A, C, ES, and O, respectively. The fit indices shown in Table 1 suggest that the invariances in intercepts were rather small (in fact root mean square errors of approximation [RMSEAs] showed a somewhat better fit overall after restricting the intercepts to be equal). Generally, the variability between raters was somewhat higher in the first half of the inventory than in the last. European Journal of Psychological Assessment (2019), 35(1), 14–21

M. Bäckström & F. Björklund, Is Reliability Compromised?

The mean standard deviation across 45 scales was 0.68 for the first half and 0.66 for the second. If this variability had been systematic, it could have led to higher reliability for the first half, but this was not the case. This suggests that the higher variability in the first half was the result of noise, in other words random rating. One way to capture higher relative stability in the last half of the inventory is to measure how consistent the respondents were in relation to their final score on the scales. We estimated the consistency in the first and last half by taking the absolute difference between their ratings and their final score. If the absolute difference in ratings is more close to the final score in the last half than in the first, it suggests that the raters were more consistent towards the end of the inventory. The inventory consisted of 45 subscales. Using pairwise t-test the consistency of the first and last half was compared, and it was found that the ratings of single items in the last half were significantly closer to the final score in 39 of the 45 subscales, and no scale was significantly more consistent in the first half. The mean t value for the 45 pairwise tests was 3.42, but the mean effect size (mean/SD, i.e., Cohen’s d) was only 0.08. Nevertheless, this is clear evidence that raters were capable of being at least as consistent toward the end of the inventory. These reliability analyses were complemented with a perhaps more idiosyncratic one. It tested whether the ratings of the first single item of a subscale (among the first 30 ratings of the inventory) were more consistent with the total subscale (the mean value of all items) than the ratings of the last item of the same scale (the mean number of completed ratings before this item was 473). The mean correlation (uncorrected) between the first item and the total subscale was r = .532, whereas for the last item it was r = .534. This suggests that the last item was not influenced more by fatigue and boredom than the first item. Turning to the issue of whether response times tend to influence rating reliability, we first tested if response times were longer or shorter for items in the second half of the inventory. There was response time data from 601 respondents (the measure of response time was added in the middle of the data collection). Due to factors such as slow updates to the Web server etc., some response times were very long, and were set to 10 s. The mean difference between items of the first (4.348 s) and items of the second half (3.959 s) was 0.388 s, a small but significant difference, t(600) = 26.1. The mean correlation between the response times of the first and the second half of the inventory was as high as r = .94. Did respondents with shorter response times (participants with mean response times below the median) rate less reliably than respondents with relatively longer response times (participants with mean response times above the median)? The mean alpha of the five scales of the first half was .835 for the group with short response Ó 2016 Hogrefe Publishing


M. Bäckström & F. Björklund, Is Reliability Compromised?

17

Table 1. Cronbach’s alpha and omega, and measurement invariance for the inventory of the first study (IPIP AB5C)

First half Second half

(df = 62) Measurement weights (df = 53) Δ Unconstrained

Measurement intercepts (df = 44) Δ Measurement weights

ES

O

0.85

0.84

0.85

0.86

0.79

0.89

0.88

0.88

0.89

0.83

α

0.84

0.86

0.85

0.86

0.82

ω

0.88

0.89

0.89

0.89

1,369.3

1,485.4

1,072.4

1,199.0

0.86 1,411.2

CFI

0.900

0.853

0.929

0.922

RMSEA

0.085

0.089

0.079

0.079

0.086

CFI

0.900

0.853

0.928

0.920

0.870

0.078

0.081

Δw2

10.13

4.82

0.072 18.86

0.073 34.04

0.871

0.079 18.08

Δp

0.34

0.85

0.026

< 0.001

0.034

ΔCFI

< .001

< .001

0.001

0.002

0.001

ΔRMSEA

0.007

0.008

0.007

0.006

0.007

CFI

0.898

0.850

0.927

0.918

0.866

RMSEA Measurement intercepts

C

α

RMSEA Measurement weights

A

ω

w2

Unconstrained

E

0.073

0.076

0.068

0.069

0.075

Δw2

44.85

44.55

29.08

70.82

53.11

Δp

< 0.001

< 0.001

= 0.001

< 0.001

< 0.001

ΔCFI

0.002

0.003

0.001

0.002

0.004

ΔRMSEA

0.005

0.005

0.004

0.004

0.004

Notes. E = Extraversion; A = Agreeableness; C = Conscientiousness; ES = Emotional stability; O = Openness; CFI = Comparative Fit Index; RMSEA = Root mean square error of approximation; α = Cronbach’s alpha; ω = McDonald’s omega based on CFA; Δ = Difference: less restricted minus more restrictive model.

times and .825 for the group with longer response times. For the second half the same figures were .850 and .825. Together this suggests that although respondents had shorter response times in the second half, this did not affect their level of consistency. We also tested if consistency and response times correlated, but this correlation was not significant.

Study 2 The results of Study 1 suggest that the reliability of personality self-ratings is maintained towards the end of long inventories. This might come as a surprise to laypeople, and perhaps some experienced researchers too. Regardless of this, the results need to be corroborated. Therefore, in Study 2 we turn to another long personality inventory, to once again investigate the reliability of subscales based on items that appear at the beginning versus the end. Furthermore, there could potentially be a small but systematic loss of validity towards the end of long personality inventories. As the reliability levels were so high in both halves of the inventory in Study 1, the risk of validity-loss should be slight. Nevertheless, in Study 2 Paunonen’s (2003) wellknown measure of criterion validity is included to enable direct comparison of whether ratings at the beginning

Ó 2016 Hogrefe Publishing

versus end of a long inventory differ with regard to criterion validity.

Method Materials Three hundred sixty-one participants (mean age 24.1, SD = 5.5, 62% females) participated in the study, which concerned the validity of a new personality inventory (Bäckström, Björklund, & Lindén, 2014), Most of them were students at Swedish universities and they were all compensated with a movie ticket for their help. Two inventories measuring the same factors were administered on the Web, plus a number of extra items, such that in total there was 547 items. All items were presented in a new random order for each participant, and the Web application forced participants to rate every item. The two inventories both measured the Big Five and the present study used subscales from both inventories resulting in a single large Big Five inventory. The two inventories were the IPIP-NEO (Goldberg, 1990; McCrae & Costa, 1987) and the IPIP-NEO neutralized (Bäckström et al., 2014). The IPIP-NEO includes 300 items measuring 30 different facets of the Five factor model of personality. The present study included only 20 facets based on a total of 200 items. The IPIP-NEO was developed to mimic the NEO-PI-R

European Journal of Psychological Assessment (2019), 35(1), 14–21


18

(Costa & McCrae, 1992) using unique items, but measuring the same kind of constructs and it has been shown to have good psychometric properties. It has been used previously in similar research as this (e.g., Johnson, 2005). The present version was in Swedish and has been validated previously (Bäckström et al., 2014). It consisted of 247 items with a total of 20 subscales. In total we used 200 items in this study, but there were often more items measuring each facet (each subscale consisted of between 12 and 20 items, but only 10 items were used to calculate the scales). All items were rated on a 5-point Likert scale. Extra Items There were some extra items in the study, which were not used. Some were experimental items not included in the ordinary scales of the two personality inventories. There were also two measures of socially desirable responding from the IPIP scales (Self-deception and Impression management, 28 items), and a scale measuring happiness. Criterion Ratings To measure criterion validity we used the Behavior Report Form from Paunonen (2003). In total there were 32 criterion items measuring all sorts of concepts and behaviors, such as alcohol consumption, smoking, medication use, traffic violations, rated intelligence, dating frequency, leisure interests, and more. Specific criteria were not of interest; the purpose of the behavioral reports was to enable comparison of the correlation between the personality scales of the first and second half of the inventories. Procedure The participants were recruited from the campus of a Swedish university, most of them were students. During recruitment they were informed about the study and the Web address of the inventory. After completion of all 547 items they were asked to take rest. They then completed other tests and inventories, including the Behavior Report Form. For the attempt to replicate the findings from Study 1, some of the estimations were performed again in Study 2. Here, the five factors included in the SEM models originated from two different inventories, therefore the facets were measured with two subscales. These pairs of subscales were allowed to correlate in the CFA models. To investigate if they capture the same personality content factors, the two parts of the inventory will be correlated with one another. To estimate how criterion validity was influenced by the fact that the relevant scale had items that appeared in the first versus second half of the inventory, we correlated both halves of the personality inventories, first and last, with the 32 criteria from the Behavior Report Form. From the large matrix of these European Journal of Psychological Assessment (2019), 35(1), 14–21

M. Bäckström & F. Björklund, Is Reliability Compromised?

estimations we extracted the mean validity, the variability in validity (a crude measure of discriminative validity) and the maximum validity for each scale to each criterion. Finally, we created validity indices consisting of average mean validities, average variability validities and average maximum validities. These three measures, one set from the first half and one set from the second half of the inventories, will be compared to investigate criterion validity. Data screening showed that the scales were approximately normally distributed. There were few outliers and we chose to include these data values. As concerns criterion validity, we examine whether the ratings that are made towards the end of the inventory have weaker relationships to actual behavior. This would result in a decrease in both the average mean and the average maximal criterion validity, as expressed in the indices. Weaker relationships to actual behavior would also decrease the discriminative criterion validity and thereby decrease the average variability validity (the standard deviation of the validities).

Results Two issues are investigated in this study. The first concerns whether reliability is compromised by a very long inventory. Table 2 shows the reliabilities of the first half and the second half. It is clear that the reliability is as high in the last half as the first half of the ratings. In fact, the reliabilities as measured with Cronbach’s alpha were somewhat higher for the last half. This suggests that participant’s ratings were as reliable during the second half of the inventory as during the first. The Omega reliability coefficients were based on CFA (see Table 2). For both inventories the Omega was higher for the last half of the inventory. The invariance testing of the factor loadings, for example, measurement weights, suggested that only one factor was close to significant. All fit indices (see Table 2) suggested that the reliability differences between the first and second half were negligible. Invariance estimation in relation to the intercepts showed that two factors, C and ES, departed significantly from invariance. On the other hand, the fit indices suggested that the lack of invariance was small; most of the RMSEAs were in fact smaller when intercepts were fixed. To summarize, all results supported the null hypothesis that reliability is not compromised towards the end of very long personality inventories. The second issue concerned if criterion validity is compromised by long inventories, which can be tested since the results supported the reliability of long inventories. The average mean validities between the IPIP inventory and the Behavioral Report Form criteria were .09 for both the first and second half. The average Ó 2016 Hogrefe Publishing


M. Bäckström & F. Björklund, Is Reliability Compromised?

19

Table 2. Cronbach’s alpha, omega, and invariance for the inventory of the second study (IPIP-NEO and IPIP-NEO Neutralized combined) E

A

C

ES

O 0.80

α

0.82

0.77

0.82

0.89

ω

0.79

0.73

0.76

0.85

0.73

Second half

α

0.86

0.81

0.84

0.91

0.85

Unconstrained

w2

First half

ω (df = 56) Measurement weights (df = 48) Δ Unconstrained

Measurement intercepts (df = 40) Δ Measurement weights

Measurement intercepts

0.77 223.1

0.78 212.9

0.88 95.5

0.79 134.1

CFI

0.957

0.870

0.917

0.981

RMSEA

0.064

0.092

0.089

0.053

0.067

CFI

0.953

0.871

0.919

0.982

0.950

RMSEA Measurement weights

0.84 124.8

Δw2

0.060 17.7

0.082

0.079

0.046

7.2

2.4

3.6

0.952

0.061 10.8

Δp

0.023

0.518

0.026

0.892

0.214

ΔCFI

0.004

0.001

0.002

0.001

0.002

ΔRMSEA

0.004

0.010

0.010

0.007

0.006

CFI

0.955

0.865

0.921

0.978

0.949

RMSEA

0.053

0.076

0.071

0.046

Δw2

2.7

Δp

0.953

16.5

4.1

0.036

0.001

22.4

0.056 11.4

0.004

0.001

ΔCFI

0.002

0.006

0.002

0.004

0.001

ΔRMSEA

0.007

0.006

0.008

< .001

0.005

Notes. E = Extraversion; A = Agreeableness; C = Conscientiousness; ES = Emotional stability; O = Openness; CFI = Comparative Fit Index; RMSEA = Root mean square error of approximation, α = Cronbach’s alpha; ω = McDonald’s omega based on CFA, Δ = Difference: less restricted minus more restrictive model.

variabilities were .07 for both parts and the average maximum correlation to any criterion were .31 for both, coefficients which are on par with those of Paunonen’s (2003) original study. The same figures for the neutralized inventory were .09, .07, and .29, respectively. Together these results did not indicate that criterion validity was compromised towards the end of the long inventory. In addition we checked whether the factors of the first and the last half of the inventory correlated. It was found that correlations were .89, .87, .88, .90, and .89, for Extraversion, Agreeableness, Conscientiousness, Emotional stability, and Openness, respectively. Correlations of this size would have been impossible if the last (or the first) half of the inventory had been unreliable.

Discussion Main Findings The present study concerned the data quality of extensive personality self-ratings. The gut feeling that scales based on items that appear towards the end of long personality inventories are associated with reliability and validity issues (e.g., Burisch, 1984; Robins et al., 2001) was not supported by the results. Rather, the results from comparing reliability estimates for scales that appear early versus late in the Ó 2016 Hogrefe Publishing

inventories suggest that raters keep concentrated enough to complete them in the intended way, and rate consistently over a quite substantial amount of items. The criterion validity was maintained for the scales that appeared in the latter half of the inventory. And although ratings were made somewhat faster in the latter half of the inventory, they were equally reliable. In other words, there was little evidence of compromised data quality, a finding which is corroborated by different reliability indices (e.g., alpha, omega, estimations based on invariance analysis, correlation between scales, correlations to criteria, and consistency between the very first and very last items). These indices are different perspectives of the same basic thing and some of them are redundant, but all are included for completeness. Together this suggests that scales based on items that appear towards the end of longer personality inventories do not impede their overall reliability. This is not to say that longer tests are always to be preferred over shorter ones. Rather, the choice of instrument should be dependent on the research question and the context at hand. Long inventories may be very difficult to administrate. There may be attrition concerns (Rolstad, Adler, & Rydén, 2011), not least in Internet studies, although this was not studied here. Also, expecting a long questionnaire may decrease the willingness to participate (Galesic & Bosnjak, 2009). Shorter scales may be preferred when part of a package of instruments, European Journal of Psychological Assessment (2019), 35(1), 14–21


20

M. Bäckström & F. Björklund, Is Reliability Compromised?

particularly if there are many respondents (which compensates for the somewhat lower reliability) and personality is not focal. If personality is the main focus of the study, using very short personality scales may come with the cost of under- or overestimating the importance of personality (Credé, Harms, Niehorster, & Gaye-Valentine, 2012). In what way are the present results relevant to other research methods using very long questionnaires or a very large number of items, such as surveys? Generally, personality questionnaires include a number of randomly distributed items measuring different factors, while the typical survey does not mix items regarding different subjects of interest. The present research seems to suggest that reliable responding during long session is possible. However, very long surveys can be demanding, increasing the risk of poorer data quality (Galesic & Bosnjak, 2009), for example, when using an open-ended response format or when the type of questions or the response format changes several times. On the other hand, it is likely that repetition of similar items, characterizing personality questionnaires, is less taxing on the participant’s cognitive resources. Attitudinal research is more similar to personality research than survey research is, but not all attitude scales concern the participant’s own attitude. Our results should be generalizable to attitude scales to the extent that the items are similar to personality items, that is, concern accessible self-related information.

Conclusions

Limitations

ESM 1. Table (PDF). Supplementary statistics for Study 1.

We limit the interpretation of our results to personality self-ratings. Our results do not concern surveys (which focus on representing the population and may have just one item per topic, making it difficult to investigate consistency in responses from single respondents) or population studies, or self-ratings in general. Rather, they concern self-ratings in long personality inventories specifically. We cannot conclude from the present results whether the reliability and validity are maintained towards the end of a package of tests of similar size as the inventories used here but where the scale content refers to something else than personality. We note, however, that the results dovetail with similar research on self-reported food preferences (Bendig, 1955), which suggests a potential for generalization (see also Knowles, 1988, who used 30-item tests). The results are also limited to volunteering respondents. The present results are only generalizable to populations who are willing to finish the entire inventory. It is possible that those who are forced to take personality inventories get bored or fatigued more quickly, and respond less reliably. Finally, the results from the present study should not be transferred to the level of the individual. Our results allow for particular respondents to be tired, bored, inconsistent, European Journal of Psychological Assessment (2019), 35(1), 14–21

etc. In fact, the results have little to say about the causes of the ratings that were made. Several factors may contribute to explain why the ratings turned out the way they did, which is a related but different research question than that of the present study. We limit ourselves to the issue of reliability and criterion validity in responses to scales with items appearing towards the end of long personality inventories. In future studies boredom, fatigue and other factors that have been suggested to influence the reliability of self-ratings may be assessed directly, and related to reliability and validity estimates.

Some investigations, whether theoretically driven or for example testing the lexical hypothesis, call for a substantive amount of personality self-ratings. The present results suggest that researches need not worry too much about the reliability of scales having items that appear towards the end of personality inventories when conducting such studies. Electronic Supplementary Material The electronic supplementary material is available with the online version of the article at http://dx.doi.org/10.1027/ 1015-5759/a000363

References Bäckström, M., Björklund, F., & Larsson, M. R. (2009). Five-factor inventories have a major general factor related to social desirability which can be reduced by framing items neutrally. Journal of Research in Personality, 43, 335–344. doi: 10.1016/ j.jrp.2008.12.013 Bäckström, M., Björklund, F., & Larsson, M. R. (2014). Criterion validity is maintained when items are evaluatively neutralized: Evidence from a full-scale five-factor model inventory. European Journal of Personality, 28, 620–633. doi: 10.1002/ per.1960 Bendig, A. W. (1955). Rater reliability and “judgmental fatigue”. Journal of Applied Psychology, 39, 451–454. doi: 10.1037/ h0046015 Berry, D. R., Wetter, M. W., Baer, R. A., Larsen, L., Clark, C., & Monroe, K. (1992). MMPI-2 random responding indices: Validation using a self-report methodology. Psychological Assessment, 4, 340–345. doi: 10.1037/1040-3590.4.3.340 Burisch, M. (1984). Approaches to personality inventory construction: A comparison of merits. American Psychologist, 39, 214–227. doi: 10.1037/0003-066X.39.3.214 Costa, P. T., & McCrae, R. R. (1992). Revised NEO personality inventory (NEO-PI-R) and NEO five-factor inventory (NEO-FFI) professional manual. Odessa, FL: Psychological Assessment Resources. Ó 2016 Hogrefe Publishing


M. Bäckström & F. Björklund, Is Reliability Compromised?

Credé, M., Harms, P., Niehorster, S., & Gaye-Valentine, A. (2012). An evaluation of the consequences of using short measures of the Big Five personality traits. Journal of Personality and Social Psychology, 102, 874–888. doi: 10.1037/a0027403 Galesic, M., & Bosnjak, M. (2009). Effects of questionnaire length on participation and indicators of response quality in a Web survey. The Public Opinion Quarterly, 73, 349–360. doi: 10.1093/ poq/nfp031 Goldberg, L. R. (1990). An alternative “description of personality”: The Big Five factor structure. Journal of Personality and Social Psychology, 59, 1216–1229. doi: 10.1037/0022-3514.59.6.1216 Hofstee, W. K., de Raad, B., & Goldberg, L. R. (1992). Integration of the Big Five and circumplex approaches to trait structure. Journal of Personality and Social Psychology, 63, 146–163. doi: 10.1037/0022-3514.63.1.146 Johnson, J. A. (2005). Ascertaining the validity of individual protocols from Web-based personality inventories. Journal of Research in Personality, 39, 103–129. doi: 10.1016/j.jrp.2004. 09.009 Knowles, E. (1988). Item context effects on personality scales: Measuring changes the measure. Journal of Personality and Social Psychology, 55, 312–320. doi: 10.1037/0022-3514.55. 2.312 Kurtz, J. E., & Parrish, C. L. (2001). Semantic response consistency and protocol validity in structured personality assessment: The case of the NEO-PI-R. Journal of Personality Assessment, 76, 315–332. doi: 10.1207/S15327752JPA7602_12 McCrae, R. R., & Costa, T. P. Jr. (1987). Validation of the fivefactor model of personality across instruments and observers. Journal of Personality and Social Psychology, 52, 81–90. doi: 10.1037/0022-3514.52.1.81 McDonald, R. P. (1999). Test theory: A unified treatment. Mahwah, NJ: Erlbaum.

Ó 2016 Hogrefe Publishing

21

Paunonen, S. V. (2003). Big five factors of personality and replicated predictions of behavior. Journal of Personality and Social Psychology, 84, 411–422. doi: 10.1037/0022-3514.84.2.411 Revelle, W. (2015). psych: Procedures for personality and psychological research Evanston, IL Northwestern University. http:// CRAN.R-project.org/package=psych Version=1.5.1 Revelle, W., & Zinbarg, R. E. (2009). Coefficients alpha, beta, omega, and the glb: Comments on Sijtsma. Psychometrica, 74, 145–154. doi: 10.1007/S11336-008-9102-Z Robins, R. W., Hendin, H. M., & Trzesniewski, K. H. (2001). Measuring global self-esteem: Construct validation of a single-item measure and the Rosenberg Self-Esteem Scale. Personality and Social Psychology Bulletin, 27, 151–161. doi: 10.1177/0146167201272002 Rolstad, S., Adler, J., & Rydén, A. (2011). Response burden and questionnaire length: Is shorter better? A review and metaanalysis. Value in Health, 14, 1101–1108. doi: 10.1016/j.jval. 2011.06.003 Received May 25, 2015 Revision received December 15, 2015 Accepted December 27, 2015 Published online November 7, 2016 Martin Bäckström Department of Psychology Lund University Box 117 221 00 Lund Sweden Tel. +46 2 2201-179 E-mail martin.backstrom@psy.lu.se

European Journal of Psychological Assessment (2019), 35(1), 14–21


Original Article

Working Alliance Inventory for Children and Adolescents (WAI-CA) Development and Psychometric Properties Bárbara Figueiredo,1 Pedro Dias,2 Vânia Sousa Lima,2 and Diogo Lamela3 1

School of Psychology, University of Minho, Braga, Portugal

2

Universidade Católica Portuguesa, Centro de Estudos em Desenvolvimento Humano, Faculdade de Educação e Psicologia,

Porto, Portugal 3 Lusófona University of Porto, Portugal Abstract: The purpose of this study was to validate a version of the Working Alliance Inventory (WAI) for children and adolescents (WAI-CA). The sample included 109 children/adolescents aged between 7 and 17 years, outpatients in a Clinical Psychology Unit (Portugal), who completed the WAI-CA between psychotherapy sessions 3 and 35. A subsample of 30 children/adolescents aged between 10 and 14 years filled out both the WAI-CA and the WAI within a one-to-two week’s interval. A subsample of 57 children/adolescents with ages between 7 and 17 years filled out the WAI-CA, and their accompanying parent the WAI. Results show high internal consistency (Cronbach’s alpha ranging from .71 to .89) and good external validity. Significant differences were found in the bond subscale according to age, gender, and diagnosis, with higher values in children compared to adolescents, in girls compared to boys, and in participants with internalizing and externalizing problems compared to participants with school problems. Moderate to strong significant correlations were found between children/adolescents’ WAI-CA and WAI scores and weak correlations between children/adolescents’ WAI-CA scores and parent’s WAI scores. Results suggest that the WAI-CA is a valid measure of working alliance to be used with children and adolescents. Keywords: adolescents, children, therapeutic alliance, Working Alliance Inventory

Therapeutic alliance is a key ingredient of psychotherapeutic process and outcomes, regardless of the theoretical model adopted in therapy (Blatt, Hawley, Ho, & Zuroff, 2006; Horvath & Bedi, 2002; Horvath & Symonds, 1991; Martin, Garske, & Davis, 2000). Therapeutic alliance comprises three principal dimensions in Bordin’s (1979) formulation: Tasks, Goals, and Bond. The Tasks dimension concerns client and therapist’s agreement about the relevance of activities developed in psychotherapy. The Goals dimension refers to the degree of consensus between client and therapist regarding the aims to be achieved with therapy. Finally, the Bond dimension includes topics such as trust and acceptance, considered to be core features of a positive interaction between client and therapist. Based on this conceptual framework, Horvath and Greenberg (1989) developed the Working Alliance Inventory (WAI) to assess therapeutic alliance. The inventory consists of a total of 36 items on a Likert scale of 7 points, divided into 3 subscales with 12 items each, corresponding to the above dimensions proposed by Bordin (1979, Tasks, Goals, and Bond). This measure has been widely used in European Journal of Psychological Assessment (2019), 35(1), 22–28 DOI: 10.1027/1015-5759/a000364

the context of psychotherapy research, in several countries, including Portugal (Machado & Horvath, 1999), and it is clear by its suitability in different clinical contexts (e.g., Couture et al., 2006; Florsheim, Shotorbani, GuestWarnick, Barratt, & Hwang, 2000; Hersoug, Hoglend, Monsen, & Havik, 2001; Hersoug, Monsen, Havik, & Hoglend, 2002; Symonds & Horvath, 2004). Research on therapeutic alliance has been almost exclusively focused on the context of adult psychotherapy, with relatively few studies being conducted with children and adolescents (e.g., DiGiuseppe, Linscott, & Jilton, 1996; Foreman, Gibbins, Grienenberger, & Berry, 2000; Green, 2006; Shirk & Saiz, 1992). Similarly to research with adults, studies on therapeutic alliance with children/ adolescents have shown a moderate impact of the alliance on the therapeutic outcomes (Diamond et al., 2006; Karver, Handelsman, Fields, & Bickman, 2006; Shirk & Karver, 2003). As opposed to what was observed with adults, research on therapeutic alliance with children/ adolescents has shown that alliance assessed by the therapist has a better predictive value than alliance assessed by Ó 2016 Hogrefe Publishing


B. Figueiredo et al., Development and Psychometric Properties

the client (child/adolescent or parent; Shirk & Karver, 2003; Shirk, Karver, & Brown, 2011). Components of therapeutic alliance with children and adolescents have also been described as different from adults. The therapeutic alliance with children and adolescents would have only two domains: affective and collaborative (Constantino, Castonguay, & Schut, in press; Creed & Kendall, 2005). Some authors have suggested that relational dimensions might be more relevant to the therapeutic alliance with children and adolescents (Karver et al., 2008). Difficulties to discriminate therapeutic goals may occur with children, making goals probably a less important dimension to understand and assess alliance with them (Zack, Castonguay, & Boswell, 2007). Moreover, relationship dimensions have been referred to as the key component of the therapeutic alliance with children and adolescents, as referred both by therapists and parents (Kazdin, Siegel, & Bass, 1990). Authors argue that these developmental differences could lead to difficulties in adopting Bordin therapeutic alliance model with children and adolescents (e.g., Zack et al., 2007). Differential effects may be observed in different alliance dimensions according to participant’s developmental stage and diagnoses (DiGiuseppe et al., 1996). Specifically, bonding seems to be more important to preschool children, whereas agreement regarding goals and tasks appears to play a more relevant role with adolescents (DiGiuseppe et al., 1996). Furthermore, therapeutic alliance with children and adolescents may be more complex and multidimensional than with adults, as it also involves the therapeutic alliance with the parents. For instance, in case of depression and substance abuse, parents’ alliance is assumed to be a better predictor of therapeutic outcomes (Diamond, Diamond, & Liddle, 2000). A wide variety of instruments are used to measure adult’s therapeutic alliance and relationship, but only few instruments were proposed for measuring child and adolescent therapeutic alliance (Faw, Hogue, Johnson, Diamond, & Liddle, 2005). The available instruments have shown adequate psychometric properties; nevertheless, the oftensmall sample size makes it difficult to discriminate which instrument may be the most promising or comprehensive (Faw et al., 2005; Hartley & Strupp, 1983; McLeod & Weisz, 2005; Shirk & Saiz, 1992). Validated instruments commonly used in studies are: The Working Alliance Inventory (WAI; adapted from adults, Shirk & Saiz, 1992), the Vanderbilt Therapeutic Alliance Scale (VTAS; adapted from adults, Bickman et al., 2004; Diamond, Liddle, Hogue, & Dakof, 1999; Robbins, Turner, Alexander, & Perez, 2003; Shelef, Diamond, Diamond, & Liddle, 2005), and Therapeutic Alliance Scales (TAS) by Shirk and Saiz (1992), the only scale specifically developed for younger children. The Adolescent Therapeutic Alliance Scale (ATAS; Faw et al., Ó 2016 Hogrefe Publishing

23

Table 1. Children/adolescent socio-demographics (N = 109) Variable

%

Gender Male

61.5

Female

38.5

Age (years) 7–12

68.8

13–17

31.2

Education (year of school) 4–5

32.1

6–7

33.9

7–9

28.4

9–12

5.5

Socioeconomic status (SES) Low

16.4

Medium

72.1

High

11.5

Diagnosis Internalizing Problems

30.3

Externalizing Problems

23.9

School Problems

23.9

No diagnosis

21.9

2005) and the Therapy Process Observational Coding System-Alliance Scale (TPOCS-A; McLeod & Weisz, 2005) have also been used to assess children/adolescent therapeutic alliance. The purpose of this study was to validate a Portuguese version of the Working Alliance Inventory (WAI; Machado & Horvath, 1999) for children and adolescents (WAI-CA) aged between 7 and 17 years, appropriate to the developmental features of this age group.

Method Participants The sample of this study was composed of 109 children/ adolescents (67 boys and 42 girls), aged between 7 and 17 years (M = 11.31, SD = 2.46), outpatients in a University Child and Adolescent Clinical Psychology outpatient unit in Northern Portugal. Children’s mean age was 10.1 years (SD = 1.61) and adolescent’s mean age was 14.18 years (SD = 1.34). The highest percentage of children/adolescents had a medium family socioeconomic status (n = 45). Diagnosis as indicated previously to intervention was almost equally divided into internalizing (e.g., anxiety disorders, mood disorders, n = 33) externalizing (e.g., oppositional defiant disorder, conduct disorder, n = 26), and school problems (n = 26), but 24 children/adolescents did not receive any diagnosis (see Table 1). European Journal of Psychological Assessment (2019), 35(1), 22–28


24

In 57 randomly selected cases, parents were invited to participate in the study (see Step 3, as described in Procedures). These parents were accompanying the child/ adolescent and involved in some sessions of the therapy. Parents’ sample was composed by 33 fathers (57.9%) and 24 mothers (42.1%) aged between 24 and 53 years, half of them older than 40 years (54.2%) and half aged below 40 years (45.8%), mostly married (71.9%), employed (89.5%), and medium to highly educated (54.4%).

Measures Working Alliance Inventory (WAI; Horvath & Greenberg, 1986; Portuguese Version by Machado & Horvath, 1999) The WAI (Horvath & Greenberg, 1989) is one of the dominant instruments used in research on therapeutic alliance with adults. The WAI is a self-report measure comprising of 36 items in a 7-point Likert scale, organized in 3 subscales of 12 items each: (a) Goals: agreement about the goals of therapy, (b) Tasks: agreement about the tasks of the therapy, and (c) Bonds: the bond between the client and therapist. There are three versions of the WAI: therapist, client, and observer. Each version is designed to yield goal, task, and bond alliance ratings. The WAI showed good internal consistency (Cronbach’s α between 0.85 and 0.93) (Horvath & Greenberg, 1989). Several authors have chosen one or more of the WAI scales to assess alliance with adolescents or different adult family members (Florsheim et al., 2000; Hawke, Hennen, & Gallione, 2005; Kaufman, Rohde, Seeley, Clarke, & Stice, 2005; Shelef et al., 2005; Tetzlaff et al., 2005). In this domain, DiGiuseppe et al. (1996) adapted the Working Alliance Inventory (WAI; Horvath & Greenberg, 1989) lowering the reading level for adolescents (ages 11–18 years) and demonstrated adequate internal consistency (α > .90, DiGiuseppe et al., 1996). Working Alliance Inventory for Children and Adolescents (WAI-CA) In order to design a version of the WAI adequate for children and adolescents, an individual think-aloud process was carried out with the Portuguese version of the Working Alliance Inventory (Machado & Horvath, 1999), followed by a discussion about items’ meaning conducted with a small sample of 10 children aged between 8 and 10 years. Based on the participants’ doubts and comments, items were shortened, advanced vocabulary leading to greater difficulty was replaced by synonyms more developmentally appropriate and sentence structure was simplified. For example, “My therapist and I are working towards mutually agreed upon goals” was replaced by “My psychologist and I work together to achieve mutually agreed goals”

European Journal of Psychological Assessment (2019), 35(1), 22–28

B. Figueiredo et al., Development and Psychometric Properties

(Goals subscale). “I believe the time my therapist and I are spending together is not spent efficiently” was replaced by “I think that the time spent with my psychologist is not well used” (Tasks subscale). “I feel uncomfortable with my therapist” was replaced by “I don’t feel at ease with my psychologist” (Bond subscale). Faced with difficulties raised by some participants regarding the extension of the scale of response, Likert scale amplitude was reduced from 7 to 5 points. This aimed at simplifying the process of response to the target population. The new version was again presented to a small sample of five children aged between 8 and 10 years, who read the items to make sure they were understandable. Eight child and adolescent psychotherapists were invited to comment on the final version the WAI-CA. The final version of the WAI-CA, according to the adult version, includes 36 5-point Likert scale items, organized in 3 subscales (12 items each): Bond, Tasks, and Goals.

Procedures This study received approval from the Institutional Review Boards (IRBs). Randomly selected children/adolescents and their parents who agreed to participate in this study were asked to sign an informed consent while waiting for their appointment. All the contacted children/adolescents agreed to participate in the study. The study design included three steps: 1) examining internal consistency, construct validity, normative data, and group differences of the Working Alliance Inventory – Children and Adolescents (n = 109); 2) comparing results from the Working Alliance Inventory – Children and Adolescents and the Working Alliance Inventory in a subsample of 30 adolescents; 3) comparing results from children/adolescents (Working Alliance Inventory – Children and Adolescents) and their parents (Working Alliance Inventory) in a subsample of 57 children and adolescents and their accompanying parents. In Step 1, children and adolescents enrolled in psychotherapy process were asked to fill out the WAI-CA between session 3 and 35 (M = 5.61; SD = 5.56). In Step 2, the two Portuguese versions of WAI – adults (WAI) and children/adolescents (WAI-CA) – were administered to a sample of 30 children/adolescents aged between 10 and 14 years, between the 4th and 5th session of their psychotherapeutic process, with a minimum interval of one week. In order to avoid the effect of order presentation, the two versions were counterbalanced. Ó 2016 Hogrefe Publishing


B. Figueiredo et al., Development and Psychometric Properties

25

Table 2. Pearson correlations between WAI-CA subscales and total score (N = 109) 1. 1. Goals WAI-CA

2.

3.

Table 3. WAI-CA descriptive statistics for children (7–12, n = 75), adolescents (13–17, n = 34), and total sample (N = 109)

4.

1

M

SD

Median

Min–max

P25–P75

47–56

Goals

2. Tasks WAI-CA

.77**

3. Bond WAI-CA

.64**

.75**

4. Total WAI-CA

.90**

.93**

.87**

1

Notes. *p .05, **p .001.

In Step 3, Portuguese WAI versions were administered between session 3 and 10 to 57 children/adolescents (WAI-CA) and their accompanying parents (WAI). Session number mean was 4.12 (SD = 1.05).

Results Internal Consistency Good internal consistency values were obtained for WAI-CA total scale (Cronbach’s α = .89), as well as for WAI-CA subscales (Cronbach’s α of .71 for the Goals subscale, .79 for the Tasks subscale, and .73 for the Bond subscale).

Construct Validity As presented in Table 2, moderate to strong positive correlations were found between the children/adolescents WAI-CA subscales and total score.

7–12

50.92

6.02

52

33–60

13–17

49.21

6.98

52

33–57

45–54

Total Sample

50.39

6.35

52

33–60

47–55 51–58

Tasks 7–12

53.73

5.12

55

36–60

13–17

51.32

7.88

54

26–60

46–57

Total Sample

52.96

6.20

55

26–60

50–57

Bond 7–12

54.59

4.20

55

40–60

52–58

13–17

51.44

6.90

52

28–60

48–57

Total Sample

53.59

5.38

54

28–60

51–58

7–12

159.23

13.37

161

117–180

152–169

13–17

152.09

20.59

161

87–174

142–168

Total Sample

157.01

16.21

161

87–180

150–168

Total WAI-CA

Table 4. MANOVA results on WAI-CA subscales according to participants’ gender Male (n = 67) M (SD)

Female (n = 42) M (SD)

F(1, 104) Gender

Task WAI-CA

52.52 (6.19)

53.73 (6.26)

.95

Bond WAI-CA

52.68 (5.75)

55.15 (4.43)

5.50*

Goals WAI-CA

49.68 (5.89)

51.51 (6.95)

2.12

Notes. Multivariate results: gender, Wilks’ Λ = .93, F(3, 102) = 2.49, p = .065. *p < .05.

Normative Data Table 3 shows the descriptive statistics of the results obtained with the WAI-CA for the study sample.

Age, Gender, and Diagnosis Differences A significant main effect for age was obtained in the WAI-CA results, Wilks’ Λ = .93, F(3, 102) = 2.71, p = .049. Children obtained significantly higher values than adolescents in the Bond subscale, F(1, 104) = 7.88, p < .05, and marginally significant higher scores in the Task subscale, F(1, 104) = 3.36, p < .10. A marginally significant main effect for gender in the WAI-CA subscales was found from multivariate results, Wilks’ Λ = .93, F(3, 102) = 2.49, p = .065. Table 4 shows results considering gender on WAI-CA subscales. Girls had significantly higher results than boys in the Bond subscale, F(1, 104) = 5.50, p < .05. A significant main effect of diagnosis in WAI-CA results was found in the multivariate analysis, Wilks’ Λ = .84, F(6, 156) = 2.44, p = .028. Table 5 shows the results Ó 2016 Hogrefe Publishing

considering diagnosis on WAI-CA subscales. Univariate results revealed that this significant difference was related to the Bond subscale, F(2, 80) = 4.84, p < .05. Post hoc results showed differences between school and externalizing problems (p = .014), and school and internalizing problems (p = .078). Children without a diagnosis were excluded from the analysis. Multivariate and univariate analyses did not show significant effects of educational level on the WAI-CA subscales’ results, Wilks’ Λ = .91, F(9, 243) = 1.04, p = .41.

Concurrent Validity Correlations Between Adolescents’ WAI and WAI-CA (n = 30) Moderate to strong positive correlations were obtained between adolescents’ WAI-CA and WAI subscales (Goal, Task, and Bonding) and total scores (see Table 6). In addition, positive correlations were also found between all WAI-CA and WAI items (61% show rsp > .50). European Journal of Psychological Assessment (2019), 35(1), 22–28


26

B. Figueiredo et al., Development and Psychometric Properties

Table 5. MANOVA results on WAI-CA subscales according to participants’ diagnosis Externalizing (n = 26) M (SD)

Internalizing (n = 33) M (SD)

School Problems (n = 24) M (SD)

F(2, 80) Diagnosis

Task

53.38 (4.95)

52.94 (5.88)

51.38 (8.63)

.65

Bond

55.31 (3.67)

54.00 (4.92)

50.63 (7.51)

4.84*

Goals

49.94 (5.76)

49.21 (8.34)

49.87 (6.72)

.19

Notes. Multivariate results: diagnosis, Wilks’ Λ = .84, F(6, 156) = 2.44, p = .028. Post hoc: differences between school and externalizing problems, (p = .014) and internalizing and school problems (p = .078). *p < .05.

Table 6. Pearson correlations between WAI and WAI-CA (N = 30) Goals WAI-CA

Tasks WAI-CA

Bond WAI-CA

Total WAI-CA

Goals WAI

.83**

.58**

.77**

.82**

Tasks WAI

.63**

.66**

.76**

.77**

Bond WAI

.55**

.63**

.86**

.76**

Total WAI

.76**

.65**

.84**

.84**

Note. **p .001.

Correlations Between Children/Adolescents WAI-CA and Parent’s WAI (n = 57) As presented in Table 7, weak positive correlations were found between children/adolescents WAI-CA and their accompanying parent’s WAI subscales and total scores.

Discussion Results from this study show good psychometric properties of the WAI-CA. Data indicate that adapting the discourse and restricting the range of the response scale did not compromise the direction of the response when reported by participants in the adult and children/adolescents’ WAI versions. The few variations in results may be due to changes over time in the therapeutic relationship, as mentioned in the literature (Kivlighan & Shaughnessy, 1995). Correlations between all WAI-CA subscales and total scores suggest that the three dimensions associated with the construct of therapeutic alliance are interrelated in WAI-CA, in a similar way of what was reported in the adult version (Machado & Horvath, 1999). Positive correlations between children/adolescents’ and parents’ working alliance data were found, as stated in other studies (e.g., Foreman et al., 2000). This indicates that children/adolescents and their parents evaluate the therapeutic process in a similar way, underlining the relevance of establishing and maintaining a working alliance also with the parents at this age group, particularly when both the child/adolescent and their parents are involved in the therapy. In our study, correlations between children/adolescents and their parents are weak, in line European Journal of Psychological Assessment (2019), 35(1), 22–28

Table 7. Pearson correlations between WAI-CA and accompanying parent’s WAI subscales and total scores (N = 57) Goals WAI (Parent)

Tasks WAI (Parent)

Bond WAI (Parent)

Total WAI (Parent)

Goals WAI-CA

.29*

.31*

.16

.28*

Tasks WAI-CA

.32*

.29*

.17

.29*

Bond WAI-CA

.22

.26

.30*

.29*

Total WAI-CA

.30*

.31*

.21

.31*

Note. *p .05.

with literature on cross-informant agreement between different types of informants in child clinical psychology, consistently showing modest correlations (e.g., Achenbach, 2006; De Los Reyes, Alfano, & Beidel, 2010). Our data suggest that Bordin’s three dimensions of therapeutic alliance might be relevant in children and adolescents. The assumption that children and adolescents would have difficulties in discriminating therapeutic goals (Karver et al., 2008) was not supported by the correlations found between children/adolescent’s self-report on both forms (WAI-CA and WAI) and children/adolescents’ WAI-CA and their parent’s WAI. Furthermore, internal consistency of goals subscale was similar to the other dimensions, supporting its relevance with this population. Significant differences in WAI-CA bond scale according to age, gender, and diagnosis show the sensitivity of the inventory to possible differences in therapeutic alliance. Higher values were found in the Bond subscale in children compared to adolescents, in girls compared to boys, and in participants with internalizing and externalizing problems compared to participants with school problems. Results are similar to the ones found in the literature, as significant differences were found according to gender, age, and diagnosis (e.g., Shirk et al., 2011). Future developments of the WAI-CA validation process (enabling more robust reliability and validity analyses) should consider the use of a larger and more genderbalanced sample from different clinical settings and the use of other instruments for assessing external validity (e.g., TAS), also allowing to examine the factor structure of the measure, in order to compare with DiGiuseppe et al.’s (1996) work. Moreover, information regarding diagnoses severity, psychotherapists’ therapeutic orientation should be collected. Results suggest that this new version of the WAI for children and adolescents (WAI-CA) is a valid measure of working alliance to be used with clients aged between 7 and 17 years, particularly with younger children, due to language and response scale amplitude adaptations. The WAI-CA seems to be a robust inventory, usable by both researchers and clinicians interested in psychotherapeutic alliance with children and adolescents. Researchers on Ó 2016 Hogrefe Publishing


B. Figueiredo et al., Development and Psychometric Properties

psychotherapy with children and adolescents may use this measure to examine several key issues, such as the role of working alliance in process and outcome monitoring, its relation to psychopathology, and parental involvement in psychotherapy.

References Achenbach, T. M. (2006). As others see us: Clinical and research implications for cross-informant correlations for psychopathology. Current Directions in Psychological Science, 15, 94–98. Bickman, L., Vides de Andrade, A. R., Lambert, E. W., Doucette, A., Sapyta, J., Boyd, A. S., . . . Rauktis, M. B. (2004). Youth therapeutic alliance in intensive treatment settings. The Journal of Behavioral Health Services & Research, 31, 134–149. Blatt, S. J., Hawley, L., Ho, M., & Zuroff, D. C. (2006). The relationship of perfectionism, depression, and therapeutic alliance during treatment for depression: Latent difference score analysis. Journal of Consulting and Clinical Psychology, 74, 930–942. Bordin, E. S. (1979). The generalizability of the psychoanalytic concept of the working alliance. Psychotherapy: Theory, Research and Practice, 16, 252–260. Constantino, M. J., Castonguay, L. G., & Schut, A. J. (in press). The working alliance: a flagship for the “scientist-practitioner” model in psychotherapy. In G. Tryon (Ed.), Counseling based on research. New York, NY: Allyn & Bacon. Couture, S. M., Roberts, D. L., Penn, D. L., Cather, C., Otto, M. W., & Goff, D. (2006). Do baseline client characteristics predict the therapeutic alliance in the treatment of schizophrenia? The Journal of Nervous and Mental Disease, 194, 10–14. Creed, T. A., & Kendall, P. C. (2005). Therapist alliance-building behavior within a cognitive-behavioral treatment for anxiety in youth. Journal of Consulting Clinical Psychology, 73, 498–505. De Los Reyes, A., Alfano, C., & Beidel, D. (2010). The relations among measurements of informant discrepancies within a multisite trial of treatments for childhood social phobia. Journal of Abnormal Child Psychology, 38, 395–404. Diamond, G., Liddle, H. A., Hogue, A., & Dakof, G. A. (1999). Alliance building interventions with adolescents in family therapy: A process study. Psychotherapy, 36, 355–368. Diamond, G. M., Diamond, G. S., & Liddle, H. A. (2000). The therapist-parent alliance in family-based therapy for adolescents. Journal of Clinical Psychology, 56, 1037–1050. Diamond, G. S., Liddle, H. A., Wintersteen, M. B., Dennis, M. L., Godley, S. H., & Tims, F. (2006). Early therapeutic alliance as a predictor of treatment outcome for adolescent cannabis users in outpatient treatment. American Journal of Addictions, 15, 26–33. DiGiuseppe, R., Linscott, J., & Jilton, R. (1996). Developing the therapeutic alliance in child-adolescent psychotherapy. Applied & Preventive Psychology, 5, 85–100. Faw, L., Hogue, A., Johnson, S., Diamond, G. M., & Liddle, H. A. (2005). The Adolescent Therapeutic Alliance Scale (ATAS): Initial psychometrics and prediction of outcome in family-based substance abuse prevention counseling. Psychotherapy Research, 15, 141–154. Florsheim, P., Shotorbani, S., Guest-Warnick, G., Barratt, T., & Hwang, W. C. (2000). Role of the working alliance in the treatment of delinquent boys in community-based programs. Journal of Clinical and Child Psychology, 29, 94–107. Foreman, S. A., Gibbins, J., Grienenberger, J., & Berry, J. W. (2000). Developing methods to study child psychotherapy using

Ó 2016 Hogrefe Publishing

27

new scales of therapeutic alliance and progressiveness. Psychotherapy Research, 10, 450–461. Green, J. (2006). Annotation: The therapeutic alliance – a significant but neglected variable in child mental health treatment studies. Journal of Child Psychology and Psychiatry, 47, 425–435. Hartley, D., & Strupp, H. (1983). The therapeutic alliance: Its relationship to outcome in brief psychotherapy. In J. Masling (Ed.), Empirical studies of psychoanalytic theories (Vol. 1, pp. 1–27). Hillsdale, NJ: Erlbaum. Hawke, J. M., Hennen, J., & Gallione, P. (2005). Correlates of therapeutic involvement among adolescents in residential drug treatment. American Journal of Drug and Alcohol Abuse, 31, 163–178. Hersoug, A. G., Hoglend, P., Monsen, J. T., & Havik, O. E. (2001). Quality of working alliance in psychotherapy: Therapist variables and patient/therapist similarity as predictors. Journal of Psychotherapy Practice and Research, 10, 205–216. Hersoug, A. G., Monsen, J. T., Havik, O. E., & Hoglend, P. (2002). Quality of early working alliance in psychotherapy: Diagnoses, relationship and intrapsychic variables as predictors. Psychotherapy and Psychosomatics, 71, 18–27. Horvath, A. O., & Bedi, R. P. (2002). The alliance. In J. C. Norcross (Ed.), Psychotherapy relationships that work: Therapist contributions and responsiveness to patients (pp. 37–69). New York, NY: Oxford University Press. Horvath, A. O., & Greenberg, L. (1986). The development of the Working Alliance Inventory: A research handbook. In L. Greenberg & W. Pinsoff (Eds.), Psychotherapeutic processes: A research handbook (pp. 529–556). New York, NY: Guilford Press. Horvath, A. O., & Greenberg, L. L. (1989). Development and validation of the Working Alliance Inventory. Journal of Counseling Psychology, 36, 223–233. Horvath, A. O., & Symonds, B. D. (1991). Relation between working alliance and outcome in psychotherapy: A meta-analysis. Journal of Counseling Psychology, 38, 139–149. Karver, M. S., Handelsman, J. B., Fields, S., & Bickman, L. (2006). Meta-analysis of therapeutic relationship variables in youth and family therapy: The evidence for different relationship variables in the child and adolescent treatment outcome literature. Clinical Psychology Review, 26, 50–65. Karver, M. S., Shirk, S., Handelsman, J. B., Fields, S., Crisp, H., Gudmundsen, G., & McMakin, D. (2008). Relationship processes in youth psychotherapy: Measuring alliance, alliance-building behaviors, and client involvement. Journal of Emotional and Behavioral Disorders, 16, 15–28. Kaufman, N. K., Rohde, P., Seeley, J. R., Clarke, G. N., & Stice, E. (2005). Potential mediators of cognitive-behavioral therapy for adolescents with comorbid major depression and conduct disorder. Journal of Consulting & Clinical Psychology, 73, 38–46. Kazdin, A. E., Siegel, T. C., & Bass, D. (1990). Drawing on clinical practice to inform research on child and adolescent psychotherapy: Survey of practitioners. Professional Psychology: Research and Practice, 21, 189–198. Kivlighan, D., & Shaughnessy, P. (1995). Analysis of the development of the working alliance using hierarchical linear modelling. Journal of Counseling Psychology, 42, 338–349. Machado, P. P., & Horvath, A. O. (1999). Inventário da Aliança Terapêutica – W.A.I [Working Alliance Inventory – WAI]. In M. Simões, M. M. Gonçalves, & L. S. Almeida (Eds.), Testes e Provas Psicológicas em Portugal (Vol. 2). Braga, Portugal: APPORT/SHO. Martin, D. J., Garske, J. P., & Davis, M. K. (2000). Relation of the therapeutic alliance with outcome and other variables: A metaanalytic review. Journal of Consulting Clinical Psychology, 68, 438–450.

European Journal of Psychological Assessment (2019), 35(1), 22–28


28

McLeod, B., & Weisz, J. R. (2005). The Therapy Process Observational Coding System – Alliance Scale: Measure characteristics and prediction of outcome in usual clinical practice. Journal of Consulting and Clinical Psychology, 73, 323–333. Robbins, M. S., Turner, C. W., Alexander, J. F., & Perez, G. A. (2003). Alliance and dropout in family therapy for adolescents with behavior problems: Individual and systemic effects. Journal of Family Psychology, 17, 534–544. Shelef, K., Diamond, G. M., Diamond, G. S., & Liddle, H. A. (2005). Adolescent and parent alliance and treatment outcome in multidimensional family therapy. Journal of Consulting Clinical Psychology, 73, 689–698. Shirk, S. R., & Karver, M. (2003). Prediction of treatment outcome from relationship variables in child and adolescent therapy: A meta-analytic review. Journal of Consulting Clinical Psychology, 71, 452–464. Shirk, S. R., Karver, M., & Brown, R. (2011). The alliance in child and adolescent psychotherapy. Psychotherapy, 48, 17–24. Shirk, S. R., & Saiz, C. C. (1992). Clinical, empirical, and developmental perspectives on the therapeutic relationship in childpsychotherapy. Development and Psychopathology, 4, 713–728. Symonds, D., & Horvath, A. O. (2004). Optimizing the alliance in couple therapy. Family Process, 43, 443–455.

European Journal of Psychological Assessment (2019), 35(1), 22–28

B. Figueiredo et al., Development and Psychometric Properties

Tetzlaff, B. T., Kahn, J. H., Godley, S. H., Godley, M. D., Diamond, G. S., & Funk, R. R. (2005). Working alliance, treatment satisfaction, and patterns of posttreatment use among adolescent substance users. Psychology of Addictive Behaviors, 19, 199–207. Zack, S. E., Castonguay, L. G., & Boswell, J. F. (2007). Youth working alliance: A core clinical construct in need of empirical maturity. Harvard Review of Psychiatry, 15, 278–288. Received September 10, 2014 Revision received December 12, 2015 Accepted January 22, 2016 Published online November 7, 2016 Bárbara Figueiredo School of Psychology University of Minho Campus de Gualtar 4700-057 Braga Portugal Tel. +35 19 3999-3937 E-mail bbfi@psi.uminho.pt

Ó 2016 Hogrefe Publishing


Original Article

Measurement Invariance of English and French Language Versions of the 20-Item Toronto Alexithymia Scale Carolyn A. Watters,1 Graeme J. Taylor,2 Lindsay E. Ayearst,3 and R. Michael Bagby4 1

Department of Psychology, University of Toronto, Toronto, Ontario, Canada

2

Department of Psychiatry, University of Toronto and Mount Sinai Hospital, Toronto, Ontario, Canada

3

Toronto, Ontario, Canada

4

Departments of Psychology and Psychiatry, University of Toronto, Toronto, Ontario, Canada Abstract: The alexithymia construct is commonly measured with the 20-Item Toronto Alexithymia Scale (TAS-20), with more than 20 different language translations. Despite replication of the factor structure, however, it cannot be assumed that observed differences in mean TAS-20 scores can be interpreted similarly across different languages and cultural groups. It is necessary to also demonstrate measurement invariance (MI) for language. The aim of this study was to evaluate MI of the English and French versions of the TAS-20 using data from 17,866 Canadian military recruits; 71% spoke English and 29% spoke French as their first language. We used confirmatory factor analyses (CFAs) to establish a baseline model of the TAS-20, and four increasingly restrictive multigroup CFA analyses to evaluate configural, metric, scalar, and residual error levels of MI. The best fitting factor structure in both samples was an oblique 3-factor model with an additional method factor comprised of negatively-keyed items. MI was achieved at all four levels of invariance. There were only small differences in mean scores across the two samples. Results support MI of English and French versions of the TAS-20, allowing meaningful comparisons of findings from investigations in Canadian French-speaking and English-speaking groups. Keywords: alexithymia, confirmatory factor analysis, language equivalency, measurement invariance, 20-Item Toronto Alexithymia Scale

Alexithymia is a personality construct characterized by difficulty identifying subjective feelings, difficulty describing feelings to others, a restricted imagination, and an externally orientated cognitive style (Taylor & Bagby, 2012). The construct was formulated by Nemiah, Freyberger, and Sifneos (1976), and has since generated a large amount of research by investigators in many different countries, especially during the last two decades (for recent reviews see, Luminet, Vermeulen, & Grynberg, 2013; Taylor & Bagby, 2012). The expansion of research on alexithymia can be attributed in large part to the development of the 20-item Toronto Alexithymia Scale (TAS-20; Bagby, Parker, & Taylor, 1994), as it provided researchers with a reliable, valid, and common metric to measure alexithymia. Validated in Canadian community, student, and psychiatric outpatient samples, the TAS-20 is comprised of three factors that map onto the theoretical conception of alexithymia – Difficulty identifying feelings (DIF), Difficulty describing feelings to others (DDF), and Externally oriented thinking (EOT; Bagby, Parker, et al., 1994; Parker, Taylor, & Bagby, 2003). Although the

Ó 2016 Hogrefe Publishing

TAS-20 does not include a factor for assessing restricted imagination, the EOT factor correlates negatively with measures of fantasy and imaginal processes, suggesting that this factor also indirectly assesses the restricted imagination facet of the alexithymia construct (Bagby, Taylor, & Parker, 1994; Taylor & Bagby, 2013). The TAS-20 has been translated into more than 20 different languages, and the theoretically-informed 3-factor structure has been extracted with confirmatory factor analysis (CFA) in East Asian and Western and Eastern European countries with nonclinical and/or clinical samples (e.g., Meganck, Vanheule, & Desmet, 2008; Taylor, Bagby, & Parker, 2003; Tsaousis et al., 2010; Zhu, Yi, Ryder, Taylor, & Bagby, 2007). Despite replication with most translated versions of the TAS-20 of what is now regarded as the “standard” 3-factor structure, it cannot be assumed that observed differences (or non-differences) in mean TAS-20 scores can be interpreted similarly across different languages and cultural groups. It is essential to also demonstrate measurement invariance (MI) for language (Chen, 2008). An instrument or test can be said to show MI across groups when

European Journal of Psychological Assessment (2019), 35(1), 29–36 DOI: 10.1027/1015-5759/a000365


30

members of each group assign the same meanings to its constituent items, and when respondents who share the same level of the underlying construct obtain the same score regardless of group membership (Meredith, 1993). Establishing MI for language is important for comparing findings from countries with different languages and particularly relevant in countries that have more than one language where data are commonly pooled for analyses. Demonstrating invariance allows for generalizing research findings across languages, and for making group comparisons that are not inflated or attenuated due to measurement error.

Establishing Language Equivalency Four levels of MI are generally considered (Meredith, 1993; Milfont & Fischer, 2010) First is configural invariance, which is met when a construct is made up of the same number of factors, with the same items associated with each factor, across groups; if this level is not met, it can be concluded that the assessment instrument is not measuring the same construct across groups. Second is metric invariance, which is met when the factor loadings of all items are also equivalent across the groups; equivalence at this level is required for meaningful comparison of predictive relationships across groups. Third is scalar invariance, which is met when individual items show the same point of origin (i.e., intercept) across the groups; this level of equivalence is necessary for comparing group means (Chen, 2008). Finally, residual error or uniqueness invariance (Vandenberg & Lance, 2000) which, if achieved, suggests that error variances are similar across groups; comparisons can then be made between raw observed scores rather than having to control for measurement error through methods such as latent variable modeling. Only three studies have investigated MI of the TAS-20; none investigated equivalency across different languages. One study demonstrated MI of the English version of the TAS-20 across predominantly Anglo-American and predominantly Hispanic student samples in the United States (Culhane, Morera, Watson, & Millsap, 2009). Another study demonstrated partial MI of the Dutch language version of the TAS-20 across university student and psychiatric outpatient samples (Meganck et al., 2008). Bagby, Ayearst, Morariu, Watters, and Taylor (2014) demonstrated MI of Internet administration and paper-and-pencil versions of the TAS-20, indicating that outcomes generated by different studies using either format are generalizable to one another. Given that the French translation of the TAS-20 is used by researchers in France, Belgium, and Canada, the objective of the current investigation was to test MI across French and English language versions of the TAS-20. European Journal of Psychological Assessment (2019), 35(1), 29–36

C. A. Watters et al., Measurement Invariance of the TAS-20

Method Participants and Procedure The data for this study were provided by the Canadian Government Department of National Defence (DND) and were collected as part of a larger research study being conducted by the DND. The participants in the larger study were 21,817 Canadian Armed Forces (CAF) recruits who completed the CAF Recruit Health Questionnaire (RHQ; Lee, Whitehead, & Dubiniecki, 2010) in the first few weeks of their basic training at the CAF Leadership and Recruit School between July 2003 and December 2009. The recruits gave written voluntary informed consent to their participation as approved by the Defence Research and Development Canada Research Ethics Committee. To be included in the current study sample, participants must have provided complete data on the TAS-20. Of the total number of recruits who met this criterion (N = 17,866), 71% (n = 12,706) spoke English as their first language; the remaining 29% (n = 5,160) spoke French as their first language. These two groups are hereafter referred to as English-speaking and French-speaking samples.

Measures The RHQ was developed by investigators at the DND and is comprised of several questionnaires and scales, including the TAS-20 (Lee et al., 2010). It is administered to all recruits in order to collect socio-demographic data and information about a variety of factors associated with health and well-being including overall health, alcohol consumption, personality traits, emotional awareness, and types of social support. The RHQ can be administered in either an English or French language version. The TAS-20 items are rated on a 5-point Likert scale ranging from 1 (= strongly disagree) to 5 (= strongly agree); five items are reverse scored (Bagby, Parker, et al., 1994). Scores range from 20 to 100; higher scores indicate higher levels of alexithymia. The French translation of the TAS-20 was developed in France two decades ago, and the standard 3-factor structure was subsequently replicated with CFA in both clinical and nonclinical samples (Loas et al., 2001). The quality of the French translation was verified by a back-translation made by a bilingual psychologist (Loas, Otmani, Verrier, Fremaux, & Marchand, 1996).

Statistical Analyses Three sets of analyses were conducted. First, we tested a number of potential factor models to determine the optimal structure in each sample (i.e., the baseline model) for subsequent MI analyses. We specifically tested models that Ă“ 2016 Hogrefe Publishing


C. A. Watters et al., Measurement Invariance of the TAS-20

have been examined in most previous investigations of the TAS-20. These included a unidimensional model, the standard oblique 3-factor model, a 4-factor model in which EOT is split into two distinct factors (“Pragmatic thinking” with three items and “Lack of importance of emotions” with five items), and the standard 3-factor model that included a method factor comprised of the reversescored items (Bagby, Parker, et al., 1994; Mattila et al., 2010; Meganck et al., 2008; Müller, Bühner, & Ellgring, 2003; Parker et al., 2003; Tsaousis et al., 2010; Zhu et al., 2007). The second set of analyses involved MI testing of the optimal baseline model, evaluated through four increasingly restrictive multigroup confirmatory factor analyses (MG-CFA), which represent the four levels of invariance described earlier. Depending on the results of the MI testing, the third set of analyses involved group comparisons through either latent mean tests (i.e., if scalar invariance was met) or observed group comparisons (i.e., t-tests; if residual error invariance was met). Internal consistency and item-to-scale homogeneity of the TAS-20 and its factor scales were also evaluated by calculating alpha coefficients (with coefficient α .70 indicating adequate internal consistency) and mean inter-item correlations (MIC; with a range between .15 and .50 indicating acceptable homogeneity of item content). To determine the degree of association among the TAS-20 factor scales and total scale, Pearson correlations were calculated. All CFA and MG-CFA models were tested with EQS 6.1 statistical software using maximum likelihood estimation (Bentler, 2005); all other analyses were conducted using SPSS 16.0 (SPSS Inc., 2008). The data were treated as continuous as the alexithymia construct has been shown to be dimensional (Mattila et al., 2010). For all CFA models, factor variances were set to equal one. To assess goodness of model fit, several indices were utilized: Satorra-Bentler chi-square statistic (S-Bw2), the root-mean-square error of approximation (RMSEA) and 90% confidence interval (90% CI), and the comparative fit index (CFI), all three of which were corrected for non-normality; and the standardized root-mean-square residual (SRMR; Browne & Cudeck, 1993; Byrne, 2006; Hu & Bentler, 1999; Satorra & Bentler, 1994). Since we expected TAS-20 scores to

31

violate normality assumptions (i.e., positive skewness) due to the nonclinical nature of the sample, a scaling correction offered in EQS was utilized, in which the S-Bw2 statistic and standard errors robust to non-normality were generated, resulting in the corrected statistics for S-Bw2, RMSEA, and CFI (Byrne, 2006). Given the known sensitivity of w2 statistics to sample size and the use of a large sample, it was expected that S-Bw2 would be significant for all models. The quality of each CFA model was evaluated according to the following widely accepted criteria: RMSEA .08, SRMR .10, and CFI .90 for acceptable fit; RMSEA .05, SRMR .08, and CFI .95 for good fit (Browne & Cudeck, 1993; Hu & Bentler, 1999). To compare the quality of competing baseline models the Akaike Information Criterion (AIC; Akaike, 1987) was utilized, in which a smaller number represents the more optimal model. Two statistics were used as indicators of invariance: (1) change in CFI (ΔCFI), where invariance is met based on ΔCFI .01 (Chen, 2007; Cheung & Rensvold, 2002); (2) change in RMSEA (ΔRMSEA), with a change .015 representing invariance between MI levels (Chen, 2007). The assumption of MI at each level was accepted if ΔCFI and ΔRMSEA did not decline by more than .01 or .015, respectively, between increasingly restrictive MI models. Of note, although more stringent change criteria to determine MI have been proposed (see Meade, Johnson, & Braddy, 2008), we chose ΔCFI and ΔRMSEA due to the wide acceptance and use in the literature of these two particular criteria.

Results Sample Characteristics Most recruits in both samples were male, age 24 years or younger, and had completed postsecondary or high school education (see Table 1). Using a significance level set at p .01 (Bonferroni correction for multiple comparisons, α = .05/5), there were some differences in education between samples, but the effect sizes were small based on Cramer’s V .15. In the combined sample, 31 recruits did not report gender, 390 did not report age, and 182 did not report education. Item-level descriptive statistics

Table 1. Demographics for French-speaking and English-speaking samples French (n = 5,160)

English (n = 12,706)

w2(df)

p

% male

86

84

6.23 (1)

.013

% 24 years and younger

66

68

6.42 (1)

.011

% not completed high school

20

11.6

211.11 (1)

<.001

% completed high school

51

49

8.33 (1)

.004

% completed postsecondary

23

29

82.67 (1)

<.001

Ó 2016 Hogrefe Publishing

European Journal of Psychological Assessment (2019), 35(1), 29–36


32

C. A. Watters et al., Measurement Invariance of the TAS-20

Table 2. Goodness of Fit Indices for competing TAS-20 factor structures (French, n = 5,160; English, n = 12,706) Model

S-Bw2

df

CFI

RMSEA (90% CI)

SRMR

AIC

A. 1-Factor French

6,513.61

170

.76

.085 (.083, .087)

.08

6,173.61

English

16,357.43

170

.78

.087 (.085, .088)

.08

16,017.43

B. 3-Factor (DIF, DDF, EOT) French

5,477.35

167

.80

.079 (.077, .080)

.10

5,143.35

English

12,444.19

167

.83

.076 (.075, .077)

.10

12,110.19

C. 4-Factor (DIF, DDF, PR, IM) French

5,204.84

164

.81

.077 (.075, .079)

.09

4,876.84

English

11,160.27

164

.85

.073 (.072, .074)

.08

10,832.27

D. 4-Factor (DIF, DDF, EOT, Method) French

3,396.35

162

.88

.062 (.060, .064)

.05

3,072.35

English

7,239.43

162

.90

.059 (.057, .060)

.04

6,915.43

Notes. DIF = Difficulty Identifying Feelings; DDF = Difficulty Describing Feelings; EOT = Externally Oriented Thinking; PR = Pragmatic Thinking; IM = Lack of Importance of Emotions; S-Bw2 = Satorra-Bentler chi-quare; df = degrees of freedom; CFI= comparative fit index; RMSEA = root-mean-square error of approximation; CI = confidence interval; SRMR = standardized root-mean-square residual; AIC = Akaike Information Criterion. All S-Bw2 values were significant at p < .001.

(available on request) were positively skewed for both French and English versions of the scale; item mean scores and standard deviations were similar across languages.

Confirmatory Factor Analysis The values for the goodness of fit indices are displayed in Table 2. The optimal fitting model for both Englishspeaking and French-speaking samples was the standard oblique 3-factor model (DIF, DDF, EOT) with an orthogonal method factor comprised of the reverse-scored items. Although the CFI was slightly less than .90 in the Frenchspeaking sample, it was acceptable in the English-speaking sample and the other two indices were acceptable in both samples. The standardized parameter estimates of this model are displayed in Table 3.

Alpha Coefficients, Mean Inter-Item Correlations, and Comparison of Mean Scores Descriptive statistics for the TAS-20 and its factor scales are displayed in Table 5. The French-speaking sample scored higher than the English-speaking sample on the total TAS-20 and lower on EOT, although the effect sizes of these differences (Cohen’s d) were small. There was a small effect size for the difference on DIF, with the French-speaking sample scoring slightly higher than the English-speaking sample. The alpha coefficients for the two samples were very similar and exceeded the criterion of .70 for adequate reliability for the total scale, and for DIF and DDF, but were below this standard for EOT. The MICs were within the recommended range of .15–.50 for both samples, except for EOT.

Measurement Invariance

Factor Scale Intercorrelations

The fit indices and difference statistics used to test MI assumptions are displayed in Table 4. As the CFI and RMSEA values did not decline more than .01 and .015, respectively, results suggest MI was achieved at all four levels of invariance, implying that item composition of the factors was equal across the two samples; the items did not function differently based on language; and that the means of the two samples are interpretively comparable using raw, observable scores. Comparison of factor covariances across languages also showed invariance across languages, with inter-factor latent correlations ranging from .58 to .84 (French [English]: DIF-DDF = .84 [.84], DIF-EOT = .58 [.58], DDF-EOT = .70 [.66]).

The magnitudes of the intercorrelations among the three TAS-20 factor scales and the total scale were similar for the two language versions of the scale, and all correlations were significant at p < .001 (French [English]: DIF-DDF = .66 [.65], DIF-EOT = .35 [.38], DIF-Total = .86 [.87]; DDF-EOT = .44 [.43], DDF-Total = .86 [.85]; EOT-Total = .70 [.71]).

European Journal of Psychological Assessment (2019), 35(1), 29–36

Discussion The findings in this study support MI of the English and French language versions of the TAS-20. Our finding that Ó 2016 Hogrefe Publishing


C. A. Watters et al., Measurement Invariance of the TAS-20

33

Table 3. Standardized parameter estimates for CFA of the optimal TAS-20 Model: French [English] Item

DIF

DDF

EOT

Method

Difficulty Identifying Feelings 1. I am often confused about. . .

.72 [.74]

3. I have physical sensations that...

.51 [.60]

6. When I am upset, I don’t know if. . .

.68 [.73]

7. I am often puzzled by. . .

.48 [.71]

9. I have feelings that. . .

.77 [.80]

13. I don’t know what’s going on. . .

.79 [.80]

14. I often don’t know why. . .

.65 [.70]

Difficulty Describing Feelings 2. It is difficult for me to find. . .

.78 [.76]

4-R. I am able to describe. . .

.48 [.40]

11. I find it hard to describe. . .

.74 [.72]

12. People tell me to describe. . .

.63 [.62]

17. It is difficult for me to reveal. . .

.63 [.62]

.38 [.36]

Externally Oriented Thinking 5-R. I prefer to analyze problems rather. . .

.02a [.02a]

8. I prefer just to let things. . .

.40 [.44]

10-R. Being in touch with. . .

.22 [.26]

15. I prefer talking to. . .

.64 [.60]

16. I prefer to watch “light” entertainment. . .

.43 [.43]

.49 [.37] .61 [.50]

18-R. I can feel close to someone, even. . .

.05 [.06]

.46 [.51]

19-R. I find examination of my feelings. . .

.15 [.23]

.57 [.70]

20. Looking for hidden meanings. . .

.33 [.38]

a

Notes. R = reverse-scored item; nonsignificant parameter estimate. DIF = Difficulty Identifying Feelings; DDF = Difficulty Describing Feelings; EOT = Externally Oriented Thinking.

Table 4. Goodness of Fit Indices and difference statistics of MI Models for French and English versions of the TAS-20 ΔCFI

ΔRMSEA

S-Bw2

df

CFI

RMSEA, 90% CI

SRMR

Configural

10,668.99

324

.896

.060 (.059, .061)

.049

Metric

10,960.98

349

.893

.058 (.057, .059)

.054

.003

.002 <.001

Model

Covariances

10,973.02

352

.893

.058 (.057, .059)

.053

<.001

Scalar

13,236.39

372

.895

.059 (.058, .060)

.054

.002

.001

Residual error

13,848.64

392

.891

.059 (.058, .060)

.055

.004

<.001

Notes. S-Bw2 = Satorra-Bentler chi-square; df = degrees of freedom; CFI= comparative fit index; RMSEA = root-mean-square error of approximation; CI = confidence interval; SRMR = standardized root-mean-square residual; ΔCFI = change in CFI (all values not significant); ΔRMSEA = change in RMSEA (all values not significant).

the 3-factor model with an additional method factor was the optimal fitting model in both English- and Frenchspeaking samples (configural invariance) implies that the two language groups conceptualize the alexithymia construct in a similar way. The finding that the factor loadings of the items were similar across the two samples (metric invariance) indicates that the TAS-20 items function similarly whether presented in English or French, and that predictive relationships can be meaningfully compared across groups. The demonstration of scalar invariance

Ó 2016 Hogrefe Publishing

implies that the meaning of the construct and the levels of the underlying items are the same for English- and French-speaking groups, and that mean scores can be meaningfully compared across groups. The finding of residual error/uniqueness invariance means that error variance for each of the items is similar across the two language groups, implying that raw observed scores can be compared across groups without concern for the unequally distributed influence of measurement error (Vandenberg & Lance, 2000). Our findings allow for

European Journal of Psychological Assessment (2019), 35(1), 29–36


34

C. A. Watters et al., Measurement Invariance of the TAS-20

Table 5. Descriptive statistics and Mean Difference tests for French and English versions of the TAS-20 (French [English]) t(17,864)

pa

d

α

MIC

11.80 [12.03]

4.01

<.001

.07

.85 [.86]

.22 [.23]

5.60 [5.93]

10.55

<.001

.17

.84 [.89]

.44 [.53]

13.19 [13.01]

4.61 [4.48]

2.37

.018

.79 [.77]

.43 [.40]

20.27 [20.65]

4.25 [4.38]

5.30

.001

.09

.50 [.54]

.11 [.13]

Mean

SD

TAS-20 total

48.44 [47.65]

DIF

14.97 [13.98]

DDF EOT

Notes. DIF = Difficulty Identifying Feelings; DDF = Difficulty Describing Feelings; EOT = Externally Oriented Thinking. d = Cohen’s d; α = Cronbach’s α; MIC = mean inter-item correlation. asignificant p .01; Bonferroni correction (α = .05/4 comparisons).

meaningful comparisons of results obtained from investigations in French-speaking groups and English-speaking groups in Canada, as any observed differences in mean TAS-20 scores are not likely due to measurement artifacts. Although the language equivalency established in this study likely generalizes to findings from the numerous investigations of alexithymia that have been conducted with the same French translation of the TAS-20 in France and Belgium (e.g., Baezo-Velasco, Carton, & Almohsen, 2012; Berthoz et al., 2002; Grynberg, Luminet, Corneille, Grèzes, & Berthoz, 2010), further research is needed to demonstrate MI across French, Belgian, and Canadian samples as cultural differences between Europe and Canada may be larger than between two language groups within Canada. Consistent with previous studies (Gignac, Palmer, & Stough, 2007; Mattila et al., 2010; Meganck et al., 2008), which reported the fit of the standard 3-factor model of the TAS-20 or a nested model (the three factors nested within a first-order global alexithymia factor) improved after adding the same method factor, we found the best fitting factor model included a method factor comprised of the five reverse-scored items. Although the CFI did not quite meet the criterion value in the French-speaking sample, researchers generally agree that the cutoff values should not be treated as “golden rules” or replace sound judgement (Chen, 2007; Marsh, Hau, & Wen, 2004). The low reliability of the EOT factor in our study is consistent with findings in previous investigations of French and several other translated versions of the TAS-20 (Eid & Boucher, 2012; Loas et al., 2001; Meganck et al., 2008; Tsaousis et al., 2010; Zhu et al., 2007). Given that the four reverse-scored EOT items had higher loadings on the method factor than on the EOT factor in our study, these items may not adequately represent the EOT facet of the alexithymia construct or may be unduly influenced by the negatively-keyed structure of the items in turn causing a response bias; further research is needed to examine the influence of these issues on assessing the EOT component of the alexithymia construct. Despite this, the EOT factor shows similar or higher magnitude correlations than the DIF and DDF factors with the total score on English and

European Journal of Psychological Assessment (2019), 35(1), 29–36

various translated versions of the Toronto Structured Interview for Alexithymia, implying that EOT does contribute to the assessment of the multifaceted alexithymia construct (e.g., Bagby, Taylor, Parker, & Dickens, 2006; Grabe et al., 2009). Limitations of our study are that it used a sample of military recruits only, the comparatively young age (vis-àvis the general population) of these recruits, and the small percentage of female participants. MI may not be supported across gender or in older adults or patient groups. The strengths of the study include the large sample size, and demonstration of language equivalence at four different, increasingly restrictive levels of invariance.

Acknowledgments This research was supported by a Joseph-Armand Bombardier Doctoral Scholarship awarded to Carolyn A. Watters by the Social Sciences and Humanities Research Council of Canada (Grant: 767-2012-1272). The authors express thanks to Jeff Whitehead, MSc, MD, FRCPC and Jennifer E.C. Lee, PhD at the Canadian Government Department of National Defence for providing the TAS-20 data collected from Canadian Armed Forces military recruits.

References Akaike, H. (1987). Factor analysis and AIC. Psychometrika, 52, 317–332. doi: 10.1007/BF02294359 Baezo-Velasco, C., Carton, S., & Almohsen, C. (2012). Alexithymia and emotional awareness in females with painful rheumatic conditions. Journal of Psychosomatic Research, 73, 398–400. doi: 10.1016/j.jpsychores.2012.08.008 Bagby, R. M., Ayearst, L. E., Morariu, R. A., Watters, C., & Taylor, G. J. (2014). The internet administration version of the 20-item Toronto Alexithymia Scale. Psychological Assessment, 26, 16–22. doi: 10.1037/a0034316 Bagby, R. M., Parker, J. D. A., & Taylor, G. J. (1994). The Twenty-Item Toronto Alexithymia Scale – I. Item selection and cross-validation of the factor structure. Journal of Psychosomatic Research, 38, 23–32. doi: 10.1016/0022-3999 (94)90005-1

Ó 2016 Hogrefe Publishing


C. A. Watters et al., Measurement Invariance of the TAS-20

Bagby, R. M., Taylor, G. J., & Parker, J. D. A. (1994). The twentyitem Toronto Alexithymia Scale-II. Convergent, discriminant, and concurrent validity. Journal of Psychosomatic Research, 38, 33–40. doi: 10.1016/0022-3999(94)90006-X Bagby, R. M., Taylor, G. J., Parker, J. D. A., & Dickens, S. E. (2006). The development of the Toronto Structured Interview for Alexithymia: Item selection, factor structure, reliability and concurrent validity. Psychotherapy and Psychosomatics, 75, 25–39. doi: 10.1159/000089224 Bentler, P. M. (2005). ESQ 6.1 Structural equations program manual. Los Angeles, CA: Multivariate Software. Berthoz, S., Artiges, E., Van de Moortele, P.-F., Poline, J.-B., Rouquette, S., Consoli, S. M., & Martinot, J.-L. (2002). Effect of impaired recognition and expression of emotions on frontocingulate cortices: An fMRI study of men with alexithymia. American Journal of Psychiatry, 159, 961–967. Browne, M. W., & Cudeck, R. (1993). Alternative ways of assessing model fit. In K. A. Bollen & J. S. Long (Eds.), Testing structural equation models (pp. 136–162). Newbury Park, CA: Sage. Byrne, B. M. (2006). Structural equation modeling with EQS (2nd ed.). New York, NY: Psychology Press. Chen, F. F. (2007). Sensitivity of goodness of fit indexes to lack of measurement invariance. Structural Equation Modeling, 14, 464–504. doi: 10.1080/10705510701301834 Chen, F. F. (2008). What happens if we compare chopsticks with forks? The impact of making inappropriate comparisons in cross-cultural research. Journal of Personality and Social Psychology, 95, 1005–1018. doi: 10.1037/a0013193 Cheung, G. W., & Rensvold, R. B. (2002). Evaluating goodnessto-fit indexes for testing measurement invariance. Structural Equation Modeling, 9, 233–255. doi: 10.1207/ S15328007SEM0902_5 Culhane, S. E., Morera, O. F., Watson, P. J., & Millsap, R. E. (2009). Assessing measurement and predictive invariance of the Toronto Alexithymia Scale-20 in U.S. Anglo and U.S. Hispanic student samples. Journal of Personality Assessment, 91, 387–395. doi: 10.1080/00223890902936264 Eid, P., & Boucher, S. (2012). Alexithymia and dyadic adjustment in intimate relationships: Analyses using the actor partner interdependence model. Journal of Social and Clinical Psychology, 31, 1095–1111. doi: 10.1521/jscp.2012.31.10.1095 Gignac, G. E., Palmer, B. R., & Stough, C. (2007). A confirmatory factor analytic investigation of the TAS-20: Corroboration of a five-factor model and suggestions for improvement. Journal of Personality Assessment, 89, 247–257. doi: 10.1080/ 00223890701629730 Grabe, H. J., Löbel, S., Dittrich, D., Bagby, R. M., Taylor, G. J., Quilty, L. C., . . . Rufer, M. (2009). The German version of the Toronto Structured Interview for Alexithymia: Factor structure, reliability, and concurrent validity in a psychiatric patient sample. Comprehensive Psychiatry, 50, 424–430. doi: 10.1016/j.comppsych.2008.11.008 Grynberg, D., Luminet, O., Corneille, O., Grèzes, J., & Berthoz, S. (2010). Alexithymia in the interpersonal domain: A general deficit in empathy? Personality and Individual Differences, 49, 845–850. doi: 10.1016/j.paid.2010.07.013 Hu, L. T., & Bentler, P. M. (1999). Cutoff criteria for fit indexes in covariance structure analysis: Conventional criteria versus new alternatives. Structural Equation Modeling, 6, 1–55. doi: 10.1080/10705519909540118 Lee, J. E. C., Whitehead, J., & Dubiniecki, C. (2010). Descriptive analyses of the Recruit Health Questionnaire: 2003–2004 (DGMPRA TM 2010–010). Ottawa, Canada: Director General

Ó 2016 Hogrefe Publishing

35

Military Personnel Research and Analysis, Department of National Defence. Loas, G., Corcos, M., Stephan, P., Pellet, J., Bizouard, P., & Venisse, J. L., The Réseau INSERM no. 494013. (2001). Factorial structure of the 20-item Toronto Alexithymia Scale. Confirmatory factorial analyses in nonclinical and clinical samples. Journal of Psychosomatic Research, 50, 255–261. doi: 10.1016/S0022-3999(01)00197-0. Loas, G., Otmani, O., Verrier, A., Fremaux, D., & Marchand, M. P. (1996). Factor analysis of the French version of the 20-Item Toronto Alexithymia Scale (TAS-20). Psychopathology, 29, 139–144. doi: 10.1207/s15327906mbr3201_3 Luminet, O., Vermeulen, N., & Grynberg, D. (2013). L’Alexithymie. Bruxelles, Belgium: De Boeck. Marsh, H. W., Hau, K.-T., & Wen, Z. (2004). In search of golden rules: Comment on hypothesis-testing approaches to setting cutoff values for fit indexes and dangers in overgeneralizing Hu and Bentler’s (1999) findings. Structural Equation Modeling, 11, 320–341. doi: 10.1207/s15328007sem1103_2 Mattila, A. K., Keefer, K. V., Taylor, G. J., Joukamaa, M., Jula, A., Parker, J. D. A., & Bagby, R. M. (2010). Taxometric analysis of alexithymia in a general population sample from Finland. Personality and Individual Differences, 49, 216–221. doi: 10.1016/j.paid.2010.03.038 Meade, A. W., Johnson, E. C., & Braddy, P. W. (2008). Power and sensitivity of alternative fit indices in tests of measurement invariance. Journal of Applied Psychology, 93, 568–592. doi: 10.1037/0021-9010.93.3.568 Meganck, R., Vanheule, S., & Desmet, M. (2008). Factorial validity and measurement invariance of the 20-item Toronto Alexithymia Scale in clinical and nonclinical samples. Assessment, 15, 36–47. doi: 10.1177/1073191107306140 Meredith, W. (1993). Measurement invariance, factor analysis and factorial invariance. Psychometrika, 58, 525–543. doi: 10.1007/ BF02294825 Milfont, T. L., & Fischer, R. (2010). Testing measurement invariance across groups: Applications in cross-cultural research. International Journal of Psychological Research, 3, 112–131. Müller, J., Bühner, M., & Ellgring, H. (2003). Is there a reliable factorial structure in the 20-item Toronto Alexithymia Scale? A comparison of factor models in clinical and normal adult samples. Journal of Psychosomatic Research, 55, 561–568. doi: 10.1016/S0022-3999(03)00033-3 Nemiah, J. C., Freyberger, H., & Sifneos, P. E. (1976). Alexithymia: A view of the psychosomatic process. In O. W. Hill (Ed.), Modern trends in psychosomatic medicine (Vol. 3, pp. 430–439). London, UK: Butterworths. Parker, J. D. A., Taylor, G. J., & Bagby, R. M. (2003). The 20-item Toronto Alexithymia Scale III. Reliability and factorial validity in a community population. Journal of Psychosomatic Research, 55, 269–275. doi: 10.1016/S00223999(02)00578-0 Satorra, A., & Bentler, P. M. (1994). Corrections to test statistics and standard errors in covariance structure analysis. In A. von Eye & C. C. Clogg (Eds.), Latent variables analysis: Applications for developmental research (pp. 399–419). Thousand Oaks, CA: Sage. SPSS Inc. (2008). SPSS Version 16.0 [Computer software]. Chicago, IL: Author. Taylor, G. J., & Bagby, R. M. (2012). The alexithymia personality dimension. In T. A. Widiger (Ed.), The Oxford handbook of personality disorders (pp. 648–673). New York, NY: Oxford University Press.

European Journal of Psychological Assessment (2019), 35(1), 29–36


36

Taylor, G. J., & Bagby, R. M. (2013). Alexithymia and the five-factor model of alexithymia. In T. A. Widiger & P. T. Costa Jr. (Eds.), Personality disorders and the five-factor model of personality (pp. 193–207). Washington, DC: American Psychological Association. Taylor, G. J., Bagby, R. M., & Parker, J. D. A. (2003). The 20-item Toronto Alexithymia Scale IV. Reliability and factorial validity in different languages and cultures. Journal of Psychosomatic Research, 55, 277–283. doi: 10.1016/S00223999(02)00601-3 Tsaousis, I., Taylor, G., Quilty, L., Georgiades, S., Stavrogiannopoulos, M., & Bagby, R. M. (2010). Validation of a Greek adaptation of the 20-item Toronto Alexithymia Scale. Comprehensive Psychiatry, 51, 443–448. doi: 10.1016/ j.comppsych.2009.09.005 Vandenberg, R. J., & Lance, C. E. (2000). A review and synthesis of the measurement invariance literature: Suggestions, practices, and recommendations for organizational research. Organizational Research Methods, 3, 4–70. doi: 10.1177/ 109442810031002

European Journal of Psychological Assessment (2019), 35(1), 29–36

C. A. Watters et al., Measurement Invariance of the TAS-20

Zhu, X., Yi, J., Ryder, A. G., Taylor, G. J., & Bagby, R. M. (2007). Cross-cultural validation of a Chinese translation of the 20-item Toronto Alexithymia Scale. Comprehensive Psychiatry, 48, 489–496. doi: 10.1016/j.comppsych.2007.04.007 Received August 4, 2015 Revision received January 7, 2016 Accepted January 22, 2016 Published online November 7, 2016 R. Michael Bagby Department of Psychology University of Toronto 1265 Military Trail Toronto, Ontario M1C 1A4 Canada Tel. (+1) 416 508-4134 Fax (+1) 416 586-8654 E-mail rmichael.bagby@utoronto.ca

Ó 2016 Hogrefe Publishing


Original Article

Psychometric Properties of the Basic Psychological Need Satisfaction and Frustration Scale – Intellectual Disability (BPNSFS-ID) Noud Frielink,1,2 Carlo Schuengel,3 and Petri J. C. M. Embregts1,2,4 1

Department Tranzo, Tilburg School of Social and Behavioral Sciences, Tilburg University, The Netherlands

2

Dichterbij Innovation and Science, Gennep, The Netherlands

3

Section of Clinical Child and Family Studies and EMGO + Institute for Health and Care Research, Vrije Universiteit Amsterdam,

The Netherlands 4

Department Medical and Clinical Psychology, Tilburg School of Social and Behavioral Sciences, Tilburg University, The Netherlands Abstract: The Basic Psychological Need Satisfaction and Frustration Scale – Intellectual Disability (BPNSFS-ID), an adapted version of the original BPNSFS (Chen, Vansteenkiste, et al., 2015), operationalizes satisfaction and frustration with the three basic psychological needs according to self-determination theory (SDT): autonomy, relatedness, and competence. The current study examined the psychometric properties of the BPNSFS-ID in a group of 186 adults with mild to borderline intellectual disability (MBID). The results indicated an adequate factorial structure of the BPNSFS-ID, comprising the satisfaction and frustration of each of the three needs. The associations between BPNSFS-ID subscales autonomy, relatedness, and competence and the self-determination subscale of the Personal Outcome Scale (POS), the De Jong Gierveld Loneliness Scale, and the General Self-Efficacy Scale – 12 (GSES-12) supported the construct validity. In addition, the BPNSFS-ID demonstrated high internal consistency (α = .92) and 2-week test-retest reliability (r = .81 for the composite subscale autonomy, r = .69 for the composite subscale relatedness, and r = .85 for the composite subscale competence). Overall, the BPNSFS-ID proved to be a valid and reliable measure of basic psychological need satisfaction and need frustration among people with MBID. Keywords: basic psychological need satisfaction and need frustration, needs universality, self-determination theory, intellectual disability, psychometric properties

Over the past three decades the importance of the quality of life concept of people with intellectual disabilities (ID) has been highlighted. According to Schalock and his colleagues (2002), subjective well-being is a key component of quality of life in this population. Subjective well-being can be described as a positive global perception of one’s life, consisting of cognitive (e.g., life satisfaction) and affective (the presence of happiness and absence of negative feelings) components (Diener, 2000). Self-determination theory (SDT) posits that individuals have three innate, universal psychological needs, whose satisfaction is crucial for subjective well-being (Ryan & Deci, 2000). These are the needs for autonomy (i.e., perceiving that people can make their own decisions and choices), relatedness (i.e., feeling that one is connected to and cared for by other people), and competence (i.e., feeling effective in achieving valued outcomes). Consequently, if the needs for autonomy, relatedness, and competence are fulfilled, one should Ó 2016 Hogrefe Publishing

experience subjective well-being (Howell, Chenot, Hill, & Howell, 2011; Tay & Diener, 2011), regardless of level of intellectual functioning (Deci, 2004). Although it has been argued that the basic psychological needs are universally important (Deci, 2004; Deci & Ryan, 2000), there is a dearth of research on these needs in people with ID. Studying these basic psychological needs in people with ID is important from SDT’s perspective as it may provide additional support for the universality claim of SDT (i.e., the theory is applicable to all people, regardless of intellectual functioning). Moreover, studying these needs is critical for the ID field as it may provide insight into how to support people with ID to achieve optimal well-being. Based on their study among students with learning disabilities, Deci, Hodges, Pierson, and Tomassone (1992) concluded that students function more positively when teachers support their autonomy rather than control and pressure them. In addition, Grolnick and Ryan (1990) European Journal of Psychological Assessment (2019), 35(1), 37–45 DOI: 10.1027/1015-5759/a000366


38

found that many of the motivation and self-evaluative problems that children with learning disabilities have may be nonspecific; they may be apparent in other children who have difficulties in learning as well. It should be mentioned however, that the vast majority of the participants in both studies had a below average IQ (< 80) but not an ID. There are few large-scale studies because of a lack of psychometrically adequate instruments to quantify the extent to which the three psychological needs are fulfilled among people with ID. Therefore, valid and reliable instruments for assessment of autonomy, relatedness, and competence are urgently needed for people with ID. The current study, which focuses on the psychometric properties of such an instrument, is therefore an essential first step. Self-determination theory researchers have developed several valid and reliable global and domain-specific scales for need satisfaction and need frustration for the nonintellectually disabled population, including (a) the Basic Psychological Need Satisfaction Scale (BPNS; Ilardi, Leone, Kasser, & Ryan, 1993), (b) the Balanced Measurement of Psychological Needs (BMPN; Sheldon & Hilpert, 2012), (c) the Relationship Need Satisfaction Scale (RNSS; La Guardia, Ryan, Couchman, & Deci, 2000), (d) the Basic Psychological Needs Satisfaction and Frustration Scale (BPNSFS; Chen, Vansteenkiste, et al., 2015), (e) the Psychological Need Thwarting Scale (PNTS; Bartholomew, Ntoumanis, Ryan, & ThøgersenNtoumani, 2011), (f) the Work-related Basic Need Satisfaction Scale (W-BNS; van den Broeck, Vansteenkiste, Witte, Soenens, & Lens, 2010), and (g) the Psychological Need Satisfaction in Exercise (PNSE; Wilson, Rogers, Rodgers, & Wild, 2006). The BMPN and BPNSFS differ from the other instruments in that they measure both need frustration and need satisfaction. This distinction between need satisfaction and need frustration is consistent with recent theorizing (Vansteenkiste & Ryan, 2013) and empirical research (e.g., Bartholomew, Ntoumanis, Ryan, Bosch, & ThøgersenNtoumani, 2011), underlining the distinct role of need frustration in predicting ill-being. That is, a low score on need satisfaction (“dissatisfaction”) is conceptually not equivalent to need frustration (e.g., “I do not feel related” vs. “I feel I am rejected”). For example, people might already feel lonely because their need for relatedness with their colleagues gets deprived (“dissatisfaction”) or because attempts to establish contact are thwarted resulting in a more intense frustration (i.e., need frustration). Such frustrations of basic needs may engender specific emotions, such as defeat and humiliation in the case of European Journal of Psychological Assessment (2019), 35(1), 37–45

N. Frielink et al., Psychometric Properties of BPNSFS-ID

rejection by others, depending on context (Bartholomew, Ntoumanis, Ryan, & Thøgersen-Ntoumani, 2011). Differential emotional responses to need frustration and low need satisfaction may predict differential associations with adaptive and maladaptive developmental outcomes. That is, in a study among athletes, Bartholomew, Ntoumanis, Ryan, Bosch, et al. (2011) found that need satisfaction was associated with positive outcomes regarding sport participation (i.e., positive affect and vitality), whereas need frustration was associated with maladaptive developmental outcomes such as negative affect, depression, and burnout. Moreover, need satisfaction was associated with athletes’ perceptions of autonomy support, while need frustration was related to coach control. Because Chen, Vansteenkiste, et al. (2015) provided evidence for the measurement equivalence of the BPNSFS, this questionnaire is preferred over the BMPN. Although recently developed, the BPNSFS has already been applied in several studies in a range of domains, including the examination of the role of psychological need satisfaction in sleep behavior of adults (Campbell et al., 2015) and the role of environmental and financial safety in need satisfaction (Chen, van Assche, Vansteenkiste, Soenens, & Beyers, 2015). As the BPNSFS looked more promising, this questionnaire was chosen for the current study. That is, in the current study, the psychometric properties of an adapted version of the BPNSFS, the Basic Psychological Need Satisfaction and Frustration Scale – Intellectual Disability (BPNSFS-ID), were examined in people with mild ID (defined as IQ between 50 and 70) and with borderline intellectual functioning (IQ between 70 and 85), hereafter designated as people with mild to borderline ID (MBID). The first hypothesis was that, using confirmatory factor analyses (CFAs), the structure of six correlated but distinct factors of BPNSFS-ID (i.e., the satisfaction and frustration of the needs for autonomy, relatedness, and competence) fit the data from people with MBID. This was important not only to test whether the basic psychological needs are adequately operationalized, but also to test whether the theoretical distinction between the needs is applicable to people with ID too. To investigate this, a series of CFAs were conducted based on theory (Vansteenkiste & Ryan, 2013) and the results of Chen, Vansteenkiste, et al. (2015). That is, four models were tested: – Model 1 (the null model): a six-factor model differentiating between need satisfaction and need frustration within each of the three needs; – Model 2: the same six-factor model using two higherorder constructs representing psychological need satisfaction and need frustration; – Model 3: the same six-factor model with three higher-order constructs representing the basic Ó 2016 Hogrefe Publishing


N. Frielink et al., Psychometric Properties of BPNSFS-ID

psychological needs for autonomy, relatedness, and competence; and – Model 4: a three-factor model consisting of the three needs for autonomy, relatedness, and competence. It was also hypothesized that the three basic needs of the BPNSFS-ID would be strongly associated with convergent operationalizations of these needs. That is, based on the nomological web of SDT, satisfaction and frustration of the need for autonomy would be associated with the subscale self-determination of the Personal Outcome Scale (POS; Van Loon, Van Hove, Schalock, & Claes, 2008a), the need for relatedness would be associated with the De Jong Gierveld Loneliness Scale (de Jong-Gierveld & Kamphuls, 1985), and the need for competence would be associated with the General Self-Efficacy Scale-12 (GSES-12; Sherer et al., 1982). In addition, the internal consistency and testretest reliability of the BPNSFS-ID were tested. The internal consistency measured with Cronbach’s α, was used to gauge how well a priori defined items of the questionnaire measured the same construct, whereas the test-retest reliability indicated the stability of the measure in the absence of systematic attempts to induce change, which is a critical characteristic if the measure is to be used in effectiveness research in the future.

Materials and Methods Participants and Procedures After ethical approval by the Ethics Committee of Tilburg University, participants were selected at random from four healthcare organizations for people with ID in the southern part of the Netherlands. All four organizations support individuals with ID living in residential homes and 24-hr community residences, receiving ambulant support or attending day care centers. Inclusion criteria for participation were: aged above 18 years, mild to borderline ID (IQ-score between 50 and 85), and at least weekly contact for a minimum of three months with a professional caregiver. A total of 368 individuals were invited to participate in the study; 165 declined, resulting in 203 participants. After participation 17 participants were excluded because they did not meet the inclusion criteria, leaving a total of 186. The mean age was 40.3 years (range 18.1–84.8); 110 were male. The mean IQ on file was 67; 109 participants had a mild ID (range 50–70) and 77 had a borderline level of intellectual functioning (range 71–85). During each measurement, all items of each questionnaire were read aloud to the participants, while they could also read along with all items. The participants verbally

Ó 2016 Hogrefe Publishing

39

indicated the response by giving the answer (mostly from 1 to 5) which was then recorded and logged by the researchers. The vast majority of the participants understood all items; for those who needed help, a standardized explanation was given. In the case a participant did not understood the item after this standardized clarification, the item was left blank and became a missing value.

Measures Need Satisfaction and Frustration The Basic Psychological Need Satisfaction and Frustration Scale (BPNSFS), originally developed by Chen, Vansteenkiste, et al. (2015), is here adapted as the BPNSFS-ID to improve comprehension by people with MBID. The BPNSFS-ID assesses both satisfaction and frustration of the three basic psychological needs defined in SDT: autonomy, relatedness, and competence. The BPNSFS-ID has 24 items (eight for each basic need; four for satisfaction and four for frustration). Examples are: “In my life, I can do whatever I want when I want” (satisfaction of the need for autonomy), “In my life, I feel excluded by the people who I would like to belong to” (frustration of the need for relatedness), and “In my life, I think that I can do things well” (satisfaction of the need for competence). All items were rated on a 5-point Likert scale (1 = completely untrue and 5 = completely true). Chen, Vansteenkiste, et al. (2015) employed a CFA to validate the factor structure of the original BPNSFS, and found a six-factor model that differentiated between need satisfaction and need frustration within the three needs yielded the best fit (SBS-w2 (231) = 372.71, CFI = .97, RMSEA = .03, SRMR = .04). The internal consistency ranged from .64 to .89 for the six factors across four countries in university students (Belgium, China, USA, and Peru). To adapt the questionnaire to people with MBID, two researchers familiar with both SDT and people with MBID reworded each of the 24 BPNSFS items independently, ensuring that the items were comprehensible for people with MBID while safeguarding the meaning according to SDT. The two researchers and an experienced professional working with people with MBID developed a consensus version based on these two adaptations. This consensus version was discussed with all authors of the present study, resulting in small adaptations. For example, the original item “I feel that people who are important to me are cold and distant towards me” was replaced by “Important people in my life keep me at a distance.” In addition, the original item “I feel competent to achieve my goals” was modified into “In my life, I have the feeling that I can reach my goals.” Finally, five persons with MBID were invited to complete this adapted BPNSFS-ID. They found

European Journal of Psychological Assessment (2019), 35(1), 37–45


40

N. Frielink et al., Psychometric Properties of BPNSFS-ID

the BPNSFS-ID easy to comprehend and a few minor adaptations to the phrasing and grammar were made to improve clarity, based on their recommendations.

(Cronbach’s α = .69); the current study had an internal consistency of .84 (Cronbach’s α).

Self-Determination The subscale self-determination of the Personal Outcomes Scale (POS; Van Loon et al., 2008a) was used to assess whether participants felt free to make their own choices and decisions. This subscale consists of six items, rated on a 3-point Likert scale (1 = always, 2 = sometimes, and 3 = seldom or never). The subscale has a good internal consistency (Cronbach’s α = .75) and measuring convergent validity of another instrument with a similar domain (GENCAT; Verdugo, Arias, Gomez, & Schalock, 2008) showed a correlation of .79 (Van Loon, Van Hove, Schalock, & Claes, 2008b). The current study had an internal consistency of .66 (Cronbach’s α).

Data Analysis

Loneliness The De Jong Gierveld Loneliness Scale (de Jong-Gierveld & Kamphuls, 1985) was used to measure loneliness. The scale consists of five positively formulated items (e.g., “There are many people I can trust completely”) and six negatively formulated items (e.g., “I miss having people around me”), which were rated on a 5-point Likert scale (1 = completely untrue and 5 = completely true). This scale has been applied in several studies in a range of populations, including a study in people with psychiatric and intellectually disabilities (Broer, Nieboer, Strating, Michon, & Bal, 2011), and showed sufficient reliability and validity (de Jong-Gierveld & van Tilburg, 1999). To ensure comprehension by people with MBID, five persons with MBID were invited to complete the De Jong Gierveld Loneliness Scale. Based on their recommendations on the phrasing and grammar to improve item clarity, six items were slightly rephrased for the current study. The current study had an internal consistency of .89 (Cronbach’s α). General Self-Efficacy The General Self-Efficacy Scale-12 (GSES-12), originally developed by Sherer and colleagues (1982) and enhanced to 12 items by Woodruff and Cashman (1993), was used to measure self-efficacy. To ensure comprehension by people with MBID, five persons with MBID were invited to complete the GSES-12. Based on their recommendations on the phrasing and grammar to improve item clarity, three items were slightly rephrased for the current study. All items were rated on a 5-point Likert scale (1 = completely untrue and 5 = completely true). The original scale has been used previously with people who have ID (Forte, Jahoda, & Dagnan, 2011), revealing a good internal consistency

European Journal of Psychological Assessment (2019), 35(1), 37–45

The analysis, performed using IBM SPSS for Windows (version 22) and AMOS (version 22), comprised three stages: (1) confirmatory factor analyses, (2) convergent and discriminant validity, and (3) reliability. Firstly, to investigate the factorial validity, a series of CFAs were conducted based on theory (Vansteenkiste & Ryan, 2013) and the results of Chen, Vansteenkiste, et al. (2015). That is, four models were tested in CFA using AMOS: – Model 1 (the null model): a six-factor model differentiating between need satisfaction and need frustration within each of the three needs; – Model 2: a six-factor model using higher-order constructs in which both the three need satisfaction factors and the three need frustration factors are the six first-order factors, and the two higher-order constructs representing psychological need satisfaction and need frustration; – Model 3: a six-factor model with the same six firstorder factors as Models 1 and 2, in which three higher-order constructs represent the psychological needs for autonomy, relatedness, and competence; and – Model 4: a three-factor model consisting of the three needs for autonomy, relatedness, and competence. Because AMOS requires all variables of interest to have complete data, the Expectation Maximization (EM) estimation in SPSS was used to impute the missing values (0.72% of all values were missing). This could be done because data were found to be missing completely at random (MCAR) as indicated by Little’s MCAR test [w2 (141, N = 186) = 136.40, p = .59]. The four models were evaluated using a normed Chi-square, the root mean square error of approximation (RMSEA), the Bentler Comparative Fit Index (CFI), and the standardized root mean square residual (SRMR; Kline, 2005; Schweizer, 2010). A normed w2 < 2 is considered a good model fit and a value < 3 an acceptable model fit (Bollen, 1989). Consistent with Browne and Cudeck (1993), RMSEA values < .05 are considered as good whereas values between .05 and .08 are considered as acceptable. CFI signifies a good model fit for values > .95, whereas values between .90 and .95 indicate an acceptable fit (Hu & Bentler, 1999). Finally, SRMR values < .10 are considered acceptable (Kline, 2005). However, although these traditional fit indices with fixed critical values are useful to evaluate models, they have

Ó 2016 Hogrefe Publishing


N. Frielink et al., Psychometric Properties of BPNSFS-ID

important drawbacks as they cannot control for type I and type II errors, resulting in the rejection of correct models and the acceptance of incorrect models (Marsh, Hau, & Wen, 2004). Therefore, Saris, Satorra, and Van der Veld (2009) suggested “the detection of misspecification”procedure, by using the Modification Index (MI), the Expected Parameter Change (EPC), and the power of the MI test. To interpret the MI test for each of the restricted parameters of the model, the minimum size of the misspecification that one would like to detect by the MI test with a high likelihood (power) was chosen to be .1 and the power was ranked high when it was > .75 (Saris et al., 2009). Because this “detection of misspecification”-procedure is relatively new, in the current study, both approaches (i.e., the traditional fit indices and the detection of misspecifications) will be reported. Next, in addition to the traditionally Chi-square difference test, which may reject reasonable models (Marsh et al., 2004), for choosing the best model the Bayesian Information Criterion (BIC) and CFI indices were used. Models with the lowest BIC are preferred, and a nonsignificance Chi-square difference test suggests that the reduced model is the better fitting model. In addition, to evaluate invariance constraints, the CFI indices were compared; Cheung and Rensvold (2002) suggested that decreases in fit > 0.01 support the more restricted model. Secondly, to evaluate the convergent validity, the BPNSFS-ID subscales autonomy, relatedness, and competence were correlated with the self-determination subscale of the POS, the De Jong Gierveld Loneliness Scale, and the GSES-12, respectively. The discriminant validity was measured by correlating the autonomy subscale of the BPNSFS-ID with the convergent operationalizations of the other two needs: GSES-12 and the De Jong Gierveld Loneliness Scale. In a similar vein, the relatedness subscale of the BPNSFS-ID was correlated with the GSES-12 and the self-determination subscale of the POS, and the competence subscale of the BPNSFS-ID was correlated with the self-determination subscale of the POS and the De Jong Gierveld Loneliness Scale. Regarding the discriminant validity, dependent correlations derived from the crossconstruct and the within-construct were compared using Steiger’s Z-test (Steiger, 1980). Correlations < .29 were considered weak, between .30 and .49 moderate, and > .49 strong (Cohen, 1988). Finally, the reliability of the BPNSFS-ID was determined by computing Cronbach’s α. Also, the 2-week test-retest reliability was determined by reinterviewing 20% of the participants (N = 40) According to Nunnally, Bernstein, and Berge (1967), a value > .60 is sufficient for early stages research, but values > .80 should be pursued. The testretest reliability was gauged by computing Pearson correlations between the first and second measurement.

Ó 2016 Hogrefe Publishing

41

Results Confirmatory Factor Analyses The global fit measures of the four models are presented in Table 1. Based on these fit measures, all four models yield an acceptable to good fit. Although Models 1 and 3 yield a statistically significant better fit than the other two models, Model 2 is theoretically important given the importance of the distinction between need satisfaction and need frustration. As Model 2 has an acceptable fit, this model appears to be the best fitting model based on theory and the traditional fit indices. The “detection of misspecification”-output as measured with Modification Index (MI), the Expected Parameter Change (EPC), and the power of the MI test, indicated that there were no serious misspecifications for Model 2 (see Electronic Supplementary Material, ESM 1), therefore, the model is acceptable. For Model 2 (six factors with higher-order constructs representing psychological need satisfaction and need frustration, see Figure 1), all factor loadings were significant at a p < .001 level. The standardized factor loadings varied as follows: between .45–.87 for the latent variable autonomy satisfaction and .72–.80 for autonomy frustration, between .84–.88 for relatedness satisfaction and .59–.77 for relatedness frustration, and between .60–.77 for competence satisfaction and .61–.79 for competence frustration.

Convergent and Discriminant Validity The autonomy satisfaction and frustration subscales showed strong convergence with the self-determination scale, r = .65, p < .001 and r = .60, p < .001, respectively. The correlations between the competence satisfaction and frustration subscales were assessed by associating these subscales with the self-efficacy scale, and were r = .66, p < .001 and r = .62, p < .001, respectively. The convergent validity of the relatedness satisfaction and frustration subscales was measured by correlating the subscales with the loneliness scale; the correlations were r = .65, p < .001 and r = .71, p < .001. Discriminant validity of the BPNSFS-ID was measured by assessing the correlation between the six subscales and the convergent operationalizations of the two other basic needs (i.e., two of the following three questionnaires: the selfdetermination scale, the self-efficacy scale, and the loneliness scale). The correlations for each subscale are reported in Table 2; they ranged between .32 and .55. A Steiger’s Z-test was conducted to compare the dependent correlations derived from the cross-construct and the withinconstruct. Results indicated that all within-construct

European Journal of Psychological Assessment (2019), 35(1), 37–45


42

N. Frielink et al., Psychometric Properties of BPNSFS-ID

Figure 1. Visual representation of Model 2 with six factors and higher-order factors representing psychological need satisfaction and need frustration (N = 186). The ellipses represent both the factors and the higher-order constructs and the rectangles represent items. Numbers to the left of the rectangles represent residuals (expressed as covariance). Numbers between the single-arrow-lines connecting constructs and items indicate a hypothesized direct effect (expressed as standardized regression coefficients). The number between the bidirectional arrow connecting the higher-order constructs imply a relationship between factors (expressed as covariance).

.30 .70 .20 .12

.19

.75 .87 .45

.44 .30 .36 .21

.73

.24

.59

.72 .80 .79

.24 .18 .14 .15

.84

.62

.28

.87 .87 .88

.57

.15 .48

1.19 .59

.32 .26

.08

.77

.70

.76 .68

.82

.33 .19 .74 .18 .24

.12

.77 .69 .60

.77

.29 .46 .61 .32

.70

.28

.77 .79

.11

.29

associations were significantly stronger than the cross-construct associations at a p < .001 level, except the comparison between the correlation of the competence satisfaction subscale and the self-efficacy scale (r = .65) and the competence satisfaction subscale and the loneliness scale (r = .55); this resulted in ZH = 2.13, p = .033.

Reliability The internal consistency of the BPNSFS-ID was found to be Cronbach’s α .92. The internal consistency for each scale is reported in Table 3; alphas ranged between .78 and .92. The 2-week test-retest reliabilities (M = 14.6 days, SD = 2.0, range = 11.0–21.0) of the BPNSFS-ID factors ranged between .68 and .85 (see Table 3). European Journal of Psychological Assessment (2019), 35(1), 37–45

Discussion This study provides evidence for the reliability and validity of the Basic Psychological Need Satisfaction and Frustration Scale – Intellectual Disability (BPNSFS-ID). Similar to the results of the original BPNSFS (Chen, Vansteenkiste, et al., 2015), the BPNSFS-ID shows good to excellent internal consistency and test-retest reliability, for both the total scale and the divided subscales. Confirmatory factor analyses confirmed a six-factor structure of the BPNSFS-ID, comprising the satisfaction and frustration of the needs for relatedness, autonomy, and competence. In addition, similar to the original BPNSFS (Chen, Vansteenkiste, et al., 2015), supplementary higher-order analysis did support the distinction between need satisfaction and need frustration. That is, based on Ó 2016 Hogrefe Publishing


N. Frielink et al., Psychometric Properties of BPNSFS-ID

43

Table 1. Comparison of the four tested models (N = 186) w2

df

w2/df

RMSEA (90% CI)

CFI

SRMR

BIC

1. Six factors

319.30*

237

1.34

.043 (.030; .055)

.96

.055

648.53

2. Six factors with need satisfaction and need frustration as higher-order constructs 3. Six factors with autonomy, relatedness, and competence as higher-order constructs 4. Three factors

481.29*

245

1.96

.072 (.063; .082)

.90

.099

768.70

161.99 (8)*

330.42*

243

1.36

.044 (.031; .056)

.96

.059

628.28

11.12 (6)

457.45*

249

1.84

.067 (.058; .077)

.91

.076

723.97

127.03 (12)*

Model

w2Δ (df)#

Notes. Df = degrees of freedom; RMSEA = Root Mean Square Error of Approximation; CFI = Comparative Fit Index; SRMR = Standardized Root Mean Square Residual; BIC = Bayes Information Criterion. #w2Δ (df) = Chi-square difference test comparing the fit of Models 2, 3, and 4 with Model 1; df is the difference in degrees of freedom between the two compared models. *p < .05.

Table 2. Correlationsª among study variables (N = 186) Measure

1

2

3

4

5

6

7

8

9

Need satisfaction Autonomy

1

Relatedness

.25**

1

Competence

.40**

.38**

Autonomy

.64**

.17*

.35**

Relatedness

.31**

.76**

.47**

.33**

Competence

.46**

.33**

.65**

.44**

.52**

Self-determination scale

.65**

.32**

.37**

.60**

.41**

.50**

–Loneliness scale

.35**

.65**

.55**

.38**

.71**

.52**

.49**

1

Self-efficacy scale

.35**

.33**

.66**

.39**

.45**

.62**

.40**

.62**

1

Need frustration 1 1 1 1 1

Notes. ªAs the needs for autonomy, relatedness, and competence are separate but related factors, additional partial correlation analyses were used to control for the covariance with the other two needs. Similar to the Pearson correlations, all partial convergent correlations were strong (between .49 and .57) and significant at a p < .001 level, except the correlation between competence frustration and the self-efficacy scale; this partial correlation was moderate (r = .45, p < .001). *p .05. **p .01.

the current data, need satisfaction and need frustration appear to be two dimensions. This finding is consistent with recent studies (Bartholomew, Ntoumanis, Ryan, Bosch, et al., 2011) and theory (Vansteenkiste & Ryan, 2013), suggesting that need satisfaction and need frustration are best viewed as independent concepts with separate precedents and predicting distinct results. For example, Chen and colleagues (2015) found that need satisfaction was related positively to life satisfaction but unrelated to depressive symptoms. On the contrary, need frustration was related positively to depressive symptoms and negatively to life satisfaction. Future research is needed to address these associations among people with MBID. In addition to the factorial validity, the study showed strong correlations between the three basic needs of the BPNSFS-ID (i.e., the need for autonomy, relatedness, and competence) and convergent operationalization of these needs (i.e., selfdetermination, loneliness, and self-efficacy, respectively). In addition, discriminant validity of the BPNSFS-ID appeared to be adequate. An exception applies to the divergent correlation between the competence satisfaction and frustration subscales of the BPNSFS-ID and the De Jong Gierveld Loneliness Scale and between the competence Ó 2016 Hogrefe Publishing

frustration subscale of the BPNSFS-ID and the POS. That is, these correlations were, in contrast with the expectation, found to be strong. However, all within-construct associations were significantly higher than the cross-constructs. The present results should be interpreted in light of the limitations of the study. Firstly, of the 368 individuals who were invited to participate in the study, 165 declined. The potential nonresponse bias could not be calculated by comparing participants with nonparticipants because there were no demographics available for the nonparticipants. The nonparticipants (45%) mainly said that they declined to participate due to the time investment of 1.5 hr or because professional caregivers argued it would be too stressful for them. In addition, only a small number participated in the test-retest reliability and results need to be replicated with larger sample sizes. Lastly, as no measures for both adaptive and maladaptive psychosocial functioning were included in the current study, it was not possible to actual test the notion that need satisfaction and need frustration have differential outcomes among people with MBID. Overall, the results of the present study provide support for the psychometric properties of the BPNSFS-ID in a European Journal of Psychological Assessment (2019), 35(1), 37–45


44

N. Frielink et al., Psychometric Properties of BPNSFS-ID

Table 3. Internal consistencies and test-retest correlations of the composite need scores, need satisfaction, and need frustration (N = 186) Internal consistencies* Factor

Composite scores

Test-retest reliabilities**

Satisfaction

Frustration

Composite scores

Satisfaction

Frustration

Autonomy

.87

.78

.85

.81

.72

.79

Relatedness

.91

.92

.79

.69

.76

.83

Competence

.86

.79

.81

.85

.68

.71

Notes. *Internal consistencies are measured as Cronbach’s α; **Test-retest reliabilities are measured as Pearson correlations.

group of people with MBID in the Netherlands. This is an important first step in testing the universality of the theoretical premises across populations of people with and without ID, because a reliable and valid measurement is urgently needed for fulfillment of autonomy, relatedness, and competence. Future research might focus on the evaluation of the predictive validity to further confirm the validity of the BPNSFS-ID. That is, the link between need satisfaction and need frustration and subjective wellbeing and ill-being among people with MBID should be examined in a longitudinal design. This is not only theoretically interesting, but also, from the practical point of view, useful as it may provide valuable insights to enhance subjective well-being and thus quality of life of people with MBID. Acknowledgments We would like to thank clients of Dichterbij, Lunet Zorg, S&L Zorg, and Zuidwester who participated in this study. We are also grateful to Luciënne Heerkens, Jan Willem Schuurman, Teresa Furtado Plácido, and Gert Stigter for their assistance with participant selection. In addition, we thank Lex Hendriks, Wobbe Zijlstra, and Daniel Oberski for their statistical support. The research was funded by Dichterbij. Dichterbij has not imposed any restrictions on free access to or publication of the research data. This manuscript has not been previously published and is not under consideration in the same or substantially similar form in any other (peerreviewed) media. All authors listed have contributed sufficiently to the project to be included as authors, and all those who are qualified to be authors are listed in the author byline. Electronic Supplementary Materials The electronic supplementary material is available with the online version of the article at http://dx.doi.org/10.1027/ 1015-5759/a000366 ESM 1. Table 1 (Excel). The Test on Misspecifications in the six-factor model with higher-order constructs representing need satisfaction and need frustration, based on Saris et al. (2009).

European Journal of Psychological Assessment (2019), 35(1), 37–45

References Bartholomew, K. J., Ntoumanis, N., Ryan, R. M., Bosch, J. A., & Thøgersen-Ntoumani, C. (2011). Self-determination theory and diminished functioning: The role of interpersonal control and psychological need thwarting. Personality and Social Psychology Bulletin, 37, 1459–1473. Bartholomew, K., Ntoumanis, N., Ryan, R. M., & ThøgersenNtoumani, C. (2011). Psychological need thwarting in the sport context: Assessing the darker side of athletic experience. Journal of Sport and Exercise Psychology, 33, 75–102. Bollen, K. A. (1989). Structural equations with latent variables. New York, NY: Wiley. Broer, T., Nieboer, A., Strating, M., Michon, H., & Bal, R. (2011). Constructing the social: an evaluation study of the outcomes and processes of a “social participation” improvement project. Journal of Psychiatric and Mental Health Nursing, 18, 323–332. Browne, M. W., & Cudeck, R. (1993). Alternative ways of assessing model fit. In K. A. Bollen & J. S. Long (Eds.), Testing structural equation models (pp. 136–162). Newbury Park, CA: Sage. Campbell, R., Vansteenkiste, M., Delesie, L. M., Mariman, A. N., Soenens, B., Tobback, E., . . . Vogelaers, D. P. (2015). Examining the role of psychological need satisfaction in sleep: A SelfDetermination Theory perspective. Personality and Individual Differences, 77, 199–204. Chen, B., van Assche, J., Vansteenkiste, M., Soenens, B., & Beyers, W. (2015). Does psychological need satisfaction matter when environmental or financial safety are at risk? Journal of Happiness Studies, 16, 745–766. Chen, B., Vansteenkiste, M., Beyers, W., Boone, L., Deci, E. L., Duriez, B., . . . Verstuyf, J. (2015). Basic psychological need satisfaction, need frustration, and need strength across four cultures. Motivation and Emotion, 39, 216–236. Cheung, G. W., & Rensvold, R. B. (2002). Evaluating goodness-offit indexes for testing measurement invariance. Structural Equation Modeling, 9, 233–255. Cohen, J. (1988). Statistical power analysis for the behavioral sciences. Hillsdale, NY: Erlbaum. de Jong-Gierveld, J., & Kamphuls, F. (1985). The development of a Rasch-type loneliness scale. Applied Psychological Measurement, 9, 289–299. de Jong-Gierveld, J., & van Tilburg, T. (1999). Manual of the Loneliness Scale. Amsterdam, The Netherlands: VU University. Deci, E. L. (2004). Promoting intrinsic motivation and selfdetermination in people with mental retardation. International Review of Research in Mental Retardation, 28, 1–29. Deci, E. L., Hodges, R., Pierson, L., & Tomassone, J. (1992). Autonomy and competence as motivational factors in students with learning disabilities and emotional handicaps. Journal of Learning Disabilities, 25, 457–471. Deci, E. L., & Ryan, R. M. (2000). The “what” and “why” of goal pursuits: Human needs and the self-determination of behavior. Psychological Inquiry, 11, 227–268.

Ó 2016 Hogrefe Publishing


N. Frielink et al., Psychometric Properties of BPNSFS-ID

Diener, E. (2000). Subjective well-being: The science of happiness and a proposal for a national index. The American Psychologist, 55, 34–43. Forte, M., Jahoda, A., & Dagnan, D. (2011). An anxious time? Exploring the nature of worries experienced by young people with a mild to moderate intellectual disability as they make the transition to adulthood. The British Journal of Clinical Psychology, 50, 398–411. Grolnick, W. S., & Ryan, R. M. (1990). Self-perceptions, motivation, and adjustment in children with learning disabilities: A multiple group comparison study. Journal of Learning Disabilities, 23, 177–184. Howell, R. T., Chenot, D., Hill, G., & Howell, C. J. (2011). Momentary happiness: The role of psychological need satisfaction. Journal of Happiness Studies, 12, 1–15. Hu, L. T., & Bentler, P. M. (1999). Cutoff criteria for fit indexes in covariance structure analysis: Conventional criteria versus new alternatives. Structural Equation Modeling: A Multidisciplinary Journal, 6, 1–55. Ilardi, B. C., Leone, D., Kasser, T., & Ryan, R. M. (1993). Employee and supervisor ratings of motivation: Main effects and discrepancies associated with job satisfaction and adjustment in a factory setting. Journal of Applied Social Psychology, 23, 1789–1805. Kline, R. B. (2005). Principles and practice of structural equation modeling. New York, NY: Guilford Press. La Guardia, J. G., Ryan, R. M., Couchman, C. E., & Deci, E. L. (2000). Within-person variation in security of attachment: A self-determination theory perspective on attachment, need fulfillment, and well-being. Journal of Personality and Social Psychology, 79, 367–384. Marsh, H. W., Hau, K. T., & Wen, Z. (2004). In search of golden rules: Comment on hypothesis-testing approaches to setting cutoff values for fit indexes and dangers in overgeneralizing Hu and Bentler’s (1999) findings. Structural Equation Modeling, 11, 320–341. Nunnally, J. C., Bernstein, I. H., & Berge, J. M. T. (1967). Psychometric theory. New York, NY: McGraw-Hill. Ryan, R. M., & Deci, E. L. (2000). Self-determination theory and the facilitation of intrinsic motivation, social development, and well-being. The American Psychologist, 55, 68–78. Saris, W. E., Satorra, A., & van der Veld, W. M. (2009). Testing structural equation models or detection of misspecifications? Structural Equation Modeling, 16, 561–582. Schalock, R. L., Brown, I., Brown, R., Cummins, R. A., Felce, D., Matikka, L., . . . Parmenter, T. (2002). Conceptualization, measurement, and application of quality of life for persons with intellectual disabilities: Report of an international panel of experts. Mental Retardation, 40, 457–470. Schweizer, K. (2010). Some guidelines concerning the modeling of traits and abilities in test construction. European Journal of Psychological Assessment, 26, 1–2.

Ó 2016 Hogrefe Publishing

45

Sheldon, K. M., & Hilpert, J. C. (2012). The balanced measure of psychological needs (BMPN) scale: An alternative domain general measure of need satisfaction. Motivation and Emotion, 36, 439–451. Sherer, M., James, E. M., Mercandante, B., Prentice-Dunn, S., Jacobs, B., & Rogers, W. R. (1982). The self-efficacy scale: Construction and validation. Psychological Reports, 51, 663–671. Steiger, J. H. (1980). Tests for comparing elements of a correlation matrix. Psychological Bulletin, 87, 245–251. Tay, L., & Diener, E. (2011). Needs and subjective well-being around the world. Journal of Personality and Social Psychology, 101, 354–365. van den Broeck, A., Vansteenkiste, M., Witte, H., Soenens, B., & Lens, W. (2010). Capturing autonomy, competence, and relatedness at work: Construction and initial validation of the Work-related Basic Need Satisfaction scale. Journal of Occupational and Organizational Psychology, 83, 981–1002. Van Loon, J., Van Hove, G., Schalock, R., & Claes, C. (2008a). Personal Outcomes Scale. Middelburg, The Netherlands: Arduin. Van Loon, J., Van Hove, G., Schalock, R., & Claes, C. (2008b). Personal Outcomes Scale; Administration and Standardization Manual. Antwerp, Belgium: Garant. Vansteenkiste, M., & Ryan, R. M. (2013). On psychological growth and vulnerability: Basic psychological need satisfaction and need frustration as a unifying principle. Journal of Psychotherapy Integration, 23, 263–280. Verdugo, M. A., Arias, B., Gomez, L. E., & Schalock, R. L. (2008). The GENCAT Scale of quality of life: Standardization manual. Barcelona, Spain: Generalitat of Catalonia. Wilson, P. M., Rogers, W. T., Rodgers, W. M., & Wild, T. C. (2006). The psychological need satisfaction in exercise scale. Journal of Sport and Exercise Psychology, 28, 231–251. Woodruff, S. L., & Cashman, J. F. (1993). Task, domain, and general efficacy: A reexamination of the self-efficacy scale. Psychological Reports, 72, 423–432. Received April 1, 2015 Revision received January 13, 2016 Accepted January 22, 2016 Published online October 7, 2016 Noud Frielink Department Tranzo Tilburg School of Social and Behavioral Sciences Tilburg University P.O. Box 90153 5000 LE Tilburg The Netherlands E-mail n.frielink@tilburguniversity.edu

European Journal of Psychological Assessment (2019), 35(1), 37–45


Original Article

Psychometric Properties of the Online Arabic Versions of BDI-II, HSCL-25, and PDS Pirko Selmo,1 Tobias Koch,2 Janine Brand,1 Birgit Wagner,3 and Christine Knaevelsrud3 1

Research Department, Treatment Center for Torture Victims, Berlin, Germany

2

Department of Methods and Evaluation, Free University of Berlin, Germany

3

Department of Clinical Psychology and Psychotherapy, Medical School of Berlin, Germany Abstract: The Beck Depression Inventory (BDI-II), Hopkins Symptom Checklist (HSCL-25), and Posttraumatic Diagnostic Scale (PDS) are three widely applied clinical instruments for assessing depression, anxiety, and posttraumatic stress symptoms, respectively. Use of online-based psychological help and assessment is rapidly growing which necessitates the need for the validation of online assessment. To address these needs, data from 1,544 Arabic mother tongue treatment-seeking participants, who filled in the Arabic versions of these instruments online, was analyzed in two steps. In the first step, exploratory structural equation modeling (ESEM) was used to scrutinize factorial validity and eliminate items. In the second step, we examined the interrelationships between the latent factors (dimensions) using confirmatory factor analysis (CFA) of multitrait-multimethod (MTMM) data. Results show an acceptable to good fit of the hypothesized model, providing some first insights into the factorial and construct validity of the Arabic versions of BDI-II, HSCL-25, and PDS under consideration of cultural-specific aspects. Present evidence speaks for construct validity of the three instruments and the reliability and usefulness of online assessment. Keywords: psychometrics, online assessment, ESEM, CFA, cross-cultural

A large number of Arab countries have in recent years been torn by war-related violence and the latest violent upheavals of the Arab Spring. Exposure to war-related violence, migration, torture, and bereavement is associated with a higher risk of psychological disorders such as depression, anxiety, and posttraumatic stress disorder (PTSD). Provision of mental health care in the Arab world as well as in countries of refuge is hampered by lack of the reliable instruments required for accurate diagnosis. For this study we therefore selected Beck Depression Inventory (BDI-II), Hopkins Symptom Checklist (HSCL-25), and Posttraumatic Diagnostic Scale (PDS), as they are three widely applied self-report measurements for the assessment of depression, anxiety, and PTSD. Despite ongoing controversies, self-report measurements are a useful tool to support the role of structural and semistructural clinical interviews as they eliminate cost and burden associated with the latter (Norris & Aroian, 2008) as well as being useful in presenting the possibility of online-based assessment (e.g., Wagner, Brand, Schulz, & Knaevelsrud, 2012). The usefulness of online assessment should not be restricted only to online-based treatment. Computer-assisted data collection is more accurate and undoubtedly allows a faster and more reliable evaluation. European Journal of Psychological Assessment (2019), 35(1), 46–54 DOI: 10.1027/1015-5759/a000367

As well as providing a higher quality of collected data (Weeks, 1992), a digital questionnaire can also be programmed to be more interactive, for example, paying attention to unanswered items, or refusing answers with latencies below the expected minimum required for reading the relevant item. The BDI-II (Beck, Steer, & Brown, 1996) consists of 21 items rated on a 4-point scale that assesses the severity of different symptoms of depression. In the original study by Beck et al. (1996), the two dimensions based on an undergraduate sample were labeled cognitive-affective and somatic, whereas for outpatients somatic-affective and cognitive factors were recognized. The latter was replicated in Steer et al. (1999). Cross-cultural validation of BDI-II in an Arab context has been repeatedly confirmed (Abdel-Khalek, 1998, 2001; Al-Musawi, 2001; Al-Turkait & Ohaeri, 2010; Ghareeb, 2000). The two dimensions in both varieties were replicated for Kuwaiti undergraduates in Al-Turkait and Ohaeri (2010). On the other hand, Al-Musawi (2001), based on a sample of Bahraini students, found best fit for a three-factor solution labeled cognitiveaffective, overt emotional upset, and somatic-vegetative. According to the author this did not contradict the originally postulated factor solution of Beck et al. (1996). Ă“ 2016 Hogrefe Publishing


P. Selmo et al., Psychometric Properties of the Online Arabic Versions of BDI-II, HSCL-25, and PDS

The HSCL-25 consists of two clusters that assess anxiety with the first 10 items and depression with the remaining 25, both rated on a 4-point Likert scale. In HSCL-90 (Lipman, Covi, & Shapiro, 1979), however, 3 of the 10 anxiety items are part of a construct known as phobic anxiety. The Arabic version of HSCL-25 has been administered in various studies (Al-Turkait, Ohaeri, El-Abbasi, & Naguy, 2011; Caspi, Saroff, Suleimani, & Klein, 2008; Kobeissi et al., 2011). Al-Turkait et al. (2011) investigated the relationship between anxiety and depression by conducting confirmatory factor analyses (CFA) on the Arabic version of HSCL-25 and found evidence for a bifactor model including a specific anxiety factor and specific depression factor, that share a common general trait. This finding is in line with the tripartite theory of the relation between anxiety and depression (Clark & Watson, 1991), which is an attempt to account for the high comorbidity between symptoms of anxiety and mood disorders (Andrews, 1996), and the high convergence found across various studies between measures of depression and anxiety (Clark & Watson, 1991). The Posttraumatic Diagnostic Scale (PDS; Foa, Cashman, Jaycox, & Perry, 1997) is one of the most commonly used measures for the assessment of PTSD (Elhai, Gray, Kashdan, & Franklin, 2005). Section III of the PDS, which is relevant for this study, consists of 17 items, that correspond to the DSM–IV–TR (APA, 1994) criteria B (intrusions), C (avoidance & emotional numbing), and D (hyperarousal). This structure of PTSD criteria has changed in DSM-5 (APA, 2013), one of the major outcomes being the split of criterium C into avoidance and negative affect. Norris and Aroian (2008) have studied the psychometric properties of PDS-Arabic and confirmed its validity and clinical utility, albeit restricted to a sample of immigrant women in the USA. In this study, we investigated the psychometric properties of three self-rating questionnaires in Arabic, measuring depression, anxiety, and PTSD. For the purposes of crosscultural validation, we used exploratory structural equation modeling (ESEM) to examine the factor structure of the three instruments. This formed the basis for the CFA-MTMM model, used to scrutinize convergent and discriminant validity. Subscales of the three instruments were treated as either similar traits that belong to different instruments (monotrait-heteromethod) or heterogeneous traits that belong to either the same instrument (heterotraitmonomethod) or different instruments (heterotraitheteromethod). Accordingly, a higher correlation is expected between subscales of BDI with depression, as measured by HSCL (monotrait-heteromethod), than between anxiety and depression. However, we expect to find relatively high correlations between depression and anxiety, because previous research has shown that both Ó 2016 Hogrefe Publishing

47

symptoms overlap to considerable extent (e.g., Clark, Steer, & Beck, 1994; Löwe et al., 2008; Mineka, Watson, & Clark, 1998). On the contrary, PTSD comprises different symptoms/criteria of heterogeneous mechanisms (Suvak & Barrett, 2011), where some of them represent nonspecific psychiatric distress (Spitzer, First, & Wakefield, 2007). While symptoms of intrusion and avoidance seem to be more specific to PTSD, symptoms of emotional numbing and hyperarousal overlap with those of mood and anxiety disorders (e.g., difficulty concentrating, loss of interest, see Grubaugh, Long, Elhai, Frueh, & Magruder, 2010; Spitzer et al., 2007). Due to this obvious overlap and for purposes of validation, we tend to include emotional numbing and depression in one monotrait block and hyperarousal with anxiety in another. In these monotrait blocks, relatively higher correlations are expected than in the heterotrait blocks, for example, hyperarousal and emotional numbing. This has important implications for studying construct validity and helps us to gain crucial insights toward a better understanding of these concepts and their relations.

Methods Sample and Procedure The BDI-II-Arabic developed by Ghareeb (2000) was used in this study. No previous translations were available at the time when PDS and HSCL-25 were adopted by the treatment center for torture victims, a Berlin-based organization that offers survivors of torture and war-related violence, that is, medical and psychological help. Hence, the scales were translated into standard Arabic and then back translated by a team of professionals and have been in use ever since. Descriptive statistics of the sample are provided in Table 1. In total, our sample consisted of N = 1,544 (65.6% female) Arabic mother tongue treatment-seeking individuals who completed the instruments provided online at a virtual treatment center for PTSD and depression, as one of the projects offered by the treatment center for torture victims. After registration, each participant underwent a diagnostic assessment in order to prove eligibility. Possibly because recruitment took place over the Internet for a writing therapy, the sample is quite young (M = 25.79, SD = 6.69) and educated (70% have a university degree). Although the majority of them come from Egypt or Saudi Arabia, where subcultures (Egypt & Sudan, The Gulf, North Africa, The Levant & Iraq) are considered instead of states, we see that these are well represented. Average total scores of the measurements indicate a high prevalence of psychopathology among the sample; European Journal of Psychological Assessment (2019), 35(1), 46–54


48

P. Selmo et al., Psychometric Properties of the Online Arabic Versions of BDI-II, HSCL-25, and PDS

Table 1. Sample demographics of n = 1,544 treatment-seeking participants Characteristic

n(%)

Gender Women

1,013 (65.6)

two factors: (1) victims of sexual abuse in such conservative societies are more likely to seek anonymous therapeutic help over the Internet and (2) instability and lack of Internet access in countries afflicted by war prevents potential PTSD patients from registering in the program.

Family status Single Married

1,055 (68.3)

Statistical Analysis

318 (20.6)

Country Egypt

392 (25.4)

Saudi Arabia

370 (24.0)

Morocco

119 (7.7)

Algeria

117 (7.6)

Iraq

77 (5.0)

Jordan

75 (4.9)

Syria

49 (3.2)

Education University

962 (62.3)

High school degree

302 (19.6)

2 years training

115 (7.4)

Postgraduate

111 (7.2)

Primary traumaa Sexual assault/abuse

329 (24.1)

Sudden death of family member/friend

96 (7.0)

Violent assault

85 (6.2)

War & conflict zone/ torture/imprisonment

54 (4.0) M (SD)

Age Traumatab BDI-II

Cut off (%)

25.79 (6.69) 3.11 (3.17) 33.67 (12.56)

14 (93.3) 20 (84.1) 29 (62.5)

HSCL

72.43 (14.97)

> 1.75 (94.4)

PDS-SSSa

30.03 (10.73)

11 (93.6) 21 (78.9) 36 (30.8)

Notes. an = 1,368; bUnspecified stressful event excluded.

the mean score of the BDI-II was 33.37 (SD = 6.69), which indicates severe depression. This value is significantly higher than those of Beck et al. (1996) and Steer, Ball, Ranieri, and Beck (1999) outpatient samples. The mean average score of the HSCL-25 was 72.43 (SD = 14.97), over 94% scored above the cutoff 1.75, which indicates clinically relevant psychopathology (Lavik, Laake, Hauff, & Solberg, 1999). On average, each participant reported experiencing 3.11 Traumata (SD = 3.17). The average score of 30.03 (SD = 10.73) on PDS indicates moderate to severe symptomatology. A quarter of all reported Traumata was related to sexual abuse, which is substantially greater than the number of war-related Traumata. This could be due to European Journal of Psychological Assessment (2019), 35(1), 46–54

The statistical analysis proceeded in two steps. First ESEM, an integration of CFA and exploratory factor analysis (EFA; Marsh et al., 2009), was used to scrutinize the factorial validity of each questionnaire. ESEM was performed using categorical observed variables as indicators. One advantage of ESEM is that it is less restrictive than ordinary CFA models, while still allowing researchers to test the underlying factor structure of the given measures (Asparouhov & Muthén, 2009). Secondly, we included all relevant subscales of the three instruments into a single CFA model. The subscales were formed based on the results and factor solution of the ESEM. The single CFA model allowed us to examine the interrelations among the subscales apart from measurement-error influences. This analysis provides first insights concerning the construct validity of the subscales and may bring light to the controversies regarding distinctness, comorbidity, and overlap among depression, anxiety, PTSD, and their subscales. The selection of homogeneous items is required for a proper parameter estimation (Marsh, Hau, Balla, & Grayson, 1998), leading to the exclusion of some items (marked with a in Tables 2, 3, and 4). Each component was then split into two parcels to reduce the number of freely estimated parameters in the model (i.e., model complexity) and ease estimation burdens. The item parcels were computed following the recommendations by Little, Cunningham, Shahar, and Widaman (2002). That is, we matched the items to two parcels according to the factor loading pattern from a separately conducted EFA. Means, standard deviations, item total correlations, and intercorrelations of the three instruments can be seen in the Electronic Supplementary Material, ESM 1. All analyses were performed using Mplus (Version 7.2; Muthén & Muthén, 2012) and R (Version 3.0.0; R Core Team, 2013).

Results Factor Analysis Fits of the three ESEMs and the CFA model are presented in Table 2. The chi-square values are significant, indicating that there were still residual variances to be explained. However, other fit indices speak for the plausibility of the hypothesized models. RMSEAs were < .06 and CFIs Ó 2016 Hogrefe Publishing


P. Selmo et al., Psychometric Properties of the Online Arabic Versions of BDI-II, HSCL-25, and PDS

Table 2. Model Fits of the ESEM and CFA Models w (N, df)

CFI

RMSEA

BDI-II

1,259.78 (1,544, 169)*

.95

.07

HSCL-25

1,449.72 (1,544, 228)*

.96

.06

363.79 (1,374, 88)*

.98

.05

271.33 (1,544, 76)*

.98

.04

2

ESEM

PDS-3 CFA

Notes. CFA = Confirmatory factor analysis; RMSEA = Root mean square error of approximation; CFI = Comparative fit index. *p < .0001.

were > .96, with the exception of BDI-II (RMSEA = .07 and CFI = .95). The factor solution of the BDI-II reported in Table 3 corresponds nicely to the hypothesized factor structure and thus replicates the two-dimensional structure of the BDI-II. Highest loadings on the somatic-affective dimension were found for the items Tiredness or Fatigue .81, Loss of Energy .78, and Changes in Appetite .69. This pattern of loadings is identical with that found in Steer et al. (1999). Highest loadings on the cognitive dimension were found for the items Self-Criticalness .75 and Guilty Feelings .71. The cognitive items Pessimism, Worthlessness, and Suicidal Thoughts loaded also on BDI-Somatic. A moderate correlation of .60 between the two factors is compatible with correlations reported in previous studies. Analyses indicated best fit for three HSCL-25 factors instead of two, as shown in Table 4. In addition to depression and anxiety, a third factor was identified, encompassing the three items that originally comprised the phobic anxiety dimension in HSCL-90 (Lipman et al., 1979). Most items loaded saliently on the belonging factor, with the exception of items 6, 15, and 16, which were bivocal, while item 10 did not achieve saliency. Moderate intercorrelations ranging between .42 and .50 reflected the conceptual similarity between factors and, at the same time, indicated the ability of the scale to differentiate between them. Highest loadings were for Feelings of Worthlessness .89, Feeling Lonely .87, and Feeling Blue .83 on depression, and for Trembling .67 and Heart Pounding or Racing .66 on anxiety. All three loadings on the third phobic anxiety factor were high .62, .85, and .92. Current analysis supports the separation of the symptoms of avoidance and emotional numbing. The factor structure of German PDS, reported by Griesel, Wessa, and Flor (2006), was partly confirmed in the current study. The first factor, including symptoms of intrusion and active avoidance, was replicated here as shown in Table 5. Dissociative Amnesia showed similar loadings on all three factors and did not achieve saliency. Of the further seven items expected to load on the emotional numbing/ hyperarousal factor (Griesel et al., 2006), three items were bivocal and loaded on hyperarousal as well. These three items, 13, 14, and 15, belong to the cluster of hyperarousal Ó 2016 Hogrefe Publishing

49

Table 3. Standardized loadings of the two-factor solutions of the exploratory structural equation modeling (ESEM) of the Beck Depression Inventory-II (BDI-II). n = 1,544 BDI-II

Factor I Somatic-affective

Factor II cognitive

1. Sadness

.59

.17

4. Loss of pleasure

.63

.09

10. Crying

.37

.07

11. Agitation

.45

.13

12. Loss of interest

.66

.01

13. Indecisiveness

.48

.28

15. Loss of energy

.78

–.01

16. Changes in sleeping pattern

.61

–.05

17. Irritability

.56

.11

18. Changes in appetite

.69

–.07

19. Concentration difficulty

.65

.06

20. Tiredness or fatigue

.81

–.07

21. Loss of interest in sex

.47

–.04

a

.43

.35

3. Past failure

.23

.51

2. Pessimism

5. Guilty feelings 6. Punishment feelings 7. Self-dislike

–.07

.71

.09

.57

.19

.56

–.00

.75

9. Suicidal thoughts or wishesa

.34

.28

14. Worthlessnessa

.33

.51

8. Self-criticalness

Factor correlations Factor I

1.00

Factor II

.60

1.00

Notes. A priori target loadings are shaded in gray. Values in bold type refer to salient ( .30) non-a priori target loadings. aItems excluded in the CTC(M–1) model.

as defined by DSM–IV–TR (APA, 1994). However, a third factor, containing only two unique hyperarousal items with high factor loadings, was identified. Moderate and low factor correlations ranging from .27 to .51 are in line with the conceptual functionality of these symptom clusters in PTSD.

CFA Model Latent correlations of the trait-method units are presented in Table 6. Values in bold type refer to monotraitheteromethod correlations. The correlation coefficients in this block were large, ranging between .52 and .84, which according to Campbell and Fiske’s (1959) criteria indicate convergent validity. These correlations were mostly larger than the italicized correlations in the heterotraitheteromethod block, which, with the exception of the correlation between HSCL-anxiety and BDI-somatic .62, ranged between .33 and 50. According to Campbell and European Journal of Psychological Assessment (2019), 35(1), 46–54


50

P. Selmo et al., Psychometric Properties of the Online Arabic Versions of BDI-II, HSCL-25, and PDS

Table 4. Standardized loadings of the three-factor solutions of the exploratory structural equation modeling (ESEM) of the Hopkins Symptom Checklist (HSCL-25). n = 1,544 HSCL-25

Factor I anxiety

Factor II depression

Factor III phobic anxiety

.01

–.06

.92

–.01

.07

.85

9. Spells of terror or panic

.28

.21

.62

3. Faintness, dizziness, or weakness

.51

.19

.11

4. Nervousness or shakiness inside

.54

–.03

.08

5. Heart pounding or racing

.66

< .01

.25

6. Trembling

.67

.43

.18

7. Feeling tense or keyed upa

.31

.26

.12 –.01

1. Suddenly scared for no reason 2. Feeling fearful

8. Headaches

.44

.02

10. Feeling restless, can’t sit stilla

.23

.28

.27

11. Feeling low in energy, slowed down

.25

.52

–.03

12. Blaming yourself for things

.10

.48

.02

13. Crying easily

.10

.38

–.04

14. Loss of sexual interest or pleasure

.13

.35

–.02

15. Poor appetitea

.33

.37

–.12

16. Difficulty falling asleep, staying asleepa

.32

.41

–.04

17. Feeling hopeless about the future

–.03

.81

.01

18. Feeling blue

< .01

.83

.05

19. Feeling lonely

–.07

.87

–.02

20. Feeling trapped or caught

–.01

.62

.05

21. Worrying too much about things

–.05

.73

.09

22. Feeling no interest in things

.11

.59

.27

23. Thoughts of ending your life

–.00

.75

–.02

24. Feeling everything is an effort 25. Feelings of worthlessness

.06

.78

.03

–.12

.89

< .01

.44

1.00

Factor correlations Factor I

1.00

Factor II

.42

Factor III

.50

Notes. A priori target loadings are shaded in gray. Values in bold type refer to salient ( .30) non-a priori target loadings. aItems excluded in the CTC(M–1) model.

Fiske (1959), discriminant validity is supported, for example, when heterotrait-monomethod correlations, shown in parentheses, are smaller than the monotraitheteromethod correlations. In the present study, the heterotrait-monomethod correlations ranged between .40 and .52, with one exception showing a high positive correlation between HSCL-anxiety and HSCL-depression subscales of .62 (see Table 6).

Discussion This is the first study to examine the psychometric properties of online versions of three major instruments for depression, anxiety, and PTSD for use in Arabic

European Journal of Psychological Assessment (2019), 35(1), 46–54

population. The measures were scrutinized on item level using ESEM and on scale level using a CFA-MTMM model. Findings from the present study support the usefulness of online assessment and the validity of BDI-II, HSCL-25, and PDS for Arabic populations. A two-factor solution for BDI-II encompassing both cognitive and somatic-affective symptoms has been consistently reproduced in clinical as well as student samples (English: Beck et al., 1996; Steer et al., 1999; Whisman, Perez, & Ramel, 2000; Japanese: Kojima et al., 2002; Arabic: Al-Musawi, 2001; Turkish: Kapci, Uslu, Turkcapar, & Karaoglan, 2008). Although in Steer et al. (1999), Crying loaded on the cognitive factor and Sadness was bivocal, these two items belong clearly to the somatic-affective dimension in the present sample, albeit Beck et al. (1996) predicted a shift between factors in such affective symptoms.

Ó 2016 Hogrefe Publishing


P. Selmo et al., Psychometric Properties of the Online Arabic Versions of BDI-II, HSCL-25, and PDS

51

Table 5. Standardized loadings of the three-factor solutions of the exploratory structural equation modeling (ESEM) of the Posttraumatic Diagnostic Scale (PDS). n = 1,368 PDS

Factor I intrusion & active avoidance

Factor II emotional numbing & hyperarousal

Factor III hyperarousal

1. Recurrent and intrusive recollections of the event

.78

.03

–.01

2. Recurrent distressing dreams of the event

.60

–.05

.27

3. Acting/feeling as if traumatic event were recurring

.66

.04

.09

4. Psychological distress at exposure

.82

.12

–.02

5. Physiological reactivity on exposure

.76

–.01

.16

6. Efforts to avoid thoughts associated with trauma

.59

.14

< .01

7. Efforts to avoid activities that remind of the traumaa

.41

.30

.07

8. Dissociative amnesiaa

.13

.14

.10

9. Diminished interest in significant activities

.28

.51

–.07

10. Feeling of detachment or estrangement from others

.01

.83

< .01

11. Restricted range of affect

.01

.55

.10

–.03

.52

.19

13. Difficulty falling or staying asleepa

.13

.33

.32

14. Irritability or outbursts of angera

.04

.38

.40

–.05

.46

.35

.01

.14

.63

–.07

.01

.80

12. Sense of a foreshortened future

15. Difficulty concentrating

a

16. Hypervigilance 17. Exaggerated startle response Factor correlation Factor I

1.00

Factor II

.51

Factor III

.27

.38

1.00 a

Notes. A priori target loadings are shaded in gray. Values in bold type refer to salient ( .30) non-a priori target loadings. Items excluded in the CTC(M model.

1)

Table 6. Latent correlation coefficients of the CFA model 1

2

3

1. BDI-So

2. BDI-Co

.76

3. HSCL-De

.84

.73

4

5

6

7

4. HSCL-An

.62

.45

(.62)

5. HSCL-PA

.48

.38

(.49)

.69

6. PDS-In

.48

.33

.46

.46

.40

7. PDS-EN

.70

.57

.73

.47

.36

.66

8. PDS-Hy

.50

.49

.49

.52

.52

(.40)

(.52)

8

Notes. CFA = Confirmatory factor analysis; So = Somatic-affective; Co = Cognitive; De = Depression; An = Anxiety; PA = Phobic anxiety; In = Intrusion, EN = Emotional numbing; Hy = Hyperarousal. All reported correlations are significant p < .001. Values in bold type refer to monotrait-heteromethod correlations. Italicized values refer to the hetetotrait-heteromethod correlations. Heterotrait-monomethod correlations are shown in parentheses.

Suicidal Thoughts or Wishes was a poorly functioning item among Turkish outpatients (Kapci et al., 2008), which might be due to the formulation that includes the word “suicide,” a negatively charged word in Islamic societies. This might explain the different result of the wellfunctioning HSCL-25 item, Thoughts of Ending your Life. A further culturally sensitive issue concerns item 21, Loss of Interest in Sex. In the Bahraini study (Al-Musawi, 2001), the author attributed low loading of this item to an unwillingness to reveal feelings on such a sensitive issue, although the same item functioned well both in Al-Turkait and Ó 2016 Hogrefe Publishing

Ohaeri (2010) and in the current study. An explanation of the very good functioning of this item in the present study may be the highly anonymous nature of online assessment, which reduces inhibition in reporting on sensitive issues (Tourangeau & Smith, 1996). We conclude that both items, suicidal ideation and loss of interest in sex, are good predictors of depression even in a conservative society, although one should remain aware of potential impacts relating to formulation and anonymity. Closer investigation of the loading patterns reveals that, in contrast to the mostly homogeneous somatic items, only European Journal of Psychological Assessment (2019), 35(1), 46–54


52

P. Selmo et al., Psychometric Properties of the Online Arabic Versions of BDI-II, HSCL-25, and PDS

the cognitive items are bivocal showing a general tendency toward reporting somatic symptoms in the current sample. This finding is in line with a presumable commonness of experiencing and expressing depression as a somatic phenomenon in non-Western societies (Kleinman, 1977; Ryder, Yang, & Heine, 2002). This cognitive factor underlies items corresponding to those in HSCL-25, defined by Al-Turkait et al. (2011) as “core depression symptoms,” characterized by high negative and low positive affectivity (Clark & Watson, 1991). Furthermore, Steer et al. (1999) suggested relying on this cognitive dimension for measuring depression in medical and psychiatric populations, to avoid misattributions of somatic symptoms. Obviously, such misattributions can also occur when it comes to somatic symptoms of other psychological disorders, which could in part be responsible for the high rates of comorbidity between depression and other disorders. Based on these findings, future approaches may consider relying on cognitive symptoms to detect depression and on somatic ones to measure its severity. Perhaps it is easier for patients to objectify and scale symptoms like changes in appetite (frequency of occurrence denotes severity) than to scale pessimism or guilty feelings (rather categorical). However, further research is needed to verify this claim. The factor solution supports the factorial validity of the Arabic version of HSCL-25. Worth noticing is that items 17, 23, and 25 of the HSCL-25, corresponding to the three BDI-Cognitive bivocal items 2, 9, and 14, loaded highly and unidimensionally on depression. On the other hand, items 11, 16, and 18 of BDI-II, corresponding to the three HSCL-25 bivocal items 10, 15, and 16, loaded highly and unidimensionally on BDI-Somatic. These two observations indicate a strong link between anxiety and somatic-affective symptoms of depression and, as mentioned above, a core depression factor representing cognitive symptoms. Furthermore, the lower correlation of emotional numbing with BDI-Cognitive .57, compared to its correlation with BDI-Somatic .70 and HSCL-Depression .73, leads us to speculate that the overlap between PTSD and depression is essentially a somatic phenomenon. Present data supports the argument that dissociative amnesia might not be a key feature in PTSD (Griesel et al., 2006; Merckelbach, Dekkers, Wessel, & Roefs, 2003). The factor structure of the Arabic PDS corresponds nicely to the German version, as shown by Griesel et al. (2006). Both studies are in line with the new conceptualization of PTSD symptom dimensionality in DSM-5 (APA, 2013). Although considered a monotrait unit, the correlation between hyperarousal and anxiety .52 was lower than that between anxiety and depression .62. This rather unexpected result may be explained by two considerations: Firstly, despite being a component of anxiety, hyperarousal

European Journal of Psychological Assessment (2019), 35(1), 46–54

might be a distinctive phenomenon caused by a different mechanism (Joiner et al., 1999). Secondly, with only two items, hyperarousal was probably underrepresented in the model. High correlations between anxiety, depression, and PTSD have been interpreted as indicators for convergent validity in several studies (e.g., Al-Musawi, 2001; Norris & Aroian, 2008). Within these highly convergent constructs, we could demonstrate many indicators for discriminant validity by contrasting the underlying subscales in a CFA model. To be specific, correlations between similar traits measured by different methods were larger than the correlations between different traits (convergent validity) even when they were measured by the same method (discriminant validity). All in all, we come to the conclusion that online assessment is a reliable and practical source of data collection, which opens up new possibilities for clinical care facilities. Furthermore, BDI-II, HSCL-25, and PDS are reliable and valid measurements for depression, anxiety, and PTSD throughout the Arab world, demonstrating the cross-cultural validity of these psychological concepts. The importance of the findings in this study is based on several factors. Firstly, the study was based on a relatively large sample, which was not specific to one Arabic country or dialect but rather represented equally the main Arabic subcultures. Secondly, it included several constructs and measurements in a single model, which enabled both the studying of their relations to one another in an interesting way and the considering of more than one indicator for construct validity.

Limitations Method of data collection in this study is an important double-edged consideration. As the instruments were filled in online, it was impossible to control or observe how, when, and where patients completed the questionnaires. Future studies may compare whether a paper-pencil based data collection yields similar results. A further limitation of the current study is the lack of consideration of gender differences and other demographic factors. However, we primarily aimed here at confirming global assumptions concerning reliability and validity of the instruments. A more detailed investigation considering gender and other subgroup effects along with other cultural differences is beyond the scope of this study and must be the focus of a separate study, for example, by using multiple-group analysis. Furthermore, as the current sample consisted mainly of younger participants, the study is primarily representative of younger generations.

Ó 2016 Hogrefe Publishing


P. Selmo et al., Psychometric Properties of the Online Arabic Versions of BDI-II, HSCL-25, and PDS

Electronic Supplementary Material ESM 1. Tables (PDF). Psychometric Properties of the online Arabic BDI-II, HSCL-25 and PDS.

References Abdel-Khalek, A. M. (1998). Internal consistency of an Arabic adaptation of the Beck Depression Inventory in four Arab countries. Psychological Reports, 82, 264–266. Abdel-Khalek, A. M. (2001). A short version of the Beck Depression Inventory without omission of clinical indicators. European Journal of Psychological Assessment, 17, 233–240. Al-Musawi, N. M. (2001). Psychometric properties of the Beck Depression Inventory-II with university students in Bahrain. Journal of Personality Assessment, 77, 568–579. Al-Turkait, F. A., & Ohaeri, J. U. (2010). Dimensional and hierarchical models of depression using the Beck Depression Inventory-II in an Arab college student sample. BMC Psychiatry, 10, 60. Al-Turkait, F. A., Ohaeri, J. U., El-Abbasi, A. M., & Naguy, A. (2011). Relationship between symptoms of anxiety and depression in a sample of Arab college students using the Hopkins Symptom Checklist 25. Psychopathology, 44, 230–241. American Psychiatric Association. (1994). Diagnostic and statistical manual of mental disorders (4th ed.). Washington, DC: American Psychiatric Press. American Psychiatric Association. (2013). Diagnostic and statistical manual of mental disorders (5th ed.). Arlington, VA: American Psychiatric Publishing. Andrews, G. (1996). Comorbidity in neurotic disorders: The similarities are more important than the differences. In R. M. Rapee (Ed.), Current controversies in the anxiety disorders. New York, NY: Guilford Press. Asparouhov, T., & Muthén, B. (2009). Exploratory structural equation modeling. Structural Equation Modeling, 16, 397–438. Beck, A. T., Steer, R. A., & Brown, G. K. (1996). Manual for the Beck Depression Inventory-II. San Antonio, TX: Psychological Corporation. Campbell, D. T., & Fiske, D. W. (1959). Convergent and discriminant validation by the multitrai-multimethod matrix. Psychological Bulletin, 56, 81–105. Caspi, Y., Saroff, O., Suleimani, N., & Klein, E. (2008). Trauma exposure and posttraumatic reactions in a community sample of Bedouin members of the Israel Defense Forces. Depression and Anxiety, 25, 700–707. Clark, D. A., Steer, R. A., & Beck, A. T. (1994). Common and specific dimensions of self-reported anxiety and depression: Implications for the cognitive and tripartite models. Journal of Abnormal Psychology, 103, 645–654. Clark, L. A., & Watson, D. (1991). Tripartite model of anxiety and depression: Psychometric evidence and taxonomic implications. Journal of Abnormal Psychology, 3, 316–336. Elhai, J. D., Gray, M. J., Kashdan, T. B., & Franklin, C. L. (2005). Which instruments are most commonly used to assess traumatic event exposure and posttraumatic effects? A survey of traumatic stress professionals. Journal of Traumatic Stress, 5, 541–545. Foa, E. B., Cashman, L., Jaycox, L., & Perry, K. (1997). The validation of a self-report measure of posttraumatic stress disorder: The Posttraumatic Diagnostic Scale. Psychological Assessment, 9, 445–451.

Ó 2016 Hogrefe Publishing

53

Ghareeb, A. G. (2000). Manual of the Arabic BDI-II. Cairo, Egypt: Angle Press. Griesel, D., Wessa, M., & Flor, H. (2006). Psychometric qualities of the German version of the Posttraumatic Diagnostic Scale (PTDS). Psychological Assessment, 18, 262–268. Grubaugh, A. L., Long, M. E., Elhai, J. D., Frueh, B. C., & Magruder, K. M. (2010). An examination of the construct validity of posttraumatic stress disorder with veterans using a revised criterion set. Behaviour Research and Therapy, 48, 909–914. Joiner, T. E. Jr., Steer, R. A., Beck, A. T., Schmidt, N. B., Rudd, M. D., & Catanzaro, S. J. (1999). Physiological hyperarousal: construct validity of a central aspect of the tripartite model of depression and anxiety. Journal of abnormal psychology, 108, 290–298. Kapci, E. G., Uslu, R., Turkcapar, H., & Karaoglan, A. (2008). Beck Depression Inventory II: Evaluation of the psychometric properties and cut-off points in a Turkish adult population. Depression and Anxiety, 25, 104–110. Kleinman, A. (1977). Depression, somatization, and the new crosscultural psychiatry. Social Science & Medicine, 11, 3–10. Kobeissi, L., Araya, R., El Kak, F., Ghantous, Z., Khawaja, M., Khoury, B., . . . Zurayk, H. (2011). The relaxation exercise and social support trial (RESST): Study protocol for a randomized community based trial. BMC Psychiatry, 11, 142. Kojima, M., Furukawa, T. A., Takahashi, H., Kawai, M., Nagaya, T., & Tokudome, S. (2002). Cross-cultural validation of the Beck Depression Inventory-II in Japan. Psychiatry Research, 110, 291–299. Lavik, N. J., Laake, P., Hauff, E., & Solberg, O. (1999). The use of self-reports in psychiatric studies of traumatized refugees: Validation and analysis of HSCL-25. Nordic Journal of Psychiatry, 53, 17–20. Lipman, R. S., Covi, L., & Shapiro, A. K. (1979). The Hopkins Symptom Checklist (HSCL). Journal of Affective Disorders, 1, 9–24. Little, T. D., Cunningham, W. A., Shahar, G., & Widaman, K. F. (2002). To parcel or not to parcel: Exploring the question, weighing the merits. Structural Equation Modeling, 9, 151–173. Löwe, B., Spitzer, R. L., Williams, J. B., Mussell, M., Schellberg, D., & Kroenke, K. (2008). Depression, anxiety, and somatization in primary care: Syndrome overlap and functional impairment. General Hospital Psychiatry, 30, 191–199. Marsh, H. W., Hau, K.-T., Balla, J. R., & Grayson, D. (1998). Is more ever too much? The number of indicators per factor in confirmatory factor analysis. Multivariate Behavioral Research, 33, 181–220. Marsh, H. W., Muthén, B., Asparouhov, T., Lüdtke, O., Robitzsch, A., Morin, A. J. S., & Trautwein, U. (2009). Exploratory structural equation modeling, integrating CFA and EFA: Application to students’ evaluations of university teaching. Structural Equation Modeling, 16, 439–476. Merckelbach, H., Dekkers, T., Wessel, I., & Roefs, A. (2003). Amnesia, flashbacks, nightmares and dissociation in aging concentration camp survivors. Behaviour Research and Therapy, 41, 351–360. Mineka, S., Watson, D., & Clark, L. A. (1998). Comorbidity of anxiety and unipolar mood disorders. Annual Review of Psychology, 49, 377–412. Muthén, L., & Muthén, B. (2012). Mplus user’s guide version 6.11. Los Angeles, CA: Muthén & Muthén. Norris, A. E., & Aroian, K. J. (2008). Assessing reliability and validity of the Arabic language version of the Post-traumatic Diagnostic Scale (PDS) symptom items. Psychiatry Research, 160, 327–334.

European Journal of Psychological Assessment (2019), 35(1), 46–54


54

P. Selmo et al., Psychometric Properties of the Online Arabic Versions of BDI-II, HSCL-25, and PDS

R Core Team. (2013). R: A language and environment for statistical computing URL: http://www.R-project.org/. Vienna, Austria: R Foundation for Statistical Computing Ryder, A. G., Yang, J., & Heine, S. J. (2002). Somatization vs. psychologization of emotional distress: A paradigmatic example for cultural psychopathology. In W. J. Lonner, D. L. Dinnel, S. A. Hayes, & D. N. Sattler (Eds.), Online readings in psychology and culture (9th ed.). Bellingham, WA: Center for Cross-Cultural Research, Western Washington University. Spitzer, R. L., First, M. B., & Wakefield, J. C. (2007). Saving PTSD from itself in DSM-V. Journal of Anxiety Disorders, 21, 233–241. Steer, R. A., Ball, R., Ranieri, W. F., & Beck, A. T. (1999). Dimensions of the Beck Depression Inventory-II in clinically depressed outpatients. Journal of Clinical Psychology, 55, 117–128. Suvak, M. K., & Barrett, L. F. (2011). Considering PTSD from the perspective of brain processes: A psychological construction analysis. Journal of Traumatic Stress, 24, 3–24. Tourangeau, R., & Smith, T. (1996). Asking sensitive questions: The impact of data collection mode, question format, and question context. Public Opinion Quarterly, 60, 275–304. Wagner, B., Brand, J., Schulz, W., & Knaevelsrud, C. (2012). Online working alliance predicts treatment outcome for posttraumatic stress symptoms in Arab war-traumatized patients. Depression and Anxiety, 29, 646–651.

European Journal of Psychological Assessment (2019), 35(1), 46–54

Weeks, M. F. (1992). Computer-assisted survey information collection: A review of CASIC methods and their implications for survey operations. Journal of Official Statistics, 4, 445–465. Whisman, M. A., Perez, J. E., & Ramel, W. (2000). Factor structure of the Beck Depression Inventory-Second Edition (BDI-II) in a student sample. Journal of Clinical Psychology, 56, 545–551. Received November 17, 2014 Revision received January 12, 2016 Accepted January 22, 2016 Published online November 7, 2016 Pirko Selmo Research Department Treatment Centre for Torture Victims Turmstrasse 21 10559 Berlin Germany Tel. +49 30 303 906 23 Fax +49 30 306 143 71 E-mail p.selmo@bzfo.de

Ó 2016 Hogrefe Publishing


How to provide culturally sensitive care for clients with PTSD and related disorders “The field of cultural clinical psychology takes an important stride forward with this carefully edited volume on the cultural shaping of posttraumatic stress disorder.” Andrew G. Ryder, PhD, Associate Professor of Psychology, Concordia University, Montreal, QC, Canada

Andreas Maercker / Eva Heim / Laurence J. Kirmayer (Eds.)

Cultural Clinical Psychology and PTSD 2019, x + 236 pp. US $62.00 / € 49.95 ISBN 978-0-88937-497-3 Also available as eBook This book, written and edited by leading experts from around the world, looks critically at how culture impacts on the way posttraumatic stress disorder (PTSD) and related disorders are diagnosed and treated. There have been important advances in clinical treatment and research on PTSD, partly as a result of researchers and clinicians increasingly taking into account how “culture matters.” For mental health professionals who strive to respond to the needs of people from diverse cultures who have experienced traumatic events, this book is invaluable. It presents recent research and practical approaches on key topics, including:

www.hogrefe.com

• How culture shapes mental health and recovery • How to integrate culture and context into PTSD theory • How trauma-related distress is experienced and expressed in different cultures, reflecting local values, idioms, and metaphors • How to integrate cultural dimensions into psychological interventions Providing new theoretical insights as well as practical advice, it will be of interest to clinical psychologists, psychiatrists, and other health professionals, as well as researchers and students engaged with mental health issues, both globally and locally.


Assessment and treatment of Internet addiction

“This excellent book is a pleasure to read. At a time when clinicians are scrambling to learn what they can about the rapidly developing problem of Internet addiction, this book offers them an excellent place to start.” Hilarie Cash, PhD, Chief Clinical Officer and Co-Founder of reSTART Life, PLLC, Fall City, WA – the first residential treatment program for Internet addiction in the US

Daria J. Kuss / Halley M. Pontes

Internet Addiction (Series: Advances in Psychotherapy – Evidence-Based Practice – Volume 41) 2019, iv + 86 pp. US $29.80 / € 24.95 ISBN 978-0-88937-501-7 Also available as eBook This book examines how you can identify, assess, and treat Internet addiction in the most effective manner. Internet use has become an integral part of our daily lives, but at what point does it become problematic? What are the different kinds of Internet addiction? And how can professionals best help clients? This compact, evidencebased guide written by leading experts from the field helps disentangle the debates and controversies around Internet addiction, including social media addiction

www.hogrefe.com

and Internet gaming disorder, and outlines the current assessment and treatment methods. The book presents a 12–15 session treatment plan for Internet and gaming addiction using the method and setting with the best evidence: group CBT. Printable tools in the appendix help clinicians implement therapy. This accessible book is essential reading for clinical psychologists, psychiatrists, psychotherapists, counselors, social workers, teachers, researchers, as well as students and parents.


Original Article

The Effect of Alternative Scoring Procedures on the Measurement Properties of a Self-Administered Depression Scale An IRT Investigation on the CES-D Scale Noboru Iwata,1 Akizumi Tsutsumi,2 Takafumi Wakita,3 Ryuichi Kumagai,4 Hiroyuki Noguchi,5 and Naotaka Watanabe6 1

Department of Psychology, Hiroshima International University, Higashi-Hiroshima, Japan

2

Department of Public Health, Kitasato University, Sagamihara, Japan

3

Faculty of Sociology, Kansai University, Suita, Japan

4

Graduate School of Education, Tohoku University, Sendai, Japan Department of Psychology and Human Developmental Sciences, Nagoya University, Japan

5 6

Graduate School of Business Administration, Keio University, Yokohama, Japan Abstract: To investigate the effect of response alternatives/scoring procedures on the measurement properties of the Center for Epidemiologic Studies Depression Scale (CES-D) which has the four response alternatives, a polytomous item response theory (IRT) model was applied to the responses of 2,061 workers and university students (1,640 males, 421 females). Test information functions derived from the polytomous IRT analyses on the CES-D data with various scoring procedures indicated that: (1) the CES-D with its standard (0-1-2-3) scoring procedure should be useful for screening to detect subjects with “at high-risk” of depression if the θ point showing the highest information corresponds to the cut-off point, because of its extremely higher information; (2) the CES-D with the 0-1-1-2 scoring procedure could cover wider range of depressive severity, suggesting that this scoring procedure might be useful in cases where more exhaustive discrimination in symptomatology is of interest; and (3) the revised version of CES-D with replacing original positive items into negatively revised items outperformed the original version. These findings have never been demonstrated by the classical test theory analyses, and thus the utility of this kind of psychometric testing should be warranted to further investigation for the standard measures of psychological assessment. Keywords: Center for Epidemiologic Studies Depression Scale, item response theory, measurement accuracy, rating scale, Japanese

Major depression is one of the most common diseases in industrialized countries and is the second most debilitating disease worldwide (Ferrari et al., 2013). The detection of depression in community settings, as well as prevention and treatment strategies, will increasingly become a major public health issue warranting heightened attention. As a conventional measure for such a purpose, many self-administered questionnaires have been developed. The Center for Epidemiologic Studies Depression Scale (CES-D), developed in the US for use in community surveys to identify groups “at high-risk” of depression in a general population (Radloff, 1977, 1989), is one of the most popular instruments. The CES-D has been widely used as a measure Ó 2016 Hogrefe Publishing

of depressive symptomatology for persons in the U.S. (e.g., Culp, Clyman, & Culp, 1995; Myers & Weissman, 1980), Europe (e.g., Fava, 1983; Fuhrer & Rouillon, 1989), and Asia (e.g., Cho & Kim, 1998; Iwata & Saito, 1987; Mackinnon, McCallum, Andrews, & Anderson, 1998). The common strategy of these conventional measures in assessing the respondents is based on the total score after summing up the scores assigned to individual item responses (e.g., 0-1-2-3 for four response alternatives). Therefore the scoring properties should be regarded as one of the most critical characteristics influencing the validity and reliability of such assessment tool. Although the item contents of this kind of assessment tool have

European Journal of Psychological Assessment (2019), 35(1), 55–62 DOI: 10.1027/1015-5759/a000371


56

received particular attention, the appropriateness of response alternatives and their scoring procedures has been seldom addressed. One of the scientifically sound approaches could be achieved through the application of Item Response Theory (IRT). IRT is a modern psychometric theory that provides a foundation for scaling persons and items based on responses to assessment items (Hambleton, Swaminathan, & Rogers, 1991). Compared with classical test theory (CTT), IRT generally provides more sophisticated information regarding the psychometric properties of individual assessment items that enables a researcher to improve measurement accuracy (Embretson & Reise, 2000). There exist several studies that have applied IRT techniques to the CES-D (Jones & Fonda, 2004; Orlando, Sherbourne, & Thissen, 2000; Pickard, Dalal, & Bushnell, 2006; Stansbury, Ried, & Velozo, 2006). These studies have employed IRT to link different versions of the CES-D and other mental health measures (Jones & Fonda, 2004; Orlando et al., 2000), and to investigate the differential item functioning (Pickard et al., 2006). Stansbury et al. (2006) have applied the Rasch model IRT to the CES-D data of its four response alternatives, and demonstrated that positively worded items performed poorly overall, and their removal reduced the scale’s bandwidth only slightly. Nevertheless, no previous studies have addressed the appropriateness of the response alternatives/scoring procedures of the CES-D by applying polytomous IRT model. This study employs the polytomous IRT models to examine the effect of response alternatives/scoring procedures on the measurement properties of the CES-D using data obtained from Japanese workers and college students. Particular attention is paid to the difference in test information functions according to several scoring procedures on the response alternatives of the CES-D. A well-known example of alternative scoring procedure is the case of the General Health Questionnaire (GHQ; Goldberg & Williams, 1988), that is, 0-0-1-1 GHQ scoring for four response alternatives. The GHQ scoring would be rationalized as its primary purpose for detecting “probable case of psychological morbidity” in a community setting, whereas another 0-1-1-1 CGHQ scoring had also been recommended for those with chronic symptoms (Goodchild & Duncan-Jones, 1985). It should however be noted that the GHQ/CGHQ scoring had been validated by CTT, but not by IRT. IRT obviously outperforms CTT on the evidence level because of its sample independent nature and various measurement information on a certain latent continuum of interest, such as item characteristics and test information, provided by IRT as well (Hambleton et al., 1991). These advantages are exhibited in this study.

European Journal of Psychological Assessment (2019), 35(1), 55–62

N. Iwata et al., Polytomous IRT on CES-D

Methods Sample The participants were total 2,143 workers and students in two organizational settings and two universities in Japan. Workers were asked to respond to a self-administered questionnaire including items for demographics, the CESD scale, the State-Trait Anxiety Inventory (STAI; Spielberger, 1989), social support scale of the Brief Job Stress Questionnaire (Shimomitsu, Yokoyama, Ono, Maruta, & Tanigawa, 1998), and some job stress items originally developed, prior to receiving their annual health check. The survey questionnaires were distributed through health and safety division of each organization. Workers responded to the survey questionnaire in their homes and brought their questionnaires at an annual health check. Of 1,871 adult workers, complete responses to demographics (gender and age) and the CES-D were obtained from 1,796 (96.0%: 1,480 males and 316 females). University students were asked to participate in this research at the end of classes and agreed voluntarily to respond anonymously to the CES-D in the classrooms. Of the 271 undergraduates, 265 (97.8%: 160 males and 105 females) completed. We did not employ any exclusion criteria for the respondents. The complete CES-D data obtained from total 2,061 workers and students (1,640 males, 421 females) were analyzed in this study. Mean age was 37.4 (SD = 11.7; range 18–60 years) years for males, 28.8 (SD = 10.0; range 18–59 years) years for females, yielding a significant but medium size difference (t = 15.2, p < .001, Cohen’s d = 0.79). We obtained permission from the occupational organization directors and department heads of the universities to conduct the survey. All candidates received the overall information about the study, which emphasized the confidentiality of their answers and their freedom to decline participation. Also, all procedures were in accordance with the ethical standards of the responsible committee on human subjects and with the Helsinki Declaration of 1975, as revised in 2000.

Measurement The CES-D is a 20-item self-administered questionnaire that assesses the frequency of depressive symptoms during the past week (Radloff, 1977). This measure constitutes 16 negative (NEG) items such as “I felt depressed” and four positive affect (POS) items such as “I was happy.” NEG items include seven items each for “Depressed affect” (DEP) and “Somatic and retarded activities,” and two items for “Interpersonal relations.” Subjects select one of four

Ó 2016 Hogrefe Publishing


N. Iwata et al., Polytomous IRT on CES-D

response alternatives: “rarely or none of the time (experienced less than 1 day during the past week),” “some or a little of the time (1–2 days),” “occasionally or a moderate amount of the time (3–4 days),” and “most or all of the time (5–7 days).” These are usually scored as 0, 1, 2, and 3, respectively. The four POS items are reverse-scored so that higher scores indicate a greater level of depressive symptomatology, that is, “low positive affect.” We used the Japanese version of the CES-D, developed and validated by Shima, Shikano, Kitamura, and Asai (1985), in this study. We also added the four additional negatively revised versions (Iwata et al., 1998) of the original POS items, because the inappropriateness of POS items for the Japanese population was pointed out by cross-cultural studies (Iwata & Buka, 2002; Iwata & Higuchi, 2000; Iwata, Saito, & Roberts, 1994) and in a clinical setting (Iwata et al., 1998). Although STAI and other scales were used in the worker survey, the descriptions of these scales are skipped because the response data on these scales were beyond the scope of this analytic study.

Data Analysis We employed factor analysis to assess the dimensionality of CES-D data (Takane & de Leeuw, 1987). A principal axis factoring based on polychoric correlations among the CES-D items was conducted using Mplus software (Muthén & Muthén, 2006). We applied the generalized partial credit model for polytomous IRT analyses (Muraki, 1992) using PARSCALE (Muraki & Bock, 2003) to the original and the revised CES-D items, separately. The revised CES-D (CESD-R) consisted of 16 NEG items and four negatively revised POS items, instead of the original POS items. Using the estimated parameters, we drew the item response category characteristic curve (IRCCC) and test information curve. After the IRT analyses of these two versions with the standard rating procedure (i.e., 0-1-2-3), we repeated the analyses for alternative ratings to find out more appropriate rating procedure.

Results Dimensionality of the Original and Revised CES-D For the original CES-D, a principal axis factoring extraction revealed that the eigenvalue (% explained) of the first factor was 9.14 (45.7%), and those of subsequent factors were 1.77 (8.9%), 1.10 (5.5%), 0.85 (4.3%), 0.81 (4.0%), and so on. The first eigenvalue was five times large as the second eigenvalue. For the CESD-R, the corresponding eigenvalues (% explained) were 10.85 (54.2%), 1.13 (5.7%), 0.88 (4.4%), Ó 2016 Hogrefe Publishing

57

0.80 (4.0%), 0.74 (3.7%), and so on. The first eigenvalue was nine times large as the second eigenvalue. Although the CESD-R showed more obvious unidimensionality than the original CES-D, taking its traditional procedure as a total score to be used for the assessment into consideration, both versions of the CES-D were regarded essentially as unidimensional.

IRT Analyses of the CES-D With Standard Likert-Scoring Based on IRT, item characteristics can be expressed by item response category characteristic curves (IRCCCs), plotted by the probability of choosing a category of the item as a function of the latent trait measured by a scale containing the item. Figure 1A shows the IRCCC for a DEP item, #6 “I felt depressed,” which seemed the best item with respect to psychometric properties. Narrower and more peaked category response curves observed in this item indicate that the response categories differentiate among trait levels fairly well (i.e., the item discrimination parameter). The intersection point of two adjacent category response curves indicates where on the latent-trait scale the response of one category becomes relatively more likely than the previous category (i.e., the category intersection parameters). Figure 1B displays the item information curve of the same item. The item information function indicates the amount of information an item contains at all points along the latent-trait continuum. Highly discriminating items have “peaked” information curves in that they provide much information in a narrow range of trait values (meaning a small standard error of measurement), whereas low discriminating items have flatter and more spread out information curves. This item performs well at relatively higher levels of the latent trait, possibly involving fairly large amount of information as a whole. In contrast, Figure 2 shows the worst case, a POS item, #4 “(Not) as good as others,” with respect to the item performance. Their IRCCCs were extremely flat without any peak of curve (Figure 2A), and almost no information could be derived from this item (Figure 2B). Similar features were found for the remaining three POS items, while not for NEG items (figure not shown). On the other hand, when this POS item was negatively revised, both IRCCC and item information curve became much better (Figure 3). Similarly, other negatively revised items of remaining three POS items showed much favorable features as compared to their corresponding POS items. Summation of the individual item information functions equals the test information function of the scale. The information curves represent the standard error of measurement at any chosen trait level (Hambleton European Journal of Psychological Assessment (2019), 35(1), 55–62


58

N. Iwata et al., Polytomous IRT on CES-D

(A)

(A)

100%

100%

0

3

80%

1

Endorsement (%)

Endorsement (%)

80%

60%

2 40%

0 2

40%

1 3 20%

20%

0%

0% -5

-4

-3

-2

-1

0

1

2

3

4

-5

5

-4

-3

-2

-1

0

1

2

3

4

5

2

3

4

5

Depressive Severity

Depressive Severity

(B)

(B)

3

3

Item Information

Item Information

60%

2

1

0

2

1

0 -5

-4

-3

-2

-1

0

1

2

3

4

5

Depressive Severity

Figure 1. The item response category characteristic curves (A) and item information curve (B) of a depressed affect item, #6 “I felt depressed.”

& Swaminathan, 1985). Thus, the precision at which a test is performing at various ranges of the latent trait can be determined: for example, the conditional information value of 9 corresponds to the conditional standard error (SE) of 0.33 by the maximum likelihood estimation (MLE). Given that ρ = 1/(1 + SE2), the SE of 0.33 under the assumption of the true score variance to 1 corresponds to a reliability of ρ = 0.90 in CTT. That is, I(θ) = 9 by MLE indicates a reliability of ρ = 0.90. Here, we could not employ MLE but EAP (expected a posteriori) estimation because of the existence of fairly number of respondents who scored 0, to which MLE could not estimate the θ. As described above, a conditional information value corresponds to a conditional SE value by the MLE. However, by the EAP, such one-to-one corresponding relationship does not exist, and thus the information curve is expressed as a scatter plot, not a line. The measurement precision of various scoring procedures could not be easily compared using such scatter plot. Therefore, only for the purpose of making the comparison easier, the similar strategy applied to MLE was employed here; that is, the θ levels at the test information of 9 were calculated, and the percentages of population measurable at “higher reliability” (or smaller measurement error) levels were estimated (Table 1). European Journal of Psychological Assessment (2019), 35(1), 55–62

-5

-4

-3

-2

-1

0

1

Depressive Severity

Figure 2. The item response category characteristic curves (A) and item information curve (B) of a positive affect item, #4 “(Not) as good as others.”

The original CES-D with the standard 0-1-2-3 scoring showed a sufficient amount of test information on the latent-trait (θ) levels more than 0.01, and the test discriminates best at relatively higher levels of the trait around 1.8 (“CES-D 0-1-2-3” in Figure 4). For the test information function of the CESD-R with 0-1-2-3 scoring (denoted as “CESD-R 0-1-2-3”), the curve appeared comparable to that of the original CES-D, whereas obviously much more amount of information was obtained from the CESD-R.

IRT Analyses of the CES-D With Alternative Scoring Procedures We then repeated the similar IRT analyses to the CES-D with alternative scoring procedures to compare the test information function (also displayed in Figure 4). The 0-1-2-2 scoring means the most severe response alternative, “most or all of the time,” should be lumped with the second severe response alternative, “occasionally or a moderate amount of the time.” For the original and revised CES-D with 0-1-2-2 scoring, the test information function curves were similar but slightly flatted as compared to those of the standard scoring procedure, while the top peak of the curve Ó 2016 Hogrefe Publishing


N. Iwata et al., Polytomous IRT on CES-D

59

40

(A) 100%

CESD-R (0-1-2-3)

Endorsement (%)

3

1

60%

CES-D (0-1-2-3)

30

Test Information

0

80%

40%

2

CESD-R (0-1-2-2)

20 CESD-R (0-1-1-2) CES-D (0-1-1-2)

CESD-R (0-1-1-1)

10

20%

CES-D (0-1-1-1)

0% -5

-4

-3

-2

-1

0

1

2

3

4

5

Depressive Severity

Item Information

(B)

CES-D (0-1-2-2)

0

-4

-3

-2

-1

0

1

2

3

4

5

Depressive Severity

3

Figure 4. The test information curves of various scoring procedures on the standard CES-D scale (solid lines) and on the revised CES-D scale (CESD-R: dot lines) in which four positive affect items had been revised negatively.

2

1

0 -5

-4

-3

-2

-1

0

1

2

3

4

5

Depressive Severity

Figure 3. The item response category characteristic curves (A) and item information curve (B) of a negatively-revised positive affect item, #4 “I felt I was inferior to others.”

Table 1. θ values at the point of 9 on the information functions* by various scoring procedure

Scales (scoring)

Lower θ

Higher θ

Width

Expected population covered (%)

CES-D (0-1-2-3)

0.01

3.10

3.09

(49.5)

(0-1-2-2)

0.13

2.48

2.61

(54.4)

(0-1-1-2)

0.10

3.38

3.48

(54.1)

(0-1-1-1)

0.24

1.28

1.52

(49.3)

(0-1-2-3)

0.35

3.15

3.50

(63.6)

(0-1-2-2)

0.45

2.59

3.04

(66.8)

(0-1-1-2)

0.53

3.49

4.02

(70.1)

(0-1-1-1)

0.60

1.39

1.99y

(64.2)

CESD-Ry

Notes. *A convenience point for comparison purpose. Revised CES-D in which four positive affect items have been negatively-reworded.

was also slightly shifted toward a lower level, that is, around 1.3. Again, the information curves were larger for the CESD-R than for the original CES-D. For the test information function of the original and revised CES-D with 0-1-1-2 scoring, which meant two middle response alternatives, “some or a little of the time” Ó 2016 Hogrefe Publishing

and “occasionally or a moderate amount of the time,” should be lumped together, the curves differed considerably from those of the previous ones, that is, “trapezoidlike” distribution, while both showed the same peak at θ = 2.4. Although the slopes were almost similar at the lower θ levels, the lines kept higher information level until higher levels of the latent trait. The width between lower and higher θ’s was the largest as compared to other scoring procedures, including standard one. Thus, the 0-1-1-2 scoring procedure showed more favorable measurement properties than that for the previous ones; a more peaked curve was acknowledged among the more depressed respondents, and certain amounts of the test information were preserved through a wider latent-trait range. This was remarkable for the CESD-R. Additional analyses for the 0-1-1-1 scoring procedure showed similar but smaller curves as compared to those found for the standard and 0-1-2-2 scoring procedures.

Internal Consistencies of the CES-D With Alternative Scoring Procedures Table 2 shows corrected item-total correlations and Cronbach’s α’s of the original and revised CES-D scale for various scoring procedures. Items yielding lowest and highest values are listed according to the subscales with average. For NEG items, the corresponding correlations were comparable across the three scoring procedures, but detailed inspection revealed that the correlations tended to be slightly lower from the standard scoring to 0-1-1-2 scoring. The correlations of the CESD-R were constantly higher than their original versions’ counterparts. The correlations of POS items, #4 and #8 “(Not) felt hopeful” in particular, were extremely lower than those of NEG items. European Journal of Psychological Assessment (2019), 35(1), 55–62


60

N. Iwata et al., Polytomous IRT on CES-D

Table 2. Corrected item-total correlations by different scoring procedures Standard scoring

Alternative scoring

(0-1-2-3) Items

(0-1-2-2)

(0-1-1-2)

CES-D

CESD-R

CES-D

CESD-R

CES-D

CESD-R

.58

Depressed affect Mean value

.59

.62

.58

.60

.56

Lowest: 17 crying spells

.52

.54

.50

.53

.50

.52

Highest: 6 felt depressed

.70

.73

.68

.70

.64

.66

.55

.57

.54

.56

.51

.53

Somatic and retarded Activities Mean value Lowest: 2 poor appetite

.44

.45

.43

.44

.42

.43

Highest: 7 everything an effort

.63

.65

.62

.64

.57

.59 .57

Interpersonal relations Mean value

.54

.58

.54

.58

.54

Lowest: 19 people disliked me

.52

.57

.52

.56

.52

.55

Highest: 15 people unfriendly

.56

.59

.56

.59

.56

.58

(Low) Positive affect Mean value

.32

.64

.34

.63

.26

.60

Lowest: 4 (Not) As good as others

.21

.60

.25

.59

.14

.57

Highest: 16 (Not) Enjoyed life

.46

.69

.42

.67

.39

.64

.88

.93

.89

.92

.87

.92

Cronbach’s α

Note. CESD-R: Four positive affect items of the original CES-D have been negatively-reworded.

In contrast, their negatively revised items showed fairly higher correlations that were mostly equivalent to those of DEP items. Cronbach’s α’s ranged from .874 for the 0-1-1-2 scoring of the original CES-D to .927 for the standard scoring of the CESD-R. The CESD-R showed higher α’s > .91 for all scoring procedures, while the original CES-D had satisfactory level of the internal consistencies.

Discussion In this study we investigated the effect of response alternatives/scoring procedures on the measurement properties of the CES-D by utilizing a polytomous IRT model to data obtained from Japanese workers and college students. According to the mathematical equation for the test information of a polytomous IRT model, it seems obvious that the more response categories, the more information the test would hold. The test information curve indicates how much the instrument can get an information at an individual θ point on the latent trait of interest. In other words, it indicates how sensitive the instrument would be to the changes around a certain θ level. Therefore, the narrower but higher information curve according to the standard 0-1-2-3 scoring procedure as compared to other scoring procedures suggests us that this scoring should be the best for a screening purpose, that is,

European Journal of Psychological Assessment (2019), 35(1), 55–62

detecting those in a category of “depressed,” if the θ point showing the highest information corresponds to the cut-off point (see Figure 4). If we employ the 0-1-1-2 scoring procedure, the CES-D could cover wider range of latent trait (i.e., depressive severity) with satisfying level of reliability. Therefore, this 0-1-1-2 scoring procedure might be useful in cases where more exhaustive discrimination in symptomatology is of interest. We also compared two versions of the CES-D: the CES-D and the CESD-R. All the results revealed that the revised version outperformed the original version (Figure 4, Tables 1 and 2). One typical difference could be exemplified in Figures 2 and 3; response alternatives of the original POS item did not function in discriminating a respondent’s level on the latent trait (Figure 2), while those of revised item did function much better(Figure 3). As to the internal consistencies, the latter type of items showed much better contribution (Table 2). These findings appear to provide additional evidence to Stansbury et al. (2006). They, while employing the Rasch model IRT, have demonstrated the poor psychometric performance of POS items, and proposed the revised scale containing only 16 NEG items, removing four POS items. This study, utilizing polytomous IRT model, suggests that manipulating the response alternatives, for example by lumping, or replacing the current wording with more appropriate wording could lead to better discrimination and achieve more accurate information. Evaluating the

Ó 2016 Hogrefe Publishing


N. Iwata et al., Polytomous IRT on CES-D

item characteristics and the test information based on IRT can provide a test-statistical justification for such improvement (Hambleton & Swaminathan, 1985). The findings obtained from this study have never been demonstrated by the CTT analyses, and thus the utility of this kind of sophisticated psychometric testing should be warranted to further investigation for the standard measures of mental health and psychological assessment. However, we should note that our expedient use of the information function value of 9 was only for comparison purposes (Table 1), and not necessarily recommended when based on EAP estimation. On the other hand, another fundamental assumption of most psychosocial measures exists where constructs are measured on an interval scale (Cook et al., 2001). This property is actually required for the calculation of most psychological measures. A polytomous IRT is also applicable to examine this kind of equidistant issue, while we have not mentioned here due to avoiding a logical confusion. Finer analysis including such investigation may provide a clue to further improving the accuracy of this kind of psychological assessment tool.

Conclusions This study represents the first rigorous psychometric investigation of the effect of alternative scoring procedures on the measurement properties of a self-administered scale, CES-D, the most popular measure of depressive symptoms. The polytomous IRT analyses on the CES-D data with various scoring procedures indicated that: (1) the CES-D with its standard (0-1-2-3) scoring procedure should be useful for a screening tool to detect subjects with “at high-risk” of depression if the θ point showing the highest information corresponds to the cut-off point, because of its extremely higher information; (2) the CES-D with the 0-1-1-2 scoring procedure could cover wider range of depressive severity, suggesting that this scoring procedure might be useful in cases where more exhaustive discrimination in symptomatology is of interest; and (3) the revised version of CES-D with replacing original positive items into negatively revised items outperformed the original version, while this would be the case for Japanese and/or East Asian population. As exemplified in this study, the utility of this kind of rigorous psychometric testing should be warranted to further investigation for the standard measures of mental health and psychological assessment. Acknowledgments This study was supported partly by Grand-in-Aid for Scientific Research (C), from the Japan Ministry of

Ó 2016 Hogrefe Publishing

61

Education, Culture, Sports, Science and Technology (Project Number 14570367) and Health and Labour Sciences Research Grants (Research on Occupational Safety and Health; H17-Rodo-5), from the Japan Ministry of Health, Labour and Welfare.

References Cho, M. J., & Kim, K. H. (1998). Use of the Center for Epidemiologic Studies Depression (CES-D) Scale in Korea. Journal of Nervous & Mental Disease, 186, 304–310. doi: 10.1097/00005053199805000-00007 Cook, K. F., Ashton, C. M., Byrne, M. M., Brody, B., Geraci, J., Giesler, R. B., . . . Wray, N. P. (2001). A psychometric analysis of the measurement level of the rating scale, time trade-off, and standard gamble. Social Science & Medicine, 53, 1275–1285. doi: 10.1097/00005392-200208000-00163 Culp, A. M., Clyman, M. M., & Culp, R. E. (1995). Adolescent depressed mood, reports of suicide attempts, and asking for help. Adolescence, 30, 827–837. PMID:8588519. Embretson, S. E., & Reise, S. P. (2000). Item response theory for psychologists. Mahwah, NJ: Erlbaum. Fava, G. A. (1983). Assessing depressive symptoms across cultures: Italian validation of the CES-D self-rating scale. Journal of Clinical Psychology, 39, 249–251. doi: 10.1002/ 1097-4679(198303)39:23.0.CO;2-Y Ferrari, A. J., Charlson, F. J., Norman, R. E., Patten, S. B., Freedman, G., Murray, C. J. L., . . . Whiteford, H. A. (2013). Burden of depressive disorders by country, sex, age, and year: Findings from the Global Burden of Disease Study 2010. PLoS Medicine, 10, e1001547. doi: 10.1371/journal. pmed.1001547 Fuhrer, R., & Rouillon, F. (1989). La version francaise de l’echelle CES-D (Center for Epidemiologic Studies-Depression Scale). Description et traduction de l’echelle d’autoevaluation [The French version of the CES-D (Center for Epidemiologic Studies-Depression Scale)]. European Psychiatry, 4, 163–166. [In French]. Goldberg, D., & Williams, P. (1988). A user’s guide to the General Health Questionnaire. NFER-Nelson Publishing: Windsor. Goodchild, M. E., & Duncan-Jones, P. (1985). Chronicity and the General Health Questionnaire. British Journal of Psychiatry, 146, 55–61. doi: 10.1192/bjp.146.1.55 Hambleton, R. K., & Swaminathan, H. (1985). Item response theory: Principles and applications. Boston: Kluwer. Hambleton, R. K., Swaminathan, H., & Rogers, H. J. (1991). Fundamentals of item response theory. Newbury Park, CA: Sage Press. Iwata, N., & Buka, S. (2002). Race/ethnicity and depressive symptoms: A cross-cultural/ethnic comparison among university students in East Asia, North and South America. Social Science and Medicine, 55, 2243–2252. doi: 10.1016/ S0277-9536(02)00003-5 Iwata, N., & Higuchi, H. R. (2000). Responses of Japanese and American university students to the STAI items that assess the presence or absence of anxiety. Journal of Personality Assessment, 74, 48–62. doi: 10.1207/ S15327752JPA740104 Iwata, N., & Saito, K. (1987). Relationships of the Todai Health Index to the General Health Questionnaire and the Center for Epidemiologic Studies Depression Scale. Japanese Journal of Hygiene, 42, 865–873.

European Journal of Psychological Assessment (2019), 35(1), 55–62


62

Iwata, N., Saito, K., & Roberts, R. E. (1994). Responses to a selfadministered depression scale among younger adolescents in Japan. Psychiatry Research, 53, 275–287. doi: 10.1016/01651781(94)90055-8 Iwata, N., Umesue, M., Egashira, K., Hiro, H., Mizoue, T., Mishima, N., & Nagata, S. (1998). Can positive affect items be used to assess depressive disorders in the Japanese population? Psychological Medicine, 28, 153–158. doi: 10.1017/ S0033291797005898 Jones, N. S., & Fonda, S. J. (2004). Use of an IRT-based latent variable model to link different forms of the CES-D from the Health and Retirement Study. Social Psychiatry and Psychiatric Epidemiology, 39, 828–835. doi: 10.1007/s00127-004-0815-8 Mackinnon, A., McCallum, J., Andrews, G., & Anderson, I. (1998). The Center for Epidemiological Studies Depression Scale in older community samples in Indonesia, North Korea, Myanmar, Sri Lanka, and Thailand. Journal of Gerontology: Psychological Sciences, 53B, 343–352. PMID: 9826965. Muraki, E. (1992). A generalized partial credit model: Application of an EM algorithm. Applied Psychological Measurement, 16, 159–176. doi: 10.1177/014662169201600206 Muraki, E., & Bock, R. D. (2003). PARSCALE: Parameter scaling of rating data [Computer program]. Chicago, IL: Scientific Software, Inc. Muthén, L. K., & Muthén, B. O. (2006). Mplus user’s guide. Los Angeles, CA: Muthén & Muthén. Myers, J. K., & Weissman, M. M. (1980). Use of a self-report symptom scale to detect depression in a community sample. American Journal of Psychiatry, 137, 1081–1084. PMID: 7425160. Orlando, M., Sherbourne, C. D., & Thissen, D. (2000). Summedscore linking using item response theory: Application to depression measurement. Psychological Assessment, 12, 354–359. doi: 10.1037/1040-3590.12.3.354 Pickard, A. S., Dalal, M. R., & Bushnell, D. M. (2006). A comparison of depressive symptoms in stroke and primary care: Applying Rasch models to evaluate the Center for Epidemiologic Studies Depression Scale. Value in Health, 9, 59–64. doi: 10.1111/ j.1524-4733.2006.00082.x Radloff, L. S. (1977). The CES-D scale: A self-report depression scale for research in the general population. Applied Psychological Measurement, 1, 385–401. doi: 10.1177/ 014662167700100306

European Journal of Psychological Assessment (2019), 35(1), 55–62

N. Iwata et al., Polytomous IRT on CES-D

Radloff, L. S. (1989). The use of the Center for Epidemiologic Studies Depression Scale in adolescents and young adults. Journal of Youth and Adolescence, 20, 149–166. doi: 10.1007/ BF01537606 Shima, S., Shikano, T., Kitamura, T., & Asai, M. (1985). New selfrating scale for depression. Clinical Psychiatry, 27, 717–723. [In Japanese]. Shimomitsu, T., Yokoyama, K., Ono, Y., Maruta, T., & Tanigawa, T. (1998). Development of a novel Brief Job Stress Questionnaire. In S. Kato (Ed.), Report of the research grant for the prevention of work-related diseases from the Ministry of Labour (pp. 107–115) [In Japanese]. Tokyo, Japan: Ministry of Labour. Spielberger, C. D. (1989). State-Trait Anxiety Inventory: Bibliography (2nd Ed.). Palo Alto, CA: Consulting Psychologists Press. Stansbury, J. P., Ried, L. D., & Velozo, C. A. (2006). Unidimensionality and bandwidth in the Center for Epidemiologic Studies Depression (CES-D) Scale. Journal of Personality Assessment, 86, 10–22. doi: 10.1207/s15327752jpa8601_03 Takane, Y., & de Leeuw, J. (1987). On the relationship between item response theory and factor analysis of discretized variables. Psychometrika, 52, 393–408. doi: 10.1007/ BF02294363

Received August 27, 2015 Revision received January 27, 2016 Accepted February 16, 2016 Published online November 7, 2016

Noboru Iwata Department of Psychology Hiroshima International University 555-36 Kurose-Gakuendai, Higashi-Hiroshima Hiroshima 724-0695 Japan Tel. (+81) 823-70-4919 Fax (+81) 823-70-4852 E-mail iwatan@he.hirokoku-u.ac.jp; n0b0ru1wata515@gmail.com

Ó 2016 Hogrefe Publishing


Original Article

Is SMS APPropriate? Comparative Properties of SMS and Apps for Repeated Measures Data Collection Erin I. Walsh1 and Jay K. Brinker2 1

Research School of Psychology, The Australian National University, Canberra, Australia

2

Department of Psychology, University of Alberta, Alberta, Canada Abstract: The ubiquity of mobile telephones worldwide offers a unique opportunity for bidirectional communication between researchers and participants. There are two ways mobile phones could be used to collect self-report data: via Short Message Service (SMS) or app (mobile telephone software applications). This study examined the comparative data quality offered by SMS and app, when mobile phone type, selfreport instrument, and sampling schedule are controlled. One hundred ten undergraduate students used their own iPhones to complete the same repeated measures instrument on 20 occasions, responding either by SMS or by app. There were no differences between SMS and app respondents in terms of response rates or response delay. However, data from those responding via SMS was significantly less complete than from app respondents. App respondents rated their respondent experience as more convenient than SMS respondents. Though findings are only generalizable to an undergraduate sample, this suggests that researchers should consider using apps rather than SMS for repeated measures self-report data collection. Keywords: Short Message Service, app, mobile telephone, methodology

Over three quarters of the global population own a mobile telephone (The World Bank, 2012). As either a supplement or a replacement to traditional research modes such as telephone or postal surveys, mobile telephones offer an unprecedented opportunity for researchers to communicate with participants in self-report research. Though uptake of mobile technology in self-report research is gaining momentum, there remains little structured investigation into the optimal way to use mobile phones in self-report research (Haller, Sanci, Sawyer, Coffey, & Patton, 2006). Two of the ways mobile telephones can support self-report data collection are Short Message Service (SMS) and mobile telephone applications (apps). SMS is a text-only messaging system available on even the most basic mobile telephone handset and a very common communication method in people’s daily lives (Anhoj & Moldrup, 2009). Despite the rise of other text-based mobile communication technologies (such as Multimedia Messaging Service, platform-specific services like iMessage, or services like WhatsApp), SMS remains a dominant communication medium worldwide, and in Australia (ACMA, 2013; Mackay & Weidlich, 2009). Its widespread nature may provide an important opportunity for researchers to communicate with their participants (Haller et al., 2006; Lehman, 2011). Some research using SMS involves sending messages through a mobile handset, Ó 2016 Hogrefe Publishing

but a more common approach is to manage scheduling, sending, and receiving of SMS through online databases. Some do this through preexisting SMS aggregation services (as in Walsh & Brinker, 2012), and others write a computer program of their own to manage the SMS (as in Reimers & Stewart, 2009). Apps are downloadable software programs that are common to all smart mobile telephones or smartphones (Miller, 2012). They are typically tied to a particular mobile operating system, such as Android or iOS, though there has been a move toward cross-system app compatibility (Ribeiro & da Silva, 2012). There are millions of apps used for different purposes, from communication to games, and many are designed specifically for self-report data collection. With over a thousand self-report survey apps and at least six thousand different health-related apps, use of apps for health and medical research and intervention is gaining attraction (Rosser & Eccleston, 2011). Self-report apps can be designed to mimic the web browsing experience (and thus involve a user experience similar to online surveys) or can have their own aesthetic more in line with mobile telephone interfaces (Kojo, Heiskala, & Virtanen, 2014). Researchers using apps may choose to use preexisting software, such as iSurvey, or design their own apps to meet their specific research goals (e.g., Fukuoka & Kamitani, 2011; Morris et al., 2010). Others have combined SMS European Journal of Psychological Assessment (2019), 35(1), 63–69 DOI: 10.1027/1015-5759/a000376


64

with apps, by routing everyday SMS usage through apps for the purposes of data collection (e.g., Montag et al., 2014). Recognizing the global saturation of mobile phones, and the potential use of both apps and SMS as platforms for selfreport data collection, it is important to establish how SMS and app compare as a data collection method. Complete and timely responses are important for building a highquality dataset, and so response completeness and response delay are a useful metric for comparing how SMS and app perform as data collection tools. Two sources of data incompleteness are complete nonresponses, and item skipping resulting in an only partially complete instrument (Sax, Gilmartin, & Bryant, 2003). Nonresponses threaten the total sample size available for analyses (Fox, Crask, & Kim, 1988) and can lead to an unrepresentative portion of a given population being sampled, threatening the validity of research (Flick, 1988). Skipping items can result in small levels of incompleteness. This is problematic because score totals cannot be calculated (Mogensen, 1963), and item missingness causes difficulties for many methods of statistical analysis (Van Buuren, 2010). Meta-analyses suggest that the average response rate in academic research is roughly 50% (Baruch & Holtom, 2008). This can depend on the specific mode used for data collection, with comparative studies indicating mail surveys obtain a higher response rate than voice calls (Dillman et al., 2009a), and online surveys a higher response rate than mail surveys (Cook, Heath, & Thompson, 2000). A comparison of participants responding via app and via paper diary has found a higher response rate in app respondents (Tsai et al., 2007). Repeated measures research using apps has reported roughly 80% response rates (Fukuoka & Kamitani, 2011), suggesting that a relatively high response rate may be expected from apps. Many apps follow the lead of online surveys by prompting participants to complete skipped items, and only allowing them to submit their response when every item in the survey has been satisfactorily completed. For online data collection, some studies have found this has led to significantly less item skipping in online surveys in comparison to paper surveys where no such prompts are possible (Van de Vijver & Harsveldt, 1994), though others have found the opposite (Richardson & Johnson, 2009). Response rates to research using SMS to communicate with participants vary from 20% (Chib, Wilkin, Ling, Hoefman, & Van Biejma, 2012) to 100% (Donaldson, Fallows, & Morris, 2014). SMS has no provision for automatically detecting and prompting participants to complete skipped items in a larger questionnaire, so it provides no barrier to incomplete submission. In a comparison of completeness of SMS, paper, and online diaries, European Journal of Psychological Assessment (2019), 35(1), 63–69

E. I. Walsh & J. K. Brinker, Is SMS APPropriate?

Lim, Sacks-Davis, Aitken, Hocking, and Hellard (2010) found that participants responding via SMS were more likely to return diaries, but provided more incomplete data, than those responding using paper or online diaries. Together, this literature suggests that data collected via SMS may offer higher response rates, but lower response completeness, than data collected via app. As the time between an event or experience increases, so does the likelihood of recall bias distorting self-report (Raphael, 1987). Minimizing the delay between when a response is required, and provision of that response would likely improve the accuracy of the data. Mode can impact on both how quickly people begin their response and how long it takes to complete it. For example, web surveys are quicker to complete than paper surveys with the same content (Richardson & Johnson, 2009). Participants tend to respond more promptly when using SMS, in comparison to paper (Asiimwe et al., 2011; Broderick et al., 2012). Response delays in SMS research range from 2 min (Conner & Reid, 2012) up to an hour (De Lepper, Eijkemans, Van Beijma, Loggers, & Tuijn, 2013). Response delays in app research have been around 8 min (Hofmann & Patel, 2014). Although range and median are informative for forming response delay expectations, they have limited usefulness for direct comparison of the response delays that may be expected when collecting self-report data via SMS and app. To date, no research has directly compared the response delays associated with SMS and app self-report responses. The way participants perceive a particular research mode can impact upon how they engage with it (Dillman et al., 2009b). Positive perceptions of convenience can lessen the perceived burden of responding (Sharp & Frankel, 1983), and lead to deeper engagement with research, and thus more honest and thoughtful responses (Naughton, Jamison, & Sutton, 2013). Negative perceptions regarding data privacy can be a barrier to using mobile phones for research purposes (DĂŠglise, Suggs, & Odermatt, 2012; Ranney et al., 2014). Reflecting on their participation experience, across a number of studies participants have reported that they felt responding via SMS (Akamatsu, Mayer, & Farrelly, 2006; Lim et al., 2010; Matthews, Doherty, Sharry, & Fitzpatrick, 2008) and app (Fernandez, Johnson, & Rodebaugh, 2013; Marshall, Medvedev, & Antonov, 2008) were convenient and private. To date, there has been no research directly contrasting perceived privacy and convenience of SMS and apps being used for self-report research. The aim of the current paper is to directly contrast SMS and app in terms of response rate, response pleteness, response delay, and participant evaluation of privacy and convenience. Findings will be used to discuss the potentially different utility of apps and SMS for researchers. Ă“ 2016 Hogrefe Publishing


E. I. Walsh & J. K. Brinker, Is SMS APPropriate?

Method Participants Sample This study was only open to individuals who owned an iPhone, because the end-user experience can be markedly different even with very similar mobile phones due to different screen sizes and user interface layouts (Keijzers, Ouden, & Lu, 2008). One hundred fifteen undergraduate students in Australia participated in return for course credit. Aged 17–52 years (M = 22), 84% of participants were female. The ethical aspects of this research were approved by the Australian National University Human Research Ethics Committee, and all participants provided written informed consent prior to participation.

Materials Entry Questionnaire This was a computer-administered questionnaire consisting of demographic and mobile ownership questions. Ongoing Questionnaire This short questionnaire formed part of a larger project on the topic of mental time travel. All questions were selfreport, on the topic of the respondent’s current state of mind and surroundings. It consisted primarily of categorical question choices, with one Likert scale and one open-ended response. Five of the questions were mandatory, and one was optional. Specifically, Question 1 was a categorical choice between six categories, Question 2 was an optional open-ended request for elaboration on the response to Question 1. Question 3 was a bipolar Likert rating scale. Questions 4 through 6 were categorical choices, with Question 4 being a binary choice, Question 5 a choice from six categories, and Question 6 a choice from three categories. For those responding via app, this questionnaire was preloaded into the iPhone survey app, iSurvey. For those responding via SMS, the questionnaire was sent in full via SMS. Exit Questionnaire This was a computer-administered questionnaire regarding the participation experience. Participants rated the privacy and convenience of their response experience on a 3-point scale of poor, neutral, or good.

Procedure Study Groups This study manipulated whether participants responded via an app or SMS. This was between-subjects, as participants Ó 2016 Hogrefe Publishing

65

could only undertake this study once, responding via app or SMS, but not both. Due to a limited licensing time frame associated with the survey app, assignment to responding via app or SMS was not random. Instead, participants recruited prior to the end of license time frame provided responses via app, and those recruited afterwards responded via SMS. To minimize the potential for this nonrandom assignment to bias participant behavior, participants were not aware of the two different response conditions. Recruitment materials indicated that mobile telephones would be used to collect survey data, but did not specify how that data was to be collected. Fifty-four participants responded via app, while 61 responded via SMS. Study Procedure Participants attended a physical meeting with the researcher and completed the computer-administered entry questionnaire. A test SMS prompt was sent during this meeting to confirm the researcher had the appropriate contact details. The mental time travel questionnaire was then provided to participants. Those recruited first, thus assigned to responding via app, were guided through the app installation process. Those recruited later, thus assigned to responding via SMS, were sent the questionnaire via SMS. To ensure the task was clear and the mobile systems were functioning correctly, a test run of the ongoing questionnaire was completed during this physical meeting. In the two days following the physical meeting with the researcher, all participants received a total of 20 SMS prompts (10 per day) to complete the exit questionnaire. Those responding via app opened iSurvey and entered their answers, while those responding via SMS replied to the prompt SMS with their answers. Participants then attended a follow-up appointment, where they completed the computer-administered exit survey. Those responding via app were guided through the process of submitting their responses and then deleting the iSurvey app. Those responding via SMS, where applicable, were reimbursed for the cost of sending SMS for the purposes of participation.

Results Participants in the app group (n = 54) were aged 17–52 years (M = 22), and 70% were female. Participants in the SMS group (n = 61) were aged 18–46 years (M = 21), and 75% were female. T-tests and chi-square tests did not indicate the two samples differed significantly in terms of age (t = 1.2, p = .90), gender (w2 = 0.15, p = .69), or number of SMS sent in daily life (t = 1.4, p = .13). European Journal of Psychological Assessment (2019), 35(1), 63–69


66

SMS and app responses were compared in terms of response completeness and response delay. Response completeness consisted of increasingly stringent criterion. A partially complete response consisted of one to four questions answered. A complete response was the five required questions answered, but not the optional sixth question. An overcomplete response is an attempt of all six questions (where the sixth was specified as optional). These categories are mutually exclusive. The outcome variable is therefore the count of partial, basic, and full responses each participant provided, with a maximum of 20 possible responses. In both response conditions, participants provided an average of 15 responses. T-tests revealed that this did not significantly differ between SMS and app respondents. However, compared with SMS respondents, app respondents provided significantly fewer partial responses, t(144) = 8.47, p < .01 (per person, app mean = 1, SMS mean = 8), and significantly fewer complete responses, t(144) = 2.21, p = .02 (per person, app mean = 3, SMS mean = 1). Conversely, app respondents provided significantly more overcomplete responses, t(114) = 5.14, p < .01 (per person, app mean = 12, SMS mean = 6). This pattern of results suggests that response mode did not affect whether responses were attempted, but that people using an app were significantly more likely to provide complete responses. Response completeness can also be examined in terms of the number of questions answered within responses, removing complete nonresponses (where none of the six questions were attempted) from analysis. Viewed in this way, those responding via app completed an average of six questions per sampling occasion (SD = 0.58), while those responding via SMS completed an average of five questions per sampling occasion, but this was more variable (SD = 0.86). A multilevel model was fit, with responses nested by participant and response mode specified as a predictor of number of questions answered. Response mode significantly predicted the number of questions answered b = 0.64, 95% CI [0.42, 0.85]. The slope suggests that, for every three answered via SMS, an app respondent is likely to answer four. This supports the assertion that mode is significantly associated with response completeness. While coding the data, it was clear that SMS respondents were not completing one question in particular as required. When asked to rate their mood on a Likert scale, many SMS respondents instead provided a qualitative mood descriptor such as “frustrated” or “bored.” Though some manner of response had been provided, this was coded as a missing response as it did not conform to the required response format.

European Journal of Psychological Assessment (2019), 35(1), 63–69

E. I. Walsh & J. K. Brinker, Is SMS APPropriate?

Table 1. Ratings of convenience and privacy by mode Counts (percentages) App

SMS

Model properties v2

v2 Power

Fisher’s p

.58

.05

.31

.24

Convenience Poor

1 (2%)

4 (7%)

5.956

Neutral

8 (15%)

18 (31%)

p = .05

Good

43 (83%)

36 (62%)

Privacy Poor

0 (0%)

2 (4%)

2.909

Neutral

7 (13%)

11 (19%)

p = .203

Good

46 (87%)

43 (77%)

Notes. Counts reflect what was included in the model, percentages are included to give context due to differing sample sizes of app and SMS respondents. N is slightly smaller than total sample in either group due to some missing data in the exit survey.

Response delay was evaluated by way of number of minutes between a prompt, and response in minutes, with the shortest delay possible set at 1 min. As can be expected given this was a response time variable, this response delay was strongly bounded and skewed. Given that this data shape is theoretically expected, rather than transform the data to meet model assumptions, models were fitted using a Poisson distribution. The median response delay for responses completed via app was three min, while those completed via SMS was 4 min. A logistic multilevel model was fit, with mode as a predictor of receipt of response delay (in minutes), nested by participant. This model did not reveal a significant association between response mode and response delay (b = 0.15, 95% CI [ 0.19, 0.51], suggesting that response delay was unaffected by mode. Summarized in Table 1, two chi-square tests were completed to explore differences in participant perceptions of convenience and privacy, based on whether they participated by way of SMS or app. While the two groups did not significantly differ in their perceptions of privacy, those using apps were significantly more likely to rate their data collection mode as having “good” convenience than those using SMS.

Discussion This study examined whether app or SMS provided superior data completeness, response delay, and participant evaluation of privacy and convenience. Collecting data by app or SMS did not impact upon whether or not a response was attempted, whether the response was extraneous or a duplicate, or how promptly participants responded. The response rate for SMS and app respondents was equivalent, promisingly exceeding the average response

Ó 2016 Hogrefe Publishing


E. I. Walsh & J. K. Brinker, Is SMS APPropriate?

rate in academic research estimated by Baruch and Holtom (2008). However, mode did significantly impact on response completion. Following the same pattern as in Lim et al. (2010), SMS data was significantly less complete than app data. This may be due to two factors caused by the uncontrolled response format of SMS. Firstly, while app respondents had fixed forms in which to provide their answers, the free-text nature of SMS responses allowed participants to respond in a nonstandard format (i.e., providing qualitative mood descriptors such as “fine” rather than requested Likert ratings). Though participants technically answered the question, this data must be considered missing as it cannot be confidently reconciled with the required numeric format. Secondly, apps offer item skipping prevention akin to online surveys, while SMS does not. This allows more accidental response omissions to occur in SMS. Given the almost identical overall response rates, this indicates that data collection via app provides superior data completeness, particularly when the usability of the data is contingent on participants following specific response format instructions. Minimizing response delays minimizes potential data distortion due to retrospective recall bias (Raphael, 1987). The median response delay of under four min for both modes was consistent with the literature using SMS (Conner & Reid, 2012; De Lepper et al., 2013), and was better than what may be expected from the literature using apps. This may be because the current study had a more compressed sampling schedule (10 times in a day) than those reviewed in Hofmann and Patel (2014, three to seven times in a day), thus engendering a greater sense of rush to respond, lest a late response become a missed response. Another possibility is that the current study sampled only from university undergraduates, a population particularly likely to have their mobile telephones nearby at all times, while the studies in Hofmann and Patel (2014) were a mixture of undergraduates and members of the general population. These short response delays are particularly promising for ecological momentary assessment, where researchers seek to tap transient, current thoughts and feelings, as problems of recall bias are minimized when responses are prompt. These results suggest that either app or SMS may be a viable method of data collection where prompt responses are particularly important. As in previous research using SMS and apps as a means for communicating with participants, perceptions of the privacy and convenience of both modes were generally positive (Akamatsu, Mayer, & Farrelly, 2006; Lim et al., 2010; Matthews et al., 2008). Here, participants who responded via apps were significantly more likely to rate their data collection mode as having “good” convenience than those using SMS. This difference cannot be due to the response platform (as all participants were using Ó 2016 Hogrefe Publishing

67

iPhones), or the response schedule (which was randomized), suggesting that something may be more convenient about responding via app than SMS. One possibility is that respondents participating via SMS received the questions in an initial SMS, and only prompts when it came time to respond. This resulted in the questions and the input space for answers being separated, thus necessitating scrolling. Conversely, those responding via app were presented with the questions directly next to answer input. This could be clarified in future research, by sending the full SMS questionnaire on each response occasion, rather than just a prompt referring participants to an earlier SMS containing the questionnaire. This was the first study to directly compare SMS and app response behavior for self-report psychological research. The difference between the two response modes was made clear by controlling the demographic to only undergraduate students, and the response platform to only iPhones. However, this limits the generalizability of findings. Further investigation is warranted to see how SMS and apps compare in a wider population sample, likely to own different types of mobile telephones, and importantly, across a wider range of ages. Engagement with mobile telephone differs on the basis of age (Devitt & Roker, 2009; Ling, 2010; Mante & Piris, 2010), which may in turn impact on the viability of using SMS or apps for data collection with a particular age group. For example, teenagers and young adults use SMS heavily in their daily lives (Charlton, Panting, & Hannan, 2002; Pain et al., 2005), and have experience with apps – only a tenth of individuals aged 18–35 years have never downloaded an app (Deloitte, 2013). Conversely, older adults use SMS more sparingly (Lobet-Maris & Henin, 2002; Mallenius, Rossi, & Tuunainen, 2007), and almost a third of those aged 65 and over have never downloaded an app (Deloitte, 2013). It would be educative to establish whether the relative efficacy of apps and SMS reflects these differing levels of preexisting mastery. A particularly useful set of tools for pursuing these questions is Psychoinformatics. In brief, psychoinformatics applies computer science tools to psychological data collection, often via data mining and collection from multiple digital and behavioral sources (for a more detailed explanation, see Yarkoni, 2012). Here, simultaneous behavioral and software monitoring of the interaction between participant and mobile telephone during everyday life, and self-report data collection could clarify behavioral differences in communication via app and SMS. This could be achieved using a similar technique to Montag et al. (2014), who collected data on smartphone usage behavior via a bespoke monitoring app. This paper directly contrasted SMS and app in terms of response rate, response completeness, response delay, European Journal of Psychological Assessment (2019), 35(1), 63–69


68

and participant evaluation of privacy and convenience. In a self-report, repeated measures paradigm, apps outperformed SMS in terms of data completeness, and positive participant perceptions of the research experience. All else being equal, this suggests that researchers should consider using apps rather than SMS for repeated measures selfreport data collection. Acknowledgments The authors would like to acknowledge the support and access to participants provided by Janie Busby-Grant.

References ACMA. (2013). ACMA Communications report 2012–2013. Australia: The Australian Media Communications Authority. Akamatsu, C. T., Mayer, C., & Farrelly, S. (2006). An investigation of two-way text messaging use with deaf students at the secondary level. Journal of Deaf Studies and Deaf Education, 11, 120–131. doi: 10.1093/deafed/enj013 Anhoj, J., & Moldrup, C. (2009). Feasibility of collecting diary data from asthma patients through mobile phones and SMS (short message service): Response rate analysis and focus group evaluation from a pilot study. Journal of Medical Internet Research, 6, e42. doi: 10.2196/jmir.6.4.e42 Asiimwe, C., Gelvin, D., Lee, E., Ben Amor, Y., Quinto, E., Katureebe, C., . . . Berg, M. (2011). Use of an innovative, affordable, and open-source short message service-based tool to monitor malaria in remote areas of Uganda. The American Journal of Tropical Medicine and Hygiene, 85, 26–33. doi: 10.4269/ajtmh.2011.10-0528 Baruch, Y., & Holtom, B. C. (2008). Survey response rate levels and trends in organizational research. Human Relations, 61, 1139–1160. doi: 10.1177/0018726708094863 Broderick, C. R., Herbert, R. D., Latimer, J., Mathieu, E., van Doorn, N., & Curtin, J. A. (2012). Feasibility of short message service to document bleeding episodes in children with haemophilia. Haemophilia: The Official Journal of the World Federation of Hemophilia, 18, 906–910. doi: 10.1111/j.13652516.2012.02869.x Charlton, T., Panting, C., & Hannan, A. (2002). Mobile telephone ownership and usage among 10- and 11-year-olds. Emotional and Behavioural Difficulties, 7, 37–41. Chib, A., Wilkin, H., Ling, L. X., Hoefman, B., & Van Biejma, H. (2012). You have an important message! Evaluating the effectiveness of a text message HIV/AIDS campaign in Northwest Uganda. Journal of Health Communication, 17(Suppl 1(April 2014)), 146–157. doi: 10.1080/10810730.2011.649104 Conner, T. S., & Reid, K. A. (2012). Effects of intensive mobile happiness reporting in daily life. Social Psychological and Personality Science, 3, 315–323. Cook, C., Heath, F., & Thompson, R. L. (2000). A meta-analysis of response rates in Web- or Internet-based surveys. Educational and Psychological Measurement, 60, 821–836. doi: 10.1177/ 00131640021970934 Déglise, C., Suggs, L. S., & Odermatt, P. (2012). Short message service (SMS) applications for disease prevention in developing countries. Journal of Medical Internet Research, 14, e3. doi: 10.2196/jmir.1823

European Journal of Psychological Assessment (2019), 35(1), 63–69

E. I. Walsh & J. K. Brinker, Is SMS APPropriate?

De Lepper, A. M., Eijkemans, M. J. C., Van Beijma, H., Loggers, J. W., & Tuijn, C. J. (2013). Response patterns to interactive SMS health education quizzes at two sites in Uganda: A cohort study. Tropical Medicine and International Health, 18, 516–521. doi: 10.1111/tmi.12059 Deloitte Global Media Consumer Survey, Developed Countries. (2013). United Kingdom: Deloitte Touche Tohmatsu Ltd. Devitt, K., & Roker, D. (2009). The role of mobile phones in family communication. Children & Society, 23, 189–202. doi: 10.1111/ j.1099-0860.2008.00166.x Dillman, D., Phelps, G., Tortora, R., Swift, K., Kohrell, J., Berck, J., & Messer, B. L. (2009a). Response rate and measurement differences in mixed-mode surveys using mail, telephone, interactive voice response (IVR) and the Internet. Social Science Research, 38, 1–18. doi: 10.1016/j.ssresearch.2008.03.007 Dillman, D. A., Smyth, J. D., & Christian, L. M. (2009b). Internet, Mail, and Mixed-Mode Surveys (3rd ed.). Hoboken, NJ: Wiley. Donaldson, E. L., Fallows, S., & Morris, M. (2014). A text message based weight management intervention for overweight adults. Journal of Human Nutrition and Dietetics: The Official Journal of the British Dietetic Association, 27(Suppl 2), 90–97. doi: 10.1111/jhn.12096 Fernandez, K. C., Johnson, M. R., & Rodebaugh, T. L. (2013). TelEMA: A low-cost and user-friendly telephone assessment platform. Behavior Research Methods, 45, 1279–1291. doi: 10.3758/s13428-012-0287-9 Flick, S. N. (1988). Managing attrition in clinical research. Clinical Psychology Review, 8, 499–515. doi: 10.1016/0272-7358(88) 90076-1 Fox, R., Crask, M., & Kim, J. (1988). Mail survey response rate a meta-analysis of selected techniques for inducing response. Public Opinion Quarterly, 52, 467–491. Fukuoka, Y., & Kamitani, E. (2011). New insights into compliance with a mobile phone diary and pedometer use in sedentary women. Journal of Physical Activity and Health, 8, 398–403. Haller, D., Sanci, L., Sawyer, S., Coffey, C., & Patton, G. (2006). R U OK 2 TXT 4 RESEARCH? – Feasibility of text message communication in primary care research. Australian Family Physician, 35, 175–176. Hofmann, W., & Patel, P. V. (2014). SurveySignal: A convenient solution for experience sampling research using participants’ own smartphones. Social Science Computer Review, 33, 235–253. doi: 10.1177/0894439314525117 Keijzers, J., Ouden, E. D., & Lu, Y. (2008). Usability benchmark study of commercially available smart phones: Cell phone type platform, PDA type platform and PC type platform. Proceedings of the 10th International Conference on Human Computer Interaction with Mobile Devices and Services (pp. 265–272). ACM. Retrieved from http://dl.acm.org/citation.cfm? id=1409269 Kojo, I., Heiskala, M., & Virtanen, J. (2014). Customer journey mapping of an experience-centric service by mobile selfreporting: Testing the Qualiwall Tool. In A. Marcus (Ed.), Design, user experience, and usability. Theories, methods, and tools for designing the user experience (pp. 261–272). Bern, Switzerland: Springer International. Retrieved from http://link.springer.com/ chapter/10.1007/978-3-319-07668-3_26 Lehman, B. J. (2011). Getting started: Launching a study in daily life. In M. R. Mehl & T. S. Conner (Eds.), Handbook of research methods for studying daily life (pp. 89–107). New York, NY: The Guilford Press. Lim, M. S. C., Sacks-Davis, R., Aitken, C. K., Hocking, J. S., & Hellard, M. E. (2010). Randomised controlled trial of paper, online and SMS diaries for collecting sexual behavior information from young people. Journal of Epidemiology and

Ó 2016 Hogrefe Publishing


E. I. Walsh & J. K. Brinker, Is SMS APPropriate?

Community Health, 64, 885–889. doi: 10.1136/jech.2008. 085316 Ling, R. (2010). Texting as a life phase medium. Journal of Computer-Mediated Communication, 15, 277–292. doi: 10.1111/ j.1083-6101.2010.01520.x Lobet-Maris, C., & Henin, L. (2002). Talking without communicating or communicating without talking: From the GSM to the SMS. Estudios de Juventud, 57, 101–114. Mackay, M. M., & Weidlich, O. (2009). Austrailan Mobile Phone lifestyle index. Specialist. Australian Interactive Media Industry Association Mobile Industry Group. Retrieved from http://www. aimia.com.au/enews/mobile/090929AIMIA_Report_FINAL.pdf Mallenius, S., Rossi, M., & Tuunainen, V. (2007). Factors affecting the adoption and use of mobile devices and services by elderly people – Results from a pilot study. 6th Annual Global Mobility Roundtable, 31. Retrieved from http://citeseerx.ist.psu.edu/ viewdoc/download?doi=10.1.1.130.2463&rep=rep1&type=pdf Mante, E. A., & Piris, D. (2010). SMS use by young people in the Netherlands. Revista de Estudios de Juventud, 52, 47–58. Marshall, A., Medvedev, O., & Antonov, A. (2008). Use of a smartphone for improved self-management of pulmonary rehabilitation. International Journal of Telemedicine and Applications,1–5. doi: 10.1155/2008/753064 Matthews, M., Doherty, G., Sharry, J., & Fitzpatrick, C. (2008). Mobile phone mood charting for adolescents. British Journal of Guidance & Counselling, 36, 113–129. doi: 10.1080/ 03069880801926400 Miller, G. (2012). The smartphone psychology manifesto. Perspectives on Psychological Science, 7, 221–237. doi: 10.1177/ 1745691612441215 Mogensen, A. (1963). Item-skipping and right and wrong solutions in a preliminary version of a multiple-choice vocabulary test. Acta Psychologica, 21, 49–54. Montag, C., Błaszkiewicz, K., Lachmann, B., Andone, I., Sariyska, R., Trendafilov, B., . . . Markowetz, A. (2014). Correlating personality and actual phone usage: Evidence from psychoinformatics. Journal of Individual Differences, 35, 158–165. Morris, M. E., Kathawala, Q., Leen, T. K., Gorenstein, E. E., Guilak, F., Labhard, M., & Deleeuw, W. (2010). Mobile therapy: Case study evaluations of a cell phone application for emotional self-awareness. Journal of Medical Internet Research, 12, e10. doi: 10.2196/jmir.1371 Naughton, F., Jamison, J., & Sutton, S. (2013). Attitudes towards SMS text message smoking cessation support: A qualitative study of pregnant smokers. Health Education Research, 28, 911–922. doi: 10.1093/her/cyt057 Pain, R., Grundy, S. U. E., Gill, S., Towner, E., Sparks, G., & Hughes, K. (2005). “So long as i take my mobile”: Mobile phones, urban life and geographies of young people‘s safety. International Journal of Urban and Regional Research, 29, 814–830. Ranney, M. L., Choo, E. K., Cunningham, R. M., Spirito, A., Thorsen, M., Mello, M. J., & Morrow, K. (2014). Acceptability, language, and structure of text message-based behavioral interventions for high-risk adolescent females: A qualitative study. Journal of Adolescent Health, 55, 1–8. doi: 10.1016/ j.jadohealth.2013.12.017 Raphael, K. (1987). Recall bias: A proposal for assessment and control. International Journal of Epidemiology, 16, 167–170. Reimers, S., & Stewart, N. (2009). Using SMS text messaging for teaching and data collection in the behavioral sciences.

Ó 2016 Hogrefe Publishing

69

Behavior Research Methods, 41, 675–681. doi: 10.3758/ BRM.41.3.675 Ribeiro, A., & da Silva, A. R. (2012). Survey on cross-platforms and languages for mobile apps. In 2012 Eighth International Conference on the Quality of Information and Communications Technology (pp. 255–260). Washington, DC: IEEE. doi: 10.1109/ QUATIC.2012.56 Richardson, C., & Johnson, J. (2009). The influence of web- versus paper-based formats on the assessment of tobacco dependence: Evaluating the measurement invariance of the Dimensions of Tobacco Dependence Scale. Substance Abuse: Research and Treatment, 3, 1–14. Rosser, B. A., & Eccleston, C. (2011). Smartphone applications for pain management. Journal of Telemedicine and Telecare, 17, 308–312. doi: 10.1258/jtt.2011.101102 Sax, L., Gilmartin, S., & Bryant, A. (2003). Assessing response rates and nonresponse bias in web and paper surveys. Research in Higher Education, 44, 409–432. doi: 10.1023/ A:1024232915870 Sharp, L. M., & Frankel, J. (1983). Respondent burden: A test of some common assumptions. Public Opinion Quarterly, 47, 36–53. The World Bank. (2012). 2012 Information and Communications for Development: Maximizing Mobile. Washington, DC: World Bank. Tsai, C. C., Lee, G., Raab, F., Norman, G. J., Sohn, T., Griswold, W. G., & Patrick, K. (2007). Usability and feasibility of PmEB: A mobile phone application for monitoring real time caloric balance. Mobile Networks and Applications, 12, 173–184. doi: 10.1007/s11036-007-0014-4 Van Buuren, S. (2010). Item imputation without specifying scale structure. Methodology: European Journal of Research Methods for the Behavioral and Social Sciences, 6, 31–36. doi: 10.1027/ 1614-2241/a000004 Van de Vijver, F., & Harsveldt, M. (1994). The incomplete equivalence of the paper-and-pencil and computerized versions of the General Aptitude Test Battery. Journal of Applied Psychology, 79, 852–859. Walsh, E. I., & Brinker, J. K. (2012). Evaluation of a Short Message Service diary methodology in a nonclinical, naturalistic setting. Cyberpsychology, Behavior and Social Networking, 15, 615–618. doi: 10.1089/cyber.2012.0189 Yarkoni, T. (2012). Psychoinformatics new horizons at the interface of the psychological and computing sciences. Current Directions in Psychological Science, 21, 391–397. Received June 24, 2015 Revision received February 23, 2016 Accepted March 14, 2016 Published online December 29, 2016

Erin I. Walsh The Australian National University Room 218, Research School of Psychology Building 39 Science Road Canberra, ACT, 0200 Australia erin.walsh@anu.edu.au

European Journal of Psychological Assessment (2019), 35(1), 63–69


Original Article

Psychometric Properties of the Borderline Personality Features Scale for Children-11 (BPFSC-11) in a Sample of Community Dwelling Italian Adolescents Andrea Fossati,1 Carla Sharp,2 Serena Borroni,3 and Antonella Somma1 1

Department of Human Studies, LUMSA University, Rome, Italy, and San Raffaele Hospital, Milan, Italy

2

Department of Psychology, University of Houston, and The Menninger Clinic, Houston, TX, USA

3

Faculty of Psychology, Vita-Salute San Raffaele University, and San Raffaele Hospital, Milan, Italy Abstract: The aims of the current study were to assess the psychometric properties of the Borderline Personality Features Scale for Children11 (BPFSC-11) in adolescence. In particular, we aim at evaluating: the internal consistency and six-month test-retest reliability of the Italian translation of the BPFSC-11, its factor structure, and its convergent validity. Eight hundred five community dwelling adolescents were administered the Italian translations of the BPFSC-11 and Personality Diagnostic Questionnaire-4+(PDQ-4+) Borderline Personality Disorder (BPD) scale. The BPFSC-11 showed adequate internal consistency (Cronbach’s a = .78) and moderate six-month test-retest stability. Although confirmatory factor analysis did not support a one-factor model of the BPFSC-11 items, a bi-factor model (RMSEA = .04) showed that all BPFSC-11 items loaded significantly onto a general common factor, with two specific factors capturing largely residual variance due to distribution artifacts. In this study, the bivariate correlation between the BPFSC-11 and the PDQ-4+BPD scale was .64 (p < .001). Finally, the BPFSC-11 showed gender invariance across items. In summary, our findings support the reliability and validity of the BPFSC-11 as a measure of self-reported borderline personality features in community dwelling adolescents. Keywords: BPFSC-11, Italian version, reliability, validity, adolescence

Borderline Personality Disorder (BPD) is a debilitating disorder that occurs in approximately 1–3% of the general population (Leichsenring, Leibing, Kruse, New, & Leweke, 2011) and is associated with heightened risk for a number of self-destructive behaviors (American Psychiatric Association, 2013; Leichsenring et al., 2011). Given the significant costs associated with adult BPD, there has been an increase in research examining BPD in adolescence (e.g., Sharp, Ha, Michonski, Venta, & Carbone, 2012). Psychometric data clearly indicate that BPD can be reliably diagnosed in adolescence using descriptive diagnostic criteria (e.g., Michonski, Sharp, Steinberg, & Zanarini, 2013). Valid and reliable instruments that are both time and cost effective would greatly assist clinicians in the assessment of BPD features in adolescence (Sharp et al., 2012). Crick, Murray-Close, and Woods (2005) developed the Borderline Personality Features Scale for Children (BPFSC); originally, European Journal of Psychological Assessment (2019), 35(1), 70–77 DOI: 10.1027/1015-5759/a000377

the BPFSC was a 24-item, Likert-type self-report questionnaire which was developed by modifying the borderline scale of the Personality Assessment Inventory (PAI; Morey, 1991), which is a reliable and valid tool used to assess borderline personality features among adults. Although an adolescent version of the PAI was developed, its items remained largely unchanged from the adult version. The BPFSC items were age-appropriate in terms of item content and wording, and aimed to reflect the original four domains of the PAI (affective instability, identity problems, negative relationships, and self-harm). BPFSC scores were shown to converge with interview-based measures of BPD in adolescent inpatients (Chang, Sharp, & Ha, 2011) and to have concurrent validity in a community sample of boys (Sharp, Mosko, Chang, & Ha, 2011). Despite these encouraging findings, Sharp, Steinberg, Temple, and Newlin (2014) did not find support for the hypothesized 4-factor structure of the 24-item BPFSC in a community sample of Ó 2016 Hogrefe Publishing


A. Fossati et al., Psychometric Properties of the BPFSC-11

964 adolescents. Item Response Theory (IRT) analysis showed instances of local dependence among selected item pairs; as a consequence, items were eliminated, creating an 11-item version of the BPFSC (BPFSC-11). Sharp and colleagues (2014), using a different sample of 371 inpatient adolescents, demonstrated similar indices of construct validity as observed for the BPFSC total score with the BPFSC-11 scores and found evidence for good criterion validity. To our knowledge, no study on the psychometric properties of the BPFSC-11 in adolescent samples has been carried out beyond Sharp and colleagues’ validation study, and no study tested the BPFSC-11 in a cultural context different from the U.S. Moreover, although the most discriminating BPD features (i.e., core diagnostic features) in adolescence may be different between female and male adolescents (e.g., Fossati, 2014), few studies examined the invariance of the factor model of BPD scales across subgroups based on gender (e.g., Michonski et al., 2013), and none of them relied on the BPFSC-11. Reliable and valid self-report measures for the assessment of BPD during adolescence are currently available in their Italian translations: for example, the Personality Diagnostic Questionnaire-4+BPD Scale (PDQ-4+; Fossati, Gratz, Maffei, & Borroni, 2013; Hyler, 1994) and the Borderline Personality Inventory (BPI; Fossati, Feeney, Maffei, & Borroni, 2014; Leichsenring, 1999). However, none of these instruments currently validated in Italy were developed specifically to assess BPD in adolescence, and the availability of measures specifically designed to capture BPD features as they manifest themselves during adolescence would be of significant help to clinicians for detecting BPD (Sharp et al., 2014). Against this background, the major aim of the present study was to assess the psychometric properties of the BPFSC-11 in a large sample of Italian community dwelling adolescents. In particular, in the present study, we aim at evaluating: (a) the internal consistency and six-month test-retest reliability of the Italian translation of the BPFSC-11; (b) the factor structure of the BPFSC-11; (c) the gender invariance of the BPFSC-11 factor structure; (d) the convergent validity of the BPFSC-11 total score with another self-report measure of BPD features based on DSM-IV/DSM-5 Section II (APA, 2000, 2013) criteria.

Method Participants In order to participate in the present study, participants needed to be adolescent high school students. Participants Ó 2016 Hogrefe Publishing

71

were 817 adolescents who were attending a public high school in Rome, Italy metropolitan area with specialization in teacher training or social sciences; 525 participants (64.3%) were female (mean age = 16.43 years, SD = 1.40 years, range: 14–20 years), and 292 (35.7%) were male (mean age = 16.41 years, SD = 1.48 years, range: 14–20 years). Data were incomplete for 12 participants (1.5%; questionnaires were considered incomplete if any of the items of given scales was not answered) and these participants were excluded from the final sample. Participants with incomplete questionnaires did not differ from participants with complete questionnaires on gender, w2(1) = 1.93, p > .10, φ = .05, and age, t(815) = 0.22, p > .70, d = 0.06. The final sample was comprised of 805 high school students: 515 (64.0%) were female (mean age = 16.42 years, SD = 1.47 years, range: 14–20 years) and 290 (36.0%) were male (mean age = 16.43 years, SD = 1.39 years, range: 14–20 years). Participants’ mean age was 16.43 years, SD = 1.42 years, range: 14–20 years. The gender composition of the sample closely reflected the gender-based preference of the school. In order to participate in the present study, participants were required to speak Italian as their first language in order to avoid cultural and lexical bias in questionnaire responses. After obtaining Institutional Review Board approval from the university and the principals of the schools, researchers recruited adolescents from classrooms. Written informed parent consent and adolescent assent were obtained prior to study participation. The BPFSC-11 six-month test-retest reliability was evaluated in a subsample of 471 (58.5%) adolescent participants (male = 285, female = 186, mean age = 16.30 years, SD = 1.39 years). When compared to participants who did not participate, the retest subsample adolescents were significantly younger than adolescents who did not participate in the test-retest study, t(803) = 2.97, p < .01, d = 0.21, and showed a significantly higher proportion of female participants, w2(1) = 5.92, p < .05, φ = .09, although the effect size indices for these differences were small by conventional standards (Cohen, 1988). Interestingly, test-retest participants did not show any significant difference from participants who did not take part in the test-retest study on the BPFSC-11 total score at the baseline, t(803) = 1.02, p > .30, d = 0.07.

Measures Borderline Personality Features Scale-11 (BPFSC-11; Sharp et al., 2014) The BPFSC-11 consists of 11 items measuring borderline personality features in childhood (for ages 9 and older, including adolescents). Items in the BPFSC-11 comprise European Journal of Psychological Assessment (2019), 35(1), 70–77


72

behavior reflective of core BPD features, namely, affective instability, identity problems, and negative relationships. No “self-harm” item has been included in the BPFSC-11. Sample items include “How I feel about myself changes a lot” and “I want to let some people know how much they’ve hurt me.” These items assess how participants feel about themselves and other people, and are rated on a 5-point Likert-type scale ranging from not true at all to always true. The BPFSC-11 yields a total score (range: 11–55) measuring the overall level of borderline characteristics; the higher the BPFSC-11 total score, the greater the intensity of BPD features. The BPFSC-11 has shown adequate psychometric properties (Cronbach’s α = .85) in a sample of adolescent inpatients (Sharp et al., 2014). Participants were administered the BPFSC-11 in its Italian translation. Equivalence with the original meaning of the items was the guiding principle in the translation process (Denissen, Geenen, van Aken, Gosling, & Potter, 2008). Personality Diagnostic Questionnaire-4+ Borderline Personality Disorder (BPD) Scale (PDQ-4+; Hyler, 1994) The PDQ-4+ is a self-report questionnaire with 99 true/ false items and is designed to measure the 10 personality disorders included in DSM-IV axis II/DSM-5 Section II and the two personality disorders (PDs) proposed for further research in the DSM-IV. The PDQ-4+ has one item for each DSM-IV/DSM-5 personality disorder criterion, which is separately summed to generate the total scores for each scale. Since the present study focused on BPD, participants were administered only the 9-item BPD scale. The Italian translation of the PDQ-4+ BPD scale has been found to have adequate psychometric properties (Cronbach’s α = .70) among Italian adult clinical participants (Fossati et al., 1998) and Italian high school students (e.g., Fossati et al., 2013). Owing to space consideration we included a detailed description of the translation procedures and an extensive description of data analysis as Electronic Supplementary Material, ESM 1.

Results BPFSC-11 Descriptive Statistics and Internal Consistency Analyses BPFSC-11 descriptive statistics, Cronbach’s alpha value, item analyses in the whole sample and by gender, as well as gender comparisons are listed in Table 1.

European Journal of Psychological Assessment (2019), 35(1), 70–77

A. Fossati et al., Psychometric Properties of the BPFSC-11

BPFSC-11 Six-Month Test-Retest Reliability Both Pearson r coefficient and interclass correlation (ICC) coefficient for absolute agreement based on one-way random effect ANOVA were computed in order to evaluate the six-month test-retest reliability of the BPFSC-11 total score. Among the 471 adolescents who agreed to participate in the six-month test-retest study, no significant difference was observed between the mean BPFSC-11 total score at the baseline (M = 28.08, SD = 6.47), and the mean BPFSC-11 total score at follow-up (M = 27.94, SD = 6.56), paired-sample t(470) = 0.46, p > .50, d = 0.02; rather, the six-month test-retest correlation between the two sets of BPFSC-11 scores was highly significant, r = .50, p < .001. When ICC coefficient for absolute agreement between baseline and six-month retest scores was computed based on one-way random effect ANOVA, almost identical findings were observed; ICC value was .50, 95% confidence interval (CI) [.43, .57], p < .001.

Factor Structure of the Italian Translation of the BPFSC-11 Confirmatory factor analysis (CFA) was used in order to test the hypothesis that the covariation that was observed among the BPFSC-11 items could be explained by a single latent factor. Since BPFSC-11 items are measured on an ordinal scale, a polychoric correlation matrix was computed; accordingly, a weighted least square mean and variance adjusted (WLSMV) algorithm was used in CFA. WLSMV CFA results provided evidence of marginal fit of the one-factor model of the BPFSC-11 items, w2(44) = 268.01, p < .001, RMSEA = .08, 90% CI for RMSEA [.07, .09], test of close fit (i.e., RMSEA .05) p < .001, TLI = .89, CFI = .91, WRMR = 1.39. Minimum average partial statistic (MAP; Zwick & Velicer, 1986) and quasi-inferential parallel analysis (Buja & Eyuboglu, 1992) were computed in order to assess the dimensionality of the BPFSC-11 item polychoric correlation matrix. In this sample, MAP statistic values after the extraction of the first four principal components of the BPFSC-11 item polychoric correlation matrix were .02, .05, .14, and 1.00, respectively, thus supporting a one-factor model of the BPFSC-11 items. The first four eigenvalues of the BPFSC-11 item polychoric correlation matrix were 3.59, 1.24, 1.07, and 0.90, respectively, whereas the 95th percentile values of the corresponding random eigenvalues were 1.26, 1.19, 1.14, and 1.10, respectively; indeed, the first two eigenvalues of the BPFSC-11 clearly exceeded 95% of the distribution of the corresponding random eigenvalues.

Ó 2016 Hogrefe Publishing


A. Fossati et al., Psychometric Properties of the BPFSC-11

73

Table 1. Borderline Personality Features Scale-11: Cronbach’s α and average inter-item polychoric correlation values, descriptive statistics, and item-total correlations corrected for part-whole overlap in the whole sample and broken down by gender (N = 805) Whole sample (N = 805)

Male adolescents (n = 290)

Female adolescents (n = 515)

BPFSC-11 items

M

Mdn

SD

ri-t

M

Mdn

SD

ri-t

M

Mdn

SD

ri-t

1. I feel very lonely.

2.30

2.00

1.08

.53

1.96a

2.00

1.01

.49a

2.50b

2.00

1.07

.50a

2. I want to let some people know. . .

2.93

3.00

1.23

.38

2.56a

3.00

1.16

.44a

3.14b

3.00

1.22

.29a

3. My feelings are very strong. . .

3.70

4.00

1.06

.21

3.46a

4.00

1.06

.31a

3.84b

4.00

1.04

.11a

1.22

a

3.00

3.00

1.28

.47a

a

1.99

4. I feel that there is something important. . . 5. I’m careless with things that are important to me. 6. People who were close to me have let me down. 7. I go back and forth between different feelings. . . 8. I get into trouble because I do things. . . 9. I worry that people I care about will leave. . . 10. How I feel about myself changes a lot. 11. Lots of times, my friends and I are really. . . BPFSC-11 total score BPFSC-11 Cronbach’s a (average inter-item r)

2.91 2.05 2.54 2.92 2.27 3.03

3.00 2.00 3.00 3.00 2.00 3.00

2.62

3.00

1.80

2.00

29.07 .78 (.24)

1.26 1.10 0.96 1.21 1.13 1.39

.46 .32 .42 .57 .37 .44

1.19

.46

0.97

.27

6.81

2.74

3.00

2.17

2.00

2.29

a

2.52

a

2.32

a

2.52

a

2.30

a

1.91

2.00 2.00 2.00 2.00

1.11 0.86 1.16 1.11 1.38

.41 .40

a

.44

a

.59

a

.36

a

.41

a

2.00

1.14

.42

2.00

1.00

.26a

26.76a .79 (.25)

6.80

2.00

1.08

.33a

2.68

b

3.00

0.99

.37a

3.15

b

3.00

1.18

.52a

2.24

a

2.00

1.13

.41a

3.32

b

3.00

1.31

.40a

2.80

b

3.00

1.19

.45a

1.00

0.95

.34a

1.73 30.37b

6.55

.76 (.22)

Notes. BPFSC-11 = Borderline Personality Features Scale for Children-11; average inter-item r = average inter-item polychoric correlation; Mdn = Median; ri-t = item-total correlation corrected for part-whole overlap. Means with different superscripts are significantly different at Bonferroni corrected p-value (i.e., p < .0045) in male participants and in female participants; Mann-Whitney U test was used for testing gender differences in BPFSC-11 item scores, whereas Student t-test was used to assess the presence of significant gender differences on BPFSC-11 mean total score. Item-total correlation coefficients with different superscripts are significantly different at Bonferroni corrected p-value (i.e., p < .0045) in male subgroup and in female subgroup according to z-test for correlation coefficient homogeneity.

When we tried to extract the first two factors from the BPFSC-11 item polychoric correlation matrix (using unweighted least square [ULS] method for factor extraction), all BPFSC-11 items showed substantial loadings (i.e., factor loadings > .30) on the first unrotated factor (mean factor loading value = .53, median factor loading value = .52, SD = .10, range: .38–.70), with the exception of item 3 which showed modest factor loading values on both unrotated factors (BPFSC-11 item 3 loaded .24 and .11 on Factor 1 and Factor 2, respectively). In other terms, according to ULS factor analysis all BPFSC-11 items seemed to tag a single latent dimension, although with different levels of accuracy. Based on dimensionality analysis results and on Sharp and colleagues’ (2014) findings, a bi-factor model with a general latent dimension and two specific factors was fitted using WLSMV Exploratory Structural Equation Modeling (ESEM; Marsh, Morin, Parker, & Kaur, 2014). The bi-factor model (Gibbons & Hedeker, 1992) specifies a general factor measured by all test items as well as specific factors accounting for the residual variance shared by subsets of items. Indeed, the BPFSC-11 was not designed to yield theoretically (or clinically) meaningful information on sub-domains of the BPD realm. The bi-factor model showed adequate goodness-of-fit indices, w2(25) = 64.47, p < .001, RMSEA = .04, 90% CI for RMSEA [.03, .06], test of close fit (i.e., RMSEA .05) p > .70, TLI = .97, CFI = .98, WRMR = 0.60. Ó 2016 Hogrefe Publishing

Standardized factor loadings for the bi-factor model of the BPFSC-11 items are listed in Table 2; for ease of presentation, only significant (i.e., p < .05) factor loadings are displayed.

Gender Invariance of the BPFSC-11 Factor Structure When we tested the invariance of the bi-factor model across subgroups based on gender using multigroup WLSMV ESEM, we observed adequate values of fit statistics even for the most restrictive model (invariance of thresholds and invariance of factor loadings), w2(104) = 188.26, p < .001, RMSEA = .05, 90% CI for RMSEA [.03, .06], test of close fit p > .70, TLI = .96, CFI = .97. When we relaxed the assumption of equality of factor loadings and thresholds in male adolescents and female adolescents, the model fit improved significantly, difference testing w2(24) = 56.63, p < .001, goodness-of-fit w2(69) = 126.82, p < .001, RMSEA = .05, 90% CI for RMSEA [.03, .06], test of close fit p > .70, TLI = .96, CFI = .98, although RMSEA, TLI, and CFI values were not markedly different from those that were observed for the competing model. Standardized factor loadings of the BPFSC-11 items for the general factor and the two specific factors in male adolescents and female adolescents are listed in Table 2. European Journal of Psychological Assessment (2019), 35(1), 70–77


74

A. Fossati et al., Psychometric Properties of the BPFSC-11

Table 2. Bi-factor model of the Borderline Personality Features Scales for Children-11 based on weighted least square mean and variance adjusted exploratory structural equation modeling: Standardized factor loadings in the whole sample and broken down by gender (N = 805) Male adolescents (n = 290)

Whole sample (N = 805) BPFSC-11 items

G

F1

F2

1. I feel very lonely.

.73

2. I want to let some people know. . .

.37

.59

3. My feelings are very strong. . .

.16

.33

.14 .28

4. I feel that there is something important. . .

.63

5. I’m careless with things that are important to me.

.43

.16

6. People who were close to me have let me down.

.49

.22 .14

G

F1

F2

.59

.29

.25

.40

.52

.36

7. I go back and forth between different feelings. . .

.68 .42

9. I worry that people I care about will leave. . .

.46

.32

10. How I feel about myself changes a lot.

.58

.12

11. Lots of times, my friends and I are really. . .

.36

.70

.34 .32

F2 .18

.40

.63

.23

.47

.48

.34

.53

.65

.28

.46

F1

.47

.65

.20

.24

.64 .61

.48 .21

G .66

.24

.64

8. I get into trouble because I do things. . .

Female adolescents (n = 515)

.51 .40

.36

.56

.61

.27

.23

.47

.34 .27 .38

Notes. BPFSC-11 = Borderline Personality Features Scale for Children-11; G = general factor; F1 = specific factor 1; F2 = specific factor 2. For ease of presentation, only significant (i.e., p < .05) factor loadings are displayed.

The standardized factor loadings for the general factor were consistent across the two subgroups based on gender (congruence coefficient [CC]1 value = .96), although item 3 showed a substantial loading on the general factor only among male adolescents. Rather, the specific factors showed factor loading patterns that were not consistently replicated across male participants and female participants (CC values were .39 and .68 for Factor 1 and Factor 2, respectively).

Convergent Validity of the BPFSC-11 Total Score With the PDQ-4+ BPD Scale Score In the present study, the Cronbach’s α coefficient of the PDQ-4+ BPD scale was .74 (average inter-item tetrachoric r = .24); the average score of the PDQ-4+ BPD scale was 3.80, SD = 1.97 (range: 0–9). The PDQ-4+ BPD scale did not correlate significantly with participants’ age, r = .01, p > .80. Female adolescents (M = 4.12, SD = 1.90) scored on average significantly higher than male adolescents (M = 3.22, SD = 1.98) on the PDQ-4+ BPD scale, t(803) = 6.36, p < .001, d = 0.45. For the full sample, the BPFSC-11 total score showed a significant correlation with the PDQ-4+ BPD scale score, r = .64, p < .001; when it was corrected for the attenuation due to measurement error, the value of the correlation coefficient between the two self-report scales became .84. Almost identical raw bivariate correlations between the 1

BPFSC-11 total score and the PDQ-4+ BPD scale score were observed in the male subgroup, r = .62, p < .001, and in the female subgroup, r = .62, p < .001, z = .02, p > .80. The correlation coefficients corrected for measurement error between the BPFSC-11 total score and the PDQ-4+ BPD scale score were .80 and .85 among male adolescents and female adolescents, respectively.

Discussion Confirming and extending previous findings (Sharp et al., 2014), the BPFSC-11 seemed to represent a reliable selfreport measure of borderline personality features when administered to Italian community dwelling adolescents. Indeed, the current study represents the first providing evidence for the utility of the Italian translation of the BPFSC-11.

Reliability of the Italian Translation of the BPFSC-11 The internal consistency of the BPFSC-11 total score was adequate also in the Italian translation of the scale, although the Cronbach’s α value that was observed in this study for the BPFSC-11 total score (i.e., .78) was somewhat lower than the .85 value that was reported by Sharp and

The replicability of the factor solution across subgroups defined by participants’ gender was evaluated by computing congruence coefficients (Gorsuch, 1983). Lorenzo-Seva and ten Berge (2006) suggested that CC values in the range .85–.94 correspond to a fair similarity, with values higher than .95 implying that the two factors compared can be considered equal.

European Journal of Psychological Assessment (2019), 35(1), 70–77

Ó 2016 Hogrefe Publishing


A. Fossati et al., Psychometric Properties of the BPFSC-11

colleagues (2014) in a clinical adolescent sample. Sample characteristics (i.e., high school students vs. inpatients) may be one of the reasons for the lower Cronbach’s α value of the Italian translation of the BPFSC-11. Despite the significant, moderate (d = 0.53) difference in BPFSC-11 average total score that was observed between female adolescents and male adolescent, Cronbach’s α values for the BPFSC-11 total score were fairly consistent across gender subgroups. In our study, median values for the BPFSC-11 items were usually 2.00 (i.e., “Hardly ever true”) or 3.00 (i.e., “Sometimes true”), with the exception of BPFSC-11 item 3 which evidenced a higher median value (i.e., “Often true”). Thus, with the exception of item 3 (“My feelings are very strong. For instance, when I get mad, I get really really mad. When I get happy, I get really really happy”), BPFSC-11 items seem to assess ways of feeling, relating, and thinking which are not commonly observed in adaptive adolescence. In our sample, all BPFSC-11 items performed moderately well in terms of item-total correlations (corrected for partwhole overlap) in both male adolescents and female adolescents, with the partial exception of Item 3. The relatively poor performance of item 3 was somewhat expected, considering that our sample was composed of high school students who reported high frequency of high scores (i.e., “Often True” or “Always True”) on this item. In other words, this BPFSC-11 item seemed to discriminate better at the lower end of the distribution of BPD features than the upper end, that is, if an individual scores low on item 3, s/he is highly unlikely to be at high end of the distribution of borderline traits, but scoring high on item 3 does not imply scoring high on the BPD latent trait. It may be argued that we should have relied on IRT for item analyses, consistent with Sharp and colleagues’ (2014) approach. It should be observed that our study did not aim to refine a measure identifying the items which performed best in terms of discriminant validity; rather, we aimed at testing the reliability of a sample of fallible observable indicators (i.e., the BPFSC-11 items) of borderline personality pathology. In this case, the adoption of a psychometric approach based on domain-sampling model (i.e., classical test theory; Nunnally & Bernstein, 1994) seemed to be appropriate. Although six-month test-retest reliability was assessed only on a subgroup (n = 471) of adolescents who significantly differed on age and female-to-male ratio from adolescents who did not participate in the test-retest study, our findings suggest high mean-level consistency (i.e., lack of significant changes in mean score) and moderate rankorder consistency of the BPFSC-11 total score among Italian community dwelling adolescents. The moderate six-month temporal stability of the BPFSC-11 total score that was observed in this study is consistent with available literature indicating that BPD diagnosis itself is likely to be less stable Ó 2016 Hogrefe Publishing

75

(even in the short period) than it was previously thought both in adolescence and adulthood (e.g., Venta, Herzoff, Cohen, & Sharp, 2014). In our study, test-retest correlation values were replicated across subgroups of male adolescents and female adolescents.

Factor Structure and Gender Invariance of the Italian Translation of the BPFSC-11 Factor analyses of the BPFSC-11 items yielded findings which were largely consistent with Sharp and colleagues’ (2014) results. Although WLSMV CFA provided only marginal support for a unidimensional structure of the BPFSC-11 items, WLSMV ESEMs showed that all BPFSC-11 items belonged to a common latent dimension (i.e., they loaded on a single general factor); residual covariances (actually, residual polychoric correlations) among selected BPFSC-11 items could be explained in terms of distribution artifacts (e.g., skewness values) rather than in terms of shared latent construct. Indeed, descriptively the BPFSC-11 item loadings on the general factor were highly similar in the male subgroup and in the female subgroup (the CC value was .96), although item 3 loaded significantly on the general factor only in the male subgroup; rather, none of the specific factors seemed to be safely replicated across subgroups based on participant’s gender. Thus, consistent with previous results (Sharp et al., 2014), our study suggests that the BPFSC-11 represents a unidimensional measure of a latent variable putatively assessing BPD features. The differences in the structure of the specific factors that was observed in our study when compared to Sharp and colleagues’ (2014) study may reflect a number of methodological factors, ranging from sampling issues to the well-known fact that full-information maximum likelihood IRT factoring and ESEM show differences at theoretical level and practical levels (e.g., Reise, Widaman, & Pugh, 1993). The invariance of the factor structure of the BPFSC-11 items across subgroups based on gender suggested that the BPFSC-11 total score could be used with both boys and girls without gender-specific adaptation.

Convergent Validity of the BPFSC-11 In our study, when the BPFSC-11 total score was correlated with the PDQ-4+ BPD scale score, we observed a positive and significant correlation with a “large” effect size (Cohen, 1988); when this raw bivariate correlation coefficient was corrected for attenuation due to measurement error, the Pearson r estimate rose even further. In our opinion, these findings support the convergent validity of the BPFSC-11 as a measure of BPD features with respect to a self-report European Journal of Psychological Assessment (2019), 35(1), 70–77


76

measure of DSM-IV/DSM-5 Section II BPD symptoms. Indeed, none of the correlation coefficients were large enough to suggest that the BPFSC-11 and PDQ-4+BPD scales represent linearly interchangeable measures of the same construct; however, it should be observed that many BPD symptoms which are measured by selected PDQ-4 +BPD items – for example, suicidal behaviors – have no counterpart in the BPFSC-11 items. In a sense, the BPFSC-11 seems to allow for assessing BPD without confronting adolescents with psychiatric symptoms or “socially undesirable” behaviors.

Limitations and Future Directions Despite positive findings, there are several limitations that should be acknowledged. Although the current study included a moderately large number of participants (N = 805), it was based on community dwelling adolescents; our sample represented a convenient study group rather than a sample representative of the Italian population. Moreover, all participants in our study were community dwelling adolescents; thus, our findings should not be extended to adolescents from clinical or forensic setting. Most importantly, mental health state or previous treatments and further demographical and personal information (e.g., family status, alcohol and drug use, medications) were not formally assessed. In our study, the sample was characterized by an unequal female-to-male ratio; unequal sample sizes might affect changes in goodness-of-fit indices in measurement invariance analysis (e.g., Chen, 2007). In the present study, we relied only on self-report measures; indeed, this may have led to spurious increase of the associations between the BPFSC-11 total score and the external measures because of shared method variance. We administered only the PDQ-4+BPD scale in order to assess the convergent validity of the BPFSC-11; thus, our findings should not be extended to other BPD measures, particularly to those based on semi-structured interviews. As a whole, these considerations limit the generalizability of our findings and stress the need for further studies before finally accepting our conclusions. For instance, future studies would involve additional instruments measuring borderline personality features (e.g., interviewed-based studies) and/or include different cultures. Moreover, the inclusion of different samples (e.g., clinical samples, samples from different cultures) may also be useful in better understanding the role of item 3. Despite these limitations in mind, our data confirm and extend Sharp and colleagues’ (2014) results, suggesting that the BPFSC-11 show adequate psychometric properties also in a sample Italian community dwelling adolescents.

European Journal of Psychological Assessment (2019), 35(1), 70–77

A. Fossati et al., Psychometric Properties of the BPFSC-11

Electronic Supplementary Material The electronic supplementary material is available with the online version of the article at http://dx.doi.org/10.1027/ 1015-5759/a000377 ESM 1. Text (PDF). Additional procedures and results.

References American Psychiatric Association. (2000). Diagnostic and statistical manual of mental disorders (4th ed., text. rev.). Washington, DC: Author. American Psychiatric Association. (2013). Diagnostic and statistical manual of mental disorders (5th ed.). Washington, DC: Author. Buja, A., & Eyuboglu, N. (1992). Remarks on parallel analysis. Multivariate Behavioral Research, 27, 509–540. doi: 10.1207/ s15327906mbr2704_2 Chang, B., Sharp, C., & Ha, C. (2011). The criterion validity of the Borderline Personality Feature Scale for Children in an adolescent inpatient setting. Journal of Personality Disorders, 25, 492–503. doi: 10.1521/pedi.2011.25.4.492 Chen, F. F. (2007). Sensitivity of goodness of fit indexes to lack of measurement invariance. Structural Equation Modeling, 14, 464–504. doi: 10.1080/10705510701301834 Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). New York, NY: Academic Press. Crick, N. R., Murray-Close, D., & Woods, K. (2005). Borderline personality features in childhood: A short-term longitudinal study. Development and Psychopathology, 17, 1051–1070. doi: 10.1017/S0954579405050492 Denissen, J. J., Geenen, R., van Aken, M. A., Gosling, S. D., & Potter, J. (2008). Development and validation of a Dutch translation of the Big Five Inventory (BFI). Journal of Personality Assessment, 90, 152–157. doi: 10.1080/00223890701845229 Fossati, A. (2014). Borderline personality disorder in adolescence: Phenomenology and construct validity. In C. Sharp & J. Tackett (Eds.), Handbook of borderline personality disorder in children and adolescents (pp. 19–34). New York, NY: Springer Science. doi: 10.1007/978-1-4939-0591-1_3 Fossati, A., Feeney, J., Maffei, C., & Borroni, S. (2014). Thinking about feelings: Affective state mentalization, attachment styles, and borderline personality disorder features among Italian nonclinical adolescents. Psychoanalytic Psychology, 31, 41–67. doi: 10.1037/a0033960 Fossati, A., Gratz, K. L., Maffei, C., & Borroni, S. (2013). Emotion dysregulation and impulsivity additively predict borderline personality disorder features in Italian nonclinical adolescents. Personality and Mental Health, 7, 320–333. doi: 10.1002/ pmh.1229 Fossati, A., Maffei, C., Bagnato, M., Donati, D., Donini, M., Fiorilli, M., . . . Ansoldi, M. (1998). Brief communication: Criterion validity of the Personality Diagnostic Questionnaire-4+(PDQ-4+) in a mixed psychiatric sample. Journal of Personality Disorders, 12, 172–178. doi: 10.1521/pedi.1998.12.2.172 Gibbons, R. D., & Hedeker, D. R. (1992). Full-information item bi-factor analysis. Psychometrika, 57, 423–436. doi: 10.1007/ BF02295430 Gorsuch, R. L. (1983). Factor analysis (2nd ed.). Hillsdale, NJ: Erlbaum.

Ó 2016 Hogrefe Publishing


A. Fossati et al., Psychometric Properties of the BPFSC-11

Hyler, S. E. (1994). PDQ-4+ Personality Questionnaire. New York, NY: New York State Psychiatric Institute. Leichsenring, F. (1999). Development and first results of the Borderline Personality Inventory: A self-report instrument for assessing borderline personality organization. Journal of Personality Assessment, 73, 45–63. doi: 10.1207/ S15327752JPA730104 Leichsenring, F., Leibing, E., Kruse, J., New, A. S., & Leweke, F. (2011). Borderline personality disorder. Lancet, 377, 74–84. doi: 10.1016/S0140-6736(10)61422-5 Lorenzo-Seva, U., & ten Berge, J. M. F. (2006). Tucker’s congruence coefficient as a meaningful index of factor similarity. Methodology, 2, 57–64. Marsh, H. W., Morin, A. J., Parker, P. D., & Kaur, G. (2014). Exploratory structural equation modeling: An integration of the best features of exploratory and confirmatory factor analysis. Annual Review of Clinical Psychology, 10, 85–110. doi: 10.1146/annurev-clinpsy-032813-153700 Michonski, J. D., Sharp, C., Steinberg, L., & Zanarini, M. C. (2013). An item response theory analysis of the DSM-IV borderline personality disorder criteria in a population-based sample of 11- to 12-year-old children. Personality Disorders: Theory, Research, and Treatment, 4, 15–22. doi: 10.1037/a0027948 Morey, L. (1991). Personality Assessment Inventory. Odessa, FL: Psychological Assessment Resources. Nunnally, J. C., & Bernstein, I. H. (1994). Psychometric theory (3rd ed.). New York, NY: McGraw-Hill. Reise, S. P., Widaman, K. F., & Pugh, R. H. (1993). Confirmatory factor analysis and item response theory: Two approaches for exploring measurement invariance. Psychological Bulletin, 114, 552–566. doi: 10.1037/0033-2909.114.3.552 Sharp, C., Ha, C., Michonski, J., Venta, A., & Carbone, C. (2012). Borderline personality disorder in adolescents: Evidence in support of the Childhood Interview for DSM-IV Borderline

Ó 2016 Hogrefe Publishing

77

Personality Disorder in a sample of adolescent inpatients. Comprehensive Psychiatry, 53, 765–774. doi: 10.1016/ j.comppsych.2011.12.003 Sharp, C., Mosko, O., Chang, B., & Ha, C. (2011). The crossinformant concordance and concurrent validity of the Borderline Personality Features Scale for Children in a sample of male youth. Clinical Child Psychology and Psychiatry, 16, 335–349. doi: 10.1177/1359104510366279 Sharp, C., Steinberg, L., Temple, J., & Newlin, E. (2014). An 11-item measure to assess borderline traits in adolescents: Refinement of the BPFSC using IRT. Personality Disorders: Theory, Research, and Treatment, 5, 70–78. doi: 10.1037/per0000057 Venta, A., Herzoff, K., Cohen, P., & Sharp, C. (2014). The longitudinal course of borderline personality disorder in youth. In C. Sharp & J. Tackett (Eds.), Handbook of borderline personality disorder in children and adolescents (pp. 229–246). New York, NY: Springer. doi: 10.1007/978-1-4939-0591-1_16 Zwick, W. R., & Velicer, W. F. (1986). Factors influencing five rules for determining the number of components to retain. Psychological Bulletin, 99, 432–442. doi: 10.1037/0033-2909. 99.3.432 Received February 1, 2015 Revision received March 6, 2016 Accepted March 14, 2016 Published online December 29, 2016 Antonella Somma Department of Human Studies Piazza delle Vaschette, 101 00193 Rome Italy a.somma@lumsa.it

European Journal of Psychological Assessment (2019), 35(1), 70–77


Original Article

Applying the Latent State-Trait Analysis to Decompose State, Trait, and Error Components of the Self-Esteem Implicit Association Test Francesco Dentale, Michele Vecchione, Valerio Ghezzi, and Claudio Barbaranelli Department of Psychology, Sapienza University of Rome, Italy Abstract: In the literature, self-report scales of Self-Esteem (SE) often showed a higher test-retest correlation and a lower situational variability compared to implicit measures. Moreover, several studies showed a close to zero implicit-explicit correlation. Applying a latent state-trait (LST) model on a sample of 95 participants (80 females, mean age: 22.49 ± 6.77 years) assessed at five measurement occasions, the present study aims at decomposing latent trait, latent state residual, and measurement error of the SE Implicit Association Test (SE-IAT). Moreover, in order to compare implicit and explicit variance components, a multi-construct LST was analyzed across two occasions, including both the SE-IAT and the Rosenberg Self-Esteem Scale (RSES). Results revealed that: (1) the amounts of state and trait variance in the SE-IAT were rather similar; (2) explicit SE showed a higher consistency, a lower occasion-specificity, and a lower proportion of error variance than SEIAT; (3) latent traits of explicit and implicit SE showed a positive and significant correlation of moderate size. Theoretical implications for the implicit measurement of self-esteem were discussed. Keywords: latent state-trait, self-esteem, implicit measurement, IÀT, Rosenberg

More than 100 years ago, James (1890) defined SelfEsteem (SE) as a relationship between perceived-self and ideal-self, focusing the attention on the subjective expectations of success (and failure) that characterize human life. However, an empirical tradition of studies on this topic was developed only in more recent years, with a theoretical perspective that conceived SE as the evaluation component of the self-concept (e.g., Markus, 1977). In this regard, Zeigler-Hill and Jordan (2010) noted that, although there was general agreement on this conceptualization of SE, some problematic issues and open questions persist. For instance, an open question regards the degree to which SE should be conceived as a stable personality characteristic or as a state that depends on situational factors (Buhrmester, Blanton, & Swann, 2011). In order to tap both state and trait self-esteem, different instruments, such as the State Self-Esteem Scale (e.g., Heatherton & Polivy, 1991) and the Rosenberg SelfEsteem Scale (e.g., RSES; Rosenberg, 1965), have been developed.

European Journal of Psychological Assessment (2019), 35(1), 78–85 DOI: 10.1027/1015-5759/a000378

Limits of Self-Report Measures of SE and the Self-Esteem Implicit Association Test Self-report SE scales showed robust psychometric properties in terms of internal consistency, test-retest stability, convergent, and criterion validity (Buhrmester, Blanton, & Swann, 2011). However, as many other self-report measures, they showed two important limitations, such as the proneness to self-enhancement response strategies (e.g., Cai et al., 2011) and the difficulty to tap all selfconcept related information using introspection, due to both self-deception effects (e.g., Hofmann, Gschwendner, & Schmitt, 2005) and cognitive factors (e.g., Dentale, San Martini, De Coro, & Di Pomponio, 2010; Dentale, Vecchione, De Coro, & Barbaranelli, 2012). Recently, mono- and dual models of social cognition were developed and empirically supported (see Gawronski & Creighton, 2013, for a review). These models provide an interesting conceptual framework to address the factors that may threaten the validity of self-report measures,

Ó 2016 Hogrefe Publishing


F. Dentale et al., LST on Self-Esteem IAT

among which are impression management responding, and introspective limits. Many attempts have been conducted to develop reliable and valid implicit measures of psychological constructs, such as the Implicit Association Test (IAT; Greenwald, McGhee, & Schwartz, 1998). The Self-Esteem IAT (SE-IAT) is a time reaction task that permits to assess the degree to which respondents associate two target categories (Me vs. Others) with two target attributes (Positive vs. Negative adjectives). A number of empirical studies have shown that IAT and self-report measures of the self-concept are: (1) weakly or not correlated (e.g., Greenwald & Farnham, 2000); (2) differently prone to faking effects (e.g., Vecchione, Dentale, Alessandri, & Barbaranelli, 2014); predictive of different types of criteria (see Buhrmester et al., 2011).

Situational Variability of the SE-IAT Scores The internal consistency of the SE-IAT, estimated using both split-half and Cronbach’s α indices, is generally lower than the one of explicit self-esteem, ranging from .49 to .88 across different studies (e.g., Bosson, Swann, & Pennebaker, 2000). Interestingly, test-retest correlations of the SE-IAT (which ranges from .31 to .69 across studies) tend to be substantially lower than internal consistencies (Bosson et al., 2000; Greenwald & Farnham, 2000; DeHart, Pelham, & Tennen, 2006). This might suggest that SE-IAT scores include a considerable amount of variability that is due to the situation and the interaction between persons and situations (latent state residuals, in the LST formulation, see Steyer, Schmitt, & Eid, 1999). In line with this reasoning, several studies (see Buhrmester et al., 2011, for a review) showed that SE-IAT scores can be affected by a number of contextual factors, such as evaluative conditioning, both with normal and subliminal presentation, subtle social signals, personal threats, or academic feedback. Other studies demonstrated that different cognitive factors are able to affect contextually the IAT scores, such as test-taking strategies (Egloff, Schwerdtfeger, & Schmukle, 2005), attentional foci when completing the IAT (Gawronski, Deutsch, LeBel, & Peters, 2008), learning effects (Schmukle & Egloff, 2004), and other components of the response processes that do not reflect associations per se (e.g., Sherman et al., 2008). In sum, SE-IAT scores seem to include a substantial latent state residual, which is likely to depend both on factors related to implicit evaluations of the self and on cognitive effects linked to the IAT experimental paradigm (Teige-Mocigemba, Klauer, & Sherman, 2010). Ó 2016 Hogrefe Publishing

79

Latent State-Trait Analysis on Personality Measures Drawing on these findings, one may ask whether researchers can disentangle the state and trait components of selfesteem. As mentioned before, a first strategy has been to develop separate self-report measures for state (e.g., Heatherton & Polivy, 1991) and trait (e.g., Rosenberg, 1965) SE. Along this view, longitudinal studies on selfreport measures of trait self-esteem, such as the RSES, have clearly demonstrated the presence of a substantive selfesteem factor that is stable over time (e.g., Marsh, Scalas, & Nagengast, 2010). Differently, even if SE-IAT scores seem to include both state and trait components, no studies, to date, were conducted to disentangle these different sources of variation. In order to evaluate the impact of trait, state, and error components on measures of psychological constructs, Latent State-Trait (LST) analysis has been developed (e.g., Steyer, Ferring, & Schmitt, 1992; Steyer, Mayer, Geiser, & Cole, 2015). Using longitudinal designs with multiple time points, these models permit to assess the consistency and the occasion-specificity of the target measures, and also to estimate their reliability. Explicit measures of personality traits, when tested with LST models, revealed a substantive amount of trait variance, but also a small proportion of occasion-specific variance (Deinzer et al., 1995). In more recent years, Schmukle and Egloff (2005) applied the LST analysis to estimate consistency and occasion-specificity of two IATs for assessing anxiety and extraversion, comparing them with the ones of structurally similar self-report scales. Results showed that LST models fitted data for both implicit and explicit measures, with adequate reliability coefficients (above .80 for all measures), and with proportions of trait variance that were substantially higher (ranging from .56 to .81) than occasion-specificities (ranging from .02 to .26). Most importantly, occasion-specificities were higher for IATs (.26 for anxiety and .15 for extraversion) than for self-report scales (.09 for anxiety and .02 for extraversion), indicating that situational variables affect more the implicit than the explicit self-concept of personality. Finally, latent state residuals of implicit and explicit measures were not significantly correlated, either for anxiety or for extraversion, suggesting that the IAT and self-ratings are differently affected by situational factors.

Aim of the Study The present study was aimed at estimating the proportion of variance of a classical SE-IAT attributed to latent trait, latent state residual, and measurement error, comparing them with corresponding components of a traditional

European Journal of Psychological Assessment (2019), 35(1), 78–85


80

(A)

F. Dentale et al., LST on Self-Esteem IAT

Figure 1. (A) LST NM model on the SE-IAT across five measurement occasions. (B) LST M-1 model on the SE-IAT across five measurement occasions. Y = observed indicators of implicit SE; ISE = implicit SE; O = occasion-specific components ; MISE= method-factor of the implicit measure.

(B)

explicit self-esteem scale, such as the RSES. In order to do this, two different LST models were analyzed: the first was a mono-construct model including five observations of the SE-IAT, each separated by two weeks (see Figures 1A and 1B); the second was a multi-construct model including two occasions of measurement, separated by two months, of both the SE-IAT and the RSES (see Figures 2A and 2B). On the basis of Schmukle and Egloff (2005) study, both models are expected to fit the data. Moreover, it was hypothesized that: (1) an adequate level of reliability for both SE-IAT and RSES; (2) a lower consistency for the SE-IAT compared to the RSES; (3) a higher occasionspecificity for the SE-IAT compared to the RSES could be observed.

Method Participants and Procedure Ninety-five students (14 males, 80 females, and 1 does not report gender membership) of Sapienza University of Rome, with a mean age of 22.49 years (SD = 6.77), were recruited for the study in exchange for a course credit. An SE-IAT was administered to participants in five temporal occasions, with a time lag of two weeks between them. In the first and the fifth sessions, participants were administered a battery including a self-report measure of self-esteem (i.e., the RSES) and other scales not relevant for the aim of the study. In both sessions, as in Schmukle and Egloff (2005), participants first performed the SE-IAT and then RSES, in order to minimize possible order effects between implicit and explicit measures. Self-report scales, European Journal of Psychological Assessment (2019), 35(1), 78–85

indeed, are assumed to be only slightly or not at all affected by the IAT.

Measures Explicit Self-Esteem Explicit self-esteem was measured with the RSES (Rosenberg, 1965). The Italian version of the scale (Prezza, Trombaccia, & Armento, 1997) has proved to be reliable, with a Cronbach’s α coefficient of .84. For both measurement occasions, two parallel halves of the RSES are computed by summing up even and odd items separately. Implicit Self-Esteem A SE-IAT classical procedure was used (Greenwald & Farnham, 2000), in which participants performed a series of categorization tasks including “Me” versus “Other” as target categories, and “Positive” versus “Negative” as attributes, with five stimuli-words for each category. Similarly to Greenwald and Farnham (2000), the following words have been used as stimuli: I, self, me, my, and mine for the “Me” category; other, they, those, their, others for the “Other” category; honest, competent, strong, clever, and beautiful for the “Positive” attribute; dishonest, incompetent, weak, stupid, ugly for the “Negative” attribute. The words were presented in random order within each block of trials. As described by Greenwald et al. (1998), the entire procedure consisted of seven blocks of trials: blocks 1 (Me vs. Other) and 2 (Positive vs. Negative) were single categorization sessions of 20 trials, block 5 (Other vs. Me) was a single categorization task of 40 trials, blocks 3–4 and 6–7 were combined sessions (Me or Positive vs. Other or Negative, and Me or Negative vs. Other or Positive) Ó 2016 Hogrefe Publishing


F. Dentale et al., LST on Self-Esteem IAT

(A)

81

of 40 trials. Participants were requested to respond as quickly and accurately as possible to the stimuli-words that appeared on the monitor. Following the D2 scoring algorithm (see Greenwald, Nosek, & Banaji, 2003), data from blocks 3–4 and 6–7 were used to compute SE-IAT difference scores, according to the built-in error penalty procedure. Positive scores indicate high implicit self-esteem, while negative scores indicate low implicit self-esteem. For all five measurement sessions, two scores that represent the two test-halves (D1 and D2) were computed by applying the D2 algorithm to blocks 3–6 and 4–7 separately.

Results Descriptive Statistics Means, standard deviations, and correlations among IAT halves across five measurement occasions are shown in Table 1. Since a mean of 7.5% participants across the five measurement occasions did not complete all SE-IATs, we tested for sex and age differences between missing and no missing subjects and no significant results were found (p > .05 for all tests). Regarding SE-IATs scores, Little’s missing completely at random (MCAR) test was used to test for randomness of missing data. A nonsignificant chi-square (w2 = 52.52, p = .45) revealed that the data were missing completely at random. In the following analysis, we handled missing data through full information maximum likelihood (FIML), using Mplus 6.1 (Muthén & Muthén, 2010). As expected, SE-IAT means were positive for all halves and measurement occasions, indicating that respondents Ó 2016 Hogrefe Publishing

Figure 2. (A) LST NM model on the SE-IAT and RSES across two measurement occasions. (B) LST M-1 model on the SE-IAT and RSES across two measurement occasions. Y = observed indicators of explicit SE; X = observed indicators of implicit SE; ISE = implicit SE; ESE = explicit SE; OI = occasionspecific components of the implicit measure; OE = occasion-specific components of the explicit measure; MISE = method-factor of the implicit measure; MESE = methodfactor of the explicit measure.

(B)

were faster in the Me-Positive versus Other-Negative categorization task than in the Me-Negative versus OtherPositive one. The mean scores tend to be higher for the first half (D1) than for the second one (D2), but only in the first two measurement occasions the difference is statistically significant, T1: t(94) = 3.46, p < .001; T2: t(94 )= 2.34, p < .05. These results might be ascribed to a learning effect found in other studies (Greenwald et al., 2003). Importantly, the degree of this effect may vary between subjects, determining an occasion-specific variance that is not due to actual fluctuations of automatic self-associations, but rather to individual differences related to learning or cognitive factors. Regarding explicit SE, no significant mean differences were found between RSES halves (Table 2).

LST Models In order to separate consistency and occasion-specificity of the SE-IAT, and to compare these components with the ones of an explicit measure of SE, two LST models were estimated (Steyer & Schmitt, 1990). Since SE-IAT testhalves (D1 and D2) showed significant mean differences and lower intercorrelations than RSES test-halves, an LST model with no method factors was tested and compared with a model that includes method effects (Eid, 1996; Eid, Schneider, & Schwenkmezger, 1999; Pohl & Steyer, 2010; Pohl, Steyer, & Krause, 2008). Recently, different approaches that account for method effects in LST models were compared using simulation studies and actual data sets (Geiser & Lockhart, 2012). Among them, the model with M-1 method factors (M-1, Eid et al., 1999), that includes one method factor less than methods used in the study, and the model with no method factors (NM) showed European Journal of Psychological Assessment (2019), 35(1), 78–85


82

F. Dentale et al., LST on Self-Esteem IAT

Table 1. Descriptive statistics for IAT test-halves in the five measurement occasions IAT1-D1 IAT1-D2 IAT2-D1 IAT2-D2 IAT3-D1 IAT3-D2 IAT4-D1 IAT4-D2 IAT5-D1 IAT5-D2 Mean IAT1-D1

1

IAT1-D2

.53**

1

IAT2-D1

.29**

.29**

1

IAT2-D2

.33**

.27*

.66**

1

IAT3-D1

.39**

.28**

.53**

.55**

1

IAT3-D2

.29**

.26*

.46**

.48**

.70**

1

IAT4-D1

.47**

.41**

.45**

.43**

.45**

.38**

1

IAT4-D2

.31**

.34**

.27*

.30**

.34**

.29**

.70**

IAT5-D1

.32**

.19

.30**

.31**

.29**

.22

.39**

.32**

1

IAT5-D2

.23**

.21*

.15

.40**

.23*

.25*

.32**

.26**

.56**

1 1

SD

Skewness Kurtosis

0.59

0.31

0.33

0.46

0.48

0.32

0.15

0.01

0.56

0.39

0.86

0.95

0.48

0.38

0.82

1.51

0.51

0.45

0.84

1.14

0.48

0.42

1.16

3.09

0.49

0.40

0.49

0.12

0.47

0.40

0.36

0.37

0.56

0.36

0.34

0.13

0.50

0.35

0.21

0.62

Notes. D1 = a test-half formed by block 3 and block 6 latencies; D2 = a test-half formed by block 4 and block 7 latencies. **p < .01; *p < .05.

Table 2. Descriptive statistics for RSES and IAT test-halves in the first and last measurement occasion RSES 1-1

RSES 1-2

RSES 2-1

RSES 2-2

IAT1-D1

IAT2-D2

IAT5-D1

RSES 1-1

1

RSES 1-2

.79**

1

RSES 2-1

.69**

.69**

1

RSES 2-2

.61**

.67**

.82**

1

IAT1-D1

.16

.18

.29**

.20

1

IAT1-D2

.11

.16

.13

.14

.53**

1

IAT5-D1

.03

.13

.19

.20

.32**

.19

1

IAT5-D2

.11

.19

.26*

.21*

.23*

.21*

.56**

IAT5-D2

1

Mean

SD

Skewness

Kurtosis

16.46

1.87

0.05

0.77

15.80

1.87

0.01

0.01

16.72

2.19

0.26

0.44

16.17

2.06

0.12

0.05

0.59

0.31

0.33

0.46

0.48

0.32

0.15

0.01

0.56

0.36

0.34

0.13

0.50

0.35

0.21

0.62

Notes. The first index refers to the occasion, the second refers to the test-half. **p < .01; *p < .05.

optimal properties, both in terms of goodness-of-fit and parameter estimates. In particular, both approaches showed unbiased parameter estimates, even for more complex models than the one conducted in the present study ( 20 estimated parameters), and for similar sample sizes (N 100). For these reasons, in the present study the M-1 model was tested and compared with the more parsimonious NM model, in order to select the best fitting model. LST models allow to estimate three different components of variance: (1) consistency [Con(Y) = Var(T)/Var(Y)], namely the proportion of trait variance; (2) occasionspecificity [OccSpec(Y) = Var(O)/Var(Y)], namely the proportion of occasion-specific variance; (3) method-specificity [MetSpec(Y) = Var(M)/Var(Y)], namely the proportion of method variance. These variance components permit to estimate reliability coefficients for the measures, using the following formula: Rel = 1 – Var(E)/Var(Y) = Con(Y) + OccSpe(Y) + MetSpe(Y), where E represents the measurement error. Since this reliability coefficient refers to one test-half only, aggregation equations that may be considered as a generalization of the SpearmanBrown formula, were applied (Steyer & Schmitt, 1990). Figures 1A and 1B illustrate, respectively, the monoconstruct NM and M-1 models, including implicit European Journal of Psychological Assessment (2019), 35(1), 78–85

self-esteem scores across five occasions. Figures 2A and 2B illustrate, respectively, the multi-construct NM and M-1 models, including both implicit and explicit SE scores across two occasions. These models allow to compare the LST components of the two measures, and to estimate the correlation between implicit and explicit traits. To this end, each score of implicit and explicit self-esteem has been split into two parts based on half of the reaction times (computing two separate D measures for blocks 3–6 and 4–7), and half of the items (computing two separate scores using even and odd items), respectively. These scores, Yij for explicit SE, and Xij for implicit SE (Figures 2A and 2B), where i represents the measurement occasion and j represents the test-half, are included as observed indicators. The latent traits (ISE in Figures 1A and 1B, ISE and ESE in Figures 2A and 2B) refer to stable individual differences in implicit and explicit self-esteem. Latent state residuals reflect occasion-specific individual differences in self-esteem (O in Figures 1A and 1B, OI and OE in Figures 2A and 2B); the latent method factors (M-I in Figures 1A and 1B, M-I and M-E in Figures 2A and 2B) assess the method-specificity of the test-halves. Finally, residual terms refer to casual fluctuations in the measurement of the observed variables. Ó 2016 Hogrefe Publishing


F. Dentale et al., LST on Self-Esteem IAT

83

Table 3. Results of the latent state-trait (LST) analyses (corrected with the aggregation formulae) Model fit

Latent variances

LST coefficients

No

w

df

p

Trait

Occasion

Error

Total

Con

OccSpe

Error

Rel

5

51.587

52

.490

0.045

.043

.026

0.114

.395

.377

.228

.772

2

32.795

29

.286

0.026

.035

.026

0.087

.299

.402

.299

.701

2.616

.573

.384

3.573

.732

.160

.107

.892

2

Mono-construct LST NM model Implicit self-esteem Multi-construct LST NM model Implicit self-esteem Explicit self-esteem

Notes. N = 95; No = number of observations; Con = Consistency; OccSpe = Occasion-Specificity; Rel = Reliability.

Mono-Construct LST Analysis on SE-IAT In order to estimate the amounts of trait, state, and error variances in the SE-IAT scores, two alternative monoconstruct models (i.e., the NM and the M-1) were tested. These models included five occasions of measurement, with an interval of two weeks between them. As illustrated in Table 1, some of the observed measures revealed a slight deviation from normality. Therefore, maximum likelihood estimation robust to non-normality (MLR) was used. In order to increase the ratio between number of participants and number of estimated parameters (10:1 is usually considered as an acceptable ratio, Kline, 2011), all factor loadings between latent factors and observed variables were fixed to 1. Moreover, we constrained to be equal both the occasion-specific variances and the error variances. Applying this approach, this ratio was higher than 10:1 for alternative models tested in the following analyses. In order to evaluate the adequacy of these models, they were compared with other, less restricted models, in which factor loadings were freely estimated. No significant differences in the chi-square of these models were found, indicating the more parsimonious models as the best fitting ones. Both NM and a M-1 models fitted the data adequately. However, the difference in the chi-square of the two models, corrected with the appropriate formula for MLR parameter estimation (see Satorra & Bentler, 2001), was not significant, Δw2(1) = 1.86, p = .17. This indicates that the more parsimonious NM model has to be preferred to the M-1 model. Goodness-of-fit and parameter estimates of this model are reported in Table 3. As illustrated, consistency and occasion-specificity of this model were of similar size. This suggests that the SE-IAT includes both trait and situational effects. Importantly, aggregating results of the two halves, an adequate reliability for the SE-IAT was found (see Table 3).

Multi-Construct LST Analysis on SE-IAT and RSES In order to investigate the relationship between implicit and explicit latent traits, two alternative models (i.e., the NM Ó 2016 Hogrefe Publishing

and the M-1) were estimated. These models included two measurement occasions for both SE-IAT and RSES testhalves, separated by an interval of 2 months. This represents an approach called multi-construct LST analysis (e.g., Steyer, Majcen, Schwenkmezger, & Buchner, 1989). As in the LST model described above, MLR robust estimator was used. As for the mono-construct model, a NM model that includes trait, state, and error components was compared with a M-1 model, which also includes implicit and explicit method-specific factors in order to control for differences between IATs and RSES test-halves. No significant differences were found between the chi-square values of the two models, Δw2(2) = 4.20, p = .12, indicating the more parsimonious NM model as the best fitting one. Goodness-of-fit and parameter estimates of this model are reported in Table 3. The comparison between the variance components of the implicit and explicit measures revealed: (1) a lower variance of the trait component for the SE-IAT; (2) a higher variance of the latent state residual for the SE-IAT; (3) a higher error variance for the SE-IAT. These results suggest that the implicit measure is more influenced by situational factors than the explicit one. Interestingly, the correlation between implicit and explicit latent traits was significant (r = .42, p < .05) and considerably higher than the implicit-explicit relationship usually found in the literature between the SE-IAT and the RSES. Moreover, no significant correlations emerged between SE-IAT and RSES latent state residuals when these parameters were relaxed.

Discussion Overall, LST analyses showed an adequate level of reliability and a significant portion of trait variance for both implicit and explicit measures. However, a higher consistency along with a lower occasion-specificity emerged for the RSES with respect to the SE-IAT, suggesting that situational factors affect more implicit than explicit measures of SE. A similar pattern of results was found in a study on the implicit measurement of personality traits European Journal of Psychological Assessment (2019), 35(1), 78–85


84

(Schmukle & Egloff, 2005), which revealed a higher occasion-specificity for implicit measures of anxiety and extraversion compared to self-ratings of the same constructs. Moreover, findings from the present study are consistent with the typical pattern of high internal consistency and moderate stability observed in different self-esteem IAT studies (e.g., DeHart et al., 2006), as well as with experimental studies demonstrating the proneness of SE-IAT to contextual factors (e.g., Rudman, Dohn, & Fairchild, 2007). Moreover, as illustrated before, SE-IAT and RSES latent state residuals are not correlated, indicating that different situational factors may affect implicit and explicit self-esteem scores. A noteworthy finding is that, despite the low relationship between SE-IAT and RSES reported in many studies (see Buhrmester et al., 2011), we observed a correlation of moderate size between the trait components of the two self-esteem measures. This may suggest that the implicitexplicit correlation has been underestimated in previous studies, which have failed to disentangle trait SE from random error and latent state residual. Recently, it has been hypothesized that the SE-IAT taps mere ephemeral states determined by situational factors, such as nonconscious mood states of participants, conditioned responses or idiosyncratic associations to the particular stimulus words they have been asked to consider (Buhrmester et al., 2011). However, both the size of trait variance and the implicit-explicit trait correlation, we found, suggest that the SE-IAT measures not only state and error components but also true stable individual differences, supporting the hypothesis that it is not merely influenced by ephemeral states determined by situational factors. As regards the limitations of the study, despite the adequate ratio between number of cases and estimated parameters, as well as the robustness of the tested models (Geiser & Lockhart, 2012), further studies with an enlarged number of participants are necessary to evaluate the replicability of the present results. A second limitation is the exclusive focalization on the classical IAT, which necessarily includes a comparison between “Me” and “Other.” Other studies should be conducted to confirm the results and generalize them to other IAT variants (SC-IAT; Karpinski & Steinman, 2006), and to experimental paradigms based on different mechanisms (e.g., the Affect Misattribution Procedure; Payne, Cheng, Govorun, & Stewart, 2005).

References Bosson, J., Swann, W. B. Jr., & Pennebaker, J. (2000). Stalking the perfect measure of implicit self-esteem: The blind men and the elephant revisited? Journal of Personality and Social Psychology, 79, 631–643. doi: 10.1037/0022-3514.79.4.631

European Journal of Psychological Assessment (2019), 35(1), 78–85

F. Dentale et al., LST on Self-Esteem IAT

Buhrmester, M. D., Blanton, H., & Swann, W. (2011). Implicit selfesteem: Nature, measurement, and a new way forward. Journal of Personality and Social Psychology, 100, 365–385. doi: 10.1037/a0021341 Cai, H., Sedikides, C., Gaertner, L., Wang, C., Carvallo, M., Xu, Y., . . . Eckstein-Jackson, L. (2011). Tactical self-enhancement in China: Is modesty at the service of self-enhancement in East-Asian culture? Social Psychological and Personality Science, 2, 59–64. doi: 10.1177/1948550610376599 DeHart, T., Pelham, B. W., & Tennen, H. (2006). What lies beneath: Parenting style and implicit self-esteem. Journal of Experimental Social Psychology, 42, 1–17. doi: 10.1016/ j.jesp.2004.12.005 Deinzer, R., Steyer, R., Eid, M., Notz, P., Schwenkmezger, P., Ostendorf, F., & Neubauer, A. (1995). Situational effects in trait assessment: The FPI, NEO-FFI, and EPI questionnaires. European Journal of Personality, 9, 1–23. doi: 10.1002/ per.2410090102 Dentale, F., San Martini, P., De Coro, A., & Di Pomponio, I. (2010). Alexithymia increases the discordance between implicit and explicit self-esteem. Personality and Individual Differences, 49, 762–767. doi: 10.1016/j.paid.2010.06.022 Dentale, F., Vecchione, M., De Coro, A., & Barbaranelli, C. (2012). On the relationship between implicit and explicit self-esteem: The moderating role of dismissing attachment. Personality and Individual differences, 52, 173–177. doi: 10.1016/j.paid.2011. 10.009 Egloff, B., Schwerdtfeger, A., & Schmukle, S. C. (2005). Temporal stability of the Implicit Association Test-Anxiety. Journal of Personality Assessment, 84, 82–88. doi: 10.1207/ s15327752jpa8401_14 Eid, M. (1996). Longitudinal confirmatory factor analysis for polytomous item responses: Model definition and model selection on the basis of stochastic measurement theory. Methods of Psychological Research Online, 1, 65–85. Eid, M., Schneider, C., & Schwenkmezger, P. (1999). Do you feel better or worse? The validity of perceived deviations of mood states from mood traits. European Journal of Personality, 13, 283–306. doi: 10.1002/(SICI)1099-0984(199907/08) Gawronski, B., & Creighton, L. A. (2013). Dual-process theories. In D. E. Carlston (Ed.), The Oxford handbook of social cognition (pp. 282–312). New York, NY: Oxford University Press. Gawronski, B., Deutsch, R., LeBel, E., & Peters, K. (2008). Response interference as a mechanism underlying implicit measures: Some traps and gaps in the assessment of mental associations with experimental paradigms. European Journal of Psychological Assessment, 24, 218–228. doi: 10.1027/10155759.24.4.218 Geiser, C., & Lockhart, G. (2012). A comparison of four approaches to account for method effects in latent state trait analyses. Psychological Methods, 17, 255–283. doi: 10.1037/ a0026977 Greenwald, A. G., & Farnham, S. D. (2000). Using the implicit association test to measure self-esteem and self-concept. Journal of Personality and Social Psychology, 79, 1022–1038. doi: 10.1037/0022-3514.79.6.1022 Greenwald, A. G., McGhee, D. E., & Schwartz, J. L. K. (1998). Measuring individual differences in implicit cognition: The Implicit Association Test. Journal of Personality and Social Psychology, 74, 1464–1480. doi: 10.1037/0022-3514.74. 6.1464 Greenwald, A. G., Nosek, B. A., & Banaji, M. R. (2003). Understanding and using the Implicit Association Test: I. An improved scoring algorithm. Journal of Personality and Social Psychology, 85, 197–216. doi: 10.1037/0022-3514. 85.2.197

Ó 2016 Hogrefe Publishing


F. Dentale et al., LST on Self-Esteem IAT

Heatherton, T. F., & Polivy, J. (1991). Development and validation of a scale for measuring state self-esteem. Journal of Personality and Social Psychology, 60, 895–910. doi: 10.1037/ 0022-3514.60.6.895 Hofmann, W., Gschwendner, T., & Schmitt, M. (2005). On implicitexplicit consistency: The moderating role of individual differences in awareness and adjustment. European Journal of Personality, 19, 25–49. doi: 10.1002/per.537 James, W. (1890). The principles of psychology. (Vol. 1). New York, NY: Holt. Karpinski, A., & Steinman, R. B. (2006). The single category Implicit Association Test as a measure of implicit social cognition. Journal of Personality and Social Psychology, 91, 16–32. doi: 10.1037/0022-3514.91.1.16 Kline, R. B. (2011). Principles and practice of structural equation modeling. New York, NY: Guilford Press. Markus, H. (1977). Self-schemas and processing information about the self. Journal of Personality and Social Psychology, 35, 63–78. Marsh, H. W., Scalas, L. F., & Nagengast, B. (2010). Longitudinal tests of competing factor structures for the Rosenberg SelfEsteem Scale: Traits, ephemeral artifacts, and stable response styles. Psychological Assessment, 22, 366–381. doi: 10.1037/ a0019225 Muthén, L. K., & Muthén, B. O. (2010). Mplus user's guide (6th ed.). Los Angeles, CA: Muthén & Muthén. Payne, B. K., Cheng, C. M., Govorun, O., & Stewart, B. D. (2005). An Inkblot for attitudes: Affect misattribution as implicit measurement. Journal of Personality and Social Psychology, 89, 277–293. doi: 10.1037/0022-3514.89.3.277 Pohl, S., & Steyer, R. (2010). Modelling common traits and method effects in multitrait-multimethod analysis. Multivariate Behavioral Research, 45, 1–28. doi: 10.1080/00273170903504729 Pohl, S., Steyer, R., & Krause, K. (2008). Modelling method effects as individual causal effects. Journal of the Royal Statistical Society, Series A, 171, 41–63. doi: 10.1111/j.1467-985X.2007.00517.x Prezza, M., Trombaccia, F. R., & Armento, L. (1997). La scala dell’autostima di Rosenberg: Traduzione e validazione italiana [Rosenberg self-esteem scale: Italian translation and validation]. Bollettino di Psicologia Applicata, 223, 35–44. Rosenberg, M. (1965). Society and the adolescent self-image. Princeton, NJ: Princeton University Press. Rudman, L. A., Dohn, M. C., & Fairchild, K. (2007). Implicit SelfEsteem compensation: Automatic threat defense. Journal of Personality and Social Psychology, 93, 798–813. doi: 10.1037/ 0022-3514.93.5.798 Satorra, A., & Bentler, P. M. (2001). A scaled difference chi-square test statistic for moment structure analysis. Psychometrika, 66, 507–514. doi: 10.1007/BF02296192 Schmukle, S. C., & Egloff, B. (2004). Does the Implicit Association Test for assessing anxiety measure trait and state variance? European Journal of Personality, 18, 483–494. doi: 10.1002/ per.525 Schmukle, S. C., & Egloff, B. (2005). A latent state-trait analysis of implicit and explicit personality measures. European Journal of

Ó 2016 Hogrefe Publishing

85

Psychological Assessment, 21, 100–107. doi: 10.1027/10155759.21.2.100 Sherman, J. W., Gawronski, B., Gonsalkorale, K., Hugenberg, K., Allen, T. J., & Groom, C. J. (2008). The self-regulation of automatic associations and behavioral impulses. Psychological Review, 115, 314–335. doi: 10.1037/0033-295X.115.2.314 Steyer, R., Ferring, D., & Schmitt, M. J. (1992). States and traits in psychological assessment. European Journal of Psychological Assessment, 8, 79–98. Steyer, R., Majcen, A. M., Schwenkmezger, P., & Buchner, A. (1989). A latent state-trait anxiety model and its application to determine consistency and specificity coefficients. Anxiety Research, 1, 281–299. doi: 10.1080/08917778908248726 Steyer, R., & Schmitt, M. (1990). The effects of aggregation across and within occasions on consistency, specificity, and reliability. Methodika, 4, 58–94. Steyer, R., Schmitt, M., & Eid, M. (1999). Latent state-trait theory and research in personality and individual differences. European Journal of Personality, 13, 389–408. doi: 10.1002/ (SICI)1099-0984 Steyer, R., Mayer, A., Geiser, C., & Cole, D. (2015). A theory of states and traits – Revised. Annual Review of Clinical Psychology, 11, 71–98. doi: 10.1146/annurev-clinpsy-032813153719 Teige-Mocigemba, S., Klauer, K. C., & Sherman, J. W. (2010). Practical guide to Implicit Association Test and related tasks. In B. Gawronski & B. K. Payne (Eds.), Handbook of implicit social cognition: Measurement, theory, and applications (pp. 117–139). New York, NY: Guilford Press. Vecchione, M., Dentale, F., Alessandri, G., & Barbaranelli, C. (2014). Fakability of implicit and explicit measures of the Big five: Research findings from organizational settings. International Journal of Selection and Assessment, 22, 211–218. doi: 10.1111/ijsa.12070 Zeigler-Hill, V., & Jordan, C. H. (2010). Two faces of self-esteem: Implicit and explicit forms of self-esteem. In B. Gawronski & B. K. Payne (Eds.), Handbook of implicit social cognition: Measurement, theory, and applications (pp. 392–407). New York, NY: Guilford Press. Received August 7, 2015 Revision received March 16, 2016 Accepted March 29, 2016 Published online December 29, 2016 Francesco Dentale Department of Psychology “Sapienza University of Rome” via dei Marsi, 78 00185 Rome Italy francesco.dentale@libero.it

European Journal of Psychological Assessment (2019), 35(1), 78–85


New edition of the popular text that separates the facts from the myths about drug and substance use “I highly recommend this book for an accurate, very readable, and useful overview of drug use problems and their treatment.” Stephen A. Maisto, PhD, ABPP, Professor of Psychology, Syracuse University, NY

Mitch Earleywine

Substance Use Problems (Series: Advances in Psychotherapy – Evidence-Based Practice – Volume 15) 2nd ed. 2016, viii + 104 pp. US $29.80 / € 24.95 ISBN 978-0-88937-416-4 Also available as eBook The literature on diagnosis and treatment of drug and substance abuse is filled with successful, empirically based approaches, but also with controversy and hearsay. Health professionals in a range of settings are bound to meet clients with troubles related to drugs – and this text helps them separate the myths from the facts. It provides trainees and professionals with a handy, concise guide for helping problem drug users build enjoyable, multifaceted lives using approaches based on decades of research.

www.hogrefe.com

2nd n editio

Readers will improve their intuitions and clinical skills by adding an overarching understanding of drug use and the development of problems that translates into appropriate techniques for encouraging clients to change behavior themselves. This highly readable text explains not only what to do, but when and how to do it. Seasoned experts and those new to the field will welcome the chance to review the latest developments in guiding self-change for this intriguing, prevalent set of problems.


Hogrefe OpenMind Open Access Publishing? It’s Your Choice! Your Road to Open Access Authors of papers accepted for publication in any Hogrefe journal can now choose to have their paper published as an open access article as part of the Hogrefe OpenMind program. This means that anyone, anywhere in the world will – without charge – be able to read, search, link, send, and use the article for noncommercial purposes, in accordance with the internationally recognized Creative Commons licensing standards.

The Choice Is Yours 1. Open Access Publication: The final “version of record” of the article is published online with full open access. It is freely available online to anyone in electronic form. (It will also be published in the print version of the journal.) 2. Traditional Publishing Model: Your article is published in the traditional manner, available worldwide to journal subscribers online and in print and to anyone by “pay per view.” Whichever you choose, your article will be peer-reviewed, professionally produced, and published both in print and in electronic versions of the journal. Every article will be given a DOI and registered with CrossRef.

www.hogrefe.com

How Does Hogrefe’s Open Access Program Work? After submission to the journal, your article will undergo exactly the same steps, no matter which publishing option you choose: peer-review, copy-editing, typesetting, data preparation, online reference linking, printing, hosting, and archiving. In the traditional publishing model, the publication process (including all the services that ensure the scientific and formal quality of your paper) is financed via subscriptions to the journal. Open access publication, by contrast, is financed by means of a one-time article fee (€ 2,500 or US $3,000) payable by you the author, or by your research institute or funding body. Once the article has been accepted for publication, it’s your choice – open access publication or the traditional model. We have an open mind!


Multistudy Report

Validation of the Adult Substance Abuse Subtle Screening Inventory-4 (SASSI-4) Linda E. Lazowski1 and Brent B. Geary2 1

The SASSI Institute, Springville, IN, USA

2

Independent Clinical Practice, Phoenix, AZ, USA Abstract: The study objective was to develop a revision of the adult Substance Abuse Subtle Screening Inventory-3 to include new items to identify nonmedical use of prescription medications, as well as additional subtle and symptom-related identifiers of substance use disorders (SUDs) and to evaluate its psychometric properties and screening accuracy against a criterion of DSM-5 diagnoses for SUD. Clinical professionals throughout the nine US Census Bureau regions and two Canadian provinces who used the SASSI Online screening tool submitted 1,284 completed administrations of the provisional SASSI-4 along with their independent DSM-5 diagnoses of SUD. Validation sample findings demonstrated SASSI-4 sensitivity of 93% and specificity of 90%, AUC = .91. Items added to identify respondents who were abusing prescription medications showed 94% overall screening accuracy. Logistic regression showed no significant effects of client demographic characteristics or type of screening setting on the accuracy of SASSI-4 screening outcomes. In Study 2, 120 adults in recovery from SUD completed the SASSI-4 under instructions to fake good. Sensitivity of 79% was demonstrated for the full scoring protocol and was 47% when only face valid scales were utilized. Clinical utility is discussed. Keywords: alcohol and drug screening, prescription drug abuse, substance use disorders, screening accuracy

Substance use disorders (SUDs) have received extensive attention as a significant public health concern (Bouchery, Harwood, Sacks, Simon, & Brewer, 2011; US Department of Justice, 2011). Costs of SUD to the US economy are profound, estimated at $193 billion annually related to lost work productivity, healthcare, early mortality, and crime, for abuse of illicit substances, with alcohol-related costs exceeding $220 billion (National Institute on Drug Abuse, 2015). Recent investigations estimate that more than 24.6 million Americans over age 12 use illegal drugs, 16.5 million drink heavily, and 60.1 million are past-month binge drinkers (Substance Abuse and Mental Health Services Administration, 2014). Additionally, the misuse of prescription medication has steadily increased and is a major contributing factor in drug overdose deaths, rise in emergency room visits, and neonatal opioid withdrawal syndrome, among other serious consequences (US Department of Health & Human Services, 2013). Screening instruments that facilitate detection of SUDs are desirable to address their significant human, economic, and societal costs. Validated SUD screening instruments are integral to the work of substance use assessment professionals and treatment providers. Increasingly, screening for SUD has European Journal of Psychological Assessment (2019), 35(1), 86–97 DOI: 10.1027/1015-5759/a000359

been advanced among best practices in primary healthcare settings (Babor et al., 2007), treatment of chronic pain (Compton, Darakjian, & Miotto, 1998), and care of military service members (Santiago, 2014). Such tools also play a significant role in evaluations done in criminal justice probation and reintegration programs (Belenko, 2006), interventions for domestic violence (Easton, Swan, & Sinha, 2000), treatment of patients with spinal cord and traumatic brain injuries (Andelic et al., 2010; Hawkins & Heinemann, 1998), and identification of depression severity risk factors (Williams et al., 2014). Early identification and treatment for SUD can improve outcomes in many domains including family cohesion, crime reduction, and vocational rehabilitation (e.g., Heinemann, Moore, Lazowski, Huber, & Semik, 2014). The Substance Abuse Subtle Screening Inventory (SASSI; Miller, 1985) was designed to be an easily administered and objective tool that would assist practitioners in identifying persons with a high probability of a substance use disorder so that additional evaluation and treatment could be initiated when appropriate. Key to the design of the inventory was the inclusion of subtle items. Such items make no obvious reference to substance use but are effective in identifying individuals likely to have an SUD, Ă“ 2016 Hogrefe Publishing


L. E. Lazowski & B. B. Geary, Validation of the Adult SASSI-4

despite inability or unwillingness to acknowledge substance use behaviors (Laux, Piazza, Salyers, & Roseman, 2012; Miller, 1985). Modifications to the scale composition of the inventory during two revisions, SASSI-2 (Miller, 1994) and SASSI-3 (Lazowski, Miller, Boye, & Miller, 1998; Miller & Lazowski, 1999), enhanced its screening accuracy and clinical utility. In 2013, the American Psychiatric Association (APA) published revised criteria for diagnosing substance use disorders (5th ed.; DSM-5; APA, 2013). The present study was designed to formulate a revision of the SASSI and to validate screening accuracy estimates for the instrument in a heterogeneous sample of adults diagnosed with and without SUD according to the most current and widely accepted diagnostic standards, the DSM-5. An additional aim was to improve the utility of the instrument by adding a brief new scale to identify prescription medication misuse and to evaluate its screening efficacy. Subtle items that were unscored on the SASSI-3, as well as new items designed to represent DSM-5 symptoms of SUD, also were evaluated for screening accuracy and improved clinical utility. In a second study, an independent sample of respondents completed the instrument under honest and fake good instructions to examine sensitivity when respondents attempted to conceal evidence of SUD.

Methods Clinical assessment professionals who were qualified users of the SASSI Institute SUD web-based screening application were recruited to administer a research version of the adult SASSI-4 to their clients. They were also asked to submit an independent DSM-5 SUD diagnostic evaluation for each client who completed the screening instrument based on the proposed diagnostic criteria previously published online (http://www.dsm5.org) by the APA in 2012. Scoring and screening reports for each client were provided by the Institute. Study enrollment lasted from September 2012 through December 2013. Eligibility criteria were at least 18 years of age and English speaking. Assessment professionals from 38 states in all nine US Census Bureau regions, as well as two Canadian provinces, participated. Using the SASSI Online web application, after counselors provided a client’s demographic and available history data (e.g., arrest history, blood alcohol content [BAC] level, prior alcohol or drug treatment instances), a link to the questionnaire was provided for client responses, which were submitted directly to the secure web server upon completion. Following submission of the counselor diagnostic evaluation for the client, the screening responses were scored, and results were made available to the counselor in an online dashboard. Counselors were able to opt out Ó 2016 Hogrefe Publishing

87

of the study at any time or exclude client data from the investigation. Figure 1 indicates data collection steps for client data, including rates of exclusions, diagnoses, SUD screening, and randomization of the sample into development and validation groups.

Participants Clinical Sample Counselors in 163 practices and organizations throughout the US and Canada administered the screening instrument. Descriptive statistics for cases screened by assessment setting type and client demographic characteristics are shown in Table 1. Diagnosed incidence of SUD in these six assessment setting types ranged from 60% to 86%. The frequency of criterion positive cases submitted by the individual practices and organizations ranged from 0 to 100%. Retest Sample Retest reliability of SASSI-4 scale scores was examined with an independent sample of 40 participants who were not receiving treatment for substance use. Project staff recruited volunteers aged 18 and older in community organizations (e.g., nonprofits, jobs programs, philanthropic organizations) to complete the screening questionnaire anonymously on two occasions, 1–8 weeks apart (M = 22.8 days, SD = 12.3, range 8–60 days) and offered participants a $10 honorarium. The sample included 12 (30%) men and 28 (70%) women. Their mean age was 61.9 years (SD = 14.5, range 24–85 years). Ethnicities of sample participants included 38 (95%) Caucasians, 1 (2.5%) African American, and 1 (2.5%) person of Hispanic origin. Forty percent of the sample (n = 16) had a high school diploma or equivalent, 27% (n = 11) had a vocational or 2-year college degree, and 33% (n = 13) had 3 or more years of postsecondary education. Study 2 Testing resistance to faking good. To replicate earlier work on the development of the SASSI in which respondents were instructed to hide signs of unfavorable characteristics and substance misuse (Miller, 1985) and to test the utility of new subtle items as indicators of SUD, an independent sample of 120 respondents was recruited to complete the SASSI-4. Participants were recruited from community organizations that provide support programs (e.g., housing or job search assistance) for adults in recovery from substance use disorders. Adult volunteers were invited to complete the screening questionnaire anonymously on two occasions, approximately 2 weeks apart (M = 11.8 days, SD = 8.5), under two types of instructions. In the first administration, participants were asked to complete the European Journal of Psychological Assessment (2019), 35(1), 86–97


88

L. E. Lazowski & B. B. Geary, Validation of the Adult SASSI-4

Figure 1. SASSI-4 clinical sample flow diagram.

questionnaire honestly, based on their actual attitudes and experiences with alcohol, illicit drugs, and nonmedical use of prescription medications. For the second administration, participants were directed to respond as someone would if they were attempting to conceal unfavorable characteristics and substance misuse while still appearing to be believable and forthcoming in their responses. A $10 honorarium was offered. The sample’s mean age was 42.2 years (SD = 13.2, range 19–71 years). Seventy-nine percent (n = 95) of the sample participants were men. Respondents’ race/ethnicity included White (77%; n = 92), Black (18%; n = 22), and Native American, biracial, or other races (5%; n = 6). Highest education included 15% (n = 18) with less than a European Journal of Psychological Assessment (2019), 35(1), 86–97

high school diploma, 31% (n = 37) with a high school diploma or equivalent, and 54% (n = 65) with one or more years of vocational or other postsecondary education.

Measures SUD Diagnoses In DSM-IV, SUDs had been classified into categories of substance dependence and substance abuse. In DSM-5, the former abuse and dependence symptom criteria are combined into a single diagnostic set. The 11 specific symptoms remain the same as in DSM-IV, with the exception that the former symptom “substance-related Ó 2016 Hogrefe Publishing


L. E. Lazowski & B. B. Geary, Validation of the Adult SASSI-4

89

Table 1. SASSI-4 clinical sample participant characteristics Characteristic

n = 1,245

%

Assessment setting Substance use treatment programs

564

46

Criminal justice: drug courts, probation and parole, community corrections Private practice

209

18

190

16

Behavioral health facilities

107

9

DOT and DUI screening and education

75

6

Social service programs

59

5

Criterion positive

945

76

Criterion negative

300

24

Clinical SUD diagnosis

Age (range 18–79) M

34.4

SD

12.1

Gender Male

781

63

Female

464

37

199

17

Education Less than high school diploma High school diploma

479

41

490

42

White

900

75

Black

133

11

Hispanic

83

7

Native American or Alaska Native

45

4

One or more years postsecondary education Race/Ethnicity

Asian, Native Hawaiian, or Pacific Islander

11

1

Biracial or other

26

2

Full time

445

38

Not employed

381

32

Part time

146

12

Retired due to disability

85

7

Student

81

7

Homemaker

25

2

Retired due to age

13

1

Single

596

49

Married or cohabiting with partner

330

28

Divorced or separated

270

22

Widowed

16

1

Employment status

Marital status

legal problems” was removed and the new symptom “craving or strong desire or urge to use a substance” was added. In DSM-5, specifiers for SUD severity are delineated by the number of symptoms evidenced: 2–3, mild; 4–5, moderate; 6 or more, severe. Counselors indicated the presence or absence of the 11 DSM-5 SUD symptom criteria and specified for which Ó 2016 Hogrefe Publishing

drug class the symptom was evidenced, within the time period (past 12 months, lifetime) for which they conducted each diagnostic evaluation. In addition to those with current SUDs, cases with SUD in remission were considered criterion positive. Counselors also indicated diagnoses of any non-substance related psychological disorders. Cases without sufficient diagnostic information were excluded from analyses. Logistic regression was used to evaluate whether excluded cases differed in client demographic characteristics or type of assessment setting from clinical cases that did include sufficient diagnostic information. Age, gender, ethnicity, education level, employment status, marital status, and type of screening setting were used as predictor variables of case inclusion status. The omnibus test of the model coefficients showed no significant effects of the demographic or setting variables on case inclusion status, w2(29, n = 1,320) = 28.1, p = .45. The incidence of diagnosed SUD in this screening sample was 75.7%. Sixteen percent of these cases met criteria for mild SUD, 12% for moderate SUD, and 48% for severe SUD. Three hundred eighty cases (29.6%) included diagnoses of non-substance related psychiatric disorders, both with and without coexisting SUD. Non-substance related diagnoses in the clinical sample included: depression (17.1%), anxiety (14.3%), bipolar disorder (5.8%), attention-deficit hyperactivity disorder (5.0%), posttraumatic stress disorder (4.0%), and other mental health diagnoses (3.5%). Research Version of the SASSI-4 The SASSI-3 consists of 67 true-false items that identify SUD through obvious and subtle content. The instrument also includes 26 face valid alcohol and other drug frequency items. Face valid and subtle items are organized into seven scales that are utilized in a series of decision rules to produce a dichotomous SUD screening classification. The Face Valid Alcohol (FVA) and Face Valid Other Drug (FVOD) scales measure how often (0 = never to 3 = repeatedly) respondents have engaged in and experienced effects from the use of alcohol and other drugs within a specified time frame (e.g., lifetime, past 12 months). All other SASSI-3 scales utilize a true-false response format. The Symptoms (SYM) scale contains face valid items that assess substance use history and consequences. The Obvious Attributes (OAT) scale is compiled empirically of items shown to discriminate between SUD criterion groups under standard instructions to answer honestly. The Subtle Attributes (SAT) scale consists of items found to discriminate between individuals with and without SUDs when respondents answered honestly and when they attempted to conceal signs of substance misuse. The Defensiveness (DEF) scale discriminates responses given under honest European Journal of Psychological Assessment (2019), 35(1), 86–97


90

versus fake good instructions. The measure indicates the extent to which respondents are willing to acknowledge minor, socially acceptable limitations or attempt to deny such flaws. The SAM (Supplemental Addictions Measure) scale consists of items that discriminate between SUD criterion groups and is used in the SASSI-3 decision rules. These seven scales are utilized in the dichotomous screening outcome. The inventory also contains two supplementary clinical scales that are not used to screen for SUD but provide information that can be useful in evaluation and treatment planning. Finally, the Random Answering Pattern (RAP) scale is used to identify profile invalidity that might be due to deliberate noncompliance, insufficient reading comprehension, inattention, or due to other processes. Fifteen new items were added to the research version of the SASSI-4 to assess their accuracy in identifying persons with SUDs based on subtle identifiers, nonmedical use of prescription drugs, and symptom criteria for a DSM-5 diagnosis of SUD (e.g., craving for a substance). The Lexiler Framework for Reading was used to assess readability of the questionnaire (Stenner, Burdick, Sanford, & Burdick, 2007). The Lexiler Measure for the provisional instrument was 740L, which corresponds to the reading text complexity band for 4th and 5th grade students (Nelson, Perfett, Liben, & Liben, 2012). Average time to complete the inventory was approximately 15 min (M = 14.8 min, SD = 5.9). SASSI responses were evaluated for profile invalidity. Thirty-nine cases indicated elevated Random Answering Pattern scores (2 or higher) and were excluded from analyses. A total of 1,245 complete cases, including a valid and complete SASSI-4, a diagnostic evaluation for SUD, and client demographic information, served as the primary dataset (see Figure 1).

Analysis Strategy Overview Five psychometric indices of the screening inventory scores and their performance were evaluated: retest reliability; internal consistency reliability; screening accuracy; impacts of assessment setting and client demographic characteristics on accuracy; and robustness of screening performance when clients attempt to fake good.

Data Analyses Reliability Retest reliability was evaluated with Pearson correlations. Because these correlation coefficients measure the reproducibility of the rank order of participant scores on retest

European Journal of Psychological Assessment (2019), 35(1), 86–97

L. E. Lazowski & B. B. Geary, Validation of the Adult SASSI-4

but do not allow one to assess magnitude of change in raw scale scores, paired sample t-tests also were conducted on participant responses to assess raw scale score stability. Internal Consistency The SASSI is composed of multiple subscales that were compiled empirically on the basis of whether items discriminated between known criterion groups under various instructional sets (Miller, 1985, 1994). SASSI items were not designed or selected based on a criterion of constant item variances. Noting how infrequently the assumptions of essential tau-equivalence are met, and yet are necessary for the appropriate use of coefficient alpha as an estimate of internal consistency reliability, recent papers in the psychometric literature have suggested the use of coefficient omega as a preferred measure of internal consistency (Dunn, Baguley & Brunsden, 2014; Gignac, 2014; McDonald, 1999; Revelle & Zinbarg, 2009). Because the congeneric measurement model underlying the use of omega does not require homogeneity of item variances, the use of coefficient omega as an estimate of internal consistency is appropriate for the current data. Internal consistency reliability was estimated for the SASSI-4 by calculating omega coefficients (McDonald, 1999) for the item set composed of all items included on the screener as indicators of SUD likelihood (Table 2: “SASSI-4 overall”) and for each scale individually. DEF and RAP scale items, as well as items in the two supplementary clinical scales, FAM and COR, were excluded from the overall analysis because they were not included on the instrument as measures of SUD likelihood. Omega coefficients were computed using R open source software statistical functions (R Development Core Team, 2014). In contrast to coefficient omega, omega hierarchical estimates how reliable test scores are as indicators of the target construct of interest (here SUD) and as such is a measure of general factor saturation (McDonald, 1999; Zinbarg, Revelle, Yovel & Li, 2005). Omega hierarchical for the SASSI-4 was calculated using the omega function implemented in the psych package (Revelle, 2014) of R statistical software (R Development Core Team, 2014). Criterion Validity The clinical sample was divided randomly into a development sample and a reserve, cross-validation sample. The development sample was used to establish a scoring protocol that maximized correspondence with clinicians’ diagnoses of SUD while balancing the false positive and false negative error rates. Screening accuracy of the new scoring protocol was validated on the reserve set of clinical cases. We computed various measures of accuracy,

Ó 2016 Hogrefe Publishing


L. E. Lazowski & B. B. Geary, Validation of the Adult SASSI-4

91

Table 2. Test-retest reliability and internal consistency reliability estimates for the SASSI-4

Scale

Test-retest

Internal consistency

Reliability coefficientsa

Reliability coefficientsb

Pearson r

95% CI

Omega

95% CI

SASSI-4 overallc

.99

[.98, .99]

.97

[.96, .97]

Face valid alcohol

.99

[.98, .99]

.93

[.93, .94]

Face valid other drug

.99

[.98, .99]

.96

[.96, .97]

Symptoms

.97

[.94, .98]

.90

[.89, .90]

Obvious attributes

.91

[.83, .95]

.74

[.71, .76]

Subtle attributes

.84

[.70, .91]

.70

[.67, .72]

Defensiveness

.78

[.61, .88]

.72

[.71, .75]

Supplemental addiction measure

.91

[.83, .95]

.83

[.81, .84]

Family vs. controlsd

.89

[.80, .94]

.70

[.67, .72]

Correctionald

.95

[.90, .97]

.81

a

b

c

[.80, .83] d

Notes. CI = confidence interval. N = 40. N = 1,245. Only items that are utilized as indicators of SUD likelihood were included in this analysis. Scale is not used to classify respondents regarding likelihood of SUD.

including: sensitivity (true positives/criterion positives), specificity (true negatives/criterion negatives), positive predictive value (true positives/test positives), and negative predictive value (true negatives/test negatives). Percent correctly classified, and Areas Under the Curve (AUCs) of Receiver Operating Characteristic (ROC) curves were utilized as summary indices of agreement between screening outcomes and DSM-5 SUD diagnoses. Likelihood ratio chisquare statistics tested whether the observed agreement differed from chance. To examine whether the accuracy of SASSI-4 screening outcomes was impacted differentially by respondents’ demographic characteristics or the type of assessment setting in which they were screened, we conducted a logistic regression analysis on screening accuracy in the validation sample using age, gender, ethnicity, education level, employment status, marital status, and screening setting as predictor variables. Gender was the only predictor that had been utilized when formulating the scoring rules with the development sample and was included as a predictor in the logistic regression to assess whether any remaining gender-associated score variance impacted screening accuracy. To assist practitioners in identifying individuals likely to have an SUD related to the misuse of prescription medications, six prescription drug abuse items were included on the SASSI-4. The items have face valid content such as “Used prescription drugs that were not prescribed for you” and “Took a higher dose or different medications than your doctor prescribed in order to get the relief you need.” Internal consistency and retest reliability estimates were calculated for this new scale. Accuracy of this screening measure was estimated using a diagnosis of SUD for opioid or sedative drugs as the criterion for presence of the disorder. Cases without an SUD served as the criterion negative sample. Ó 2016 Hogrefe Publishing

Results Modifications to SASSI-3 Scales Discriminant analyses of development sample responses indicated a 90.5% correct classification rate based on the discriminant function equation. Items that discriminated between SUD criterion groups were retained in the SASSI-4 item pool. Three items that were scored in the SASSI-3 decision rules and four subtle, previously unscored SASSI-3 research items on the published instrument were not significant criterion group discriminators and were excluded from consideration in the new scoring rules. The six remaining previously unscored subtle research items and the 15 newly added items were significant discriminators of SUD criterion groups and were added to the FVA, FVOD, SYM, or SAT scales based on item content. To evaluate the effectiveness of the subtle items, we conducted an additional analysis that utilized only the SAT items as discriminators of criterion negative and fake good respondents. The discriminant function equation indicated the SAT items alone yielded a correct classification rate of 82.1%, indicating effectiveness of the item set in identifying persons with SUD, even when respondents attempted to conceal signs of substance use disorder.

Reliability Test-Retest Reliability As shown in Table 2, Pearson correlation estimates of SASSI-4 score reliability in this sample ranged from .78 to .99 and indicate high consistency in scores retested over the 8 to 60-day interval. DEF scores were more variable, as would be expected based on the likelihood that defensive

European Journal of Psychological Assessment (2019), 35(1), 86–97


92

L. E. Lazowski & B. B. Geary, Validation of the Adult SASSI-4

Table 3. Validation sample estimates of SASSI-4 screening accuracy by DSM-5 diagnostic features Criterion status

Prev

AUC [95% CI]

Sens

Spec

PPV

NPV

%CC

Any SUD (n = 625)

76

.91 [.88, .95]

93

90

97

80

92

Lifetime SUD (n = 480)

79

.92 [.89, .96]

92

92

98

76

92

Past year SUD (n = 145)

66

.90 [.84, .97]

95

86

93

90

92

Mild (n = 97)

16

.84 [.79, .90]

78

90

84

87

86

Moderate (n = 77)

12

.91 [.87, .96]

92

90

83

96

91

Severe (n = 300)

48

.94 [.91, .97]

98

90

95

95

95

No SUD (n = 151)

24

.96 [.92, .99]

98

93

99

89

97

.93 [.89, .97]

87

98

97

93

94

SUD severity (n = 625)

Other mental health Diagnoses (n = 380) Co-occurring SUD (n = 320)

84

Non-substance Related disorder (n = 60)

16

Rx screen (n = 246) Opioid or sedative SUD (n = 95)

39

No SUD (n = 151)

61

Notes. All figures other than AUC are percentages. Prev = Prevalence of SUD within the diagnostic category; AUC = area under the curve; CI = confidence interval; Sens = Sensitivity; Spec = Specificity; PPV = Positive Predictive Value; NPV = Negative Predictive Value, %CC = percent correctly classified (i.e., accuracy); SUD = substance use disorder, Rx Screen = SASSI-4 prescription drug abuse scale.

responding is impacted by psychological state and situational factors (Flett, Besser, & Hewitt, 2005; Knee, Porter, & Rodriguez, 2014). Raw Scale Score Stability Overall magnitude of change was very small on all scales. The largest mean change occurred in OAT scores and indicated less than a 1-point increase (Δ .38) in the mean OAT scale score at Time 2. None of the differences were significant for any of the SASSI-4 scale scores (all ps > .05), indicating high temporal stability in raw scale scores in the absence of intervention. Internal Consistency Omega coefficients are shown in Table 2 and indicate overall internal consistency of .97 with scale reliability estimates ranging from .70 to .97. Findings also indicated an omega hierarchical coefficient of .78. Dividing omega hierarchical by omega total (here .78/.97) estimates 80.4% of the reliable variance in the overall measure is attributable to SUD likelihood, the construct of interest, while the remaining reliable variance is due to unique variance associated with the subscales (see also, Reise, Bonifay, & Haviland, 2013).

Criterion Validity Correspondence of SASSI-4 Screening Outcomes With DSM-5 SUD Diagnoses Beginning with the SASS-3 decision rules and the newly compiled SASSI-4 scales, an iterative process was used to adjust European Journal of Psychological Assessment (2019), 35(1), 86–97

scale cutoffs in the development sample of clinical cases (N = 620). At each iteration we assessed the impact on screening sensitivity and specificity in this sample with the aim of optimizing overall accuracy while balancing both types of screening errors. Via this process, we formulated a scoring protocol that demonstrated overall development sample classification accuracy of 92.6%: sensitivity 94.1%, specificity 87.9%, positive predictive value (PPV) 91.2%, and negative predictive value (NPV) 82.4%; likelihood ratio (1, N = 620) = 383.8, p < .001. ROC analysis indicated an AUC of .91 (SE = .02), p < .001, 95% CI [.88, .94]. Screening accuracy indices resulting from the application of the new scoring rules to the reserve validation sample (N = 625) are shown in Table 3, as are accuracy findings for lifetime and past year SUD, and by SUD severity. Cases diagnosed with mild SUD reveal that 15 of the 21 (71%) cases with negative screening outcomes were diagnosed based on evidence of only two DSM-5 symptoms; the remaining six cases were diagnosed with mild SUD based on evidence of three symptom criteria. The most frequent pattern observed in test misses with two SUD symptoms showed the individual had been arrested for DUI and the clinician indicated symptom criteria of “used more than intended” and “used in hazardous situations.” Logistic Regression Using Demographic and Screening Setting Variables as Predictors of SASSI-4 Screening Accuracy Screening accuracy by assessment setting types ranged from 87% in government and community social service Ó 2016 Hogrefe Publishing


L. E. Lazowski & B. B. Geary, Validation of the Adult SASSI-4

programs to 97% in criminal justice settings. The omnibus test of the model coefficients showed no significant effects of the demographic or screening setting variables on accuracy of the screening classification, w2(29, N = 554) = 28.1, p = .51. The null model, which included the accuracy of the SASSI-4 screening outcome as the constant, correctly predicted the classification of 510 cases (92.1%) and the full model that included the six client demographic variables and type of assessment setting showed no improvement to accuracy prediction; 510 cases were correctly classified in this model. Accuracy in Detecting Likely SUD in Persons Diagnosed With Non-Substance Related Disorders Clinicians’ assessments of individuals in the SASSI-4 validation sample indicated that 380 individuals had been diagnosed with non-substance related psychiatric disorders, either in addition to SUD (co-occurring disorders, n = 320) or with a non-substance related psychiatric disorder only (n = 60), whereby they were considered criterion negative with respect to SUD. SASSI-4 screening accuracy for these cases is also presented in Table 3 under the heading “Other Mental Health Diagnoses.” These findings provide evidence of construct validity and indicate that the SASSI-4 is effective in distinguishing identifiers associated with likely SUD from those associated with other behavioral health needs. Identification of Likely Prescription Medication Abuse Reliability analyses for the SASSI-4 Rx scale indicated a Pearson retest reliability coefficient of .95, 95% CI [.90, .97], and an internal consistency reliability coefficient omega of .88, 95% CI [.87, .89]. With cases in the development sample, a cutoff score of 3 or more on the prescription drug scale produced sensitivity of 83.3% and specificity of 96.0%, likelihood ratio (1, N = 251) = 186.5, p < .001; AUC = .90 (SE = .02), p < .001, 95% CI [.85, .94]. When this rule was applied to validation sample cases diagnosed with or without opioid or sedative SUDs, screening accuracy findings for the SASSI-4 Rx scale, shown in Table 3 under the heading “Rx Screen,” indicated strong correspondence between the prescription drug scale screening classification and clinicians’ diagnoses. To further specify parameters for the effective functioning of the SASSI-4 prescription drug scale, we assessed its screening sensitivity in identifying cases diagnosed with any type of SUD, using the same cutoff score of 3 used for opioid and sedative related SUDs. Thirty-five percent (165/474) of all SUDs in the validation sample were identified with this rule. Even when the cutoff score was lowered to 2, screening sensitivity for any type of SUD was 45% (214/474 cases), indicating that SASSI-4 prescription drug scores are not effective as a stand-alone screening measure for all SUDs. Ó 2016 Hogrefe Publishing

93

Study 2 Results SASSI-4 Resistance to Faking Good Prior alcohol and drug treatment was reported by 89% of this sample of adults recovering from substance use disorders. Screening outcomes for responses in the honest condition indicated all 120 individuals tested positive on the SASSI-4; five of these individuals (4%) had an elevated score on the SASSI RAP scale, signaling that responses might not be valid, and were excluded in subsequent analyses. At Time 2 when asked to fake good, 23 additional respondents (20%) had elevated RAP scores and were excluded. The rate of invalid profiles in the fake good condition was significantly elevated above the 3% rate observed in the clinical sample and the 4% rate of these same participants in the honest condition, providing evidence that RAP scores can reflect deliberate response styles. Faking participants also omitted responses on one or more SASSI-4 scales (n = 18), which were sufficient to prevent calculation of a conclusive screening outcome for only one of these cases. Mean scores for cases with complete pairwise responses are shown by instructional set in Table 4. As shown in Table 4, under instructions to fake good, participants’ mean scores on the face valid scales decreased significantly – between one and two standard deviations. In addition, participants’ Defensiveness scores increased significantly, consistent with early research on the instrument (Miller, 1985) and the intended design to provide practitioners a way of identifying possible response minimization. Scores on the SAT scale decreased significantly, although on average, less than one standard deviation. Screening outcomes based on the full SASSI-4 scoring protocol, which utilizes both subtle and face valid scale scores, indicated sensitivity of 79.1% (72/91 cases), 95% CI [.70, .86] in identifying likely SUD in individuals who attempted to hide signs of substance misuse. Further, 15 of the 19 respondents (79%) who were able to dissimulate a negative screening outcome had elevated DEF scores that met author-recommended guidelines for further evaluation regarding possible response minimization. In contrast to the full SASSI-4 scoring protocol, when only the face valid scale scores, FVA, FVOD, and SYM, were used to produce screening outcomes, sensitivity was 47.3%, (43/91 cases), 95% CI [.37, .57], indicating vulnerability of face valid SUD screening measures to attempts to minimize signs of SUD.

Discussion The study objective was to formulate a revision of the adult SASSI screening tool that would demonstrate high

European Journal of Psychological Assessment (2019), 35(1), 86–97


94

L. E. Lazowski & B. B. Geary, Validation of the Adult SASSI-4

Table 4. Paired samples t-tests: SASSI-4 mean scale scores as a function of instructional set Instructional set Honest SASSI-4 scale

Fake good

M

SD

M

SD

t

df

Face valid alcohol

25.5

10.1

10.6

10.8

10.26*

81

Face valid other drugs

36.0

17.1

12.3

14.8

10.71*

74

Symptoms

12.8

3.8

6.0

5.6

9.59*

83

Obvious attributes

7.3

2.2

4.2

3.0

8.99*

83

Subtle attributes

7.1

2.7

5.0

2.7

6.39*

83

Defensiveness

3.9

1.8

6.8

2.5

9.16*

84

10.9

1.8

5.6

4.2

10.68*

86

5.6

2.1

8.7

2.9

8.39*

88

10.0

2.5

4.3

4.0

11.18*

84

Supplemental addiction measure Family vs. controls Correctional Note. *p < .001.

correspondence with the current diagnostic criteria for substance use disorders. An additional aim was to balance sensitivity and specificity so the screener can be used in a variety of settings. The study sample included SUD screenings from a diverse sample of practices throughout the nine US Census Bureau regions and two Canadian provinces. Analyses indicated 92% overall correspondence with clinicians’ DSM-5 diagnoses of SUD, with no significant variation from this accuracy level across a range of assessment setting types and respondent demographic characteristics. The data provide evidence of SASSI-4 screening accuracy in detecting lifetime and past year presence of SUDs, which enhances the utility of the inventory for use in programs with varying screening objectives. Findings also provide evidence of screening sensitivity across a range of SUD symptom severity, improving use of the instrument in programs that serve individuals with varying levels of the disorder. Less sensitivity was observed in a sample of respondents with mild SUDs, particularly in cases where diagnoses evidenced only two of the 2–3 criteria necessary for this level of SUD severity. Examination of client data for those who screened negative when clinicians had indicated mild SUD severity revealed 57% had a DWI violation. Other research has found that individuals who received a DSM-IV diagnosis of alcohol abuse based solely on the criterion of driving while intoxicated differed from nonsubstance abusers on approximately half of the external diagnostic validating criteria. In contrast, persons who met substance abuse criteria by other symptoms differed from non-substance abusers on all external validating criteria (Hasin, Paykin, Endicott, & Grant, 1999). For example, the drinking-driver abuser group did not differ from those without abuse diagnoses on drinking during the week, depressed mood, or self or others’ perceptions that they needed treatment. Whether the hazardous and illicit substance-related behaviors in which they engage European Journal of Psychological Assessment (2019), 35(1), 86–97

meet the standard of a psychiatric substance use disorder merits further attention. In the current study, it is possible that the legal consequences associated with clients’ alcohol or drug use affected their diagnoses, even though legal consequences have been removed as a diagnostic symptom of SUD in DSM-5. This phenomenon might be especially likely when a legal consequence is the precipitating event for the assessment or when it is salient in a profile with otherwise low evidence of SUD. Further validation research is needed to gather additional estimates of SASSI-4 screening sensitivity for mild SUD severity. In addition, research on whether legal consequences resulting from individuals’ substance use play a role in clinicians’ DSM-5 SUD diagnoses might also be informative about the ways in which current diagnostic standards are utilized clinically. Excellent SASSI-4 sensitivity in identifying SUD in individuals experiencing co-occurring psychiatric disorders, and high specificity for individuals without SUD who instead were experiencing cognitive, mood, somatic, or related impairments associated with other psychiatric disorders was also demonstrated. Given that screening for SUD occurs in settings where individuals are likely to present with a variety of behavioral health needs, it is important that the screener effectively discriminate between symptoms associated with SUDs versus those indicative of problematic functioning in other areas. New items added to assist practitioners in identifying individuals likely to be abusing prescription medications had an overall accuracy rate of 94% in a criterion sample diagnosed with opioid or sedative related SUDs. This extends the utility of the SASSI-4 to practitioners seeking to address the escalating prevalence of misuse of prescription medications and can be useful in settings where clients may be at high risk for medication abuse (e.g., disability and pain management populations). Since this scale is a face valid measure, individuals who are unwilling to acknowledge prescription drug misuse can avoid detection Ó 2016 Hogrefe Publishing


L. E. Lazowski & B. B. Geary, Validation of the Adult SASSI-4

on this scale. In such instances, respondents’ scores on the SASSI-4 DEF scale can be a useful tool in discerning response minimization. It is also important to note that in the current study clinicians specified the class of drugs for each diagnosed client symptom but were not asked to indicate the specific drugs clients used. Since prescription opioid pain medications (e.g., Vicodin, OxyContin, and Hydrocodone) and sedatives used to treat anxiety and sleep disorders (e.g., Valium, Xanax, and Ambien) are among the most widely abused prescription medications (US Department of Health & Human Services, 2014; Volkow, 2014) we reasoned that a diagnosis of opioid or sedative related SUD had utility as a criterion variable for validating the SASSI-4 prescription drug abuse scale. Ninety-seven percent of clients whose reported nonmedical use of prescription medications met the Rx scale cutoff were independently diagnosed with an opioid or sedative related SUD (PPV), and 93% of clients who screened negative on the Rx scale were diagnosed as not having an SUD (NPV). It is possible, given the use of this more general criterion variable, that false negative cases in this analysis included some who were diagnosed with an opioid use disorder based on their heroin use, a nonprescription opiate, and thus should not be considered legitimate false negatives. Indeed, findings indicated that of the 12 criterion positive “misses,” seven screened positive on the overall SASSI-4 screening outcome for any SUD. Thus, this evaluation of the Rx scale screening accuracy may overestimate its false negative rate. Future validation studies of the SASSI-4 Rx scale that include diagnostic specification of the prescription and other drugs evidenced in criterion positive cases are needed to provide more precise estimates of screening error rates for this scale.

SASSI-4 Resistance to Faking Good The SASSI was designed to identify individuals in need of diagnostic evaluation for SUD, including individuals who may be unable or unwilling to acknowledge their substance misuse. Findings demonstrated SASSI-4 sensitivity of 79% even when respondents deliberately attempted to conceal their substance use. Moreover, only four participants out of 120 were successful in faking a negative screening outcome without producing elevations on the RAP or DEF scales. Elevated scores on these scales alert practitioners that responses are atypical and warrant further evaluation to determine whether invalid profiles are motivated and whether cases indicating elevated defensiveness are attributable to unwillingness to acknowledge SUD, lack of insight into the causes of any substance-related

Ó 2016 Hogrefe Publishing

95

consequences evidenced, or to situational or dispositional factors unrelated to substance misuse. In each of these cases, elevated RAP and/or DEF scores provide valuable information to practitioners and have utility in directing the course of additional assessment beyond that which is available through face valid screening instruments. Study Limitations The SASSI is a screening instrument, designed to be used as one source of information in clinicians’ decision-making regarding individuals in need of diagnostic evaluation for the presence of an SUD and the potential need for treatment; it does not provide a diagnosis. Respondents in early or sustained remission were among the criterion positive cases in this study. Therefore, persons who screen positive on the SASSI-4 might already be in remission. Data used to validate the screening instrument were submitted by practitioners engaged in ongoing programs of substance use screening. In one respect, this is a study advantage in that we measured clients’ responses in actual settings where decisions regarding further evaluation for treatment needs would ensue from clients’ selfreports. Additional validation in practices that serve lower rates of SUD can extend the generalizability of the current findings.

Conclusions We formulated a revision of the SASSI SUD screening instrument for alcohol, illicit drugs, and nonmedical use of prescription medications that demonstrates high sensitivity and specificity, and consistency in results across heterogeneous samples of adults in substance use treatment, criminal justice, behavioral health, and social service programs. Results also demonstrate accuracy of screenings for lifetime and past year presence of SUD. The SASSI-4 can have considerable clinical utility in forensic, vocational, and psychological evaluations. The inventory can assist practitioners in identifying individuals in need of further assessment for SUD, and treatment recommendations can be enhanced by information regarding clients’ defensiveness or acknowledgment of their substance use, and specific substance use related consequences they have experienced. Accurate detection of an SUD can facilitate differential diagnosis, client supervision and rehabilitation processes, and decisions regarding case management service delivery, particularly where substance misuse can adversely affect potential benefits of other medical and behavioral health treatment interventions.

European Journal of Psychological Assessment (2019), 35(1), 86–97


96

Acknowledgments The authors gratefully acknowledge the contributions of: Jennifer Cullen Meyer for consultation on analyses; Tim Baker, Carl Briggs, Allen Heinemann, Jonathan Kaplan, Kristin Kimmell, and Nelson Tiburcio for their many helpful comments on earlier drafts of this manuscript; Tom Cox, Anne Hazeltine, David Helton, Melissa Renn, and Lewis Seward for their assistance with onsite data collection; Adrian Hosey and Lauren Nelson for assistance with online data collection; and Scarlett Baker for study coordination and manuscript preparation. Finally, we express our sincere gratitude to the assessment professionals and community service organizations that were instrumental in respondent recruitment and diagnostic evaluations for this study.

References American Psychiatric Association. (2013). Diagnostic and statistical manual of mental disorders (5th ed.). Washington, DC: Author. Andelic, N., Jerstad, T., Sigurdardottir, S., Schanke, A., Sandvik, L., & Roe, C. (2010). Effects of acute substance use and pre-injury substance abuse on traumatic brain injury severity in adults admitted to a trauma centre. Journal of Trauma Management & Outcomes, 4, 1–12. doi: 10.1186/1752-2897-4-6 Babor, T. F., McRee, B. G., Kassebaum, P. A., Grimaldi, P. L., Ahmed, K., & Bray, J. (2007). Screening, brief intervention, and referral to treatment (SBIRT): Toward a public health approach to the management of substance abuse. Substance Abuse, 28, 7–30. doi: 10.1300/J465v28n03_03 Belenko, S. (2006). Assessing released inmates for substanceabuse-related service needs. Crime & Delinquency, 52, 94–113. doi: 10.1177/0011128705281755 Bouchery, E. E., Harwood, H. H., Sacks, J. J., Simon, C. J., & Brewer, R. D. (2011). Economic costs of excessive alcohol consumption in the US, 2006. American Journal of Preventive Medicine, 41, 516–524. doi: 10.1016/j.amepre.2011.06.045 Compton, P., Darakjian, J., & Miotto, K. (1998). Screening for addiction in patients with chronic pain and “problematic” substance use: Evaluation of a pilot assessment tool. Journal of Pain and Symptom Management, 16, 355–363. doi: 10.1016/ S0885-3924(98)00110-9 Dunn, T. J., Baguley, T., & Brunsden, V. (2014). From alpha to omega: A practical solution to the pervasive problem of internal consistency estimation. British Journal of Psychology, 105, 399–412. doi: 10.1111/bjop.12046 Easton, C., Swan, S., & Sinha, R. (2000). Motivation to change substance use among offenders of domestic violence. Journal of Substance Abuse Treatment, 19, 1–5. doi: 10.1016/S07405472(99)00098-7 Flett, G. L., Besser, A., & Hewitt, P. L. (2005). Perfectionism, ego defense styles, and depression: A comparison of self-reports versus informant ratings. J of Personality, 73, 1355–1396. doi: 10.1111/j.1467-6494.2005.00352.x Gignac, G. E. (2014). On the inappropriateness of using items to calculate total scale score reliability via coefficient alpha for multidimensional scales. European Journal of

European Journal of Psychological Assessment (2019), 35(1), 86–97

L. E. Lazowski & B. B. Geary, Validation of the Adult SASSI-4

Psychological Assessment, 30, 130–139. doi: 10.1027/10155759/a000181 Hasin, D., Paykin, A., Endicott, J., & Grant, B. (1999). The validity of DSM-IV alcohol abuse: Drunk drivers versus all others. Journal of Studies on Alcohol, 60, 746–755. Hawkins, D., & Heinemann, A. W. (1998). Substance abuse and medical complications following spinal cord injury. Rehabilitation Psychology, 43, 219–231. doi: 10.1037/00905550.43.3.219 Heinemann, A. W., Moore, D., Lazowski, L. E., Huber, M., & Semik, P. (2014). Benefits of substance use disorder screening on employment outcomes in state-federal vocational rehabilitation programs. Rehabilitation Counseling Bulletin, 57, 144–158. doi: 10.1177/0034355213503908 Knee, C. R., Porter, B. A., & Rodriguez, L. M. (2014). Selfdetermination and regulation of conflict in romantic relationships. In N. E. Weinstein (Ed.), Human motivation and interpersonal relationships: Theory, research and applications (pp. 139–158) Amsterdam, Netherlands: Springer. doi: 10.1007/ 978-94-017-8542-6_7 Laux, J. M., Piazza, N. J., Salyers, K., & Roseman, C. P. (2012). The Substance Abuse Subtle Screening Inventory-3 and stages of change: A screening validity study. Journal of Addictions & Offender Counseling, 33, 82–92. Lazowski, L. E., Miller, F. G., Boye, M. W., & Miller, G. A. (1998). Efficacy of the Substance Abuse Subtle Screening Inventory-3 (SASSI-3) in identifying substance dependence disorders in clinical settings. Journal of Personality Assessment, 71, 114–128. doi: 10.1207/s15327752jpa7101_8 McDonald, R. P. (1999). Test theory: A unified treatment Mahwah, NJ: Erlbaum. doi: 10.1111/j.2044-8317.1981.tb00621.x Miller, F. G., & Lazowski, L. E. (1999). The Substance Abuse Subtle Screening Inventory-3 (SASSI-3) manual. Springville, IN: The SASSI Institute. Miller, G. A. (1985). The Substance Abuse Subtle Screening Inventory (SASSI) Manual. Spencer, IN: Spencer Evening World. Miller, G. A. (1994). The Substance Abuse Subtle Screening Inventory (SASSI): Adult SASSI-2 manual supplement. Spencer, IN: Spencer Evening World. National Institute on Drug Abuse. (2015). Trends & Statistics. Bethesda, MD: National Institute on Drug Abuse. Retrieved from http://www.drugabuse.gov/related-topics/ trends-statistics Nelson, J., Perfett, C., Liben, D., & Liben, M. (2012). Measures of text difficulty: Testing their predictive value for grade levels and student performance [Supplemental information for Appendix A]. Retrieved from http://achievethecore.org/page/642/textcomplexity-collection R Development Core Team. (2014). R: A language and environment for statistical computing. Vienna, Austria: Foundation for Statistical Computing. http://www.R-project.org/ Reise, S. P., Bonifay, W. E., & Haviland, M. G. (2013). Scoring and modeling psychological measures in the presence of multidimensionality. Journal of Personality Assessment, 95, 129–140. doi: 10.1080/00223891.2012.725437 Revelle, W. (2014). psych: Procedures for personality and psychological research (R package version 1.4.8). Evanston, IL: Northwestern University. http://CRAN.R-project.org/package=psych Revelle, W., & Zinbarg, R. E. (2009). Coefficients alpha, beta, omega, and the glb: Comments on Sijtsma. Psychometrika, 74, 145–154. doi: 10.1007/s11336-008-9102-z Santiago, P. (2014). Substance use disorders. In S. J. Cozza, M. N. Goldenberg, & R. J. Ursano (Eds.), Care of military service members, veterans, and their families (vol. 2, pp. 119–139). Arlington, VA: American Psychiatric Publishing.

Ó 2016 Hogrefe Publishing


L. E. Lazowski & B. B. Geary, Validation of the Adult SASSI-4

Stenner, A. J., Burdick, H., Sanford, E. E., & Burdick, D. S. (2007). The Lexile Framework for Reading technical report Durham, NC: MetaMetrics. Retrieved from https://www.lexile.com/ research/9/ Substance Abuse and Mental Health Services Administration. (2014). Results from the 2013 National Survey on Drug Use and Health: Summary of national findings (NSDUH Series H-48, HHS Publication No. (SMA) 14-4863). Retrieved from http://www.samhsa.gov/data/sites/default/files/ NSDUHresultsPDFWHTML2013/Web/NSDUHresults2013.pdf US Department of Health and Human Services, Behavioral Health Coordinating Committee, Prescription Drug Abuse Subcommittee. (2013). Addressing prescription drug abuse in the United States: Current activities and future opportunities. Retrieved from http://www.cdc.gov/homeandrecreational safety/ overdose/hhs_rx_abuse.html US Department of Health and Human Services, National Institute on Drug Abuse. (2014). Research Report Series: Prescription Drug Abuse (Third Revision). (NIH Publication No. 15-4881). Retrieved from https://d14rmgtrwzf5a.cloudfront.net/sites/ default/files/prescriptiondrugrrs_11_14.pdf US Department of Justice, National Drug Intelligence Center. (2011). The economic impact of illicit drug use on American society. (Product No. 2011-Q0317-002). Retrieved from http://www.justice.gov/archive/ndic/pubs44/44731/ 44731p.pdf Volkow, N. D. (2014, May). America’s addiction to opioids: Heroin and prescription drug abuse. Testimony presented at the

Ó 2016 Hogrefe Publishing

97

Senate Caucus on International Narcotics Control, Washington, D.C. Retrieved from http://www.drugabuse.gov/about-nida/ legislative-activities/testimony-to-congress/2014/americasaddiction-to-opioids-heroin-prescription-drug-abuse Williams, R. T., Wilson, C. S., Heinemann, A. W., Lazowski, L. E., Fann, J. R., & Bombardier, C. H., U Washington PRISMS Investigators. (2014). Identifying depression severity risk factors in persons with traumatic spinal cord injury. Rehabilitation Psychology, 59, 50–56. doi: 10.1037/a0034904 Zinbarg, R. E., Revelle, W., Yovel, I., & Li, W. (2005). Cronbach’s alpha, Revelle’s beta, and Mcdonald’s omega H: Their relations with each other and two alternative conceptualizations of reliability. Psychometrika, 70, 123–133. Received January 9, 2015 Revision received December 9, 2015 Accepted December 14, 2015 Published online October 7, 2016 Linda E. Lazowski The SASSI Institute 201 Camelot Ln Springville, IN 47462 USA Tel. +1 (800) 726-0526 Fax 1 (800) 546-7995 E-mail research@sassi.com

European Journal of Psychological Assessment (2019), 35(1), 86–97


Multistudy Report

Perceived Mutual Understanding (PMU) Development and Initial Testing of a German Short Scale for Perceptual Team Cognition Michael J. Burtscher1 and Jeannette Oostlander2 1

Department of Psychology, University of Zurich, Switzerland

2

Institute for Educational Evaluation, Zurich, Switzerland Abstract: Team cognition plays an important role in predicting team processes and outcomes. Thus far, research has focused on structured cognition while paying little attention to perceptual cognition. The lack of research on perceptual team cognition can be attributed to the absence of an appropriate measure. To address this gap, we introduce the construct of perceived mutual understanding (PMU) as a type of perceptual team cognition and describe the development of a respective measure – the PMU-scale. Based on three samples from different team settings (NTotal = 566), our findings show that the scale has good psychometric properties – both at the individual as well as at the teamlevel. Item parameters were improved during a multistage process. Exploratory as well as confirmatory factor analyses indicate that PMU is a one-dimensional construct. The scale demonstrates sufficient internal reliability. Correlational analyses provide initial proof of construct validity. Finally, common indicators for inter-rater reliability and inter-rater agreement suggest that treating PMU as a team-level construct is justified. The PMU-scale represents a convenient and versatile measure that will potentially foster empirical research on perceptual team cognition and thereby contribute to the advancement of team cognition research in general. Keywords: team, teamwork, team cognition, perceptual approach, scale

Due to complex tasks that exceed individual capacities, organizations depend highly on the performance of teams (Salas, Cooke, & Rosen, 2008). Consequently, researchers have sought to determine factors that are associated with improved team performance (Mathieu, Maynard, Rapp, & Gilson, 2008). Team cognition – team members’ cognitive representations of tasks, roles, and teammates (Smith-Jentsch, 2009) – has been proposed to be one of these factors. From the onset, research on team cognition has advanced in multiple directions. Different theoretical constructs have been developed to describe facets of team cognition – team mental models (e.g., Mohammed, Ferzandi, & Hamilton, 2010) and transactive memory systems (e.g., Hollingshead, Gupta, Yoon, & Brandon, 2012) being the most common. However, other facets of team cognition have received limited attention, most notably, perceptual team cognition. Perceptual team cognition represents a distinct facet of team cognition that assesses to what extent team members perceive their knowledge to be similar (Mohammed et al., 2010; Rentsch & Mot, 2012). In our view, the lack of empirical studies on perceptual team cognition can be attributed to a certain deficit of construct refinement and, mainly, the absence of an appropriate measure. The few existing measures are either European Journal of Psychological Assessment (2019), 35(1), 98–108 DOI: 10.1027/1015-5759/a000360

impractical due to their large number of items (Johnson et al., 2007) or too specific due to their focus on a particular type of content (Ellwart, Konradt, & Rack, 2014). In order to address these issues, let us begin by introducing the construct of perceived mutual understanding (PMU) as a specific type of perceptual team cognition. PMU refers to the extent to which team members believe that there is a mutual understanding regarding key aspects of their collaboration within their team. Specifically, PMU comprises team members’ perceptions regarding a mutual understanding of team and task characteristics. By introducing PMU, we follow Mohammed et al.’s (2010) suggestion, who concluded that “there is room to measure perceptions of sharing” (p. 904). In the current study, we describe the development and initial validation of the Perceived Mutual Understanding scale (PMU-scale). It is ultimately our goal to provide a convenient and versatile scale to measure perceptual team cognition.

Team Cognition – An Overview In order to help readers to better understand the purpose of this new scale, we will briefly outline the field of team cognition research. Within taxonomies of team-related Ó 2016 Hogrefe Publishing


M. J. Burtscher & J. Oostlander, Perceived Mutual Understanding

99

constructs, team cognition can be classified as a cognitive emergent state (Mathieu et al., 2008) – that is, a dynamic property of a team that varies as a function of other team-related constructs (Marks, Mathieu, & Zaccaro, 2001). In this context, team cognition is regarded as a configural type of team construct that is inferred from the cognitions of the individual members (Mathieu, Heffner, Goodwin, Cannon-Bowers, & Salas, 2005). DeChurch and Mesmer-Magnus (2010a) proposed classifying team cognition according to three aspects: nature of emergence, content of cognition, and form of cognition. Firstly, nature of emergence refers to the relationships between the cognitions of the individual members. These relationships can either be conceptualized as compositional emergence, which refers to the similarity of individual cognitions, or as compilational emergence, which refers to the complementarity of individual cognitions (DeChurch & MesmerMagnus, 2010a; Kozlowski & Klein, 2000). Secondly, content of cognition describes the type of knowledge that comprises team cognition. Researchers primarily distinguish between two types of content: task-related and team-related content (Mathieu, Heffner, Goodwin, Salas, & Cannon-Bowers, 2000). Thirdly, regarding the form of cognition, researchers have mainly distinguished the structured approach from the perceptual approach (Rentsch & Mot, 2012). The structured approach focuses on the patterns of individual cognitions held by team members and the similarity of these patterns within a team (DeChurch & Mesmer-Magnus, 2010a). For example, each member holds a mental model regarding the chronological order of the team’s tasks. The structured approach is hereby concerned with the degree of similarity between the members’ models of the chronological order. In contrast, the perceptual approach focuses on “beliefs, expectations, and perceptions that become ‘shared’ among individuals” (Rentsch & Mot, 2012, p. 146). For example, while working together, teams develop a shared perception of each member’s expertise (Ellwart et al., 2014). As a result, each member should share the same assessment of the other team members’ levels of expertise. In terms of empirical studies, research on team cognition has favored the structured approach, whereas the perceptual approach has received limited attention (Mohammed et al., 2010).

members believe that their individual cognitions are similar (i.e., perceptual cognition). In other words: We argue that it matters whether or not team members believe that everyone on the team thinks the same way. Our argument draws on the concept of mutual knowledge (Cramton, 2001; Krauss & Fussell, 1990). Mutual knowledge has been defined as “knowledge that the communicating parties both share and know they share” (Krauss & Fussell, 1990, p. 112). Mutual knowledge does not only include the information itself, but also the awareness that others possess the same information (Cramton, 2001). As a consequence, mutual knowledge can be distinguished from common knowledge. Common knowledge only implies that different persons actually possess the same information; it does not indicate whether or not these persons know that they share this information. Common knowledge is similar to structured cognition as both constructs are concerned with objective similarities between individuals. By contrast, mutual knowledge and PMU go beyond objective similarities. Please note that although PMU and mutual knowledge are similar in this regard, they are still different constructs. PMU refers to subjective beliefs about similarities between individuals, whereas mutual knowledge refers to actual knowledge about these similarities. Following this line of argument, we propose that – similar to the additional information contained in mutual knowledge compared to common knowledge – PMU has explanatory potential above and beyond structured cognition. Structured cognition concerns itself with objective similarities in the structure of individual cognitions within a team. Measurement techniques to assess structured cognition, such as card sorting and concept mapping, aim at quantifying to which extent team members’ cognitions are actually similar (e.g., Mohammed, Klimoski, & Rentsch, 2000). As mentioned earlier, we argue that not only these objective similarities matter but also members’ beliefs about the similarities within their team. Beliefs fall into the domain of perceptual cognition, which concerns itself with the extent to which team members believe that their cognitions are similar. PMU represents a type of perceptual cognition that refers to the extent to which team members believe that there is a mutual understanding regarding key aspects of their collaboration within their team. It is important to note that PMU and structured cognition are distinct constructs and need not necessarily correspond with each other: Team members might believe that everyone on the team thinks the same way about how their task should be performed, while their individual cognitions regarding this task are actually very different. For example, members of a surgical team might believe that they all envision the same sequence of tasks, when they start performing a specific operation (i.e., high perceived mutual

Differences Between Perceptual and Structured Cognition We believe, however, there is more to the matter in question. We argue that it is not only the objective similarity of individual cognitions within a team that matters (i.e., structured cognition), but also the degree to which Ó 2016 Hogrefe Publishing

European Journal of Psychological Assessment (2019), 35(1), 98–108


100

M. J. Burtscher & J. Oostlander, Perceived Mutual Understanding

understanding). Each member, however, might actually have a different sequence of tasks in mind (i.e., low objectively similarity). We will further illustrate the distinctness and explanatory potential of PMU by comparing this construct to three established team cognition constructs, namely team mental models (Cannon-Bowers, Salas, & Converse, 1993; Mohammed et al., 2010), transactive memory systems (Ren & Argote, 2011; Wegner, 1987), and cross-understanding (Huber & Lewis, 2010).

PMU does not refer to the actual knowledge structure but includes team members’ perceptions of the overall state of a team’s knowledge structure, that is, perceptions to which extent a common knowledge structure exists. Again, both constructs do not necessarily correlate with each other. Team members might believe that everyone is aware of their teammates’ knowledge and expertise, whereas, in fact, they know very little about each other. Such a situation can easily lead to misunderstandings, for example, when one team member erroneously assumes that her teammate has specific knowledge about a certain aspect of their common task. Cross-understanding refers to the extent to which group members have an accurate understanding of one another’s mental models (Huber & Lewis, 2010). It is unaffected by actual similarities or differences between team members’ mental models. Cross-understanding can be operationalized by measuring “how well each member perceives that he or she understands what it is that each other member knows, believes, is sensitive to, and prefers” (Huber & Lewis, 2010, p. 18). In this respect, cross-understanding bears a close resemblance to PMU. A main difference between these constructs lies in the point of reference. Cross-understanding focuses on perceptions of similarities between individual members’ mental models (e.g., “How well do I understand the mental models of each of my teammates?”). By contrast, PMU aims to capture a more general kind of mutual understanding (e.g., “How well do we understand each other in this team?”). To this effect, cross-understanding refers to an attribute of an individual, whereas PMU refers to an attribute of the entire team.

Perceived Mutual Understanding as a Distinct Construct Team mental models have been described as team members’ shared and organized understanding of relevant knowledge (Cannon-Bowers et al., 1993; Klimoski & Mohammed, 1994). Sharedness usually refers to the degree to which members hold similar mental representations of key characteristic of their team and the common task (Cannon-Bowers et al., 1993; Mohammed et al., 2010). A basic assumption underlying this line of research is that similar mental models allow for more efficient ways to organize task execution, because team members can anticipate each other’s needs and actions (i.e., implicit coordination; Rico, Sánchez-Manzanares, Gil, & Gibson, 2008). A classic example of the implied mechanism is the no-look pass in sports teams (Cannon-Bowers & Salas, 2001): One player passes the ball to a teammate without looking. A successful no-look pass requires that the person passing the ball can anticipate her teammate’s movements. This can be achieved by means of holding similar mental models. We propose, however, that similar mental models do not necessarily predict whether a player will actually perform a no-look pass. If this player is doubtful about her teammate’s movements – in other words, if she believes that there is no mutual understanding within her team regarding this move – she will be reluctant to perform a no-look pass. Consequently, we believe that not only shared understanding pertains to improved coordination and performance but also the perception of shared understanding, that is, PMU. A transactive memory system can generally be described as a shared knowledge system that people in relationships develop for encoding, storing, and retrieving information about different substantive domains (Wegner, 1987). Objective knowledge about who knows what represents a key component of a transactive memory system.1 Thus, a transactive memory system refers to a team’s knowledge structure (e.g., Hollingshead et al., 2012). By contrast,

1

Relationships With Other Team-Related Constructs Team cognition is considered as a dynamic construct that varies as a function of other team-related constructs. Team-related constructs can be classified into three categories: input variables (i.e., antecedents of teamwork), mediators (i.e., team processes and emergent states), and outcomes (i.e., consequences of teamwork; Mathieu et al., 2008). To further clarify the nomological network of PMU (e.g., Ziegler, Booth, & Bensch, 2013), we will describe the assumed relationship of PMU with variables from all the three categories. Team diversity as an antecedent is thought to be negatively related to team cognition (Rentsch & Klimoski, 2001). Differences in age, gender, educational background, and tenure are indicators of different work-related

We acknowledge that transactive memory systems go beyond the presence of a shared understanding of who knows what (Ren & Argote, 2011). We focus on this aspect because it bears the closest resemblance to PMU.

European Journal of Psychological Assessment (2019), 35(1), 98–108

Ó 2016 Hogrefe Publishing


M. J. Burtscher & J. Oostlander, Perceived Mutual Understanding

101

experiences, which, in turn, shape people’s cognitive representations of their work environment: Team members with different characteristics have made different experiences and thus their cognitive representations are likely different (Rentsch & Klimoski, 2001). We suggest a similar effect of team diversity on PMU. Team members’ salient characteristics such as age and gender serve as cues for interpreting team cognition. For example, team members might believe that due to their apparent demographic differences, there is little mutual understanding regarding key aspects of their collaboration within their team. Thus, demographic diversity should be negatively related to PMU. Team cognition as an emergent state is thought to be related to team processes (Cannon-Bowers & Salas, 2001). This relationship is assumed to be reciprocal: On the one hand, team cognition can serve as a structure that guides team members’ behaviors; on the other hand, continued behavioral interaction can affect team members’ cognitive representations (DeChurch & Mesmer-Magnus, 2010a). Accordingly, we expect PMU to be related to intra-team conflict – an important interpersonal process (Mathieu et al., 2008). Research defines two main types of conflicts: Relationship conflict refers to interpersonal incompatibilities among team members; task conflict refers to disagreements among team members about the tasks being performed (Jehn, 1995). If there are hardly any conflicts among team members, said members will likely believe that there is a mutual understanding within their team. By contrast, if members have many conflicts with their teammates, they will likely believe that there is little mutual understanding. We therefore hypothesize that PMU is negatively related to task and relationship conflict. Finally, we propose that PMU is positively related to perceived team effectiveness. Team members who perceive that their team has great difficulties are less likely to believe in the existence of a mutual understanding. By contrast, team members who perceive their team to be effective are more likely to believe in the existence of a mutual understanding.

In order to develop a scale suitable for applied team settings, we chose two different field samples: healthcare (in Step 2) and sports (in Step 3). For the construct validation of the scale in Step 4, we chose a third sample of student teams. In addition to the conventional criteria for scale development, we intended the PMU-scale to meet specific requirements.

Materials and Methods Development and testing of the PMU-scale occurred in four steps: 1) item generation, 2) pretest, 3) item analysis and factor structure, and 4) scale validation. 2

Requirements for a Scale to Measure Perceptual Team Cognition One of the most important criteria is content validity. Research on team cognition has focused on two content categories: task-related and team-related content (Mathieu et al., 2000). The task-related content includes knowledge about performance requirements, goals, and subtasks, whereas the team-related content includes knowledge about interaction requirements and characteristics of other members (Mohammed et al., 2010). These types of content are not mutually exclusive; individuals can hold multiple mental models at the same time (Rentsch, Delise, & Hutchison, 2009). A measure of perceptual team cognition should thus cover both content categories. The second requirement deals with the conceptualization of team cognition. As mentioned above, team cognition is regarded as a configural team construct (Mathieu et al., 2005). A common strategy to conceptualize such teamlevel constructs is to aggregate individual members’ responses within a team. Different compositional models can be used to specify the relationship between the level of measurement (i.e., individual) and the level of analysis (i.e., team; Chan, 1998). For the PMU-scale, we used a referent-shift consensus model. In this model, the focus is on how an individual believes other team members perceive the construct in question (Chan, 1998). This conceptualization is in line with our definition of PMU as the extent to which team members believe that there is a mutual understanding regarding key aspects of their collaboration within their team. To justify aggregation, researchers use indices of inter-rater reliability and interrater agreement (LeBreton & Senter, 2008). Essentially, indices of inter-rater reliability and agreement are used to determine if a team-level construct exhibits (a) inter-team variability and (b) intra-team consensus.2 If PMU is to represent a meaningful attribute of a team, members of the same team should have similar PMU values and team means of PMU should vary significantly between teams. Inter-rater reliability refers to the relative consistency of responses among raters (Bliese, 2000; Kozlowski

As a detailed discussion of inter-rater reliability and agreement, and their importance for team research is beyond the scope of the current work, we refer to the respective literature for a more comprehensive description (e.g., Bliese, 2000; LeBreton & Senter, 2008).

Ó 2016 Hogrefe Publishing

European Journal of Psychological Assessment (2019), 35(1), 98–108


102

M. J. Burtscher & J. Oostlander, Perceived Mutual Understanding

& Hattrup, 1992). In the context of team research, interrater reliability is usually assessed by means of the intraclass correlation (ICC; Shrout & Fleiss, 1979). Specifically, researchers use the ICC(1), which is calculated from a one-way random effects ANOVA with team membership as an independent variable and the members’ ratings as the dependent variable (Bliese, 2000). ICC(1) values are often interpreted as an effect size estimate indicating the proportion of the total variance that can be explained by team membership (Bliese, 2000; LeBreton & Senter, 2008). Values of .25 or greater can be interpreted as large effects, suggesting a large influence of team membership on a variable (LeBreton & Senter, 2008). Inter-rater agreement refers to the absolute consensus among raters (Bliese, 2000; Kozlowski & Hattrup, 1992) and is usually assessed by means of the rwg(j) coefficient (James, Demaree, & Wolf, 1984, 1993). Values between .71 and .90 are considered as strong agreement (LeBreton & Senter, 2008). The third requirement concerns two practical aspects: usability and applicability. Our scale should be convenient to use in applied settings; using the scale should not be time-consuming and easily explainable to participants. Moreover, the PMU-scale should be applicable to various types of teams instead of being restricted to a specific setting.

teammate(s).” Responses were made on a 7-point Likert scale ranging from 1 = strongly disagree to 7 = strongly agree.

Step 1: Item Generation The aim of the first step was to develop the items for the PMU-scale. Based on reviews of empirical studies, the first author identified studies that used questionnaires to assess team cognition. Common themes were identified using qualitative research methods. Moreover, relevant theoretical and conceptual articles were consulted to extract definitions and taxonomies of team cognition. Using this body of knowledge and our definition of PMU as a basis, the first author formulated an initial pool of items. The wording was chosen in such a manner that each item (a) fit either the task-related or the team-related content category, (b) reflected a team-level perception of team cognition, and c) did not include a reference to a specific type of task or teamwork setting. Next, the items were discussed with the second author as well as with several researchers from the field with the intention of limiting the total number of items, as the scale should be practical to use in applied settings. In the first version of the PMU-scale, six items represented the task-related content and five items represented the team-related content. All items were formulated as declarative statements. They were introduced with the sentence “Please indicate your agreement with the following statements regarding yourself and your

European Journal of Psychological Assessment (2019), 35(1), 98–108

Step 2: Pretest The 11 items of the first version were further investigated. We used test development procedures from classical test theory (Ellis & Mead, 2002) to identify items that did not fit with common psychometric criteria. Participants and Procedure Participants were 60 physicians and nurses working at a large hospital. They worked in different team settings on a regular basis (e.g., in the operating room). Participants’ mean age was 34.87 years (SD = 7.18) and on average, they had 5.19 years of professional experience (SD = 5.48). To guarantee confidentiality, we did not include participants’ gender. Results Firstly, three items were eliminated from the item pool for content reasons based on the feedback of the participants. The focus of a fourth item was heightened to clarify its meaning. Secondly, four indices were considered for item analysis: item difficulty, skewness, kurtosis, and inter-item correlation. Firstly, item difficulty, skewness, and kurtosis were examined. Four items reached a mean far beyond the middle of the scale, and therefore exhibited a skewed distribution. These items were reformulated in order to increase item difficulty. Secondly, inter-item correlation was conducted. None of the items attracted attention because of a high intercorrelation (r > .90). To summarize, three items were removed due to their content, five items were reformulated according to results of the items analysis, and three items remained unchanged. The resulting 8-item version of the scale was submitted to further analyses.

Step 3: Item Analysis and Factor Structure We reexamined the psychometric properties of the remaining eight items to test whether reformulating had led to an improvement. Moreover, we conducted an exploratory factor analysis (EFA; maximum likelihood) with oblique rotation (oblimin) to gain knowledge about the factorial structure of the PMU-scale. Participants and Procedure Participants were 315 amateur volleyball players (173 females). Their mean age was 28.31 years (SD = 8.42). Volleyball teams consist of six players plus up to six substitutes.

Ó 2016 Hogrefe Publishing


M. J. Burtscher & J. Oostlander, Perceived Mutual Understanding

103

Results The Kaiser-Meyer-Olkin value of .89 suggested that the correlation matrix was well suited for a factor analysis. Bartlett’s Test of Sphericity was significant at p < .001. The results of the EFA supported a one-factorial solution (Appendix A). The first 3 factors had eigenvalues of 4.02, 0.87, and 0.69, respectively. Likewise, both the scree test (Cattell, 1966) as well as the parallel analysis (Horn, 1965) suggested a one-factorial solution. The first factor explained 50.22% of the variance. Item properties indicated that reformulation had improved most of the items’ difficulty. Our aim, however, was to construct a short scale. We therefore decided to remove the two items where reformulation apparently did not lead to a substantial improvement (Appendix B): Item 2 was removed because it had both the lowest item-total correlation and the lowest factor loading, Item 8 had the lowest difficulty, and the second lowest factor loading. We conducted another EFA on the 6-item version (Table 1). Again, the results supported a one-factorial solution. The first 3 factors had eigenvalues of 3.27, 0.74, and 0.58, respectively. The first factor explained 54.44% of the variance. To confirm the one-factorial solution, a confirmatory factor analysis (CFA) was conducted using the lavaan package (Rosseel, 2012) of the software R (R Core Team, 2012). We used MLM estimation – a maximum likelihood estimation with robust standard errors – and a Satorra-Bentler scaled test statistic to account for the non-normality of the data. Results of the CFA provided further support for a one-factor structure. A respective model with all items loading on a single factor fit the data well (w2 = 14.52, df = 9, p = .11, RMSEA = .04, 90% CI [.00, .08], SRMR = .03, CFI = .99). In this model, all item loadings were significant at p < .001. The final scale consisting of six items (Table 1) had a Cronbach’s alpha of .83, indicating sufficient reliability (Nunnally & Bernstein, 2004). As the assumptions for alpha are often difficult to meet (e.g., same error variance for each item), we also calculated McDonald’s Omega (Dunn, Baguley, & Brunsden, 2014).3 Omega for the 6-item version in Step 3 was .83, 95% CI [.80, .86], again indicating sufficient reliability.

Participants and Procedure Participants were 191 students (147 females, mean age = 23.52, SD = 4.09) from a Swiss university completing a course in experimental psychology. As a part of the course, teams of three to five students planned and conducted an experiment during the timeframe of a semester.

Step 4: Scale Validation The aim of the fourth step was to validate the PMU-scale. We tried to replicate the factorial structure using EFA and CFA. In addition, we tested the team-level properties of the scale using ICC(1) and rwg(j). Construct validity was tested by means of correlational analyses. 3

Measures Toward the end of the semester, participants completed the PMU-scale, scales for task and relationship conflict (Lehmann-Willenbrock, Grohmann, & Kauffeld, 2011), and a scale for perceived team effectiveness (Jung & Sosik, 2002). Task conflict was measured with four items (α = .90; ω = .90, 95% CI [.87, .92]). A sample item is “How much conflict about the work you do is there in your team?” Relationship conflict was measured with four items (α = .89; ω = .89, 95% CI [.85, .92]). A sample item is “How much are personality conflicts evident in your team?” Participants rated the four questions for each construct (from 1 = never/none to 6 = very often/very much). Team effectiveness was measured with five items (α = .91; ω = .91, 95% CI [.87, .93]). A sample item is “My team completes its task successfully.” Participants indicated their agreement on a 7-point Likert scale (from 1 = strongly disagree to 7 = strongly agree). We used Blau’s (1977) index to operationalize gender diversity, and the standard deviation to operationalize age diversity. Both represent common indices in diversity research (Harrison & Klein, 2007). Results EFA and CFA were conducted as described in Step 3. EFA supported a one-factorial solution. The first 3 factors had eigenvalues of 3.24, 0.83, and 0.67, respectively. The first factor explained 54.01% of the variance. CFA confirmed the one-factorial solution. A respective model with all items loading on a single factor fit the data well (w2 = 17.54, df = 9, p = .04, RMSEA = .07, 90% CI [.02, .11], SRMR = .04, CFI = .97). All item loadings were significant at p < .001. We compared this model to a model with two correlated latent variables (task-related and team-related items). A Satorra-Bentler scaled chi-square difference test revealed no significant differences in model fit (Δw2 = 0.68. df = 1, p = .41). Moreover, both factors in the two-factorial model were highly correlated (r = .81). As the more parsimonious model fit the data equally well, we interpreted these findings as further evidence in favor of a one-factorial solution. Again, reliability was sufficient (α = .82; ω = .82, 95% CI [.77, .86]). We assessed construct validity of the PMU-scale by relating it to several prototypical team-related constructs: age

We thank one of the anonymous reviewers for this suggestion.

Ó 2016 Hogrefe Publishing

European Journal of Psychological Assessment (2019), 35(1), 98–108


104

M. J. Burtscher & J. Oostlander, Perceived Mutual Understanding

Table 1. English translations, descriptive statistics, corrected item-total correlations, and factor loadings for the 6-item version from Step 4 (values for the 6-item version from Step 3 in parentheses) Items 1) Wir sind einer Meinung hinsichtlich unserer jeweiligen Kompetenzen. 2) Wir sind gleicher Ansicht hinsichtlich unserer individuellen Stärken und Schwächen. 3) Wir sind uns vollkommen einig, wer von uns welche Aufgabe ausführen sollte. 4) Wir haben ein gemeinsames Verständnis bezüglich des Zusammenhangs zwischen den einzelnen Aufgaben. 5) Wir stimmen in der gegenseitigen Beurteilung unserer jeweiligen Expertise vollkommen überein. 6) Wir sind uns genau bewusst, über welche Fertigkeiten jedes Teammitglied verfügt.

Factor loading5

Skewness1,2

Kurtosis3,4

We are in complete agreement Team 5.19 (5.03) 1.00 (1.14) regarding our personal capabilities. Team 5.05 (4.99) 0.93 (1.15) We have a similar understanding of our individual strengths and weaknesses. We very much agree on who Task 5.08 (5.15) 1.16 (1.12) should carry out which task.

0.80 ( 0.75)

0.86 (0.60) .72 (.63) .83 (.70)

0.22 ( 0.64)

0.06 (0.54) .66 (.63) .77 (.71)

0.53 ( 0.43)

0.26 ( 0.07).47 (.54) .50 (.59)

Task 5.35 (5.22) 0.92 (1.05) We have a similar understanding in regard to the independencies between the performances of the tasks. We are in total agreement over Team 5.09 (4.58) 0.90 (1.07) the assessment of each other’s expertise.

0.62 ( 0.65)

0.01 (0.37) .50 (.59) .53 (.65)

0.41 ( 0.33)

0.62 (0.94) .65 (.66) .73 (.73)

0.44 ( 0.65)

0.13 (.33)

Items in English

Content

M

SD

We are very well aware of each Team 4.36 (5.35) 1.32 (1.14) other’s skills.

CITC

.55 (.59) .63 (.66)

Notes. CITC = Corrected item-total correlation. NStep 4 = 191; NStep 3 = 315; 1SESkewness_Step 4 = 0.18; 2SESkewness_Step 3 = 0.14; 3SEKurtosis_Step 4 = 0.35; 4 SEKurtosis_Step 3 = 0.27; 5Factor loadings based on exploratory factor analysis. English items were directly translated from the original German items; the English version has not been validated yet.

and gender diversity, task and relationship conflict, and team effectiveness. Our findings mainly supported our hypotheses (Table 2). As expected, PMU was positively related to perceived team effectiveness, and negatively related to task and relationship conflict. Although the correlations with age and gender diversity were not significant, the respective correlation coefficients were negative as expected and, in the case of age diversity, of a nontrivial size (r = .25, p = .054, one-tailed). As diversity is a team-level construct, correlations involving diversity measures are at the team-level (N = 44). To examine team-level properties of the scale, we calculated ICC(1) and rwg(j). We only considered teams from which we had at least two respondents because calculating the above-mentioned indices requires at least two ratings per team. This reduced the sample size to 186 individuals. Our analyses indicated that a considerable amount of variance in PMU can be attributed to team membership, ICC(1) = .34, p < .001. Furthermore, there was sufficient agreement among team members, rwg(j) = .84. Taken together, these findings suggest that aggregating PMU scores is justified and thus PMU can be treated as a team-level construct.

Discussion The aim of the current study was to develop a convenient and versatile scale to measure perceptual team cognition. European Journal of Psychological Assessment (2019), 35(1), 98–108

To that end, we developed the 6-item PMU-scale in a multistage process. Our findings indicate that the PMUscale has good psychometric properties – both at the individual as well as at the team-level. The criteria applied for item generation in Step 1 ensured content validity. Item parameters were improved during the process (Steps 2 and 3). Factor analytical findings from Steps 3 and 4 suggest that the PMU-scale is one-dimensional. The scale showed sufficient reliability in two samples. Results from correlational analyses provided initial proof of construct validity. As hypothesized, PMU was positively related to perceived team effectiveness, and negatively related to both task and relationship conflict. PMU showed a substantial negative correlation with age diversity. This correlation, however, was only marginally significant. Given the limited sample size at the team-level and the restricted variance in age diversity due to the student sample, this finding might still be interpreted as partial support of our hypothesis. Contrary to our hypothesis, PMU was unrelated to gender diversity. Again, this could be attributed to the characteristics of our sample. More than half of the teams were all female, and therefore teams varied little with regard to gender diversity. In addition to the common psychometric criteria, the PMU-scale meets the specific requirements we formulated to begin with. PMU is conceptualized as a team-level construct. We formulated the items accordingly and validated the scale using participants that were part of existing teams. Ó 2016 Hogrefe Publishing


M. J. Burtscher & J. Oostlander, Perceived Mutual Understanding

105

Table 2. Summary of means, standard deviations, and intercorrelations of the variables from Step 4

1

M

SD

1

2

3

4

5

1)

PMU

5.02

0.75

2)

Team effectiveness1

5.58

0.91

.35***

3)

Relationship conflict1

1.78

0.93

.46***

.18*

4)

Task conflict1

2.45

1.01

.46***

.13

.64***

5)

Age diversity2

2.42

2.39

.25

.13

.11

.17

6)

Gender diversity2

0.16

0.21

.03

.10

.19

.07

.01

1

6

– –

2

Notes. PMU = Perceived mutual understanding. N = 191 individuals, N = 44 teams. *p < .05. **p < .01. ***p < .001 (two-tailed).

Results from Step 4 indicate that the PMU-scale meets the criteria for aggregation at the team-level: First, a significant amount of variance in PMU can be explained by team membership, and second, team members show sufficient agreement in their assessments of PMU. A further specific aim of our study was to develop a scale to use in various different team settings. This was achieved by formulating the items in such a way that they refer to teams in general rather than to a specific context. We interpret the application of our scale in three different settings (healthcare, sports, and university) as initial proof that PMU represents a meaningful construct for teams in general. Finally, by reducing the number of items to six, we ensured that the PMU-scale could be used in field settings.

Contributions Our study has implications for both theory and practice. Although both perceptual and structured cognition are considered important facets of team cognition (Rentsch & Mot, 2012), research on perceptual cognition has lagged behind (Mohammed et al., 2010). By introducing the PMU-scale, we hope to stimulate research on perceptual team cognition. In our view, both facets should be integrated to achieve a more complete understanding of the role of team cognition for team processes and outcomes. We submit that perceptual cognition has the potential to explain variance above and beyond the effects of structured cognition. On the one hand, teams may believe their team mental models to be similar (i.e., perceptual cognition) while they actually do not have similar mental models (i.e., structured cognition). In such a situation, teams will likely use implicit coordination. However, relying on implicit coordination in the absence of a similar mental model has been shown to have negative effects on team performance (Burtscher, Kolbe, Wacker, & Manser, 2011). On the other hand, if teams have structurally similar mental models of their common task, they can rely on implicit coordination mechanisms, which improves their performance (Rico et al., 2008). However, if these teams believe their

Ó 2016 Hogrefe Publishing

team mental models to be dissimilar, they are less likely to use implicit coordination and thus forfeit this potential benefit. In this context, the PMU-scale can help to improve our understanding of the complex interactions of team cognition and team coordination and their influence on team performance. Furthermore, a general measure offers the opportunity for systematic comparisons between team settings. Team cognition research has used a variety of different measurement techniques (Burtscher & Manser, 2012; DeChurch & Mesmer-Magnus, 2010a; Mohammed et al., 2000). Although a multi-method approach has several advantages, it also impedes the synthesis of empirical findings. Specifically, the way team cognition is measured affects the relationship between team cognition and team processes (DeChurch & Mesmer-Magnus, 2010b). With regard to perceptual cognition, the PMU-scale represents a potential remedy. Using the PMU-scale, researchers can compare teams from different settings, thereby exploring the role of context for the effects of team cognition In terms of practical implications, our scale can contribute to the further dissemination of team cognition in field settings. Measurement techniques such as concept mapping and card sorting are too time-consuming to be implemented on a regular basis. By contrast, the six items of the PMU-scale could be readily integrated into existing procedures such as employee surveys. For example, the PMU-scale can be used as a screening instrument to get an overview of a whole organization. Building on these results, more detailed analyses using different methodology can be applied in specific team settings.

Limitations and Avenues for Future Research Our study is not without limitations. We intentionally constructed a general instrument for ease of use in various settings. This precludes investigating specific contents and their relationships. Based on our scale, researchers may want to develop scales aimed specifically at task-related content settings. In the same vein, recent research has

European Journal of Psychological Assessment (2019), 35(1), 98–108


106

M. J. Burtscher & J. Oostlander, Perceived Mutual Understanding

exemplified the significance of other types of content such as temporal team cognition (Mohammed & Nadkarni, 2014), which the PMU-scale does not account for. Still, as mentioned above, the PMU-scale could be used as a screening instrument to identify potential issues with regard to team cognition, which could then be scrutinized using methodology that is more content specific. In addition, we have not compared PMU with measures of structured team cognition to establish discriminant validity. We believe that the theoretical background section clearly demonstrates that both facets of team cognition are conceptually different. However, to establish discriminant validity and to prove PMU’s ability to explain additional variance, future studies need to combine the scale with measures of structured team cognition. This would also allow for investigating a potential interaction between both forms of cognition. Finally, the PMU-scale was developed in German, which obviously limits its applicability. A next step would be to translate the scale into other languages.

predict team performance in simulated anesthesia inductions. Journal of Experimental Psychology: Applied, 17, 257–269. doi: 10.1037/a0025148 Burtscher, M. J., & Manser, T. (2012). Team mental models and their potential to improve teamwork and safety: A review and implications for future research in healthcare. Safety Science, 50, 1344–1354. doi: 10.1016/j.ssci.2011.12.033 Cannon-Bowers, J. A., & Salas, E. (2001). Reflections on shared cognition. Journal of Organizational Behavior, 22, 195–202. doi: 10.1002/job.82 Cannon-Bowers, J. A., Salas, E., & Converse, S. (1993). Shared mental models in expert team decision making. In N. J. Castellan (Ed.), Individual and group decision making (pp. 221–246). Hillsdale, NJ: Erlbaum. Cattell, R. B. (1966). The scree test for the number of factors. Multivariate Behavioral Research, 1, 245–276. doi: 10.1207/ s15327906mbr0102_10 Chan, D. (1998). Functional relations among constructs in the same content domain at different levels of analysis: A typology of composition models. Journal of Applied Psychology, 83, 234–246. doi: 10.1037/0021-9010.83.2.234 Cramton, C. D. (2001). The mutual knowledge problem and its consequences for dispersed collaboration. Organization Science, 12, 346–371. doi: 10.1287/orsc.12.3.346.10098 DeChurch, L. A., & Mesmer-Magnus, J. R. (2010a). The cognitive underpinnings of effective teamwork: A meta-analysis. Journal of Applied Psychology, 95, 32–53. doi: 10.1037/a0017328 DeChurch, L. A., & Mesmer-Magnus, J. R. (2010b). Measuring shared team mental models: A meta-analysis. Group Dynamics: Theory, Research, and Practice, 14, 1–14. doi: 10.1037/a0017455 Dunn, T. J., Baguley, T., & Brunsden, V. (2014). From alpha to omega: A practical solution to the pervasive problem of internal consistency estimation. British Journal of Psychology, 105, 399–412. doi: 10.1111/bjop.12046 Ellis, B. B., & Mead, A. D. (2002). Item analysis: Theory and practice using classical and modern test theory. In S. G. Rogelberg (Ed.), Handbook of research methods in industrial and organizational psychology (pp. 324–343). Malden, MA: Blackwell. Ellwart, T., Konradt, U., & Rack, O. (2014). Team mental models of expertise location: Validation of a field survey measure. Small Group Research, 45, 119–153. doi: 10.1177/ 1046496414521303 Harrison, D. A., & Klein, K. (2007). What’s the difference? Diversity constructs as separation, variety, or disparity in organizations. The Academy of Management Review, 32, 1199–1228. doi: 10.2307/20159363 Hollingshead, A. B., Gupta, N., Yoon, K., & Brandon, D. P. (2012). Transactive memory theory and teams: Past, present, and future. In E. Salas, S. M. Fiore, & M. P. Letsky (Eds.), Theories of team cognition: Cross-disciplinary perspectives (pp. 421–455). New York, NY: Routledge. Horn, J. L. (1965). A rationale and test for the number of factors in factor analysis. Psychometrika, 30, 179–185. doi: 10.1007/ BF02289447 Huber, G. P., & Lewis, K. (2010). Cross-Understanding: Implications for group cognition and performance. Academy of Management Review, 35, 6–26. doi: 10.5465/AMR.2010. 45577787 James, L. E., Demaree, R. G., & Wolf, G. (1984). Estimating withingroup interrater reliability with and without response bias. Journal of Applied Psychology, 69, 85–98. doi: 10.1037/00219010.69.1.85 James, L. E., Demaree, R. G., & Wolf, G. (1993). rwg: An assessment of within-group interrater agreement. Journal of Applied Psychology, 78, 306–309. doi: 10.1037/0021-9010.78.2.306

Conclusions Despite the popularity of team cognition, perceptual team cognition has received limited attention in empirical studies (Mohammed et al., 2010), possibly due to the lack of an appropriate assessment method. The goal of the current study was to address this gap by developing a convenient and versatile measure for perceptual team cognition. To this end, we introduced the PMU-scale. We thereby intend to foster research on perceptual team cognition and contribute to the advancement of team cognition research in general.

Acknowledgments We would like to acknowledge Sjir Uitdewilligen for his valuable comments on an earlier version of this manuscript and Etna Engeli, Christian Kron, Anne-Lise Schneider, Anna-Lena Köng, and Nadja Ott for their help in collecting the data.

References Blau, P. M. (1977). Inequality and heterogeneity. New York, NY: Free Press. Bliese, P. D. (2000). Within-group agreement, non-independence, and reliability: Implications for data aggregation and analysis. In S. W. J. Kozlowski & K. J. Klein (Eds.), Multilevel theory, research, and methods in organizations: Foundations, extensions, and new directions (pp. 349–381). San Francisco, CA: Jossey-Bas. Burtscher, M. J., Kolbe, M., Wacker, J., & Manser, T. (2011). Interactions of team mental models and monitoring behaviors

European Journal of Psychological Assessment (2019), 35(1), 98–108

Ó 2016 Hogrefe Publishing


M. J. Burtscher & J. Oostlander, Perceived Mutual Understanding

107

Jehn, K. A. (1995). A multimethod examination of the benefits and detriments of intragroup conflict. Administrative Science Quarterly, 40, 256–282. doi: 10.2307/2393638 Johnson, T. E., Lee, Y., Lee, M., O’Connor, D. L., Khalil, M. K., & Huang, X. (2007). Measuring sharedness of team-related knowledge: Design and validation of a shared mental model instrument. Human Resource Development International, 10, 437–454. doi: 10.1080/13678860701723802 Jung, D. I., & Sosik, J. J. (2002). Transformational leadership in work groups the role of empowerment, cohesiveness, and collective-efficacy on perceived group performance. Small Group Research, 33, 313–336. doi: 10.1177/ 10496402033003002 Klimoski, R., & Mohammed, S. (1994). Team mental model: Construct or metaphor? Journal of Management, 20, 403–437. doi: 10.1177/014920639402000206 Kozlowski, S. W. J., & Hattrup, K. (1992). A disagreement about within-group agreement: Disentangling issues of consistency versus consensus. Journal of Applied Psychology, 77, 161–167. doi: 10.1037/0021-9010.77.2.161 Kozlowski, S. W. J., & Klein, K. J. (2000). A multilevel approach to theory and research in organizations: Contextual, temporal, and emergent processes. In K. J. Klein & S. W. J. Kozlowski (Eds.), Multilevel theory, research, and methods in organizations: Foundations, extensions, and new directions (pp. 3–90). San Francisco, CA: Jossey-Bas. Krauss, R. M., & Fussell, S. R. (1990). Mutual knowledge and communicative effectiveness. In J. Galegher, R. E. Kraut, & C. Egido (Eds.), Intellectual teamwork: Social and technical bases of collaborative work (pp. 111–145). Hillsdale, NJ: Erlbaum. LeBreton, J. M., & Senter, J. L. (2008). Answers to 20 questions about interrater reliability and interrater agreement. Organizational Research Methods, 11, 815–852. doi: 10.1177/ 1094428106296642 Lehmann-Willenbrock, N., Grohmann, A., & Kauffeld, S. (2011). Task and relationship conflict at work: Construct validation of a German version of Jehn’s intragroup conflict scale. European Journal of Psychological Assessment, 27, 171–178. doi: 10.1027/1015-5759/a000064 Marks, M. A., Mathieu, J. E., & Zaccaro, S. J. (2001). A temporally based framework and taxonomy of team processes. The Academy of Management Review, 26, 356–376. doi: 10.5465/ amr.2001.4845785 Mathieu, J. E., Heffner, T. S., Goodwin, G. F., Cannon-Bowers, J. A., & Salas, E. (2005). Scaling the quality of teammates’ mental models: Equifinality and normative comparisons. Journal of Organizational Behavior, 26, 37–56. doi: 10.1002/job.296 Mathieu, J. E., Heffner, T. S., Goodwin, G. F., Salas, E., & CannonBowers, J. A. (2000). The influence of shared mental models on team process and performance. Journal of Applied Psychology, 85, 273–283. doi: 10.1037/0021-9010.85.2.273 Mathieu, J. E., Maynard, M. T., Rapp, T., & Gilson, L. (2008). Team effectiveness 1997–2007: A review of recent advancements and a glimpse into the future. Journal of Management, 34, 410–476. doi: 10.1177/0149206308316061 Mohammed, S., Ferzandi, L., & Hamilton, K. (2010). Metaphor no more: A 15-year review of the team mental model construct. Journal of Management, 36, 876–910. doi: 10.1177/ 0149206309356804 Mohammed, S., Klimoski, R., & Rentsch, J. R. (2000). The measurement of team mental models: We have no shared schema. Organizational Research Methods, 3, 123–165. doi: 10.1177/ 109442810032001 Mohammed, S., & Nadkarni, S. (2014). Are we all on the same temporal page? The moderating effects of temporal team cognition on the polychronicity diversity-team performance

relationship. Journal of Applied Psychology, 99, 404–422. doi: 10.1037/a0035640 Nunnally, J., & Bernstein, I. (2004). Psychometric theory. New York, NY: McGraw-Hill. R Core Team. (2012). R: A language and environment for statistical computing. Vienna, Austria. Retrieved from http://www. R-project.org Ren, Y., & Argote, L. (2011). Transactive memory systems 1985– 2010: An integrative framework of key dimensions, antecedents, and consequences. The Academy of Management Annals, 5, 189–229. doi: 10.1080/19416520.2011.590300 Rentsch, J. R., Delise, L. A., & Hutchison, S. (2009). Cognitive similarity configurations in teams: In search of the team MindMeld. In E. Salas, G.-F. Goodwin, & C. S. Burke (Eds.), Team effectiveness in complex organizations: Cross-disciplinary perspectives and approaches (pp. 241–266). New York, NY: Psychology Press. Rentsch, J. R., & Klimoski, R. J. (2001). Why do “great minds” think alike? Antecedents of team member schema agreement. Journal of Organizational Behavior, 22, 107–120. doi: 10.1002/ job.81 Rentsch, J. R., & Mot, I. R. (2012). Elaborating cognitions in teams: Cognitive similarity configurations. In E. Salas, S. M. Fiore, & M. P. Letsky (Eds.), Theories of team cognition: Cross-disciplinary perspectives (pp. 145–170). New York, NY: Routledge. Rico, R., Sánchez-Manzanares, M., Gil, F., & Gibson, C. (2008). Team implicit coordination processes: A team knowledgebased approach. Academy of Management Review, 33, 163–184. doi: 10.2307/20159381 Rosseel, Y. (2012). lavaan: An R package for structural equation modeling. Journal of Statistical Software, 48, 1–36. Salas, E., Cooke, N. J., & Rosen, M. A. (2008). On teams, teamwork, and team performance: Discoveries and developments. Human Factors: The Journal of the Human Factors and Ergonomics Society, 50, 540–547. doi: 10.1518/ 001872008x288457 Shrout, P. E., & Fleiss, J. L. (1979). Intraclass correlations: Uses in assessing rater reliability. Psychological Bulletin, 86, 420–428. doi: 10.1037/0033-2909.86.2.420 Smith-Jentsch, K. A. (2009). Measuring team-related cognition: The devil is in the details. In E. Salas, G.-F. Goodwin, & C. S. Burke (Eds.), Team effectiveness in complex organizations: Cross-disciplinary perspectives and approaches (pp. 491–508). New York, NY: Psychology Press. Wegner, D. M. (1987). Transactive memory: A contemporary analysis of the group mind. In B. Mullen & G. R. Goethals (Eds.), Theories of group behavior (Vol. 9, pp. 185–208). New York, NY: Springer. Ziegler, M., Booth, T., & Bensch, D. (2013). Getting entangled in the nomological net. European Journal of Psychological Assessment, 29, 157–161. doi: 10.1027/1015-5759/a000173

Ó 2016 Hogrefe Publishing

Received June 24, 2014 Revision received September 5, 2015 Accepted December 27, 2015 Published online October 7, 2016 Michael J. Burtscher Department of Psychology University of Zurich Binzmuehlestrasse 14/13 8050 Zurich Switzerland Tel. +41 44 635-7273 E-mail m.burtscher@psychologie.uzh.ch

European Journal of Psychological Assessment (2019), 35(1), 98–108


108

M. J. Burtscher & J. Oostlander, Perceived Mutual Understanding

Appendix A Table A1. Means, standard deviations, corrected item-total correlations, and factor loadings for the eight items from Step 3 (N = 315) Items

M

SD

CITC

Factor loading

1)

Wir sind einer Meinung hinsichtlich unserer jeweiligen Kompetenzen.

5.03

1.14

.64

.70

2)

Wir sind in der Regel einer Meinung darüber, wie unsere individuellen Aufgaben ausgeführt werden sollten. Wir sind gleicher Ansicht hinsichtlich unserer individuellen Stärken und Schwächen. Wir sind uns vollkommen einig, wer von uns welche Aufgabe ausführen sollte. Wir haben ein gemeinsames Verständnis bezüglich des Zusammenhangs zwischen den einzelnen Aufgaben. Wir stimmen in der gegenseitigen Beurteilung unserer jeweiligen Expertise vollkommen überein. Wir sind uns genau bewusst, über welche Fertigkeiten jedes Teammitglied verfügt. Wir verstehen auch die Aufgaben der anderen Teammitglieder.

5.09

1.03

.54

.59

4.99

1.15

.62

.68

5.15

1.12

.59

.64

5.22

1.05

.62

.68

4.58

1.07

.64

.70

5.35

1.14

.57

.63

5.45

1.10

.58

.62

3) 4) 5) 6) 7) 8)

Notes. CITC = Corrected item-total correlation. Factor loadings based on exploratory factor analysis.

Appendix B Table B1. Summary of intercorrelations for the 6-item version from Step 3 and Step 4 Items 1) 2) 3) 4) 5) 6)

Wir sind einer Meinung hinsichtlich unserer jeweiligen Kompetenzen. Wir sind gleicher Ansicht hinsichtlich unserer individuellen Stärken und Schwächen. Wir sind uns vollkommen einig, wer von uns welche Aufgabe ausführen sollte. Wir haben ein gemeinsames Verständnis bezüglich des Zusammenhangs zwischen den einzelnen Aufgaben. Wir stimmen in der gegenseitigen Beurteilung unserer jeweiligen Expertise vollkommen überein. Wir sind uns genau bewusst, über welche Fertigkeiten jedes Teammitglied verfügt.

Item 1

Item 2

Item 3

Item 4

Item 5

Item 6

.55

.40

.44

.50

.45

.66

.37

.43

.48

.52

.41

.35

.46

.44

.37

.42

.32

.38

.53

.36

.59

.56

.36

.50

.49

.53

.51

.32

.30

.41

Notes. Intercorrelations for Step 3 (N = 315) are presented above the diagonal, and intercorrelations for Step 4 (N = 191) are presented below the diagonal. All coefficients are significant at p < .001.

European Journal of Psychological Assessment (2019), 35(1), 98–108

Ó 2016 Hogrefe Publishing


Multistudy Report

Assessing Positive Orientation With the Implicit Association Test Giulio Costantini,1 Marco Perugini,1 Francesco Dentale,2 Claudio Barbaranelli,2 Guido Alessandri,2 Michele Vecchione,2 and Gian Vittorio Caprara2 1

Department of Psychology, University of Milan-Bicocca, Milan, Italy Psychology Department, “Sapienza” University of Rome, Italy

2

Abstract: Positive orientation (PO) is a basic predisposition that consists in a positive outlook toward oneself, one’s life, and one’s future, which is associated to many desirable outcomes connected to health and to the general quality of life. We performed a lexical study for identifying a set of markers of PO, developed an Implicit Association Test (the PO-IAT), and investigated its psychometric properties. The POIAT proved to be a reliable measure with a clear pattern of convergent validity, both with respect to self-report scales connected to PO and with respect to an indirect measure of self-esteem. A secondary aim of our studies was to validate a new brief adjective scale to assess PO, the POAS. Our results show that both the PO-IAT and the self-reported PO predict the frequency of depressive symptoms and of self-perceived intelligence. Keywords: positive orientation, IAT, self-esteem, optimism, life satisfaction

Positive Orientation (PO) attests to a basic predisposition that consists in viewing oneself as worthy of regard, one’s life as worth living, and one’s future as promising. PO is conceived of as a high-order latent dimension that is responsible for the covariance of self-esteem, life satisfaction, and optimism (Alessandri, Caprara, & Tisak, 2012). In previous investigations, PO showed consistent trait-like properties and it predicted important life outcomes such as health, depression, psychological resilience, quality of friendship, positive and negative affect, and selfenhancement (Alessandri, Caprara, et al., 2012; Caprara, Alessandri, Colaiaco, & Zuffianò, 2013; Caprara et al., 2012). PO has been shown to mediate the impact of extraversion and neuroticism on subjective happiness (Lauriola & Iani, 2015). Among the Big Five factors, PO correlates especially with extraversion, agreeableness, and neuroticism, however the size of such correlations suggests that PO cannot be fully represented within the Big Five space (Caprara et al., 2012). Caprara and collaborators developed the positivity scale (P-scale), an 8-item inventory of PO, that has shown to be reliable and valid (Caprara et al., 2012). However, no indirect1 measure of PO is available in the literature. Indirect measures rely on simple behavioral tasks for

1

assessing a variety of constructs, are less prone than selfreports to socially-desirable responding and limited introspective ability, and are thought to reflect associations more than propositions (e.g., Back, Schmukle, & Egloff, 2009). For instance, by tapping into associations which people may be unaware of, indirect measures have fostered a deeper understanding of the mechanisms of self-esteem (e.g., Zeigler-Hill, 2006), which is a fundamental constituent of PO. PO might be as well characterized by associative processes that are important to understand its properties, but that cannot be assessed by means of selfreports only. Furthermore being positively oriented is likely to be perceived as socially desirable: self-serving biases may affect indirect measures of PO to a lower degree than self-reports. The Implicit Association Test (IAT; Greenwald, McGhee, & Schwartz, 1998) is one of the most widely used, reliable, and valid indirect measures (Greenwald, Poehlman, Uhlmann, & Banaji, 2009) and has been used to evaluate aspects of self-concept (e.g., Back et al., 2009). The primary goal of our research was to develop an IAT to assess PO (PO-IAT) and to evaluate its properties. In the development of the IAT, the selection of the stimuli is very important and can influence the IAT effect (e.g., Bluemke & Friese, 2006).

We use the terms indirect and direct rather than implicit and explicit measures in line with the arguments and definitions of De Houwer and Moors (2010) and Perugini, Costantini, Richetin, and Zogmaister (2015).

Ó 2016 Hogrefe Publishing

European Journal of Psychological Assessment (2019), 35(1), 109–116 DOI: 10.1027/1015-5759/a000362


110

In Study 1 we used a careful empirical approach to select the best markers of PO to use in the IAT. With the same markers, we assembled the Positive Orientation Adjective Scale (POAS), a new brief scale of PO. We hypothesized that both new measures of PO would converge with self-reports of positive orientation, life satisfaction, optimism, self-esteem, affect, and that additionally the PO-IAT would converge with an indirect measure of self-esteem (Greenwald & Farnham, 2000). We also hypothesized that the new measures would predict the frequency of occurrence of depressive symptoms (Radloff, 1977), since the lack of depressive symptoms is among of the most important known correlates of PO (Caprara et al., 2012; Heikamp et al., 2014). Following Perugini, Richetin, and Zogmaister (2010) we further examined the combined pattern of prediction of the direct and indirect measures of PO: Using multiple regression we tested for an additive pattern, in which both kind of measures explain unique portions of criterion variance. This pattern is especially important, since it indicates that disregarding one kind of measure implies sacrificing a portion of explained criterion variance. Finally, we used the PO-IAT to explore the relationship between PO and the self-perceived intelligence. Selfperceived intelligence reflects only partially the performance in intelligence tests and is also influenced by other variables, such as personality traits (Furnham & Buchanan, 2005): Since intelligence is a generally valued quality, we expect positively oriented individuals to perceive themselves as more intelligent. However, by definition selfperceived intelligence can only be assessed using direct measures. Therefore, the use of an indirect measure of PO allows to examine this relationship while also controlling for the potential confounding effect of social desirability that is likely to affect self-report measures. In fact, since the IAT is less affected by social desirability effects (Greenwald et al., 2009), if a relationship between selfperceived intelligence and the PO-IAT emerged, it could not be simply ascribed to response biases. In examining the relationship between PO and self-perceived intelligence, we also controlled for an additional potential confound, namely the cognitive component of taskswitching ability that seems to affect at least in part the IAT scores (Back, Schmukle, & Egloff, 2005).

G. Costantini et al., Positive Orientation IAT

(2) being balanced by facets of PO (self-esteem, life satisfaction, and optimism). Balancing items by polarity is important, since the IAT requires the same number of stimuli in each category (Greenwald et al., 1998). We decided to balance the IAT items also by facet to mirror the structure of PO that emerged originally from selfreports (Alessandri, Caprara, et al., 2012): This strategy led to valid indirect measures in previous studies (e.g., Back et al., 2009; Costantini et al., 2015). Although one could object that the structure of implicit PO may differ from that emerging from self-reports, it is important to consider that there is no instrument available that would allow to perform a broad exploratory study of the structure of PO (i.e., the equivalent of a lexical study) using only indirect measures. Furthermore, since these three facets are crucial for the definition of PO (Alessandri, Caprara, et al., 2012), if we disregarded the structure of PO we may have ended up assessing a construct different from PO.

Materials and Methods Participants One hundred ninety participants (53 males, 133 females, 4 did not indicate gender) were recruited by the authors in two Italian universities and were asked to fill in a battery of questionnaires on a voluntary basis. Their average age was 22.53 years (SD = 3.64). Measures Adjective/Nouns Checklist A set of 88 adjectives and 37 nouns was developed by all authors, who identified items that could qualify as markers of PO, life satisfaction, optimism, or self-esteem, relying on the content of the P-Scale and on their expert knowledge. At this initial stage, we aimed at obtaining a list that was as comprehensive as possible. Participants rated the extent to which each adjective applied to them and the extent to which they associated each noun to themselves, on a scale from 1 (= not at all) to 5 (= completely). Positivity Scale (P-scale; α = .82) PO was assessed with the 8-item Positivity Scale (Caprara et al., 2012). Participants provided their ratings on a scale ranging from 1 (= strongly disagree) to 5 (= strongly agree).

Study 1 aimed at selecting a set of 12 items for assessing PO that had the following desirable features: (1) being balanced by polarity of PO (positivity vs. negativity) and

Life Satisfaction (SWLS; α = .82) Life satisfaction was assessed with the 5-item Satisfaction with Life Scale (Diener, Emmons, Larsen, & Griffin, 1985). Participants rated the extent to which they felt generally satisfied with life on a scale ranging from 1 (= strongly disagree) to 7 (= strongly agree). One item clearly overlapped in content with one included in the P-scale and was administered only within the P-scale.

European Journal of Psychological Assessment (2019), 35(1), 109–116

Ó 2016 Hogrefe Publishing

Study 1: Development of Stimuli for PO-IAT and for POAS


G. Costantini et al., Positive Orientation IAT

Optimism (LOT; α = .87) Optimism was assessed with the 10-item Life Orientation Test (Scheier, Carver, & Bridges, 1994). Four items were fillers and were not included in the computation of the score. Participants provided their ratings on a scale ranging from 1 (= strongly disagree) to 5 (= strongly agree). Self-Esteem (RSES; α = .89) Self-esteem was assessed with the 10-item Rosenberg SelfEsteem Scale (Rosenberg, 1965). Ratings were provided on a 4-point scale ranging from 1 (= strongly disagree) to 4 (= strongly agree). Two items clearly overlapped in content with those included in the P-scale and were administered only within the P-scale. Procedure In one university, the measures were administered in this fixed sequence: Adjective/nouns checklist, P-scale, SWLS, LOT, and RSES. In the other university, the Positivity Scale was administered in classroom, while the other measures were administered online.

Results Missing values were replaced with the rounded mean of the variable.2 Males and females did not show significantly different scores in the P-scale, Mmales = 3.61 (SD = 0.68), Mfemales = 3.65 (SD = 0.61), t(184) = 0.36, p = .72; in the SWLS, Mmales = 4.30 (SD = 1.52), Mfemales = 4.57 (SD = 1.18), t(184) = 1.29, p = .20; in the LOT, Mmales = 3.57 (SD = 0.91), Mfemales = 3.32 (SD = 0.85), t(184) = 1.81, p = .07; or in the RSES, Mmales = 3.26 (SD = 0.58), M = 3.12 (SD = 0.58), t(184) = 1.47, p = .14. For identifying the best markers of PO, we adopted a two-step strategy: The first step consisted in preselecting a smaller subset of best markers according to their correlations with the P-Scale, the RSES, the SWLS, and the LOT, which were also balanced by facet and valence. In particular, among the initial 125 items, we considered 60 that correlated more than the median with the P-scale (Mdn r = .425) and that additionally correlated more than the median with at least one among the RSES (Mdn r = .364), SWLS (Mdn r = .372), and LOT (Mdn r = .388). We selected 30 among these items that were balanced by facets (10 for self-esteem, 10 for life satisfaction, and 10 for optimism) and by valance (15 for the positive pole and 15 for the negative pole). We preferred items with higher correlation with the P-Scale, but we also kept an eye on breadth of content: for instance, we excluded items that clearly overlapped in content (e.g., for this reason we kept “futuro cupo” – gloomy 2

111

future – but we dropped “futuro buio” – dark future – and other similar items). The second step consisted in factor-analyzing these 30 markers to identify the best possible stimuli to be used in the IAT. We performed an exploratory factor analysis (EFA) with principal axis factoring estimation: Five factors had eigenvalues higher than 1 (the first eigenvalues were 13.82, 2.17, 1.71, 1.47, 1.07, and 0.95) and parallel analysis indicated three factors (the first random eigenvalues were 1.83, 1.71, 1.62, and 1.54). One factor explained 44% of the common variance and three factors explained 55%. We performed an iterative sequence of factor analyses: at each step the most unsatisfactory item was dropped and the analysis was repeated, until we obtained a final list of 12 items that were balanced by facet and by valence. Table 1 (Study 1) reports the single factor solution and the three oblique factors solution. In the single factor solution, all items loaded clearly on the general factor and the three factors clearly reproduced the facets self-esteem, life satisfaction, and optimism. As expected, the correlations among the three factors were high and ranged from .55 to .70. Both the scores on the single factor and on the three oblique factors correlated with the P-scale and with the three facets of PO assessed with RSES, SWLS, and LOT. The three oblique factors correlated with the corresponding facet of PO more than they correlated with the other facets. Males and females did not differ significantly in their factor scores, both in the single factor solution, t(184) = 1.18, p = .24, and in the three-factors solution (all ps > .13).

Discussion Starting from a large set of 125 items, we selected 12 items representative of PO that were balanced by facet and by valence, relying on a correlational analysis and on EFA. These items served primarily for developing the PO-IAT. As an ancillary benefit, with the same items we also developed a corresponding adjective list self-reported measure of PO, the POAS. Although using statistical techniques such as EFA is a strategy to reduce subjectivity in scale construction, one can argue some degree of subjectivity is always involved, at least in the choice of the initial pool of items, or in the selection of the item to drop at each step of the iterative factor analysis if many items have a similarly poor performance. This degree of subjectivity can be controlled for by inspecting whether the results replicate on a new sample: One of the aims of Study 2 is therefore to examine whether the factor structure of the 12 POAS items mirrors closely the structure emerged in Study 1. We also wish to

Since the proportion of missing values was very small (66 missing data points out of 29,450), using more sophisticated imputation methods (e.g., van Buuren & Groothuis-Oudshoorn, 2011) did not affect the results.

Ó 2016 Hogrefe Publishing

European Journal of Psychological Assessment (2019), 35(1), 109–116


112

G. Costantini et al., Positive Orientation IAT

Table 1. Exploratory factor analysis of the PO items in the two studies and correlations of the factor scores with measures of PO and of its constituents Study 1 One factor

Study 2

Three oblique factors

One factor

Three oblique factors

PO

Self-esteem

Life-satisf.

Optimism

PO

Self-esteem

Life-satisf.

Optimism

Alta stima (high esteem)

.57

.82

.03

.05

.73

.84

.09

.05

Bassa stimaa (low esteem)

.71

.62

.17

.07

.76

.81

.11

.01

Sicuro (self-assured)

.59

.68

< .01

.07

.65

.84

.08

.03

Insicuroa (insecure)

.58

.72

.01

.03

.64

.79

.09

.07

Felice (happy)

.67

.11

.71

.03

.64

.06

.92

.02

Infelicea (unhappy)

.73

.03

.88

< .01

.69

.12

.69

.05

Contento (cheerful)

.55

.05

.60

.09

.69

.02

.83

.03

Scontentoa (dissatisfied)

.70

.03

.76

.03

.65

.09

.58

.11

Positivo (positive)

.78

.03

.08

.91

.75

.05

.06

.75

Negativoa (negative)

.79

.06

.10

.83

.76

.03

.06

.90

Ottimista (optimistic)

.77

.10

.02

.73

.76

.02

.09

.80

.79

.01

.02

.86

.72

.04

.01

.88

a

Pessimista (pessimistic)

Correlation among oblique factors Self-esteem

1

1

Life-satisfaction

.55

1

Optimism

.61

.70

1

.51

1

.62

.58

1 .65***

Correlations with scales connected to PO Positivity scale

.71***

.63***

.71***

.60***

.76***

.68***

.66***

Rosenberg Self-Esteem Scale

.68***

.74***

.57***

.55***

.70***

.74***

.51***

.56***

Satisfaction With Life Scale

.60***

.50***

.65***

.48***

.61***

.50***

.61***

.48***

Life Orientation Test

.76***

.61***

.62***

.77***

.71***

.57***

.55***

.73***

Notes. Factor loadings larger than .20 are represented in bold. Oblique factors were rotated with method oblimin, the table reports the pattern matrices. a The item taps into the negative pole of positive orientation and is reverse-scored. ***p < .001.

stress that in the IAT literature, stimuli are often selected based on intuition, without specific empirical evidence of their adequacy to reflect the targeted dimension. Here instead we performed a dedicated study in which stimuli were selected based on empirical evidence. One could argue that this approach can represent a relevant methodological improvement over the quality of the stimuli selected for an IAT and, consequently, of the corresponding indirect measure.

A Task-Switching Ability IAT (TSA-IAT; Back et al., 2005) was administered in order to rule out the possibility that the relationships between the PO-IAT and self-perceived intelligence could be simply ascribed to the task-switching ability that is related with IAT scores (Back et al., 2005).

Materials and Methods

We tested the validity of the new measures, the PO-IAT and the POAS, by inspecting their convergence with existing measures of PO, of its facets, and of positive and negative affectivity. We examined the criterion validity of the new measures with respect to the frequency of depressive symptoms. We also investigated whether the measures of PO would predict self-perceived intelligence.

Participants Two hundred fifty-four participants (110 males; mean age = 22.59, SD = 4.49) were recruited in two Italian universities, 125 were collected in the first university and 129 in the second. Two additional participants were excluded from the analyses because of random responding in the IAT, as revealed by the high proportion of errors (> 30%), ten more participants were excluded because they did not complete all the measures, and eleven more participants because they indicated that they were not Italian native speakers. The power for detecting the typical IAT-criterion correlation (which is r = .274 according to Greenwald et al., 2009) was .99 for the full sample (N = 254) and .89 for the second sample (N = 129).

European Journal of Psychological Assessment (2019), 35(1), 109–116

Ă“ 2016 Hogrefe Publishing

Study 2: Validity of the New Indirect and Direct Measures of PO


G. Costantini et al., Positive Orientation IAT

No optional stopping rules have been applied to arrive at the given sample size (Asendorpf et al., 2013; Simmons, Nelson, & Simonsohn, 2011). Measures As in Study 1, we administered the P-scale (α = .82), SWLS (α = .78), LOT (α = .85), and RSES (α = .86). Additionally, we also administered the following measures. Positive Orientation IAT (PO-IAT) We developed an IAT for assessing positive orientation following the procedure by Greenwald and colleagues (1998). The target categories were Me versus Others and the attribute categories were Positivity versus Negativity. The stimuli for the attribute categories were those identified in Study 1 (see Table 1). The IAT scores were computed using the improved D6 algorithm (Greenwald, Nosek, & Banaji, 2003). Full details about the implementation of the IAT are reported in the Electronic Supplementary Material 1. Positive Orientation Adjective Scale (POAS; α = .92) Participants were instructed to indicate the extent to which each of the 12 markers of PO identified in Study 1 (see Table 1) described them on a scale, from 1 (= It does not describe me at all) to 5 (= It describes me completely). The POAS score was computed as the simple average of the items’ scores. Positive Affectivity (PA; α = .84) and Negative Affectivity (NA; α = .85) The Positive and Negative Affect Schedule (Watson, Clark, & Tellegen, 1988) was administered. Participants rated the extent to which each of 20 adjectives, 10 for the positive affect (PA) and 10 for the negative affect (NA), described their typical mood on a scale ranging from 1 (= very slightly or not at all) to 5 (= extremely). Depressive Symptoms (CES-D; α = .90) Participants rated the level of occurrence of 16 depressive symptoms during the previous week, using the Center for Epidemiologic Studies Depression Scale (Fava, 1983; Radloff, 1977), on a scale from 1 (= rarely or none of the time) to 4 (= most or all of the time [5–7 days]). The full CES-D scale also included four reverse-scored items that were not administered, since they overlapped in content with measures of self-esteem and positivity.

3

4

113

Implicit Self-Esteem (SE-IAT) Implicit self-esteem was assessed using a standard Self-Esteem IAT (α = .76; Greenwald & Farnham, 2000). The structure of the task was as given in Greenwald and Farnham (2000). Task-Switching Ability IAT (TSA-IAT) The task-switching ability IAT (α = .71; Back et al., 2005) is an IAT in which stimuli and categories are selected to minimize the effect of individual differences in preexisting associations. Participants were asked to sort stimuli in two pairs of categories, that were Letter (a sample stimulus is “B”) versus Number (e.g., “5”), and Word (e.g., “pen”) versus Calculation (e.g., “7 4 = 3”). Since a clear association is assumed for all participants between categories Letter and Word and between categories Number and Calculation, the TSA-IAT score is thought to reflect almost exclusively the method-specific variance of the IAT connected to the task-switching ability. Full details on this task can be found in the original publication (Back et al., 2005). Self-Perceived Intelligence3 (α = .84) Participants evaluated their intelligence, using IQ typical scores, in 14 different domains, which were general, verbal, logical-mathematical, spatial, musical, kinesthetic, interpersonal, intrapersonal, naturalistic, creative, existential, spiritual, emotive, and practical. This measure has been adapted from Furnham and Buchanan (2005). An EFA with a principal axis factoring estimation revealed that the 14 dimensions of self-perceived intelligence were adequately explained by a single factor. The first five eigenvalues were 4.80, 1.50, 1.34, 1.11, and 0.88. Parallel analysis indicated that two factors explained more variance than those extracted from random data (the first random eigenvalues were 1.62, 1.47, and 1.35) and the MAP criterion (Velicer, 1976) indicated a one factor structure. The loadings on the first factor ranged between .38 and .68. Procedure Both samples took the measures in the following order: PO-IAT, POAS, P-scale, RSES, SWLS, LOT, PANAS, CES-D. The second sample completed additionally the SE-IAT, the TSA-IAT, and the self-perceived intelligence questionnaire.4 The measures were administered via computer, with the exception of the self-perceived intelligence measure which was administered via paper and pencil.

Nine participants did not complete the self-reports of intelligence and two had a substantial number of missing values (5 and 11), therefore the analyses involving these scores were performed on N = 118 participants. The second sample (N = 129) filled out another questionnaire, but results involving these questions are not discussed here. For promoting the reproducibility of our research, all the original materials and the full dataset are reported in the Electronic Supplementary Materials 2–10 (e.g., see Asendorpf et al., 2013).

Ó 2016 Hogrefe Publishing

European Journal of Psychological Assessment (2019), 35(1), 109–116


114

Results We performed an EFA with principal axis factoring estimation to inspect the dimensionality of the POAS. The first four eigenvalues were 6.46, 1.49, 1.26, and 0.63; the scree test therefore indicated that a 1-factor or a 3-factor solution was plausible: Both solutions are reported in Table 1 (Study 2). The Tucker congruence coefficient between solutions emerged in Study 1 and in Study 2 was ϕ = .99 for the single factor, and ranged between ϕ = .95 and ϕ = .99 for the three-factors solutions, indicating that the results of Study 2 clearly replicated those of Study 1 (Lorenzo-Seva & ten Berge, 2006). The Spearman-Brown adjusted split-half correlation of the IAT was .86, revealing the good internal consistency of the PO-IAT. The average PO-IAT score was .52 (SD = .38, min = .81, max = 1.38), meaning that the participants were on average faster in associating themselves with Positivity than with Negativity. Males had a higher average PO-IAT score (MPO-IAT = .55, SD = .41) and POAS score (MPOAS = 3.75, SD = .66) than females (MPO-IAT = .49, SD = .36; MPOAS = 3.41, SD = .75). This difference was statistically significant only for POAS, t(252) = 3.79, p < .001, but not for the PO-IAT, t(252) = 1.27, p = .20. The correlations among the measures administered are reported in Table 2.5 The PO-IAT and the POAS converged with each other and with the self-reports of PO, selfesteem, life-satisfaction, optimism, and affect. Additionally, the PO-IAT correlated with the implicit self-esteem. The criterion validity of the PO-IAT was supported by a negative correlation with the CES-D. In the prediction of the CES-D, both the PO-IAT and the self-report measures of PO showed incremental validity over each other: When the P-scale, the POAS, and the PO-IAT were entered as predictors of the CES-D in a multiple regression, they all explained unique portions of the variance of CES-D (βP-Scale = .15, p = .048, βPOAS = .48, p < .001, βPO-IAT = .11, p = .03, R2 = .406). The PO-IAT also showed a positive correlation with self-perceived intelligence (see Table 2) and it remained a significant predictor of the self-perceived intelligence after controlling for the TSA-IAT in a multiple regression (βPO-IAT = .22, p = .025; βTSA-IAT = .07, p = .432, R2 = .044, p = .074).

Discussion We examined the psychometric properties of an IAT and of an adjective scale to assess positive orientation. Both the PO-IAT and the POAS showed high internal consistency and, as we hypothesized, both measures showed clear convergent validity with other measures of PO and of related 5

G. Costantini et al., Positive Orientation IAT

constructs. Additionally, both the PO-IAT and the selfreport measures of PO predicted the frequency of depressive symptoms and gave a unique contribution in terms of explained variance. This additive pattern of prediction (Perugini et al., 2010) confirms the importance of considering both direct and indirect measures in the assessment of PO. Interestingly the PO-IAT predicted self-perceived intelligence even after controlling for task-switching ability: This result suggests that the relationship between PO and self-perceived intelligence cannot be ascribed either to a social desirability component characterizing self-reports, or to the IAT’s method variance. Instead, it may reflect a self-serving bias of positively oriented individuals in the evaluation of their intelligence independently of the specific dimension considered, supporting the hypothesis that PO may induce or enhance a positive bias in self-evaluation (Caprara et al., 2013). Of course, we cannot exclude the opposite possibility, that a higher level of self-perceived intelligence induced a higher level of implicit PO.

General Conclusions Relying on the results of a lexical study (Study 1), we developed two new measures of PO, an IAT and an adjective checklist (POAS), and tested their psychometric properties (Study 2). The new measures showed good reliability and convergent validity. Additionally the PO-IAT and the POAS predicted the frequency of depressive symptoms independently and the PO-IAT predicted self-perceived intelligence even after controlling for the IAT’s method variance. Together these results confirm the good psychometric properties of the new measures and the importance of considering both indirect and direct measures when investigating psychological constructs. We can anticipate several contexts in which the new measures can be useful. The PO-IAT is especially suitable whenever it is crucial to rule out the effect of social desirable responding and of the limits in the introspective ability. Additionally the PO-IAT can be important for investigating whether implicit and explicit PO play a different role in crucial psychological mechanisms, as it has been shown for self-esteem (Zeigler-Hill, 2006). The POAS can serve as a useful complement to the PO-IAT whenever it is important to consider a self-report measure that mirrors the IAT (e.g., Costantini et al., 2015; Payne, Burkley, & Stokes, 2008) and it can also complement the P-Scale whenever it is relevant to administer very short and easy to answer measures, such as in daily diary studies (e.g., Fleeson, 2001). Both the PO-IAT and the POAS have

An analysis of the homogeneity of the covariance matrices in the two subsamples is reported in the Electronic Supplementary Material 1.

European Journal of Psychological Assessment (2019), 35(1), 109–116

Ó 2016 Hogrefe Publishing


G. Costantini et al., Positive Orientation IAT

115

Table 2. Pearson correlations among the explicit and implicit measures administered 1 1. PO-IAT

2

3

4

5

6

7

8

9

10

11

12

1

2. POAS

.20**

3. P-scale

.22***

.77***

1

4. RSES

.23***

.71***

.68***

5. SWLS

.24***

.61***

.73***

.49***

6. LOT

.17**

.71***

.63***

.55***

.52***

7. PA

.14*

.58***

.63***

.54***

.45***

.44***

1

8. NA

.16**

.63***

.52***

.47***

.42***

.47***

.18**

1 1 1 1 1

9. CESD

.24***

.62***

.54***

.52***

.48***

.50***

.31***

.68***

1

10. SE-IATa

.23**

.13

.15

.16

.11

.06

.01

.03

.01

1

11. TSA-IATa

.19*

.14

.08

.10

.04

.11

.12

.11

.12

.02

12. Self-perceived intelligenceb

.20*

.19*

.24**

.31***

.23*

.13

.35***

.21*

.14 < .01

1 .04

1

Notes. Correlations involving the PO-IAT and the POAS are reported in italics. PO-IAT = IAT measure of Positive Orientation; POAS = Positive Orientation Adjective Scale; P-scale = Positivity Scale; RSES = Rosenberg Self-Esteem Scale; SWLS = Satisfaction With Life Scale; LOT = Life Orientation Test; PA = Positive Affect Scale; NA = Negative Affect Scale; CES-D = Center for Epidemiologic Studies Depression Scale. aThe sample size for the correlations involving the SE-IAT and TSA-IAT is N = 129. bThe sample size for the correlations involving self-perceived intelligence is N = 118. *p < .05. **p < .01. ***p < .001.

been developed relying on samples of college students, however PO has been shown to also play a role in other contexts, such as in organizations (Alessandri, Vecchione, et al., 2012). An important task for future research is to extend the validation of the PO-IAT and of the POAS by examining their ability to predict criteria that are especially important in such contexts, such as job performance (Alessandri, Vecchione, et al., 2012; Ziegler, 2014). A limitation of this study was to not have available an objective behavioral criterion. The identification of specific behavioral correlates of PO would have gone beyond the scope of this work, but it is arguably an important issue for future research, together with the investigation of their predictability by means of indirect and direct measures. In conclusion, we have reported here a new indirect measure of Positive Orientation and provided empirical evidence of its good psychometric properties. This measure can represent a meaningful addition to the literature by tapping into associative processes related to one’s positive orientation about life which, as we have shown here as well as in previous research, goes beyond mere self-esteem.

ESM 3. Questionnaire (PDF). Paper and pencil questionnaires administered to the 2nd sample, including the self-perceived intelligence questionnaire and additional measures that are not discussed in the paper (see Footnote 4). ESM 4. Script (exp). The script including the PO-IAT measure and the POAS. ESM 5. Script (exp). The script including the following questionnaires: P-scale, RSES, SWLS, LOT, PANAS, and CES-D. ESM 6. Script (exp). The script including the self-esteem IAT (SE-IAT). ESM 7. Script (exp). The script including the task-switching ability IAT (TSA_IAT). ESM 8. Dataset (csv). Study 1 dataset. ESM 9. Dataset (csv). Study 2 dataset. ESM 10. Script (R). The script for reproducing the analyses presented in the paper.

Electronic Supplementary Material The electronic supplementary material is available with the online version of the article at http://dx.doi.org/10.1027/ 1015-5759/a000362

References

ESM 1. Text (PDF). Implementation of the PO-IAT and homogeneity of the covariances in Study 2. Codebook and item translations. ESM 2. Text (PDF). Instructions for using the materials.

Alessandri, G., Caprara, G. V., & Tisak, J. (2012). The unique contribution of positive orientation to optimal functioning: Further explorations. European Psychologist, 17, 44–54. doi: 10.1027/1016-9040/a000070 Alessandri, G., Vecchione, M., Tisak, J., Deiana, G., Caria, S., & Caprara, G. V. (2012). The utility of positive orientation in predicting job performance and organisational citizenship behaviors. Applied Psychology, 61, 669–698. doi: 10.1111/ j.1464-0597.2012.00511.x

Ó 2016 Hogrefe Publishing

European Journal of Psychological Assessment (2019), 35(1), 109–116


116

Asendorpf, J. B., Conner, M., De Fruyt, F., De Houwer, J., Denissen, J. J. A., Fiedler, K., . . . Wicherts, J. M. (2013). Recommendations for increasing replicability in psychology. European Journal of Personality, 27, 108–119. doi: 10.1002/per.1919 Back, M. D., Schmukle, S. C., & Egloff, B. (2005). Measuring taskswitching ability in the Implicit Association Test. Experimental Psychology, 52, 167–179. doi: 10.1027/1618-3169.52.3.167 Back, M. D., Schmukle, S. C., & Egloff, B. (2009). Predicting actual behavior from the explicit and implicit self-concept of personality. Journal of Personality and Social Psychology, 97, 533–548. doi: 10.1037/a0016229 Bluemke, M., & Friese, M. (2006). Do features of stimuli influence IAT effects? Journal of Experimental Social Psychology, 42, 163–176. doi: 10.1016/j.jesp.2005.03.004 Caprara, G. V., Alessandri, G., Colaiaco, F., & Zuffianò, A. (2013). Dispositional bases of self-serving positive evaluations. Personality and Individual Differences, 55, 864–867. doi: 10.1016/j.paid.2013.07.465 Caprara, G. V., Alessandri, G., Eisenberg, N., Kupfer, A., Steca, P., Caprara, M. G., . . . Abela, J. (2012). The positivity scale. Psychological Assessment, 24, 701–712. doi: 10.1037/ a0026681 Costantini, G., Richetin, J., Borsboom, D., Fried, E. I., Rhemtulla, M., & Perugini, M. (2015). Development of indirect measures of conscientiousness: Combining a facets approach and network analysis. European Journal of Personality, 29, 548–567. doi: 10.1002/per.2014 De Houwer, J., & Moors, A. (2010). Implicit measures: Similarities and differences. In B. Gawronski & B. K. Payne (Eds.), Handbook of implicit social cognition: Measurement, theory, and applications (pp. 176–193). New York, NY: Guilford Press. Diener, E., Emmons, R. A., Larsen, R. J., & Griffin, S. (1985). The satisfaction with life scale. Journal of Personality Assessment, 49, 71–75. doi: 10.1207/s15327752jpa4901_13 Fava, G. A. (1983). Assessing depressive symptoms across cultures: Italian validation of the CES-D self-rating scale. Journal of Clinical Psychology, 39, 249–251. doi: 10.1002/1097-4679 (198303)39:2<249::AID-JCLP2270390218>3.0.CO;2-Y Fleeson, W. (2001). Toward a structure- and process-integrated view of personality: Traits as density distributions of states. Journal of Personality and Social Psychology, 80, 1011–1027. doi: 10.1037/0022-3514.80.6.1011 Furnham, A., & Buchanan, T. (2005). Personality, gender and selfperceived intelligence. Personality and Individual Differences, 39, 543–555. doi: 10.1016/j.paid.2005.02.011 Greenwald, A. G., & Farnham, S. D. (2000). Using the implicit association test to measure self-esteem and self-concept. Journal of Personality and Social Psychology, 79, 1022–1038. doi: 10.1037/0022-3514.79.6.1022 Greenwald, A. G., McGhee, D. E., & Schwartz, J. L. K. (1998). Measuring individual differences in implicit cognition: The implicit association test. Journal of Personality and Social Psychology, 74, 1464–1480. doi: 10.1037/0022-3514.74.6.1464 Greenwald, A. G., Nosek, B. A., & Banaji, M. R. (2003). Understanding and using the Implicit Association Test: I. An improved scoring algorithm. Journal of Personality and Social Psychology, 85, 197–216. doi: 10.1037/0022-3514.85.2.197 Greenwald, A. G., Poehlman, T. A., Uhlmann, E. L., & Banaji, M. R. (2009). Understanding and using the Implicit Association Test: III. Meta-analysis of predictive validity. Journal of Personality and Social Psychology, 97, 17–41. doi: 10.1037/a0015575 Heikamp, T., Alessandri, G., Laguna, M., Petrovic, V., Caprara, M. G., & Trommsdorff, G. (2014). Cross-cultural validation of the positivity-scale in five European countries. Personality and Individual Differences, 71, 140–145. doi: 10.1016/j.paid.2014. 07.012

European Journal of Psychological Assessment (2019), 35(1), 109–116

G. Costantini et al., Positive Orientation IAT

Lauriola, M., & Iani, L. (2015). Does positivity mediate the relation of extraversion and neuroticism with subjective happiness? PLoS One, 10, e0121991. doi: 10.1371/journal.pone.0121991 Lorenzo-Seva, U., & ten Berge, J. M. F. (2006). Tucker’s congruence coefficient as a meaningful index of factor similarity. Methodology, 2, 57–64. doi: 10.1027/1614-2241.2.2.57 Payne, B. K., Burkley, M. A., & Stokes, M. B. (2008). Why do implicit and explicit attitude tests diverge? The role of structural fit. Journal of Personality and Social Psychology, 94, 16–31. doi: 10.1037/0022-3514.94.1.16 Perugini, M., Costantini, G., Richetin, J., & Zogmaister, C. (2015). Implicit association tests, then and now. In F. J. R. val de Vijver & T. M. Ortner (Eds.), Behavior based assessment: Going beyond self report in the personality, affective, motivation, and social domains (pp. 15–28). Göttingen, Germany: Hogrefe Publishing. Perugini, M., Richetin, J., & Zogmaister, C. (2010). Prediction of Behavior. In B. Gawronski & B. K. Payne (Eds.), Handbook of implicit social cognition (pp. 255–278). New York, NY: Guilford Press. Radloff, L. S. (1977). The CES-D Scale: A self-report depression scale for research in the general population. Applied Psychological Measurement, 1, 385–401. doi: 10.1177/014662167700100306 Rosenberg, M. (1965). Society and the adolescent self-image. Princeton, NJ: Princeton University Press. Scheier, M. F., Carver, C. S., & Bridges, M. W. (1994). Distinguishing optimism from neuroticism (and trait anxiety, self-mastery, and self-esteem): A reevaluation of the Life Orientation Test. Journal of Personality and Social Psychology, 67, 1063–1078. doi: 10.1037/0022-3514.67.6.1063 Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, 22, 1359–1366. doi: 10.1177/ 0956797611417632 van Buuren, S., & Groothuis-Oudshoorn, K. (2011). mice: Multivariate imputation by chained equations in R. Journal of Statistical Software, 45, 1–67. doi: 10.18637/jss.v045.i03 Velicer, W. F. (1976). Determining the number of components from the matrix of partial correlations. Psychometrika, 41, 321–327. doi: 10.1007/BF02293557 Watson, D., Clark, L. A., & Tellegen, A. (1988). Development and validation of brief measures of positive and negative affect: The PANAS scales. Journal of Personality and Social Psychology, 54, 1063–1070. doi: 10.1037/0022-3514.54.6.1063 Zeigler-Hill, V. (2006). Discrepancies between implicit and explicit self-esteem: implications for narcissism and self-esteem instability. Journal of Personality, 74, 119–144. doi: 10.1111/ j.1467-6494.2005.00371.x Ziegler, M. (2014). Stop and state your intentions! European Journal of Psychological Assessment, 30, 239–242. doi: 10.1027/1015-5759/a000228 Received May 15, 2015 Revision received December 4, 2015 Accepted December 27, 2015 Published online November 7, 2016 Giulio Costantini Department of Psychology University of Milan-Bicocca Piazza dell’Ateneo Nuovo 1 (U6) 20126 Milan Italy Tel. +39 02 64483867 E-mail giulio.costantini@unimib.it

Ó 2016 Hogrefe Publishing


Multistudy Report

Assessing Personality With Multi-Descriptor Items More Harm Than Good? Johannes Schult, Rebecca Schneider, and Jörn R. Sparfeldt Department of Educational Sciences, Saarland University, Saarbrücken, Germany Abstract: The need for efficient personality inventories has led to the wide use of short instruments. The corresponding items often contain multiple, potentially conflicting descriptors within one item. In Study 1 (N = 198 university students), the reliability and validity of the TIPI (TenItem Personality Inventory) was compared with the reliability and validity of a modified TIPI based on items that rephrased each two-descriptor item into two single-descriptor items. In Study 2 (N = 268 university students), we administered the BFI-10 (Big Five Inventory short version) and a similarly modified version of the BFI-10 without two-descriptor items. In both studies, reliability and construct validity values occasionally improved for separated multi-descriptor items. The inventories with multi-descriptor items showed shortcomings in some factors of the TIPI and the BFI-10. However, the other scales worked comparably well in the original and modified inventories. The limitations of short personality inventories with multi-descriptor items are discussed. Keywords: personality assessment, BFI-10, TIPI, multi-descriptor items, convergent and divergent validity

The Big Five model provides a well-established framework for studying interindividual personality differences. The model’s five broad trait domains – Neuroticism, Extraversion, Openness to Experience, Agreeableness, and Conscientiousness – have been confirmed repeatedly (Costa & McCrae, 1992; Marsh et al., 2010). Initially, Big Five instruments comprised a large set of items covering not only the five main factors, but also facet factors within each domain. For example, the Revised NEO Personality Inventory (NEO-PI-R; Costa & McCrae, 1992) consists of 240 items addressing six facets of each of the five main factors. Shorter instruments, for example the NEO FiveFactor Inventory (NEO-FFI, 60 items; Costa & McCrae, 1992) and the Big Five Inventory (BFI-44, 44 items; John, Donahue, & Kentle, 1991), showed satisfactory psychometric properties with regard to the assessment of the five main factors. More recently, even shorter personality scales were developed because of limited assessment time, especially in multipurpose large-scale studies (e.g., Hahn, Gottschling, & Spinath, 2012). The resulting inventories attempt to measure each personality factor with just three items (e.g., BFI-S; Hahn et al., 2012), two items (e.g., BFI-10; Rammstedt & John, 2007), or even one item (e.g., FIPI; Gosling, Rentfrow, & Swann, 2003). Conceptually, they have to keep the balance between semantic specificity and breadth. Compared to longer scales, such very short measures are less likely to elicit perceived item redundancy

and to artificially inflate reliability due to very similarly worded items (Gogol et al., 2014). Despite their shortness these personality scales revealed remarkable reliability (e.g., 18-month test-retest reliabilities of BFI-S: .57 r .80) and validity evidence (e.g., convergent validities of BFI-S and NEO-PI-R: .50 r .70; divergent validities: |r| .43; Hahn et al., 2012). A potential problem with many very short personality inventories is the combination of two descriptors (e.g., two adjectives) in one item. The impact of such double-barreled items on scale properties has not received sufficient attention. While answering such items respondents have to somehow integrate the content of both (potentially conflicting) descriptors. Individuals often wish to answer favorably to one descriptor and unfavorably to the other (Likert, 1974). Resolving this conflict requires additional time, which might be spent on focusing on choosing from the two options or integrating attitudes toward the two descriptors (Bassili & Scott, 1996). Established short personality inventories contain descriptor pairs within one item such as “disorganized, careless” (Conscientiousness item from the Ten-Item Personality Inventory; TIPI; Gosling et al., 2003, p. 525). These descriptor pairs were supposed “to enhance the bandwidth of the items by including in each item several descriptors selected to capture the breadth of the Big-Five dimensions” (Gosling et al., 2003, p. 508). The intended lack of (perfect) semantic overlap might result

Ó 2016 Hogrefe Publishing

European Journal of Psychological Assessment (2019), 35(1), 117–125 DOI: 10.1027/1015-5759/a000368


118

J. Schult et al., Assessing Personality With Multi-Descriptor Items

in competing response tendencies regarding the first and second descriptor of such an item. For example, there are possibly (i) some disorganized persons who do not consider themselves similarly careless, or (ii) some careless persons who do not consider themselves similarly disorganized, or (iii) test takers for whom the meaning of one descriptor qualifies the meaning of the other descriptor. The BFI-10 is another widely used short personality inventory containing multi-descriptor items; regarding an item such as “relaxed, handles stress well” (Neuroticism; Rammstedt & John, 2007, p. 210) it is conceivable that some people who consider themselves relaxed do not handle stress similarly well. In these cases, the actual responses do not contain the information regarding the individual descriptor evaluation process. In practice, participants sometimes cross out one of the two descriptors or rate each part separately in the TIPI (Herzberg & Brähler, 2006, p. 143). Differential responses reflect the heterogeneity of the respondents’ interpretations of ambiguous multi-descriptor items. When researchers evaluate the answers, they cannot be sure which descriptor the respondents focused on. Double-barreled items tend to show poor psychometric properties, crossloadings, and multi-modality (Clark & Watson, 1995; Stafford & Canary, 2006; Ziegler, 2014). Therefore, some researchers recommended excluding multi-descriptor items to decrease the respondent’s mental effort, increase internal consistency, clarify response options, and increase the understanding of given answers (e.g., Stafford, 2010; McCrae, Kurtz, Yamagata, & Terracciano, 2011). Despite these objections, double-barreled items are often used in established short personality inventories, presumably for economic reasons. It has yet to be determined empirically whether and to what extent multi-descriptor items influence the reliability and the construct validity of short personality scales.

Likewise, evidence of divergent relationships suggests that alternative constructs do not dilute the measurement at hand (Messick, 1995). Construct validity can be demonstrated within a multitrait-multimethod (MTMM) matrix that satisfies the following criteria: (a) validity values of the same construct are sufficiently large across different methods (e.g., Neuroticism in different instruments), (b) the correlations between measurements of different constructs are lower than the correlations between measurements of the same construct, and (c) measurements of the same construct need to have comparable relationships with measurements of other constructs (Campbell & Fiske, 1959). The first aspect indicates convergent validity the second and third indicate divergent validity. With regard to findings concerning multidescriptor items, reliability and construct validity might be restricted for inventories with double-barreled items due to the ambiguity of multi-descriptor items. The wording ambiguity of the original multi-descriptor item wording might result in construct underrepresentation and lead to an insufficient assessment of personality traits, for example, when participants have the tendency to focus only on the first descriptor. The validity of multi-descriptor items could also suffer from construct-irrelevant variance (cf. Messick, 1995) associated with the mental integration of multiple descriptors. In both cases, the construct validity of the personality scales should improve when each multi-descriptor item is separated into two (or more) single-descriptor items. After finding an unsatisfactory factorial structure for the TIPI in empirical data, Herzberg and Brähler (2006; Study 2) modified the items by putting each single adjective into a separate item. This procedure yielded 20 modified TIPI items. The exploratory factor analysis of responses from 349 participants showed a promising Five-Factor structure. Unfortunately, the study’s initial item pool contained four additional items that were also included in the analysis. Furthermore, eight of the items (six of them were modified TIPI items) had large secondary and/or low primary loadings and were not included in the subsequent validity study (n = 451). Therefore, it is not clear whether the resulting confirmation of the Five-Factor structure was due to the lack of particularly ambiguous items. It also remains unclear how the factorial structure of Herzberg and Brähler’s (2006) 16-adjective measure relates to the original TIPI scales because the different versions were administered to different samples. Therefore, further research investigating the effects of multi-descriptor items systematically is needed.

Reliability and Validity Considerations of reliability and validity are helpful when one has to decide whether or not to use double-barreled items. Reliability refers to the extent to which a measurement procedure yields identical results in consistent assessment conditions. Similar items can be added to obtain a more reliable test (Carmines & Zeller, 1979). Still, more items make a test longer and thus more costly. Short inventories avoid repetitive items while trying to reflect the breadth of a construct with a minimal number of items, often in the form of multi-descriptor items (Rammstedt & Beierlein, 2014). Validity is concerned with the meaningful interpretation and use of a test score with regard to a specific setting (e.g., Messick, 1995). Evidence of converging relationships indicates construct validity for a given set of indicators.

In two separate studies (Study 1: TIPI, Study 2: BFI-10), we compared the psychometric properties (reliability, construct

European Journal of Psychological Assessment (2019), 35(1), 117–125

Ó 2016 Hogrefe Publishing

The Present Studies


J. Schult et al., Assessing Personality With Multi-Descriptor Items

119

validity) of two established original questionnaires with the properties of modified versions in which each twodescriptor item of the original instruments was separated into two one-descriptor items. Regarding convergent and divergent validity evidence, NEO-FFI scores (Costa & McCrae, 1992) were used as established markers of the Big Five personality traits in the analysis of construct validity (i.e., a MTMM matrix). Test anxiety ratings were included as additional criteria, because test anxiety is closely related to Neuroticism (conceptually and empirically; e.g., Chamorro-Premuzic, Ahmetoglu, & Furnham, 2008). Hence, the present study analyzed the following research hypotheses: (1) We inspected and compared the reliability coefficients of the original inventories (TIPI, BFI-10 – with multi-descriptor items) with the coefficients of the modified instruments with only single-descriptor items. Since increasing the number of items tends to increase reliability, we additionally explored the expected reliability for longer multi-descriptor scales. (2a) Next, we inspected the relationship between the modified questionnaires and the original versions in terms of convergent and divergent validity coefficients. Regarding construct validity with the NEO-FFI as a benchmark for the Big Five personality traits and test anxiety facets as additional criteria for Neuroticism, we expected that: (2b) convergent validity coefficients (monotrait-heteromethod) of the modified instruments without multi-descriptor items are higher than coefficients of the original inventories; and (2c) divergent validity coefficients (heterotraitheteromethod) are lower for the modified scales (singledescriptor items only) than for the original inventories (including multi-descriptor items).

finally, the original TIPI. Participants were randomly assigned to booklets. All items of the personality questionnaires shared the same response format, a 5-point Likert-type scale ranging from 1 (= disagree strongly) to 5 (= agree strongly). Test anxiety was administered directly after the personality questionnaires at the end of the booklets. There was up to 2% item nonresponse per item in all the questionnaires (median = 0.5%). (1) NEO-FFI: The well-established German version (Borkenau & Ostendorf, 2008) contained 60 items (e.g., “Sometimes I feel completely worthless”), 12 items per Big Five personality factor. (2) TIPI: The Ten-Item Personality Inventory (TIPI) consisted of five 2-item scales, one for each Big Five factor (Gosling et al., 2003; German translation by Muck, Hell, & Gosling, 2007). We changed the original 7-point response scale to the 5-point response scale described above in order to keep the response options constant across all questionnaires. (3) Modified TIPI: In the German version of the TIPI, all ten items contained a combination of two descriptors (e.g., item 9: “I see myself as calm, emotionally stable”; Muck et al., 2007). Therefore, we generated two new items for each of these items, one for each adjective (e.g., “I see myself as calm” and “I see myself as emotionally stable”). Thus, the modified version consisted of 20 single-descriptor items. The first 10 items contained the respective first descriptors from the original questionnaire; items 11–20 contained the respective second descriptors. (4) Test anxiety: The two test anxiety facets worry (e.g., “I worry about my results,” 5 items) and emotionality (e.g., “I am nervous,” 5 items) were assessed with the well-established German Test-Anxiety-Inventory (PAF; Hodapp, Rohrmann, & Ringeisen, 2011). Students responded to these 10 items on a 5-point scale ranging from 1 (= never) to 5 (= very often).

Study 1: TIPI Materials and Methods Sample and Procedure A sample of N = 198 university students (75% women, age: M = 23.2 years, SD = 4.9 years, range from 18 to 45 years) from a medium-sized German university answered three personality questionnaires and a set of test anxiety items during an obligatory introductory lecture on educational assessment for sophomore teacher students. Instruments Personality was assessed with three tests in one session. In order to control for sequence effects, half of the students worked on a booklet that started with the original TIPI, followed by the NEO-FFI and the modified TIPI. The other half of the students worked on a booklet that began with the modified TIPI items, followed by the NEO-FFI and,

Analyses To answer research question 1, we reported McDonald’s ω (along with Cronbach’s α as an additional, more conservative reliability estimate; see Rammstedt & Beierlein, 2014) for the subscales of the TIPI and the modified TIPI. McDonald’s ω estimates were based on the minimum residual (ordinary least squares) solution of a 1-factor measurement model of the respective scale’s items (McDonald’s ω total, i.e., the proportion of common variance in the total item variance; Revelle, 2015; Revelle & Zinbarg, 2009). Values below .60 were deemed unacceptable, .60–.70 low but sufficient, .70–.80 acceptable, .80–.90 very good (DeVellis, 2012, p. 109; Murphy & Davidshofer, 2001, p. 142). The Spearman-Brown formula (cf. Carmines

Ó 2016 Hogrefe Publishing

European Journal of Psychological Assessment (2019), 35(1), 117–125


120

J. Schult et al., Assessing Personality With Multi-Descriptor Items

& Zeller, 1979, pp. 41–42) was used to estimate the expected reliability (ω) of the original scales if they had the same number of items as the modified scales (i.e., four per factor). For the MTMM analysis, we correlated the sum scores of each subscale of the TIPI, the modified TIPI, the NEO-FFI, and the test anxiety questionnaire. The lowest convergent validity coefficient should be higher in value than the highest divergent validity coefficient (research question 2a; Campbell & Fiske, 1959). The convergent validities of the original and the modified Big Five scales were compared statistically (research question 2b). The divergent validities were compared numerically1 (research question 2c). The analyses were performed using R (R Core Team, 2015) with the packages cocor (Diedenhofen, 2015), Hmisc (Harrell, 2015), and psych (Revelle, 2015). All significance tests used α = .05 (two-tailed).

q = .24). With regard to test anxiety and Neuroticism, there were no statistically significant differential validities (worry: Δr = .08, p = .09, q = .08; emotionality: Δr .01, p = .93, q = .01). Regarding the NEO-FFI benchmark, there were 20 divergent validity coefficients for each TIPI version. Just 6 of those 20 divergent correlations were numerically smaller for the modified TIPI than for the original TIPI ( .05 Δr .09). The substantial correlation between Openness to Experience in the TIPI and extraversion in the NEO-FFI (TIPI: r = .40; modified TIPI: r = .42) is especially noteworthy because it interferes with the interpretability of the TIPI’s Openness to Experience scale.

Results Reliability coefficients of the TIPI were mostly low but sufficient (see Table 1); McDonald’s ω values for the subscales Agreeableness (ω = .54) and Openness to Experience (ω = .36) were unacceptably low. Compared with the TIPI, the modified TIPI’s omegas were numerically higher (.06 Δω .20), except for Conscientiousness (Δω = –.02). Openness to Experience still yielded an unacceptably low reliability. Four out of five times, the consistencies of the modified scales were below the Spearman-Brown-based reliability estimates for (hypothetical) extensions of the original scales: Neuroticism: ω = .76/.81 (modified TIPI/Spearman-Brown estimates), Extraversion: ω = .73/.78, Agreeableness: ω = .60/.70, Conscientiousness: ω = .68/.82. Only the modified Openness to Experience scale showed a numerically larger reliability than the Spearman-Brown estimate based on the original scale (ω = .56/.53). Convergent correlations of the TIPI and the modified TIPI were large but not perfect (.66 r .84). All divergent correlations were smaller (|r| .36; see Table 1). With regard to the NEO-FFI benchmark, the convergent validity coefficients of the TIPI and the modified TIPI scales were mostly of similar magnitude. There were no significant differences for four personality factors (Neuroticism: Δr .01, p = .96, q .01; Extraversion: Δr = .01, p = .76, q = .01; Openness to Experience: Δr = –.06, p = .21, q = .08; Agreeableness: Δr = .03, p = .49, q = .05). Only the TIPI’s Conscientiousness scale (two items) showed a significantly lower validity coefficient than the modified scale (four items; Δr = .11, p = .002, 1

Study 2: BFI-10 The findings from Study 1 suggest that scales with combinations of multiple adjectives in a single item are not necessarily associated with reduced reliability and validity. Maybe the issues with two-descriptor items are confined to a particular personality inventory. We therefore investigated another short personality questionnaire in a comparable sample. Specifically, the original BFI-10 (Rammstedt & John, 2007), which contained several two-descriptor items, was compared to a modified version with corresponding single-descriptor items.

Materials and Methods Sample and Procedure The setting of Study 2 was similar to that of Study 1. However, Study 2 was conducted in a different academic year with a different sample. The participants (N = 268 university students from the same university, 68% women, age: M = 23.7 years, SD = 5.5 years, range from 18 to 49 years) answered personality questionnaires during the first session of an obligatory introductory lecture on educational assessment for sophomore teacher students. The test anxiety items were administered in the same lecture setting six weeks later (n = 205). Instruments Due to the parallel design with Study 1, the NEO-FFI and the PAF subscales were administered again (see Study 1). The TIPI and modified TIPI of Study 1 were substituted by the BFI-10 and a correspondingly modified BFI-10 version with single-descriptor items. Again, half of the students worked on a booklet that started with the BFI-10, followed by the NEO-FFI and the modified BFI-10. The other half of

Statistical testing seemed inappropriate due to the large number of comparisons, the statistical power based on the present sample size, and the danger of alpha error accumulation.

European Journal of Psychological Assessment (2019), 35(1), 117–125

Ó 2016 Hogrefe Publishing


J. Schult et al., Assessing Personality With Multi-Descriptor Items

121

Table 1. Study 1 (TIPI; n = 198): Correlations of manifest scores (below the diagonal) and scale reliability estimates (McDonald’s ω in parentheses) TIPIa

NEO-FFI N

E

O

A

C

N

k

12

12

12

12

12

4

N

(.87) –.45*

E

–.05

.16*

A

–.12

.37*

.37*

C

–.24*

.25*

.07

Na

4

O

a

4

TIPI a

C

4

4

N

E

O

A

C

2

2

2

2

2

Worry Emotionality 5

5

.77* –.37* –.06

(.77) .29* –.13

(.85) –.25*

(.76)

.12

.04

.09

–.31*

(.73)

Oa –.22*

.42*

.48*

.33*

.24* –.28*

.39*

(.56)

Aa –.15*

.31*

.23*

.67*

.23* –.27*

.09

.37*

(.60)

–.16*

.22*

.12

.27*

.78* –.24*

.14*

.24*

.29*

N

Test Anxiety

(.73)

.56*

a

a

A

Ea –.36*

C

TIPI

E

a

(.81)

NEO-FFI O

TIPIa

a

.77* –.40* –.00

–.16* –.25*

(.68)

.77* –.27* –.22* –.26* –.23*

(.69)

E

–.34*

.57*

.10

.10

.10

–.28*

.84*

.36*

.10

.12

O

–.27*

.40*

.42*

.24*

.18* –.35*

.36*

.71*

.31*

.19* –.27*

.37*

(.36)

A

–.09

.25*

.19*

.64*

.19* –.10

–.03

.18*

.68*

.20* –.19*

.01

.17*

C

–.14*

.18*

.11

.21*

.67* –.18*

.05

.16*

.14

.66* –.21*

.10

.26*

.01

.18*

–.05

–.05

–.05

.10

.52* –.20* –.14* –.11

–.10

(.64)

Worry

.24* –.03

Emotionality

.58* –.16* –.18* –.08

Cronbach’s α

.87

.79

.71

.75

.84

.76

.71

.52

.54

.67

.68

.63

.35

M

2.73

3.53

3.38

3.73

3.85

2.51

3.40

3.72

3.69

4.01

2.45

3.41

SD

0.67

0.52

0.52

0.49

0.54

0.75

0.73

0.56

0.52

0.61

0.87

0.87

–.13

.18* –.09

–.28*

.26* –.09

–.05

(.54) .16* –.01

.53* –.19* –.18* –.09

(.70) .09

(.84)

–.18*

.47*

(.88)

.52

.69

.83

.88

3.89

3.92

4.02

3.67

2.93

0.67

0.69

0.76

0.78

0.91

Notes. N = Neuroticism, E = Extraversion, O = Openness to Experience, A = Agreeableness, C = Conscientiousness, k = number of items per scale, a = scales of modified TIPI items; *p < .05. Convergent correlations are printed in bold.

the students began with the modified items, followed by the NEO-FFI and, finally, the original BFI-10. Participants were randomly assigned to booklets. All personality items had to be answered on a 5-point scale ranging from disagree strongly (1) to agree strongly (5). There was less than 2% item nonresponse per item in all the questionnaires (median = 0.7%). (1) BFI-10: The German questionnaire consisted of five 2-item scales, one for each Big Five factor (Rammstedt & John, 2007). The authors suggested using a third Agreeableness item to improve the reliability and validity. This optional additional Agreeableness item (“. . . is considerate and kind to almost everyone” / “. . . bin rücksichtsvoll zu anderen, einfühlsam”; Rammstedt & John, 2007, p. 211) was also included in the present study because it could potentially increase differences between scales with singledescriptor and two-descriptor items, respectively. (2) Modified BFI-10: In the German version of the BFI-10, there were eight items that included a combination of two descriptors (e.g., item 4: “I see myself as someone who is relaxed, handles stress well”; Rammstedt & John, 2007, p. 210–211). Two new items were generated for each of these original items, one for each Ó 2016 Hogrefe Publishing

descriptor (e.g., “I see myself as someone who is relaxed” and “I see myself as someone who handles stress well”). Thus, the modified BFI-10 consisted of 16 modified items plus the three remaining singledescriptor BFI-10 items.

Analyses The analyses followed the strategy outlined in Study 1. The TIPI and the modified TIPI were replaced by the BFI-10 and the modified BFI-10, respectively.

Results Reliability coefficients of the BFI-10 ranged from unacceptably low (Agreeableness: ω = .47) to acceptable (Neuroticism: ω = .71; see Table 2). In contrast, the omegas of the modified BFI-10 were numerically higher (.08 Δω .19) and at least sufficient. The omegas of the modified scales were numerically larger than the Spearman-Brown-based reliability estimates for the hypothetically extended original scales for Openness to Experience (ω = .71/.61 [modified BFI-10/Spearman-Brown

European Journal of Psychological Assessment (2019), 35(1), 117–125


European Journal of Psychological Assessment (2019), 35(1), 117–125

0.66

SD

O

–.10

–.16*

0.49

3.57

268 0.56

3.40

268

.74

.03

–.11

.76

–.02

<.01

.21*

.64*

.05

<.01

–.06

.16*

.62*

.06

.01

–.06

.15*

(.75)

12

A

C

.10

0.47

3.78

268

.74

.05

–.00

.24*

.60*

.10

.02

0.58

3.89

268

.86

.03

.31*

.71*

.08

–.04

–.05

.14*

.69*

.05

.63* .23*

–.01

–.02

.07

(.87)

12

.11

.01

.04

.24*

(.75)

12

0.80

2.90

266

.78

.45*

.23*

.18*

.04

–.03

–.20*

.78*

.17*

.06

–.01

–.21*

(.79)

4

N

a

0.73

3.45

266

.75

–.19*

–.07

–.07

.02

.02

.79*

–.11

–.09

.07

.01

(.75)

4

E

a

0.81

3.85

268

.66

–.15*

–.05

.02

.11

.92*

.01

.02

.01

.11

(.71)

3

O

a

BFI-10a

0.55

3.69

268

.60

.12

.01

.04

.81*

.09

.04

.09

.05

(.63)

5

A

a

0.85

3.46

268

.75

.17*

.20*

.86*

.09

–.01

–.13*

.24*

(.78)

3

C

a

0.97

2.99

266

.71

.52*

.34*

.20*

.06

–.01

–.14*

(.71)

2

N

0.85

3.50

266

.63

–.22*

–.12

–.10

.05

.02

(.64)

2

E

0.95

3.71

266

.50

–.21*

–.10

.02

.15*

(.51)

2

O

BFI-10

.06

(.47)

3

A

a

0.82

3.70

266

.57

.09

.17*

(.60)

2

C

0.76

3.55

205

.84

.47*

(.85)

5

Worry

0.81

2.66

205

.88

(.88)

5

Emo.

Test anxiety

= scales of modified BFI-10 items

0.63

3.57

266

.44

.05

–.03

Notes. N = Neuroticism, E = Extraversion, O = Openness to Experience, A = Agreeableness, C = Conscientiousness, Emo. = Emotionality, k = number of items per scale, (16 modified and 3 original); *p < .05. Convergent correlations are printed in bold.

2.71

M

C

268

.03

A

n

.11

–.03

O

.86

.08

–.08

E

Cronbach’s α

.57*

.71*

–.18*

N

.56*

–.17*

.04

Ca

Emotionality

–.01

.02

Aa

.25*

.09

–.06

Oa

.32*

a

.65*

–.18*

Ea

–.03

C

–.25*

.10

–.06

A

.74*

.19*

–.00

O

Na

.06

–.32*

E

Worry

BFI-10

BFI-10

NEO-FFI

(.78)

(.86)

E

12

N

N

12

k

NEO-FFI

Table 2. Study 2 (BFI-10): Correlations of manifest scores (below the diagonal) and scale reliability estimates (McDonald’s ω in parentheses)

122 J. Schult et al., Assessing Personality With Multi-Descriptor Items

Ó 2016 Hogrefe Publishing


J. Schult et al., Assessing Personality With Multi-Descriptor Items

123

estimates]), Agreeableness (ω = .63/.59), and Conscientiousness (ω = .78/.69), but not for Neuroticism (ω = .79/.83) and Extraversion (ω = .75/.78). Convergent correlations of the same constructs of the BFI-10 and the modified BFI-10 were sufficiently large (.78 r .92). At the same time, divergent correlations were small (|r| .24). One should keep in mind that the scales for Openness to Experience, Agreeableness, and Conscientiousness each contained one identical item in both test versions (i.e., the three original BFI-10’s singledescriptor items). The convergent construct validity coefficients of the BFI-10 and the modified BFI-10 scales with the NEO-FFI were mostly of similar magnitude. There was no significant differential validity evidence for four personality traits (Neuroticism: Δr = –.03, p = .25, q = –.06; Openness to Experience: Δr = .01, p = .56, q = .02; Agreeableness: Δr = –.03, p = .35, q = –.04; Conscientiousness: Δr = .02, p = .38, q = .04). The original Extraversion scale (two items) had a significantly smaller validity coefficient than the modified scale (four items; Δr = –.08, p = .01, q = –.14). Concerning the relationships with test anxiety, Neuroticism was more closely related to worry for the original BFI-10 than for the modified BFI-10 scale (Δr = .11, p = .01, q = .12; emotionality: Δr = .07, p = .09, q = .09). Regarding the NEO-FFI benchmark, there were 9 out of 20 divergent validity coefficients that were numerically smaller for the modified BFI-10 than for the original BFI-10 (–.09 Δr .14).

The present study was based on the question of whether separating widely used multi-descriptor items into pairs of one-descriptor items improves the internal consistency and the construct validity of very short personality inventories. Ideally, the response to a two-descriptor item should reflect the sum or any other integration of its two components. Our results, however, suggest that combining two descriptors in a single item can diminish reliability and validity. Reliability values tended to increase when multidescriptor items were separated (Δωmax = .20; median [Δω] = .10). However, this increase in reliability rarely exceeded the reliability gain expected from a comparable test extension. So for the most part, measurement precision was not impaired dramatically by double-barreled items in both studies. An inspection of the modified scales that did exceed the Spearman-Brown-based reliability estimates (TIPI: Openness to Experience; BFI-10: Openness to Experience, Agreeableness, and Conscientiousness) suggests that the descriptors of these scales were semantically more homogeneous and therefore more consistently interrelated than the descriptor pairs of the remaining scales.

Comparing the validity coefficients of the corresponding subscales of the original instruments and the modified versions, large monotrait-heteromethod relations could be found. Still, the lack of a perfect alignment suggests that the double-barreled items do not just indicate the sum of the single descriptors. The modified scales assess slightly different, possibly broader personality traits. Interestingly, the scale modifications did not lead to a general decrease of divergent (heterotrait-heteromethod) correlations. Thus, the unique variance of the modified instruments that is not shared with the original versions could have also stemmed from trait-unspecific aspects. Regarding the relationship with the NEO-FFI scales, the Conscientiousness scale of the TIPI and the Extraversion scale of the BFI-10 showed lower convergent validity coefficients than the respective modified scale. These differences could be due to conflicting content in the original items. The remaining factors, on the other hand, did not show substantial validity losses of the scales with two-descriptor items. This pattern is in line with the mixed results concerning the relationship between Neuroticism and test anxiety. The findings from both studies do not raise significant general objections against multi-descriptor items in personality inventories. Most scales of the TIPI and the BFI-10 worked comparably well in the original and modified inventories. With this being said, there remain issues with specific subscales (Conscientiousness in the TIPI; Extraversion in the BFI-10) that are arguably related to the combination of ambiguous descriptors in the original inventories. The cognitive challenge of deciding on an integrative response might detract from the actual item content. Another source for bias lies in the way participants weight the descriptors within one particular item (Herzberg & Brähler, 2006; Ziegler, 2014). This weighting process might even be influenced by the trait one wants to measure. For example, a very conscientious person is likely to pay more attention to the whole item, whereas a less conscientious person might skip to the response after reading just the first descriptor. Further research could clarify this issue. An issue beyond the relative merits of single-descriptor items is, of course, the overall reliability and validity of the short personality inventories at hand. The reliability estimates of specific scales (i.e., Agreeableness and Openness to Experience) were unacceptably low regardless of the number of descriptors per item. Internal consistencies of short scales that assess broad personality traits are often low and should not be used as sole reliability criterion (cf. Rammstedt & Beierlein, 2014). In our study, the correlations between the original and the modified scales can be regarded as indicators of parallel test reliability. These values (TIPI: .66 r .84; BFI-10: .78 r .92) suggest that reliabilities range from sufficient to very good.

Ó 2016 Hogrefe Publishing

European Journal of Psychological Assessment (2019), 35(1), 117–125

Discussion


124

J. Schult et al., Assessing Personality With Multi-Descriptor Items

In terms of construct validity, the convergent correlations with the NEO-FFI were mostly acceptable. The lowest validity is associated with Openness to Experience in the TIPI (r = .42), suggesting a weaker assessment of this facet. Further support for this notion comes from the inspection of divergent validities: The substantial correlation between Openness to Experience in the TIPI and extraversion in the NEO-FFI has been found previously (Herzberg & Brähler, 2006). Consequently, the scores of this particular scale should be used with caution. The remaining scales tended to show more favorable properties. Nevertheless, Extraversion scales were the only ones with consistent and at least satisfying reliability and validity coefficients. On the other hand, economic considerations have to be taken into account as well. Especially large-scale studies require very short inventories. Here, the aim is to collect sufficient psychological data without overburdening participants with too many and/or redundant items. Furthermore, questionnaire length is negatively associated with response rates (Edwards, Roberts, Sandercock, & Frost, 2004). It remains to be seen whether the time spent on a two-descriptor item is on average shorter than the time spent on two (presumably less ambiguous) singledescriptor items (cf. Herzberg & Brähler, 2006). Panel studies are often based on large and representative samples. The survey data do not enable researchers to interpret the data of an individual. The aim is rather to measure differences between and within (sub-)groups; large sample sizes might compensate lower reliability and validity values (Rammstedt & Beierlein, 2014). As our results show, pairs of descriptors should be chosen carefully when it comes to condensing the content of personality items, because descriptor combinations can result in confused participants and a suboptimal factor structure (Herzberg & Brähler, 2006; Ziegler, 2014). Whether multi-descriptor items increase item nonresponse and panel attrition in large-scale assessments is not yet clear. Regarding the samples, our study contained solely university students, who tend to have higher-than-average intelligence, socio-economic status, and educational achievements. This may reduce the generalizability of our findings; still, the sample homogeneity minimizes the educational bias in the BFI-10 responses (cf. Rammstedt, Goldberg, & Borg, 2010). Future studies could investigate the impact of multi-descriptor items on the responses of educationally more heterogeneous participants. Summarizing the reliability and validity results of both studies, the short personality scales tend to work comparably well in the original and modified versions. Very short personality inventories that consist of items containing two descriptors are suitable when research economics necessitate personality assessments with as few items as possible. Although we did not find striking evidence against

the use of double-barreled items, we suggest using singledescriptor alternatives instead (e.g., the BFI-S; cf. Hahn et al., 2012), at least in general population studies where multi-descriptor items can be a cognitive challenge for some participants.

European Journal of Psychological Assessment (2019), 35(1), 117–125

Ó 2016 Hogrefe Publishing

Acknowledgments This research was conducted with the support of the German funds “Bund-Länder-Programm für bessere Studienbedingungen und mehr Qualität in der Lehre (‘Qualitätspakt Lehre’)” [the joint program of the States and Federal Government for better study conditions and quality of teaching in higher education (“the Teaching Quality Pact”)] at Saarland University (funding code: 01PL11012). The authors developed the topic and content of this manuscript independent of this funding.

References Bassili, J. N., & Scott, B. S. (1996). Response latency as a signal to question problems in survey research. Public Opinion Quarterly, 60, 390–399. doi: 10.1086/297760 Borkenau, P., & Ostendorf, F. (2008). NEO-Fünf-Faktoren-Inventar [NEO-Five-Factor-Inventory] (2nd ed.). Göttingen, Germany: Hogrefe. Campbell, D. T., & Fiske, D. W. (1959). Convergent and discriminant validation by the multitrait-multimethod-matrix. Psychological Bulletin, 56, 81–105. doi: 10.1037/h0046016 Carmines, E. G., & Zeller, R. A. (1979). Reliability and validity assessment. Thousand Oaks, CA: Sage. doi: 10.4135/ 9781412985642 Chamorro-Premuzic, T., Ahmetoglu, G., & Furnham, A. (2008). Little more than personality: Dispositional determinants of test anxiety. Learning and Individual Differences, 18, 258–263. doi: 10.1016/j.lindif.2007.09.002 Clark, L. A., & Watson, D. (1995). Constructing validity: Basic issues in objective scale development. Psychological Assessment, 7, 309–319. doi: 10.1037/1040-3590.7.3.309 Costa, P. T., & McCrae, R. R. (1992). Revised NEO Personality Inventory (NEO-PI-R) and NEO Five-Factor Inventory (NEO-FFI) professional manual. Odessa, FL: Psychological Assessment Resources. DeVellis, R. F. (2012). Scale development: Theory and applications (3rd ed.). Thousand Oaks, CA: Sage. Diedenhofen B. (2015). cocor: Comparing correlations. R package (Version 1.1–2) [Software]. Duesseldorf, Germany: University of Duesseldorf. http://CRAN.R-project.org/package=cocor Edwards, P., Roberts, I., Sandercock, P., & Frost, C. (2004). Follow-up by mail in clinical trials. Does questionnaire length matter? Controlled Clinical Trials, 25, 31–52. doi: 10.1016/ j.cct.2003.08.013 Gogol, K., Brunner, M., Goetz, T., Martin, R., Ugen, S., Keller, U., . . . Preckel, F. (2014). “My questionnaire is too long!” The assessments of motivational-affective constructs with three-item and single-item measures. Contemporary Educational Psychology, 39, 188–205. doi: 10.1016/j.cedpsych.2014.04.002 Gosling, S. D., Rentfrow, P. J., & Swann, W. B. (2003). A very brief measure of the Big Five personality domains. Journal of Research in Personality, 37, 504–528. doi: 10.1016/S00926566(03)00046-1


J. Schult et al., Assessing Personality With Multi-Descriptor Items

125

Hahn, E., Gottschling, J., & Spinath, F. M. (2012). Short measurements of personality: Validity and reliability of the GSOEP Big Five Inventory (BFI-S). Journal of Research in Personality, 46, 355–359. doi: 10.1016/j.jrp.2012.03.008 Harrell, F. E. Jr. (2015). Hmisc: Harrell miscellaneous. R package (Version 3.17–0) [Software]. Nashville, TN: Vanderbilt University. http://CRAN.R-project.org/package=Hmisc Herzberg, P. Y., & Brähler, E. (2006). Assessing the Big-Five personality domains via short forms: A cautionary note and a proposal. European Journal of Psychological Assessment, 22, 139–148. doi: 10.1027/1015-5759.22.3.139 Hodapp, V., Rohrmann, S., & Ringeisen, T. (2011). Prüfungsangstfragebogen (PAF) [Test-Anxiety Questionnaire]. Göttingen, Germany: Hogrefe. John, O. P., Donahue, E. M., & Kentle, R. L. (1991). The Big Five Inventory – Versions 4a and 54. Berkeley, CA: University of California. Likert, R. (1974). A method of constructing an attitude scale. In G. M. Maranell (Ed.), Scaling: A sourcebook for behavioral scientists (pp. 233–243). Chicago, IL: Aldine. Marsh, H. W., Lüdtke, O., Muthén, B., Asparouhov, T., Morin, A. J., Trautwein, U., & Nagengast, B. (2010). A new look at the Big Five factor structure through exploratory structural equation modeling. Psychological Assessment, 22, 471–491. doi: 10.1037/a0019227 McCrae, R. R., Kurtz, J. E., Yamagata, S., & Terracciano, A. (2011). Internal consistency, retest reliability, and their implications for personality scale validity. Personality and Social Psychology Review, 15, 28–50. doi: 10.1177/1088868310366253 Messick, S. (1995). Validity of psychological assessment: Validation of inferences from persons’ responses and performances as scientific inquiry into score meaning. American Psychologist, 50, 741–749. doi: 10.1037/0003-066X.50.9.741 Muck, P. M., Hell, B., & Gosling, S. D. (2007). Construct validation of a short five-factor model instrument. European Journal of Psychological Assessment, 23, 166–175. doi: 10.1027/10155759.23.3.166 Murphy, K. R., & Davidshofer, C. O. (2001). Psychological testing: Principles and applications (5th ed.). Upper Saddle River, NJ: Prentice Hall. R Core Team. (2015). R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing. http://www.R-project.org Rammstedt, B., & Beierlein, C. (2014). Can’t we make it any shorter? The limits of personality assessment and ways to

overcome them. Journal of Individual Differences, 35, 212–220. doi: 10.1027/1614-0001/a000141 Rammstedt, B., Goldberg, L. R., & Borg, I. (2010). The measurement equivalence of Big-Five factor markers for persons with different levels of education. Journal of Research in Personality, 44, 53–61. doi: 10.1016/j.jrp.2009.10.005 Rammstedt, B., & John, O. P. (2007). Measuring personality in one minute or less: A 10-item short version of the Big Five Inventory in English and German. Journal of Research in Personality, 41, 203–212. doi: 10.1016/j.jrp.2006.02.001 Revelle, W. (2015). psych: Procedures for personality and psychological research. R package (Version 1.5.8) [Software]. Evanston, IL: Northwestern University. http://CRAN.R-project. org/package=psych Revelle, W., & Zinbarg, R. E. (2009). Coefficients alpha, beta, omega, and the glb: Comments on Sijtsma. Psychometrika, 74, 145–154. doi: 10.1007/s11336-008-9102-z Stafford, L. (2010). Measuring relationship maintenance behaviors: Critique and development of the revised Relationship Maintenance Behavior scale. Journal of Social and Personal Relationships, 28, 278–303. doi: 10.1177/0265407510378125 Stafford, L., & Canary, D. J. (2006). Equity and interdependence as predictors of relational maintenance strategies. Journal of Family Communication, 6, 227–254. doi: 10.1207/ s15327698jfc0604_1 Ziegler, M. (2014). Editorial: Comments on item selection procedures. European Journal of Psychological Assessment, 30, 1–2. doi: 10.1027/1015-5759/a000196

Ó 2016 Hogrefe Publishing

European Journal of Psychological Assessment (2019), 35(1), 117–125

Received August 13, 2015 Revision received December 21, 2015 Accepted February 1, 2016 Published online November 7, 2016 Johannes Schult Department of Educational Sciences Campus A5 4 Saarland University 66123 Saarbrücken Germany Tel. +49 681 302 57482 E-mail jutze@jutze.com


Multistudy Report

Detecting Random Responses in a Personality Scale Using IRT-Based Person-Fit Indices Tour Liu, Tian Lan, and Tao Xin School of Psychology, Beijing Normal University, P.R. China Abstract: Random response is a very common aberrant response behavior in personality tests and may negatively affect the reliability, validity, or other analytical aspects of psychological assessment. Typically, researchers use a single person-fit index to identify random responses. This study recommends a three-step person-fit analysis procedure. Unlike the typical single person-fit methods, the three-step procedure identifies both global misfit and local misfit individuals using different person-fit indices. This procedure was able to identify more local misfit individuals than single-index method, and a graphical method was used to visualize those particular items in which random response behaviors appear. This method may be useful to researchers in that it will provide them with more information about response behaviors, allowing better evaluation of scale administration and development of more plausible explanations. Real data were used in this study instead of simulation data. In order to create real random responses, an experimental test administration was designed. Four different random response samples were produced using this experimental system. Keywords: random response, person-fit analysis procedure, CUSUM-based indices, graphical method

In any study involving psychological tests, aberrant responses are a concern because they can lead to lower reliability and inaccuracy in assessment of individuals (Clark, Gironda, & Young, 2003; Meijer & Nering, 1997). There are many types of aberrant response behaviors in educational tests. Examples include, guessing, sleepiness, cheating, plodding, and alignment errors (Meijer, 1996). Some aberrant response behaviors that are common on educational tests, such as guessing, cheating, and copying, generally do not occur in personality tests. But other aberrant response behaviors, such as malingering, social desirability, sabotage, and random response, often do come up (Reise & Flannery, 1996). Many researchers had argued for investigating various types of aberrant response behaviors in personality tests. Ferrando and Lorenzo-Seva (2010) discussed acquiescent responding. Ferrando and Chico (2001) detected social desirability. Faking had also been measured in many ways (LaHuis & Copeland, 2009; Scherbaum, Sabet, Kern, & Agnello, 2013). Unlike the types of aberrant response behaviors mentioned above, random response is an irregular response behavior, so it is easy to simulate. Simulated random response data were usually generated in methodology studies that explored new methods for detecting aberrant responses or involved method comparisons (Conijn, Emons, & Sijtsma, 2014; Emons, 2008; European Journal of Psychological Assessment (2019), 35(1), 126–136 DOI: 10.1027/1015-5759/a000369

Reise, 1995). Random response was just an example of aberrant responses in these methodology studies. Some researchers have discussed random response in real personality tests (Berry et al., 1992; Clark, Gironda, & Young, 2003). Those empirical studies used particular information, for example, the information from subscales of the Minnesota Multiphasic Personality Inventory (MMPI-2), to identify random responses. Because random response is one of the most common aberrant behaviors in real personality tests, and because random response reduces the reliability and validity of the results of such tests, it is important to be able to identify random responding on an individual level. The present study discusses random response detection using a novelty procedure based on item response theory (IRT) methods.

Person-Fit Analysis Person-Fit Indices Generally, researchers have detected aberrant responses for two reasons. First, excluding misfit individuals (i.e., those who give aberrant responses) is an effective way to enhance the quality of the research data. Second, analyzing Ă“ 2016 Hogrefe Publishing


T. Liu et al., Detecting Random Responses in a Personality Scale Using IRT-Based Person-Fit Indices

misfit respondents may help researchers learn more about test administration and respondents’ psychological states. For these two purposes, many methods have been proposed to detect aberrant test-takers, including establishing bogus items, analyzing outliers, and using consistency indices (Meade & Craig, 2012). IRT-based person-fit analysis is another practical way to detect aberrant response individuals and diagnose causes of aberrant responses (Emons, Sijtsma, & Meijer, 2005). In IRT-based person-fit analysis, an IRT model is treated as individuals’ expected response model, because it fits most individuals well and conforms to scale construction theory. If an individual’s response is inadequate as a measure of the person’s trait level (person misfit), IRT-based person-fit indices will indicate that the individual is misfit. A wide variety of person-fit indices have been proposed for different IRT models (Karabatsos, 2003; Meijer & Sijtsma, 2001). One of the most widely used indices in psychological and educational tests is lz, which has been shown to be an effective index (Drasgow, Levine, & McLaughlin, 1987; Reise, 1990). A standardized version of l0 index (the log-likelihood of a response pattern), lz provides the opportunity of evaluating a response pattern fit to an IRT model that represents an expected pattern. This item response function is written mathematically as follows: n h X i l0 θ^j ¼ uij ln Pi θ^j þ 1 uij ln Q i θ^j ; ð1Þ i

where

lz ¼ ½l0 Eðl0 Þ =½Varðl0 Þ 0:5 :

ð2Þ

θ^j is person j’s latent trait value estimated from an IRT model. The distribution of lz is purported to be asymptotically standard normal (Drasgow, Levine, & Williams, 1985), and a large negative value of lz indicates aberrant responses. Emons (2009) argued that lz is a sensitive index for global misfit individuals who aberrantly respond to most or all of the items on a test. lz has used in some personality studies. Dodeen and Darabi (2009) found that lz values had a relationship with motivation for mathematical learning. Schmitt, Chan, Sacco, McFarl, and Jennings (1999) also found that low lz presented low motivation and conscientiousness. Another kind of person-fit indices, CUSUM-based (cumulative sum based) indices, can provide more information about local misfit (aberrantly responding to a subset of test items) than the lz index (Meijer, 2002). Originally, van Krimpen-Stoop and Meijer (2000, 2001) proposed the CUSUM-based index, CUSUMvm, for adaptive testing. This index is a statistical process control technique, a method to view changes in the performance of a statistic over time. Ó 2016 Hogrefe Publishing

127

CUSUMvm contains an upper statistic, C+, and a lower statistic, C . At the start of a test involving k items, these statistics are initialized, Cþ 0 ¼ C0 ¼ 0 and accumulated as follows:

þ Cþ i ¼ max 0; Ci 1 þ T i and C i ¼ min 0; Ci 1 þ T i ;

ð3Þ

where i = 1, 2, 3, . . ., k. where T i ¼ Xi Pi θj . Respectively, C+ and C describe positive aberrance and negative aberrance of the response patterns. Tendeiro and Meijer (2012) proposed a variant of CUSUMvm. They used two logarithms of likelihood ratios to illustrate overperformance and the underperformance of individuals. The logarithms of likelihood ratios can be defined as follows: 1 xi

xi

γUi γLi

ðPU Þ ð1 PUi Þ ¼ log i xi x Pi ð1 Pi Þ i ¼ log

xi

Pxi i ð1 Pi Þ L xi

L 1 xi

ðPi Þ ð1 Pi Þ

ð4Þ ;

where PUi and PLi describe, respectively, the response probabilities of overperformance and underperformance of individuals. Hence, γUi is more likely to be positive in case of a random response, while γLi is more likely to be negative. PUi and PLi can be defined by quadratic functions or reasonable values (Armstrong & Shi, 2009; Tendeiro & Meijer, 2012). For example, in Tendeiro and Meijer’s (2012) study, when an individual responds randomly to a m-choice item, both PUi and PLi are equal to 1/m. They substituted γUi and γLi for Ti, to produce two new versions of C+ and C , CU and CL. They labeled this new variant CUSUMrr, and they found that the CUSUMrr performed better than some of the more commonly used indices for detecting random response patterns. The larger absolute value of CU (CL) indicates the more aberrant response behaviors, and so does C+ (C ). For random response behaviors, overperformance and underperformance may occur simultaneously. Armstrong and Shi (2009) proposed a two-side control statistic to illustrate these two-side aberrant responses. The two-side control statistics of CUSUMvm and CUSUMrr can be described as follows: CUSUMvm ¼ Cþ max Cmin

CUSUMrr ¼ CUmax CLmin :

ð5Þ

Obviously, these two-side statistics combined the positive effect and the negative effect of aberrant responses, and a high value indicates that the aberrant responses exist. Unlike the lz index, which identifies aberrant responses by European Journal of Psychological Assessment (2019), 35(1), 126–136


128

T. Liu et al., Detecting Random Responses in a Personality Scale Using IRT-Based Person-Fit Indices

focusing solely on the final value of the statistic, the CUSUM-based technique provides information about what occurred during the test item by item. Consequently, CUSUM-based indices can used to detect a local misfit.

In recent decades, IRT-based person-fit indices have been increasingly used in psychological and educational tests (Lee, Stark, & Chernyshenko, 2014; St-Onge, Valois, Abdous, & Germain, 2011). Typically, those studies used a single person-fit index or method to detect random responses. However, different person-fit methods have

special utilities. If multiple methods can be used together, separate sets of advantages may combine. Ferrando and Anguiano-Carrasco (2013) used the parameters estimated from a structure equation model to do IRT-based personfit analysis, and found that this two-stage method could improve the effectiveness of detection. lz is a good global index, while CUSUM-based indices perform better than lz in local misfit detection and can be graphed. Therefore, on the basis of the analyses shown below, we recommend a new three-step person-fit analysis procedure which uses lz and CUSUM-based indices and includes graphical methods. This procedure can detect both global and local misfits and describe the random response processes. For the purpose of evaluating the effectiveness of a person-fit method, majority of many such studies used simulated data to explore the performance of person-fit methods under various conditions. It is therefore unclear how the person-fit indices might perform with real (i.e., non-simulated) data. Some empirical studies proposed various methods to identify misfit individuals and their types of misfit (Conijn et al., 2014; Ferrando, 2012; Reise & Waller, 1993). However, those studies were post hoc analyses, because the true person-fit statuses were unknown in the real data. Researchers had to use additional evidence or logical reasoning to confirm the detection results. An alternative approach is to collect responses from individuals under conditions that would tend to insure the presence of random responses. Therefore, the present study used real personality data produced from an experimental test administration. By using experimental data the true states of person-fit were known as simulation data, and the response behaviors were similar to reality. Only random responses were taken into account. Because random response is one of the most common aberrant response behaviors in personality tests, especially group tests, it is easier to promote a real random response rather than a genuine case of faking, acquiescence, extreme response, or others in a personality test. Although only random response behaviors have been discussed, the person-fit analysis procedure recommended in this study can be generalized into the detection of other aberrant response behaviors as well. In sum, an experimental test administration was designed to collect random response samples: one group of global misfit individuals and three groups of local misfit individuals. In addition, a standard test administration was conducted to collect a normal response sample. The threestep procedure was proposed to detect random response individuals. In this way, the present study intended to evaluate how well the person-fit indices performed on detecting such random response instances; how the graphical method could be used to describe the local misfit individuals; and how to identify more local misfit individuals.

European Journal of Psychological Assessment (2019), 35(1), 126–136

Ó 2016 Hogrefe Publishing

Analysis of Misfit Person Not only can researchers exclude individuals who give aberrant responses, but they can also derive further information about test administration and respondents’ psychological states through analysis of misfits. Analyzing misfits may allow researchers to figure out what type of misfit each individual is. Different types of misfits may come from many sources, including the test administration procedures, the individuals’ states of mind during tests, the purpose of tests, the format of tests, and so on. Many researchers have proposed various procedures to investigate the types and the sources of person misfit. Ferrando and Lorenzo-Seva (2016) used regression to identify sources of person misfit. Belov (2011) proposed a variable match index to detect copying behaviors. Conrad et al. (2010) found that misfit response pattern could screen for atypical suicide risk. Graphical methods are good ways to assess individual’s misfits (Ferrando, 2012; Nering & Meijer, 1998). Ferrando (2015) introduced a graphical procedure that used the expected person response curve (EPRC), the observed responses and the observed person response curve (OPRC), to evaluate both global and local misfits. However, the aberrant response patterns can sometimes offer valuable information. For example, if a person who had been identified as a misfit endorsed most of the items from a conscientiousness scale, he or she could be a high social desirability person. For another example, if a misfit person chose the midpoint on 90% of items from a multidimensional scale, it might be indicative of a particular response style. Tracing the individuals’ response patterns item by item may offer some explanations of why a person is a misfit. Tendeiro and Meijer (2012) demonstrated that CUSUM-based indices were suitable for graphing the whole response process for each individual in a given test. Because of the documented applicability of CUSUM-based indices, we used them in the present study.

Present Study


T. Liu et al., Detecting Random Responses in a Personality Scale Using IRT-Based Person-Fit Indices

129

Materials and Methods Participants Five groups of individuals participated in this study. There was one normal response group (1,770 participants) and four random response groups. The four random response groups included one global misfit group and three local misfit groups. The global misfit group had 188 participants. The local misfit groups represented three types of local misfit: misfit on the beginning of a test (193 participants), misfit on the middle of a test (201 participants), and misfit on the end of a test (189 participants). All participants were high school students from two provinces in P.R. China. Their mean age was about 17.

Materials Reassembled personality questionnaires were used in this study. These personality questionnaires contained two parts. Part 1 was the Extraversion (E) subscale (21 items) of the Eysenck Personality Questionnaire for Chinese Adults (EPQ-CA) which was the target scale (TS). Part 2 was another 80 extraversion items from five other personality scales; those items made up the nontarget scale (NTS). TS and NTS were assembled into four forms of questionnaires (Figure 1). In questionnaire form 1, NTS was assigned before TS. In questionnaire form 2, form 3, and form 4, NTS was inserted into TS and separated NT into two sections. The first section had 14 items, and the second section had 7 items. Take questionnaire form 2 for instance, the first section of TS included the second and the third set of 7 items from E scale, and the second section of TS were the first set of 7 items from E scale. By analogy, the second section in form 3 and form 4 were the second set of 7 items and the third set of 7 items from E scale, respectively (Figure 1). The TS was used to do personfit analysis, and the NTS was not. The purpose of reassembling was to precisely produce four types of misfit groups (one global misfit group and three types of local misfit groups), and to insure that individuals would respond randomly to the common items from TS. Specifically, an experimental manipulation which was an interference was a trigger for random response behaviors. Individuals responded normally before this trigger, and responded randomly after it. For each individual, the trigger point was the item to which he or she was responding when interference was coming up. However, these trigger items were rarely identified, because response speeds of individuals were inconsistent (it was not clear how much of the questionnaire each individual had completed by the time of interference). The reassembled questionnaires could insure that the interference was Ó 2016 Hogrefe Publishing

Figure 1. Four forms of questionnaire assembly. The TS, NTS stand for target scale and nontarget scale.

initiated at the time when all individuals were responding to the NTS items, because the NTS was long enough (80 items). Therefore, the trigger item must have been one of the NTS items, and the TS items after this trigger point would have involved random response behaviors.

Experimental Test Administration Random response was discussed in this study, and the challenge was how to collect random response data. Generally, many factors may result in random responses, like motivelessness, deficiency of ability, and lack of time. A personality test is a typical performance test. A possible way to promote individuals’ random response behaviors in a personality test is to make them respond without a motive and with a time limit. Here, an attempt was made to demotivate individuals in their test-responding processes by manipulating the test administration procedure. Step 1, an administrator handed out the questionnaires, and individuals responded to the questionnaires. Step 2, the administrator left the classroom, and a confederate (a teacher of the respondents) suddenly interfered and claimed a course quiz would begin as soon as individuals finished the questionnaires. Step 3, the confederate repeatedly emphasized that the questionnaires were not important and had no impact on their course scores and continued to do so until the test ended. In step 2 and step 3, the teacher’s words of instruction were something like this: “This questionnaire is unimportant, and by the way you have to finish fast or else you’ll lose time on your real test. Hurry, hurry!!” It was surmised that the confederate’s sudden interference and emphasis would affect the respondents, leading them to respond randomly to the remaining test items. The course quiz was a highstakes exam, and the questionnaire was a low-stakes one for the individuals, and the sooner they finished the questionnaire, the earlier could they start the quiz, giving them an incentive to finish the questionnaire as quickly as possible. Consequently, if this experimental scenario worked, the real random response data could be collected. European Journal of Psychological Assessment (2019), 35(1), 126–136


130

T. Liu et al., Detecting Random Responses in a Personality Scale Using IRT-Based Person-Fit Indices

Four random response groups were involved in this experimental manipulation. And two methods were used to confirm whether the experimental manipulation had influenced the individuals’ response behaviors. First, the individuals were debriefed after the test regarding whether they had responded randomly as a consequence of the confederate’s interference. The results showed that 165 of 188 (87.8%) individuals, 154 of 193 (79.8%) individuals, 150 of 201 (74.6%) individuals, and 132 of 189 (69.8%) individuals admitted that they had responded randomly to the questionnaire items after the confederate’s interference. The true number of random response individuals might even be greater, because some of individuals might have claimed during debriefing that they had not given random answers when in fact they had. Second, if the manipulation worked, the person-fit analysis would be expected to identify more random response patterns in the manipulated groups than in a control group whose teacher did not interfere. These results will be shown in the following passages. In summary, these two methods proved that the experimental test administration worked. The true purpose of the manipulation was explained to all participants after the test. The results of the personality assessment were not shared with the participants or their teachers. The data collection procedure followed the appropriate ethical standards and guidelines, and it had little effect on the participants.

model (2PLM), rather than the one-parameter logistic model (1PLM) and the three-parameter logistic model (3PLM), was implemented in R software (version 3.1.2). 2PLM was used here for two reasons. First, a personality test is a typical performance test, so there are no guessing behaviors in item response procedures. This rendered the guessing parameter in 3PLM pointless. Second, it is hard to conceive of a set of personality items that are all equally discriminating. So the discrimination parameters in 2PLM are necessarily compared with 1PLM. Moreover, many personality studies have also recommended 2PLM (Ferrando, 2001, 2003; Reise & Henson, 2003; Reise & Waller, 1990; Waller, Tellegen, McDonald, & Lykken, 1996).

Person-Fit Analysis Procedure

The three person-fit indices in present study were IRT-based indices. The IRT model had been chosen before the person-fit analysis. The IRT two-parameter logistic

Person-fit analysis was performed in three steps. In step 1, person-fit indices were calculated to identify individuals who showed random response. In step 2, a graphical method based on CUMSUM-based indices was used to describe the global and local misfit individuals and to identify the items to which the local misfit individuals randomly responded. In step 3, the person-fit indices were recalculated again, based on the items identified as those to which the individual had responded randomly in step 2. All of these steps were implemented in Microsoft Excel. By means of this procedure, researchers can not only identify the global and local misfit individuals, but also visualize the random response behaviors item by item. Similarly, this procedure may help researchers to comprehend the possible causes of random response behaviors. To identify the random response patterns, the appropriate cutoff points for lz index and two CUSUM-based indices were necessary. A cutoff of 1.65 (Type I error = 0.05 for the standard normal distribution) was used for the lz index. The distribution of lz differs from the assumed standard normal when the test is not too long and true ability is unknown. The study described here used real data, and lz served as a screening device for the person-fit analysis procedure described below. The cutoff points of CUSUMrr and CUSUMvm reported in the present paper were estimated from 10,000 individuals based on the results of item estimation from empirical data described below. This was done by computing each person-fit statistic for each simulated individual and then taking the adequate 5% quantile from their empirical distributions. These two typical ways of setting cutoff points described above could lead to different false positive rates in real data. It is difficult to compare the power of alternative person-fit statistics when their false positive rates vary. So, the normal response group of 1,770 was used to determine the cutoff points for the misfit indices that produce a common 5% false positive

European Journal of Psychological Assessment (2019), 35(1), 126–136

Ó 2016 Hogrefe Publishing

Datasets for Analysis The four random response groups finished four forms of questionnaires under manipulation. Owing to the manipulation and new questionnaire structure, it was possible to collect four random response samples: AI responded randomly to all items (188 participants), FI responded randomly to the first 7 items (193 participants), SI responded randomly to the second 7 items (201 participants), and TI responded randomly to the third 7 items (189 participants). The normal response group, which finished E-scale items under standard administration, was a common sample (1,770 participants). Next, the common sample and four random response samples were combined into four new datasets: AI dataset (1,958 participants), FI dataset (1,963 participants), SI dataset (1,971 participants), and TI dataset (1,959 participants). These four datasets together with the common sample dataset (C dataset) were used in data analysis.

IRT Model


T. Liu et al., Detecting Random Responses in a Personality Scale Using IRT-Based Person-Fit Indices

rate. In addition, the absolute values of CUSUMrr and CUSUMvm were used to graph CUSUM curves.

Results IRT Analysis Unidimensionality was examined using the C dataset before IRT analysis. Table 1 shows the results of Exploratory factor analysis (EFA) for the E scale. The tetrachoric correlation matrix instead of Pearson correlation matrix was used in the EFA procedure. Table 1 presents the largest three-factor eigenvalues. Only the first eigenvalue was larger than 1, and the ratio of the first eigenvalue (6.289) to the second eigenvalue (0.862) was over 7. In addition, the first factor accounted for 78.3% of the variance. Therefore, the assumption of unidimensionality underlying the E scale was well founded. Next, all five datasets were run using 2PLM. The results of item analysis showed that the item parameters estimated from five datasets were very close (Figure 2). That means random response patterns had little effect on the parameter estimation based on these five datasets. Thus, person-fit indices could be calculated without considering the item parameter differences. In addition, the discrimination values ranged from 0.369 to 2.558 and difficulty values ranged from 1.979 to 1.628. These results were similar to those of many previous reports that used personality tests based on IRT models (e.g., Fraley, Waller, & Brennan, 2000; Reise & Waller, 1990).

131

Table 1. Exploratory factor analysis results of E scale Factor

Eigenvalue

Difference

Proportion

Cumulative

1

6.289

5.427

0.783

0.783

2

0.862

0.100

0.107

0.891

3

0.762

0.152

0.095

0.986

(the experimental test administration) had the intended effect. In Table 2, the fifth and the sixth column show the performance of three indices under different conditions. For the AI dataset, it can be verified that lz statistics performed better than did the two CUSUM-based statistics. However, the true positive rates decreased from 31.38% (54.26%) to around 10% (22%) in the fifth (sixth) column, when only some of the E-scale items showed random responses. For the FI dataset, SI dataset, and TI dataset, it is hard to evaluate which index performed better.

Graphical Method for Random Responses

Four datasets with random response patterns were detected in the current study by using lz, CUSUMrr, and CUSUMvm. Table 2 gives the rates of false positive (identifying “normal” individuals as “random”) and true positive (identifying “random” correctly). The true positive and false positive rates (Table 2, column 5) were calculated when cutoff points were determined in typical ways. The true positive rates in column 6 occurred when false positive rates were fixed at 5% for normal response group of 1,770. As the false positive rates rose, more random responses were detected by all three indices. It was assumed that the common sample was a purer sample with few or no random responses and that the four random samples were dirty samples with random responses. Take the results of lz index for example (Table 2, column 5), the low rate of false positives (below 1%). The substantially higher rate of true positives in the manipulated datasets (31.38% for AI dataset, and around 10% for FI, SI, and TI datasets) indicated that the manipulation

Statistics can only be used to identify random individuals, but sometimes the processes of random responses may also provide more information regarding the administration procedure. CUSUM-based indices can show random responses item by item. We graphed the 10 most random individuals identified by two CUSUM-based indices using their CUSUMrr and CUSUMvm patterns (Figures 3 and 4). For both CUSUMvm and CUSUMrr, high statistical values indicated that the performance on an item was above or below an individual’s true extroversion level. If random behaviors were detected, the CUSUMvm curves and CUSUMrr curves increased. Thus, the tendency of the curves provided further information regarding the response behaviors. For example, the statistical values increased during the whole scale with the AI dataset (Figures 3 and 4). With the FI dataset, CUSUMrr and CUSUMvm values increased in the beginning, and then the CUSUMrr decreased after the eighth item. Moreover, for the TI dataset, CUSUMrr shows a more rapid increasing tendency than CUSUMvm on the last seven items, while CUSUMvm showed a steeper tendency than CUSUMrr on the second 7 items under the SI condition. These results supported the assumption that the manipulation would create the global misfit and local misfit. In general, the graphical method using CUSUM-based indices was found to illustrate those specific items on which local misfit had taken place. Local misfit was probably the result of individuals’ states of mind, testing time limits, lack of motivation, inappropriate instructions, or some combination of factors. Therefore identifying the local misfit and inferring the probable causes may help researchers evaluate the real administration procedure.

Ó 2016 Hogrefe Publishing

European Journal of Psychological Assessment (2019), 35(1), 126–136

Detecting Random Responses


132

discrimination parameter

(A)

T. Liu et al., Detecting Random Responses in a Personality Scale Using IRT-Based Person-Fit Indices

Figure 2. (A) Discrimination and (B) difficulty parameters estimated from five datasets.

3.000 2.500 2.000 C dataset AI dataset

1.500

FI dataset 1.000

SI dataset TI dataset

0.500

item1 item2 item3 item4 item5 item6 item7 item8 item9 item10 item11 item12 item13 item14 item15 item16 item17 item18 item19 item20 item21

0.000

(B) 2.000

difficulty parameter

1.500 1.000 0.500 C dataset 0.000

AI dataset

-0.500

FI dataset

-1.000

SI dataset

-1.500

TI dataset

-2.000 item1 item2 item3 item4 item5 item6 item7 item8 item9 item10 item11 item12 item13 item14 item15 item16 item17 item18 item19 item20 item21

-2.500

Table 2. False positive rates and true positive rates for four datasets

lz

CUSUMrr

CUSUMvm

Sample size

No. of random respondents

True positive (false positive)

True positive (false positive)

AI dataset

1958

188

31.38% (0.00%)

54.26% (5%)

FI dataset

1963

193

10.36% (0.68%)

22.80% (5%)

SI dataset

1971

201

9.95% (0.45%)

20.40% (5%)

TI dataset

1959

189

10.05% (0.17%)

24.34% (5%)

AI dataset

1958

188

12.23% (0.56%)

46.81% (5%)

FI dataset

1963

193

5.07% (1.64%)

11.82% (5%)

SI dataset

1971

201

7.46% (1.19%)

14.93% (5%)

TI dataset

1959

189

10.58% (0.85%)

29.63% (5%)

AI dataset

1958

188

9.04% (1.81%)

12.23% (5%)

FI dataset

1963

193

11.92% (1.64%)

15.03% (5%)

SI dataset

1971

201

6.47% (1.92%)

9.45% (5%)

TI dataset

1959

189

8.47% (1.92%)

11.64% (5%)

European Journal of Psychological Assessment (2019), 35(1), 126–136

Ă“ 2016 Hogrefe Publishing


T. Liu et al., Detecting Random Responses in a Personality Scale Using IRT-Based Person-Fit Indices

133

Figure 3. Graph for the CUSUMvm statistics showing 10 aberrant respondents from four datasets.

Identifying Local Misfit

29.85%, and 40.21%) as false positive rates were equal to 0.05. Furthermore, these detection rates were much higher than those in Table 2 where all scale items rather than 7 of 21 were used. These results indicated that the location revealed by graphical method in step 2 could help improve local misfit detection.

The graphical method was used to indicate that CUSUMbased curves matched the experimental design. In real data analysis, this method was capable of locating local misfit items. However, if individuals randomly responded to only some of the scale items, the person-fit indices based on all items might not be sensitive enough to that circumstance. Consequently, carrying out person-fit analyses again based on those items located by graphical methods may improve the detection rates. Two CUSUM-based indices were calculated based on the first 7 items from the FI dataset, the second 7 items from the SI dataset, and the third 7 items from the TI dataset. The global index lz was abandoned in this step because it depends on the likelihood of the individuals’ responses, which can be heavily influenced by shortening the test and can be insensitive to local misfit (Emons, 2009). The results of the local misfit detection (Table 3) showed that more random response patterns (24.87%, 12.94%, and 12.70%) could be identified by CUSUMvm when false positive rates varied. CUSUMrr could identify more random response individuals (40.93%,

The present study recommended a three-step person-fit analysis procedure. And the results showed that this three-step procedure was useful for random response detection. Step 1 of this procedure is a typical person-fit analysis, using a single index. However, there are two more steps in this procedure than in typical person-fit analysis. Step 2 is a graphical method and step 3 is a local misfit detection procedure. Those further steps add two advantages. One is that, from the graphical method, one can immediately visualize where the random responses appeared, as described in step 2. The visualized curves

Ă“ 2016 Hogrefe Publishing

European Journal of Psychological Assessment (2019), 35(1), 126–136

Discussion


134

T. Liu et al., Detecting Random Responses in a Personality Scale Using IRT-Based Person-Fit Indices

Figure 4. Graph for the CUSUMrr statistics showing 10 aberrant respondents from four datasets.

Table 3. False positive rates and true positive rates for four datasets in local misfit analysis CUSUMrr

CUSUMvm True positive (false positive)

True positive (false positive)

True positive (false positive)

True positive (false positive)

FI dataset

24.87% (3.28%)

26.94% (5%)

16.06% (1.02%)

40.93% (5%)

SI dataset

12.94% (1.58%)

23.88% (5%)

0.00% (0.11%)

29.85% (5%)

TI dataset

12.70% (2.15%)

17.99% (5%)

5.82% (0.06%)

40.21% (5%)

allow more inferences to be drawn than from a simple index. For instance, if some individuals are sleepy, they may respond randomly at the beginning of a test, as the FI group presented. If individuals tire during the testing session, they may respond randomly to the last several items, as the TI group presented. The different location of misfit patterns can help researchers revaluate test administration and provide some possible avenues for explanation regarding individuals’ states (e.g., sleepiness, lack of motivation, tiredness, or something else). Due to the location information of misfit patterns, person-fit index can be executed again based on the particular subset of items on which the random response occurred, as step 3 described.

This is the second advantage. Consequently, more local misfit individuals can be identified than in step 1. Many person-fit studies have used simulation data instead of real data, because it is difficult to discern the truly random response patterns in real situations. Simulation is idealized reality. The random response in a real psychological test should be more complicated. Therefore, using experimental manipulation to produce real random responses is a feasible way. The debriefing results and the person-fit detection results supported the conclusion that the manipulation procedure was effective, notwithstanding the lower detection rates (Table 2) compared with simulation study (Armstrong & Shi, 2009).

European Journal of Psychological Assessment (2019), 35(1), 126–136

Ă“ 2016 Hogrefe Publishing


T. Liu et al., Detecting Random Responses in a Personality Scale Using IRT-Based Person-Fit Indices

Three IRT-based person-fit indices were discussed in this study. The first, lz, performed better in global misfit detection. A second approach, CUSUM-based indices, was appropriate for local misfit detection. However, CUSUMrr, proposed specifically for random responses, was found to be less sensitive in this study than in a study by Tendeiro and Meijers (2012). In their study, an ability test was discussed. Ability tests are maximum performance tests, in which random response behavior primarily reflects random guessing behavior. That means that low-level individuals, instead of high-level individuals, may randomly respond in most instances. Therefore, CUSUMrr may not be sensitive enough in personality tests compared with ability tests. In future studies, another type of person-fit index, such as optimal appropriateness index (Levine & Drasgow, 1988), lz* index (Snijders, 2001), and caution indices (Tatsuoka, 1984), can be introduced in this three-step procedure. How to make good use of different person-fit indices in diverse types of tests under various practical conditions remains to be explored. When random response occurred, the three-step procedure was useful, and the graphical method was practical for visualizing the random response process. Through the person-fit analysis procedure described above, it can be inferred that random responses could result from motivelessness. However, psychological tests may reveal the existence of varieties of aberrant response behaviors in practice. The three-step person-fit procedure proposed here can be of use to detect other types of aberrant response behaviors, but its performance is unclear. Moreover, identifying aberrant response individuals is just the first step. Developing new methods to clarify the types of aberrant responses and the causes of aberrant responses is the longer-term goal.

135

Armstrong, R. D., & Shi, M. (2009). A parametric cumulative sum statistic for person fit. Applied Psychological Measurement, 33, 391–410. Belov, D. I. (2011). Detection of answer copying based on the structure of a High-Stakes Test. Applied Psychological Measurement, 35, 495–517. Berry, D. T. R., Wetter, M. W., Baer, R. A., Larsen, L., Clark, C., & Monroe, K. (1992). MMPI-2 random responding indices: Validation using a self-report methodology. Psychological Assessment, 4, 340–345. Clark, M. E., Gironda, R. J., & Young, R. W. (2003). Detection of back random responding: Effectiveness of MMPI-2 and personality assessment inventory validity indices. Psychological Assessment, 15, 223–234. Conijn, J. M., Emons, W. H. M., & Sijtsma, K. (2014). Statistic lz-based person-fit methods for noncognitive multiscale measures. Applied Psychological Measurement, 38, 122–136. Conrad, K. J., Bezruczko, N., Chan, Y., Riley, B., Diamond, G., & Dennis, M. L. (2010). Screening for atypical suicide risk with

person fit statistics among people presenting to alcohol and other drug treatment. Drug and Alcohol Dependence, 106, 92–100. Dodeen, H., & Darabi, M. (2009). Person-fit: Relationship with four personality tests in mathematics. Research Papers in Education, 24, 115–123. Drasgow, F., Levine, M. V., & McLaughlin, M. E. (1987). Detecting inappropriate test scores with optimal and practical appropriateness indices. Applied Psychological Measurement, 11, 59–79. Drasgow, F., Levine, M. V., & Williams, E. A. (1985). Appropriateness measurement with polychotomous item response models and standardized indices. British Journal of Mathematical and Statistical Psychology, 38, 67–86. Emons, W. H. M. (2008). Nonparametric person-fit analysis of Polytomous Item Scores. Applied Psychological Measurement, 32, 224–247. Emons, W. H. M. (2009). Detection and diagnosis of person misfit from patterns of summed Polytomous Item Scores. Applied Psychological Measurement, 33, 599–619. Emons, W. H. M., Sijtsma, K., & Meijer, R. R. (2005). Global, local, and graphical person-fit analysis using person-response functions. Psychological Methods, 10, 101–119. Ferrando, P. J. (2001). The measurement of neuroticism using MMQ, MPI, EPI and EPQ items: A psychometric analysis based on item response theory. Personality and Individual Differences, 30, 641–656. Ferrando, P. J. (2003). The accuracy of the E, N and P trait estimates: An empirical study using the EPQ-R. Personality and Individual Differences, 34, 665–679. Ferrando, P. J. (2012). Assessing inconsistent responding in E and N measures: An application of person-fit analysis in personality. Personality and Individual Differences, 52, 718–722. Ferrando, P. J. (2015). Assessing person fit in typical-response measures. In S. P. Reise & D. A. Revicki (Eds.), Handbook of item response theory modeling: Applications to typical performance assessment (pp. 128–155). New York, NY: Routledge/Taylor & Francis Group. Ferrando, P. J., & Anguiano-Carrasco, C. (2013). A structural model-based optimal person-fit procedure for identifying faking. Educational and Psychological Measurement, 73, 173–190. Ferrando, P. J., & Chico, E. (2001). Detecting dissimulation in personality test scores: A comparison between person-fit indices and detection scales. Educational and Psychological Measurement, 61, 997–1012. Ferrando, P. J., & Lorenzo-Seva, U. (2010). Acquiescence as a source of bias and model and person misfit: A theoretical and empirical analysis. British Journal of Mathematical and Statistical Psychology, 63, 427–448. Ferrando, P. J., & Lorenzo-Seva, U. (2016). A comprehensive regression-based approach for identifying sources of person misfit in typical-response measures. Educational and Psychological Measurement, 76, 470–486. Fraley, R. C., Waller, N. G., & Brennan, K. A. (2000). An item response theory analysis of self-report measures of adult attachment. Journal of Personality and Social Psychology, 78, 350–365. Karabatsos, G. (2003). Comparing the aberrant response detection performance of thirty-six person-fit statistics. Applied Measurement in Education, 16, 277–298. LaHuis, D. M., & Copeland, D. (2009). Investigating faking using a multilevel logistic regression approach to measuring person fit. Organizational Research Methods, 12, 296–319. Lee, P., Stark, S., & Chernyshenko, O. S. (2014). Detecting aberrant responding on unidimensional pairwise preference

Ó 2016 Hogrefe Publishing

European Journal of Psychological Assessment (2019), 35(1), 126–136

References


136

T. Liu et al., Detecting Random Responses in a Personality Scale Using IRT-Based Person-Fit Indices

tests: An application of lz based on the Zinnes-Griggs ideal point IRT model. Applied Psychological Measurement, 38, 391–403. Levine, M. V., & Drasgow, F. (1988). Optimal appropriateness measurement. Psychometrika, 53, 161–176. Meade, A. W., & Craig, S. B. (2012). Identifying careless responses in survey data. Psychological Methods, 17, 437–455. Meijer, R. R. (1996). Person fit research: An introduction. Applied Measurement in Education, 9, 3–8. Meijer, R. R. (2002). Outlier detection in high-stakes certification testing. Journal of Educational Measurement, 39, 219–233. Meijer, R. R., & Nering, M. L. (1997). Trait level estimation for nonfitting response vectors. Applied Psychological Measurement, 21, 321–336. Meijer, R. R., & Sijtsma, K. (2001). Methodology review: Evaluating person fit. Applied Psychological Measurement, 25, 107–135. Nering, M. L., & Meijer, R. R. (1998). A comparison of the Person response function and the lz person-fit statistic. Applied Psychological Measurement, 22, 53–69. Reise, S. P. (1990). A comparison of item- and person-fit methods of assessing model-data fit in IRT. Applied Psychological Measurement, 14, 127–137. Reise, S. P. (1995). Scoring method and the detection of person misfit in a personality assessment context. Applied Psychological Measurement, 19, 213–229. Reise, S. P., & Flannery, W. P. (1996). Assessing Person-Fit on measures of typical performance. Applied Measurement in Education, 9, 9–26. Reise, S. P., & Henson, J. M. (2003). A discussion of modern versus traditional psychometrics as applied to personality assessment scales. Journal of Personality Assessment, 81, 93–103. Reise, S. P., & Waller, N. G. (1990). Fitting the two-parameter model to personality data. Applied Psychological Measurement, 14, 45–58. Reise, S. P., & Waller, N. G. (1993). Traitedness and the assessment of response pattern scalability. Journal of Personality and Social Psychology, 65, 143. Scherbaum, C. A., Sabet, J., Kern, M. J., & Agnello, P. (2013). Examining faking on personality inventories using unfolding item response theory models. Journal of Personality Assessment, 95, 207–216.

European Journal of Psychological Assessment (2019), 35(1), 126–136

Schmitt, N., Chan, D., Sacco, J. M., McFarl, L. A., & Jennings, D. (1999). Correlates of person fit and effect of person fit on test validity. Applied Psychological Measurement, 23, 41–53. Snijders, T. A. B. (2001). Asymptotic null distribution of person fit statistics with estimated person parameter. Psychometrika, 66, 331–342. St-Onge, C., Valois, P., Abdous, B., & Germain, S. (2011). Accuracy of person-fit statistics: A Monte Carlo study of the influence of aberrance rates. Applied Psychological Measurement, 35, 419–432. Tatsuoka, K. K. (1984). Caution indices based on Item Response Theory. Psychometrika, 49, 95–110. Tendeiro, J. N., & Meijer, R. R. (2012). A CUSUM to detect person misfit: A discussion and some alternatives for existing procedures. Applied Psychological Measurement, 36, 420–442. van Krimpen-Stoop, E. M. L. A., & Meijer, R. R. (2000). Detection of person misfit in adaptive testing using statistical process control techniques. In W. J. van der Linden & C. A. W. Glas (Eds.), Computerized adaptive testing: Theory and practice (pp. 201–219). Boston, MA: Kluwer-Nijhoff. van Krimpen-Stoop, E. M. L. A., & Meijer, R. R. (2001). CUSUMbased person-fit statistics for adaptive testing. Journal of Educational and Behavioral Statistics, 26, 199–217. Waller, N. G., Tellegen, A., McDonald, R. P., & Lykken, D. T. (1996). Exploring nonlinear models in personality assessment: Development and preliminary validation of a negative emotionality scale. Journal of Personality, 64, 545–576. Received September 4, 2015 Revision received February 9, 2016 Accepted February 16, 2016 Published online November 7, 2016 Tao Xin School of Psychology Beijing Normal University 100875 Beijing P.R. China Tel. +86 135 5294-6058 E-mail mikebonita@163.com

Ó 2016 Hogrefe Publishing


Multistudy Report

A Need for Cognition Scale for Children and Adolescents Structural Analysis and Measurement Invariance Ulrich Keller,1 Anja Strobel,2 Rachel Wollschläger,3 Samuel Greiff,4 Romain Martin,1 Mari-Pauliina Vainikainen,5 and Franzis Preckel3 1

LUCET, University of Luxembourg, Luxembourg

2

Department of Psychology, Technische Universität Chemnitz, Germany

3

Department of Psychology, University of Trier, Germany

4

ECCS, University of Luxembourg, Luxembourg

5

Centre for Educational Assessment, University of Helsinki, Finland Abstract: Need for Cognition (NFC) signifies “the tendency for an individual to engage in and enjoy thinking” (Cacioppo & Petty, 1982, p. 116). Up to now, no scale of sufficient psychometric quality existed to assess NFC in children. Using data from three independent, diverse crosssectional samples from Germany, Luxembourg, and Finland, we examined the psychometric properties of a new NFC scale intended to fill in this gap. In all samples, across grade levels ranging from 1 to 9, confirmatory factor analysis confirmed the hypothesized nested factor structure based on Mussel’s (2013) Intellect model, with one general factor Think influencing all items and two specific factors Seek and Conquer each influencing a subset of items. At least partial scalar measurement invariance with regard to grade level and sex could be demonstrated. The scale exhibited good psychometric properties and showed convergent and discriminant validity with an established NFC scale and other noncognitive traits such as academic self-concept and interests. It incrementally predicted mostly statistically significant but relatively small portions of academic achievement variance over and above academic self-concept and interest. Implications for research on the development of NFC and its role as an investment trait in intellectual development are discussed. Keywords: need for cognition, measurement invariance, investment traits

The construct Need for Cognition (NFC) emerged from research on the Elaboration Likelihood Model of Persuasion (ELM; cf. Petty, Rucker, Bizer, & Cacioppo, 2004). It signifies the “tendency for an individual to engage in and enjoy thinking” (Cacioppo & Petty, 1982, p. 116) and has since been “discovered” by many researchers outside of social psychology who have shown that it plays a role in many aspects of cognitive behavior. In an extensive review of the literature, Cacioppo, Petty, Feinstein, and Jarvis (1996) gathered evidence showing the influence of NFC “in fields ranging from social, personality, developmental, and cognitive psychology to behavioral medicine, education, journalism, marketing, and law” (Cacioppo et al., 1996, p. 198). In educational settings, NFC has been shown to be correlated with constructs such as academic interest (Feist, 2012), academic self-concept (Dickhäuser & Reinhard, 2010), conscientiousness, and effort (Fleischhauer et al., 2010). However, very little is known about antecedents and development of NFC.

We believe that research in this area has been hampered primarily by the lack of an assessment instrument for NFC that is suitable for young children (as well as other age groups), and that is of satisfactory psychometric quality. In the following, we will briefly sketch out what is known about the relation between NFC and cognitive measures, discuss existing NFC scales, and present findings on a new scale by Preckel and Strobel (2011) suitable for the assessment of NFC in young children.

Ó 2016 Hogrefe Publishing

European Journal of Psychological Assessment (2019), 35(1), 137–149 DOI: 10.1027/1015-5759/a000370

Need for Cognition, Intelligence, and Educational Attainment: What’s Behind the Correlations? A number of studies consistently found weak to moderate correlations between NFC and measures of fluid as well as crystallized intelligence (Cacioppo et al., 1996; Fleischhauer et al., 2010; Furnham & Thorne, 2013;


138

U. Keller et al., A Need for Cognition Scale for Children and Adolescents

Hill et al., 2013; von Stumm & Ackerman, 2013). Furthermore, multiple studies found that NFC was weakly to moderately correlated with college students’ performance (Cacioppo & Petty, 1982; Richardson, Abraham, & Bond, 2012; Tolentino, Curry, & Leak, 1990). Regarding children and adolescents (aged 10 and older), correlations of comparable magnitude with school grades were found (Ginet & Py, 2000; Preckel, 2014). What functional relations explain the observed correlations between cognitive measures and NFC? Ackerman (1996, 2014), building on the work of Cattell (1987), posited that intellectual investment traits play a crucial role in the transition from intelligence as process (fluid intelligence) to intelligence as knowledge (crystallized intelligence), by motivating individuals to seek out learning opportunities and/or structure experiences in a cognitively stimulating fashion, but also through reciprocal influences between investment traits and fluid and crystallized intelligence. In an extensive review, von Stumm and Ackerman (2013) identified a large number of investment trait constructs in publications reaching back until 1938. They posit a hierarchy of investment traits, with NFC and Typical Intellectual Engagement (TIE; Goff & Ackerman, 1992) at the apex as the most general constructs. Indeed, NFC and TIE have been found to be not only conceptually, but also empirically closely related (Mussel, 2010, 2013; Woo, Harms, & Kuncel, 2007). While preliminary evidence for investment theories exists (e.g., Ziegler, Danay, Heene, Asendorpf, & Bühner, 2012), conclusive longitudinal studies are lacking, as von Stumm and Ackerman (2013) point out. Even less is known about the development of NFC in childhood and possible positive or negative influences apart from intelligence. Cacioppo et al. (1996, p. 246) speculated that NFC should result as children experience the importance of cognitive skills for coping with problems, and listed characteristics of educational settings they considered as beneficial and detrimental for the development of NFC. We are aware of no empirical work investigating these ideas.

Most published research on NFC uses the original 34-item scale introduced by Cacioppo and Petty (1982), its 18-item short form (Cacioppo, Petty, & Kao, 1984), or a variant thereof (e.g., Bless, Waenke, Bohner, Fellhauer, & Schwarz, 1994). Earlier research into the factor structure of NFC mostly supported a single-factor model (cf. Cacioppo et al., 1996), usually on the basis of exploratory factor analysis. More recent research with confirmatory methods

tends to settle on more complex models, for instance with two additional method factors for positively versus negatively worded items (Bors, Vigneau, & Lalande, 2006; Preckel, 2014). A prerequisite for tackling the open questions regarding the development of NFC and its relationship with other constructs is a measure already suitable for younger children. The self-report scales mostly used to date are not well suited for this endeavor as they employ items that exceed the language competencies of the target demographic group (e.g., “I would prefer a task that is intellectual, difficult, and important to one that is somewhat important but does not require much thought”). For children aged 10 years and older, Ginet and Py (2000) introduced a 20-item scale in French. Preckel (2014) presented a German scale for children of the same age based on the Ginet/Py scale and the 16-item German version of the Cacioppo/Petty scale by Bless et al. (1994). Still lacking, however, is a scale suitable for younger children. While Kokis (2002) introduced a 9-item scale for children from Grade 5 that Toplak, West, and Stanovich (2014) used (with modification) in a sample of children from Grade 2, this scale falls short of established psychometric standards. Its internal consistency (Cronbach’s α) is given as .55, its split-half reliability is reported as .63 in the former and .67 in the latter study. Preckel and Strobel (2011) introduced a 14-item scale designed to fill in this gap. Its development drew on Mussel’s theoretical framework incorporating many of the personality traits associated with intellectual achievement (Mussel, 2013). This Intellect framework comprises two dimensions: process and operation. Process refers to two distinct motivational orientations: Seek (approaching intellectually challenging situations) and Conquer (expending effort to master these challenges). Operation describes preferences regarding the cognitive activities Think, Learn, and Create, which are grounded in fluid intelligence, crystallized intelligence, and creativity, respectively. The combinations of the process and operation facets allow for the integration of a variety of noncognitive intellectual traits and provide a unifying theoretical basis. Mussel tentatively placed NFC in the “cell” defined by the Think operation and the Seek process. However, this does not seem to represent this trait comprehensively with the definition of NFC as “an individual’s tendency to engage in and enjoy effortful cognitive endeavors” (Cacioppo et al., 1984, p. 306; emphasis added) and its role within the Elaboration Likelihood Model, which would suggest a closer proximity to the Conquer process. Accordingly, the Preckel/Strobel scale includes items pertaining both to seeking and to mastering cognitively challenging situations. The scale consists of 14 short items of low linguistic

European Journal of Psychological Assessment (2019), 35(1), 137–149

Ó 2016 Hogrefe Publishing

Assessment and Dimensionality of Need for Cognition (NFC)


U. Keller et al., A Need for Cognition Scale for Children and Adolescents

complexity and does not include negatively worded items (see Table 2 for the English translation; the original German version as well as an adapted German version for Luxembourg and a Finnish version are listed in Table A1 in the Electronic Supplementary Material, ESM 1).

Study Aims In the present study, we intended to examine the psychometric properties of the Preckel/Strobel NFC scale in order to establish its applicability for studies scrutinizing the development of NFC, particularly in early childhood. To this end, we pursued three research aims: 1. We investigated the factor structure of the scale. Its construction was based on Mussel’s (2013) Intellect framework in which each indicator is influenced by two dimensions (i.e., operation and process). We hypothesized that a bifactor or nested factor model (cf. Brunner, Nagy, & Wilhelm, 2012) with one general NFC factor (i.e., operation THINK in the Intellect framework), and two specific group factors nested under the general NFC factor (i.e., processes Seek and Conquer in the Intellect framework) would exhibit a better fit than a single-factor model or a correlated factor model with three factors Think, Seek, and Conquer. The nested factor approach allows for the inclusion of indicators that reflect a broader spectrum of behavior than a unidimensional measurement model could capture, leading to higher content validity (Reise, Moore, & Haviland, 2010). At the same time, the instrument can still be interpreted as a single scale provided the general factor is strong enough as assessed by the explained common variance (ECV; Brouwer, Meijer, & Zevalkink, 2013; Reise, Scheines, Widaman, & Haviland, 2013). We expected to find a strong general factor with at least 60% of attributable reliable variance, in line with the tentative benchmark recommended by Reise et al. (2013). 2. We investigated measurement invariance regarding grade levels and sex by employing multigroup confirmatory factor analysis in each country separately. In order for it to be viable for studying the development of NFC in childhood, measurement invariance of the scale over different age groups has to be ensured. Furthermore, in order to warrant the generalizability of findings, measurement invariance with regard to sex is also essential. 3. We investigated reliability and validity of the scale using scale statistics for internal consistency and 1

139

correlations with conceptually related and relatively unrelated measures to establish convergent and discriminant validity. We expected moderate to high correlations with conceptually related constructs such as academic interest, academic self-concept, conscientiousness, and effort. On the other hand, we expected low correlations with constructs that can be assumed to be conceptually distinct such as class climate, attitude toward classmates, and social integration. Furthermore, we expected NFC to incrementally predict measures of academic performance over and above academic self-concept and interest.

Materials and Methods Participants The analyses were conducted separately using three samples from Finland (FI), Germany (GE), and Luxembourg (LU). All samples are cross-sectional. Table 1 summarizes the sample characteristics. For the FI sample, students from Grades 3, 6, and 9 completed the NFC scale in Finnish at the end of an assessment battery that comprised a range of cognitive and noncognitive measures.1 In the GE sample, the NFC scale was administered to the norm sample for a children’s intelligence test (Baudson & Preckel, 2013) in Grades 1 through 4. The LU sample comprised an entire cohort each of 7th and 9th graders tested in the context of the Luxembourgish school monitoring program (Martin, Ugen, & Fischbach, 2015; due to Luxembourg’s small population, a cohort consists only of approximately 6,200 students). Students could choose between German and French as a test language; for the following analyses, only data from students who chose German were retained (81% of cases in Grade 7 and 74% in Grade 9) in order to avoid further complicating the analysis. The NFC scale was administered in the context of the annual standardized tests, as part of a larger questionnaire.

Measures Need for Cognition (NFC) NFC was measured using a 14-item scale (Preckel & Strobel, 2011; items are listed in Table 2). The scale was developed in German and was used in its original form in

The data were collected in the context of a large panel study (Vainikainen, 2014) and have been used in other publications (Krkovic, Greiff, Kupiainen, Vainikainen, & Hautamäki, 2014; Wüstenberg, Stadler, Hautamäki, & Greiff, 2014). The present study, however, pursues unique research questions.

Ó 2016 Hogrefe Publishing

European Journal of Psychological Assessment (2019), 35(1), 137–149


140

U. Keller et al., A Need for Cognition Scale for Children and Adolescents

Table 1. Sample characteristics of the German, Finnish, and Luxembourgish samples Country Germany (GE)

Finland (FI)

N

Grade

(%) Female

Mean age

SD age

441

1

47

6.44

0.50

446

2

50

7.45

0.54

406

3

52

8.48

0.58

936

4

48

9.58

0.59

1,689

3

50

9.23

0.44

1,648

6

51

12.24

0.45

1,253

9

52

15.24

0.47

4,256

7

51

12.98

0.56

4,709

9

49

15.37

0.88

standardized test scores in Mathematics and German reading comprehension in the LU sample. A subsample of 443 randomly selected students from the highest academic track in the Luxembourgish Grade 9 sample also completed a different 19-item NFC scale for adolescents by Preckel (2014) using a 4-point scale from “Not true” to “True”.

Measures for Validation Since the NFC scale was administered in the context of larger studies in all three samples, further measures were available for validity assessment of the NFC scale: academic interest, academic self-concept, effort (only GE and FI), conscientiousness (only LU), social integration (GE), attitude toward classmates (FI), and class climate (LU). Sources, reliabilities, descriptive statistics, as well as sample items for these measures are given in Table 7 in the Results section. Note that scales with the same label (e.g., “academic interest”) can differ considerably across samples with respect to item number and wording. Academic performance was assessed using parentreported school grades in Mathematics and German in the GE sample, school grades from students’ report cards in Mathematics and Finnish in the FI sample, and

Data Analysis Because of the different response formats and administration languages, all analyses were conducted separately in the three samples. After calculating item and scale statistics we investigated three models by confirmatory factor analysis using Mplus version 7.2 (Muthén & Muthén, 2012): (A) a single-factor model, (B) a correlated factor model with three factors Think, Seek, and Conquer, and (C) a nested factor model with three uncorrelated latent variables: a general factor Think influencing all items, and two specific factors Seek and Conquer influencing only the items intended to measure these facets. The German and Luxembourgish questionnaires used only three and four response categories, respectively. In addition, the responses to a number of items exhibited marked ceiling effects in these two samples. Under these circumstances, a categorical least squares estimator is recommended (Lubke & Muthén, 2004; Rhemtulla, Brosseau-Liard, & Savalei, 2012). We thus used the WLSMV estimator for the GE und LU samples. The prevalence of missing data was low in the GE and LU samples, with a maximum of 1% missingness for all items. In the FI sample, technical difficulties during the computer-based administration resulted in significant data loss especially regarding the last seven items of the scale, with a maximum of 10% missingness. However, this type missingness can reasonably be assumed to be completely at random and thus ignorable. Missing data were dealt with using full information maximum likelihood estimation (FIML) in the FI sample and pairwise deletion in the GE and LU samples, as FIML was not feasible with the WLSMV estimator (Asparouhov & Muthén, 2010). Since all data were obtained in a classroom setting, the nested structure of the data was taken into account using Mplus’ ANALYSIS = COMPLEX setting, specifying class membership as the cluster variable. Model selection was based on the robust w2 statistic, the root mean square error of approximation (RMSEA), and the comparative fit index (CFI). In the FI sample, the standardized root mean square residual (SRMR) was taken into account additionally, which could not be computed using the WLSMV estimator in the other two samples. Rules of thumb for the interpretation of fit measures were taken from Schermelleh-Engel, Moosbrugger, and Müller (2003).

European Journal of Psychological Assessment (2019), 35(1), 137–149

Ó 2016 Hogrefe Publishing

Luxembourg (LU)

the GE sample. For the LU sample, two somewhat colloquial German words unfamiliar to nonnative German speakers were replaced with more common alternatives (two items were altered; see Table A1 in the Electronic Supplementary Material). For the FI sample, the scale was translated into Finnish collaboratively by two native Finnish speakers also proficient in German. The answer formats varied across samples. In order to accommodate for the young age of some students in the GE sample, a 3-point scale was employed with sad, neutral, and happy “smiley faces” to indicate disagreement, indifference, and agreement, respectively. In the FI sample, students responded on a 7-point scale (“Not true at all” to “Very true”) with smiley faces marking the extreme and middle categories; the LU sample used a 4-point scale (“Not true” to “True”). The mode of administration was computer-based in the FI sample and in the 9th grade LU sample. In the GE sample and the 7th grade LU sample, paper-pencil administration was used. In the GE sample, the items were read out aloud by the test administrators for Grade 1 and Grade 2 students.


U. Keller et al., A Need for Cognition Scale for Children and Adolescents

141

Table 2. NFC scale items and hypothesized factor loadings Item no.

Item

Think

Seek

1

I like thinking to find solutions to problems.

2

Thinking is fun for me.

X

3

When I don’t understand something, I think it through ntil I’ve got it.

X

4

I like to work on problems that require a lot of thinking.

X

5

In school I want to understand everything exactly.

X

6

I like it when I get homework that I really have to chew over.

X

X

7

I like to do puzzles.

X

X

8

I always want to know things exactly.

X

9

I like problems where I have to think a lot.

X

10

I like learning new things.

X

11

I like situations in which I have to think a lot.

X

X

12

At school, when I get problems that require me to think, I’m glad.

X

X

13

I like situations where I can accomplish something by thinking.

X

X

14

I love thinking about things.

X

To assess the strength of the general factor in the nested factor models, we computed the explained common variance (Reise et al., 2013; Sijtsma, 2009) based on the estimated model parameters. Measurement invariance regarding school grades and sex was investigated using multigroup confirmatory factor analysis (CFA) and stepwise analysis, progressing from a less restrictive model (configural invariance: equal pattern of zero and nonzero loadings) to more restrictive models (metric invariance: additionally, equal loadings; scalar invariance: additionally, equal intercepts or thresholds). We considered invariance as established when the added restrictions did not lead to a worse model fit. When estimating a nested factor model with the WLSMV estimator, loadings and thresholds for an item should only be fixed in tandem. Therefore, metric invariance (equal loadings) could not be assessed in the GE and LU samples. The w2 difference test for comparisons of nested models is known to be strongly influenced by sample size, and with samples as large as in the present study can be expected to be significant even in the presence of merely trivial differences (Meade, Johnson, & Braddy, 2008). To avoid this problem, we examined differences in the CFI (ΔCFI; Chen, 2007; Cheung & Rensvold, 2002; Meade et al., 2008).2 Convergent and discriminant validity was assessed by computing correlations of both NFC sum scores and factor scores with sum scores of external scales. Because we regarded the general Think factor as the indicator of an individual’s overall level of NFC, we did not compute correlations with the two specific factors Seek and Conquer. To correct for lack of reliability in the external criteria and make correlations more comparable across external 2

Conquer

X X X X

X X X

constructs, we also show correlations that were disattenuated based on the external criteria’s internal consistency. Furthermore, convergent validity with the NFC scale by Preckel (2014) was assessed by jointly fitting the measurement models of the two scales and estimating the latent correlation of the general NFC factors in Mplus. Finally, we investigated incremental validity by regressing measures of academic achievement on academic selfconcept and interest, and then added NFC as a predictor to obtain the amount of additional explained variance. We then compared models including and excluding NFC using ANOVA to assess the statistical significance of the result.

Results Item and Scale Statistics Item statistics are shown in Tables B1 through B3 in the Electronic Supplementary Material. Table 3 shows the scale statistics. In order to make data resulting from the different answer formats comparable, raw item scores were recoded into POMP scores (percentage of maximum possible ranging from 0% to 100%; Cohen, Cohen, Aiken, & West, 1999). Internal consistency as measured by Cronbach’s α was high to very high in all samples and grades.

Structural Analysis The single-factor model (SF) exhibited an unacceptable fit in the FI and LU samples, while the fit in the GE sample

One recent publication (Sass, Schmitt, & Marsh, 2014) has cautioned against using the ΔCFI criterion in the case of WLSMV estimation, contradicting previous research (Elosua, 2011; Yu, 2002). We therefore report w2 difference tests as well.

Ó 2016 Hogrefe Publishing

European Journal of Psychological Assessment (2019), 35(1), 137–149


142

U. Keller et al., A Need for Cognition Scale for Children and Adolescents

Table 3. Scale statistics (POMP scores) for the NFC scale in the Finnish, German, and Luxembourgish samples Grade

Mean

SD

Median

Alpha

Kurtosis

Skewness

G1

74.43

20.08

75.00

.85

0.72

G2

73.84

17.21

75.00

.80

0.02

0.47

G3

67.96

18.52

69.64

.83

0.22

0.45

G4

67.72

17.86

67.86

.84

0.42

0.52

Germany (GE) 0.79

Finland (FI) G3

66.51

23.21

69.05

.94

0.15

0.59

G6

52.27

21.90

52.38

.95

0.33

0.18

G9

48.87

21.75

50.00

.95

0.14

0.22

Luxemburg (LU) G7

60.57

20.03

59.52

.90

0.26

0.22

G9

51.54

20.75

52.38

.92

0.10

0.04

was acceptable when judged by the RMSEA, but not when judged by the CFI (with the exception of Grade 3; see Table 4). The correlated factor model (CF) showed an acceptable fit only in Grade 3 of the GE sample. The nested factor model (NF) yielded a good fit in the GE sample and an acceptable fit in the LU sample. In the FI sample, two residual correlations were set free (between items 1 and 4, and 2 and 14) to achieve acceptable fit. Explained common variance (ECV) in the NF models was high across all samples and grades, never falling below the recommended threshold of .60 for “practical” unidimensionality (Reise et al., 2013).

Measurement Invariance Grade Level In the GE and LU samples (using WLSMV estimation), the more restrictive models with scalar invariance fit the data approximately as well as the configural invariance model (Table 5). The ΔCFI values were .003 and .001, respectively; the RMSEA even indicated a better fit for the more parsimonious scalar invariance models.3 In the FI sample (using MLR estimation), metric invariance could be established (ΔCFI = .004). In the case of scalar invariance, the critical ΔCFI value of .01 was exceeded slightly. We thus iteratively freed parameters until the ΔCFI fell below the critical value. This was the case after freeing three intercept parameters, each in only

3

4

one grade level, leaving constrained 19 of the 22 parameters fixed in the partial scalar invariance model. Sex Measurement invariance regarding students’ sex (Table 6) held in the GE sample as assessed by the w2 difference test, and in the FI sample applying the ΔCFI critical value of .01. In the LU sample, the approximate fit indices indicated a better fit for the more constrained scalar invariance model.4

Convergent and Discriminant Validity The pattern of correlations between sum scores of the Preckel/Strobel NFC scale and related and unrelated constructs conformed to our expectations. Correlations with measures such as academic interest, academic selfconcept, effort, and conscientiousness were moderate, while correlations with social measures such as class climate, attitudes toward classmates, and social integration were substantially lower (Table 7) – though still markedly higher than one would expect if one assumed NFC to be completely unrelated to these variables. We return to this finding in the discussion. In order to establish convergent validity with an existing NFC scale, a measurement model was fitted comprising both the nested factor model for the 14-item scale presented above, and Preckel’s (2014) model for a distinct

The w2 difference test indicated a significantly worse fit of the scalar invariance models. Because a recent publication has cast doubt on using the ΔCFI criterion with the WLSMV estimator (Sass et al., 2014), we iteratively freed individual parameters guided by modification indices until the w2 difference test was no longer significant, leading to a partial invariant model. In the GE sample, this necessitated freeing 8 parameters in total (out of 99). In order to assess the practical significance of these changes, we estimated latent group means using both the full and the partial scalar invariance models, and found them to be unaffected (see Table C1 in the Electronic Supplementary Material). In the LU sample 19 (out of 47) parameters had to be freed. Group means changed significantly in the partial scalar model as compared to the full scalar model. The w2 difference test indicated a worse fit for the scalar invariance model. Seventeen out of 47 parameters had to be set free in order to achieve a partially invariant model yielding a nonsignificant w2 difference test.

European Journal of Psychological Assessment (2019), 35(1), 137–149

Ó 2016 Hogrefe Publishing


U. Keller et al., A Need for Cognition Scale for Children and Adolescents

143

Table 4. Fit of three measurement models for the NFC data in the German, Finnish, and Luxembourgish samples Model

Grade

w2(df)

1

191.0 (77)*

2

240.2 (77)*

3

186.2 (77)*

4

CFI

RMSEA (90%CI)

SRMR

ECV

.932

.058 (.048–.068)

.885

.069 (.059–.079)

.951

.059 (.048–.070)

442.3 (77)*

.921

.071 (.065–.078)

1

166.9 (74)*

.945

.053 (.043–.064)

2

162.3 (74)*

.938

.052 (.041–.063)

3

136.8 (74)*

.972

.046 (.034–.058)

4

321.1 (74)*

.946

.060 (.053–.067)

1

129.7 (66)*

.962

.047 (.035–.059)

.702

2

103.3 (66)*

.974

.036 (.021–.048)

.612

3

110.5 (66)*

.980

.041 (.027–.054)

.647

4

165.0 (66)*

.979

.040 (.032–.048)

.705

Germany (GE) SF

CF

NF

Finland (FI) SF

CF

NF0

3

818.5 (77)*

.907

.076 (.071–.080)

.048

6

1,408.6 (77)*

.874

.102 (.098–.107)

.057

9

1,363.3 (77)*

.847

.115 (.110–.121)

.066

3

486.9 (74)*

.948

.057 (.053–.062)

.039

6

782.3 (74)*

.933

.076 (.071–.081)

.049

9

817.0 (74)*

.912

.090 (.084–.095)

.060

3

330.7 (64)*

.967

.050 (.044–.055)

.027

.872

6

448.1 (64)*

.964

.060 (.055–.066)

.028

.833

9

439.0 (64)*

.955

.068 (.062–.075)

.033

.796

Luxembourg (LU) SF CF NF

7

4,438.2 (77)*

.933

.115 (.112–.118)

9

6,604.3 (77)*

.879

.134 (.131–.137)

7

1,597.8 (74)*

.977

.070 (.067–.073)

9

2,918.0 (74)*

.947

.090 (.088–.093)

7

968.1 (66)*

.986

.057 (.054–.060)

.712

9

1,441.4 (66)*

.974

.067 (.064–.070)

.740

Notes. SF = Single-factor model; CF = Correlated factor model; NF = Nested factor model; NF0 = Nested factor model with two residual correlations; ECV = Explained common variance. In the GE and LU samples, the SRMR could not be computed because WLSMV estimation was used. *p < .05.

19-item NFC scale with one NFC factor and two uncorrelated nested method factors (for positively and negatively worded items). The model yielded a good fit (w2 = 1,003, df = 456, p < .05, RMSEA = .052, CFI = .968, N = 443). The general factors of the two models, NFC and Think, were empirically identical with a latent correlation of r = .99.

Incremental Predictive Validity We regressed measures of academic performance in Mathematics and a verbal school subject on academic self-concept and interest (the two variables that were available in all samples), and then added NFC to the model and tested for additional explained variance. As criterion variables in the regression models we used parent-reported school grades in Mathematics and German in the GE Ó 2016 Hogrefe Publishing

sample, which were available for Grades 3 and 4; school grades from students’ report cards in Mathematics and Finnish in the FI sample (available for Grades 6 and 9); and standardized test scores in Mathematics and German reading comprehension in the LU sample. NFC explained statistically significant amounts of additional variance in the FI and LU samples (Table 8). In the GE sample, NFC only improved prediction of German grades in Grade 4. The statistically significant increases in R2 were small to moderate (range .2%–3.1%, median 1.6%).

Discussion We examined the psychometric quality of a new NFC scale suitable for younger children and adolescents in three samples from Finland, Germany, and Luxembourg. Up to European Journal of Psychological Assessment (2019), 35(1), 137–149


144

U. Keller et al., A Need for Cognition Scale for Children and Adolescents

Table 5. Invariance test results of the nested factor measurement model with regard to school grades Model results Invariance model

Invariance (vs. configural) w (df) 2

RMSEA (90% CI)

CFI

SRMR

w2(df)

ΔCFI

Germany (GE) Configural

494.6 (264)*

.040 (.034–.045)

Scalar

620.9 (363)*

.036 (.031–.040)

.977 188.3 (99)*

.003

.974 Partial scalar

550.8 (355)*

.031 (.026–.036)

111.6 (91)

.003

.980 Finland (FI) Configural

1,224.4 (192)*

.059 (.056–.062)

.029 .962

Metric

1,371.2 (236)*

.056 (.053–.059)

.037

130.3 (44)*

.004

.043

473.4 (66)*

.015

.040

302.0 (63)*

.009

499.5 (47)*

.001

.952 Scalar

1,689.8 (258)*

.060 (.058–.063) .947

Partial scalar

1,540.1 (255)*

.057 (.055–.060) .953 Luxembourg (LU)

Configural

2,350.8 (132)*

.061 (.059–.063)

Scalar

2,437.6 (179)*

.053 (.051– 055)

.981 .980 Partial scalar

2,050.5 (160)*

.051 (.049–.053)

41.1 (28)

.003

.984 Notes. CI = Confidence interval. *p < .05.

Table 6. Invariance test results with regard to gender Model results Invariance model

Invariance (vs. configural) w2(df)

RMSEA

CFI

SRMR

w2(df)

ΔCFI

36.00 (33)

.004

Germany (GE) Configural

408.4 (132)*

.043 (.039–.048)

.975

Scalar

400.8 (165)*

.036 (.031–.040)

.979

Finland (FI) Configural

1,072.4 (128)*

.057 (.054–.060)

.966

.026

Metric

1,158.1 (150)*

.054 (.051–.057)

.964

.030

76.11 (22)*

.002

Scalar

1,260.8 (161)*

.055 (.052–.058)

.960

.031

175.00 (33)*

.006

Luxembourg (LU) Configural

2,674.5 (132)*

.066 (.064–.069)

.979

Scalar

2,449.1 (179)*

.054 (.052–.056)

.981

296.40 (47)*

.002

Partial scalar

2,353.4 (162)*

.056 (.054–.058)

.982

43.10 (30)

.001

Notes. CI = Confidence interval. *p < .05.

now, for these age groups no NFC scale with sufficient psychometric properties has been published. We investigated item and scale statistics, structural and criterion-related validity, and laid special attention on the investigation of measurement invariance over grade levels and sex. Results supported the hypothesized structural model. A nested factor model with one general factor Think influencing all manifest indicators and two specific factors

Seek and Conquer influencing a smaller number of items each exhibited the best fit across all samples and grade levels, when compared to a single-factor model and a three-correlated factor model. In the nested factor model, interindividual differences in students’ item-level responses are caused primarily by differences in the general factor Think. The specific factors Seek and Conquer reflect individual differences with regard to approaching and mastering intellectual challenges, respectively, over and above

European Journal of Psychological Assessment (2019), 35(1), 137–149

Ó 2016 Hogrefe Publishing


Ó 2016 Hogrefe Publishing .25 (.26)

Class climate

.25 (.26)

.41 (.46)

.53 (.60)

.44 (.50)

.26 (.29)

.41 (.45)

.47 (.52)

.45 (.51)

.21 (.26)

.47 (.55)

.58 (.70)

.36 (.47)

r with Think factor score

α

.66

.74

.70

.84

.84

.82

.80

3

5

5

3

3

3

6

.77 .88

4,630+

.78

.75

8,925

8,925

8,954

4

3

3

3

Luxembourg (LU)

4,198

2,818*

2,823*

4,213

Finland (FI)

1,946

2,151

1,943

.58

3

No. Items

Germany (GE) 1,942

n

Own development based on PISA 2000 Luxembourgish questionnaire (MENFPS, 2001) Brunner et al. (2010), Marsh and O’Neill (1984) Based on Rauer and Schuck (2003)

Own development

Vainikainen (2014)

Little, Lopez, Oettingen, and Baltes (2001)

Baudson and Preckel (2011); Rauer and Schuck (2003, 2004)

Source

I feel good in my class.

I learn quickly in most school subjects.

I am interested in most school subjects.

I am diligent.

I usually get along very well with my classmates.

How good are you at the following school subjects? [averaged over five subjects].

How interesting do you find the following school subjects? [averaged over five subjects].

I work hard to do well at school.

I get along fine with my classmates.

I am good at school.

I like to do mental arithmetic.

I do my best at school.

Sample item

Notes. r = Correlation coefficient; α = Cronbach’s alpha. Correlations in parentheses were disattenuated to account for unreliability in the external criteria. *Items were not administered in grade 3. +Items were not administered in grade 7. All correlations were statistically significant (p < .001).

.42 (.48)

.40 (.43)

Academic self-concept

Academic self-concept

.45 (.50)

Academic interest

.58 (.66)

.45 (.51)

Effort

Academic interest

.24 (.30)

Social integration

.47 (.54)

.49 (.57)

Academic self-concept

Conscientiousness

.62 (.76)

Academic interest

.26 (.28)

.42 (.55)

Effort

Attitudes towards classmates

r with NFC sum score

Scale

Table 7. Convergent (in bold print) and divergent validity: correlations of NFC sum and factor scores with selected other scale scores

U. Keller et al., A Need for Cognition Scale for Children and Adolescents 145

European Journal of Psychological Assessment (2019), 35(1), 137–149


146

U. Keller et al., A Need for Cognition Scale for Children and Adolescents

Table 8. Incremental validity of NFC over and above academic self-concept and interest in predicting academic performance Grade

Criterion

n

R2 (w/o NFC)

R2 (with NFC)

ΔR2

F (df = 1)

Germany (GE) 3 4

Mathematics

327

13.8

14.4

0.6

German

326

7.5

7.6

0.1

2.2 0.2

Mathematics

808

17.8

18.0

0.3

2.7

German

807

18.8

20.0

1.2

12.2*

Finland (FI) 6 9

Mathematics

1,533

4.3

6.6

2.3

38.1*

Finnish

1,492

5.6

7.4

1.8

29.3*

Mathematics

1,217

6.7

8.4

1.7

22.9*

Finnish

1,186

5.3

8.4

3.1

40.1*

Luxembourg (LU) 9

German

4,699

5.7

5.9

0.2

11.6

Mathematics

4,699

7.1

8.7

1.5

79.5

2

Notes. R = Percentage of variance explained. *p < .05.

an individual’s general Need for Cognition. Importantly, the high amount of explained common variance (ECV; Reise et al., 2013; Sijtsma, 2009) demonstrated that the general factor was strong. Considering also that correlations with external criteria were essentially the same for general factor scores and sum scores, we believe the instrument can be interpreted as a single scale in practical applications even in the presence of limited multidimensionality (cf. Brouwer et al., 2013). We found tentative support for at least partial scalar measurement invariance with regard to grade level and sex. Only the scalar invariance test for grade level in the Finnish sample narrowly failed the ΔCFI criterion (Chen, 2007).5 Based on these results, it is reasonable to assume that the scale is suitable for research involving comparisons of latent means and relationships between factors across groups, for which scalar invariance is a prerequisite (Sass, 2011), at least in the population the scale was originally developed for (German-speaking primary school children). At any rate, even if scalar measurement invariance had been conclusively established in all three samples, this would not have absolved researchers using the scale from investigating invariance anew in their samples, and regarding the variables, of interest. In order to avoid the methodological problems associated with categorical variable analyses (see Footnote 3), it might be helpful to use a response format with five or more response categories so responses can be regarded as continuous rather than categorical (Rhemtulla et al., 2012). 5

Item statistics and internal consistencies were very good in all samples. Convergent validity with the 19-item NFC scale by Preckel (2014) could be clearly demonstrated for a Luxembourgish subsample of over 400 ninth-graders, with the general factors of the two measurement models correlating perfectly (r = .99). The pattern of correlations with conceptually related and unrelated measures mostly conformed to expectations, suggesting convergent and discriminant validity. In all samples, the highest correlations of NFC were found with interest, which is conceptually most closely related to NFC among the construct examined (in fact, as an anonymous reviewer pointed out, it is also an investment trait). None of the correlations, however, were high enough to suggest NFC to be redundant. While we observed higher correlations with the conceptually distant constructs tapping social behavior than we anticipated, these can be tentatively explained using findings from previous studies linking NFC to social criteria. For instance, NFC has been found to be associated with moral courage (Kinnunen & Windmann, 2013), communicative skills (Volman, Noordzij, & Toni, 2012), and a reduced tendency for social loafing in cognitively demanding tasks (Smith, Kerr, Markus, & Stasson, 2001). Regarding the Big Five personality traits, Furnham and Thorne (2013) found correlations of r = .30 with Extraversion and r = .28 with Neuroticism. In the light of these findings, it does not seem surprising that students with a higher NFC should be better socially integrated and have a more favorable attitude toward their

While the ΔCFI criterion used to be recommended for both estimation methods used in this study (Elosua, 2011; Yu, 2002), one recent publication makes different recommendations (Sass et al., 2014). At the time of writing, this methodological issue is unresolved. We therefore adopted a pragmatic approach and reported supplementary evidence for partial invariance based on (ex post facto) modification indices to assess the degree of invariance (see Footnote 3), which we found to be small in the German sample and significant in the Luxembourgish sample.

European Journal of Psychological Assessment (2019), 35(1), 137–149

Ó 2016 Hogrefe Publishing


U. Keller et al., A Need for Cognition Scale for Children and Adolescents

classmates, and a more positive appreciation of the climate in their classroom. NFC scale scores were found to incrementally predict significant, small to moderate amounts of variance in academic performance over and above academic selfconcept and interest – except for the German sample, where no significant increase in explained variance was found in Grade 3, and only a small increase in for German school grades in Grade 4. The latter finding is not surprising in the light of investment theory (Ackerman, 1996, 2014): the link between investment traits such as NFC and cognitive performance is expected to become stronger with age as children with a high NFC encounter more complex problems and seek out more cognitively challenging situations. The overall small ratios of additional explained variance point to a need for more research on the role that investment traits play in young people’s intellectual development. As a measure of NFC, the scale assesses “the core of investment” (von Stumm & Ackerman, 2013, p. 847), and as such we would certainly expect it to predict aspects of intellectual performance – but across-the-board crosssectional prediction of academic performance are obviously not the proper criterion. Rather, we believe the scale to be useful in theoretically motivated longitudinal studies (e.g., Ziegler et al., 2012) and in applications with special student groups, such as gifted students (see Implications).

147

Implications The publication of a valid and reliable measure of NFC suitable for young children starting from age 6 opens up avenues of research regarding this important construct that have so far been neglected. For instance, it is now possible to investigate the ideas sketched out by Cacioppo et al. (1984) regarding antecedents of NFC and beneficial as well as detrimental influences in educational contexts. Research into the possibility of fostering children’s Need for Cognition appears desirable not only because of the correlations of NFC with diverse measures of cognitive achievement documented in the literature. Recent research seems to indicate that, in line with investment theory (Ackerman, 1996, 2014; von Stumm & Ackerman, 2013), NFC plays an important role in shaping young people’s intellectual development, over and above their cognitive abilities. For example, Meier, Vogl, and Preckel (2014) predicted attendance of gifted classes by 5th graders and found that when controlling for cognitive abilities and school achievement, only NFC explained additional variance, but not academic self-concepts, interest in Mathematics, and goal orientations – a finding that also points to practical applications of NFC, and the scale presented here in particular, for instance in screening for gifted students who could benefit from special treatment in school (Gottfried & Gottfried, 2004).

Limitations The three samples investigated were cross-sectional; longitudinal studies are needed in order to examine longitudinal measurement variance, test-retest reliability, and change in NFC over time. However, since the cohorts of students in our samples were very close in age, it seems unlikely that cohort effects played a substantial role. In addition, the Luxembourgish data were collected in the context of a nationwide, longitudinal school monitoring program, where NFC is being assessed in Grades 7 and 9 using the scale presented here, and in Grades 1 and 3 using a short version of the scale. Longitudinal data from this program will be used to replicate and extend the findings of the present study in future publications. The three samples investigated differed in many regards: nationality, native language, age of participants, administration language, answer format, and administration mode (computer-based vs. paper-pencil). Since these differences were not varied systematically, but resulted from circumstances and independent decisions in three separate studies and were to a large degree confounded, their effects on the measurement of NFC could not be studied in isolation. However, the results still show that the factor structure of the Preckel/Strobel NFC scale is robust across a wide range of application scenarios. Ó 2016 Hogrefe Publishing

Electronic Supplementary Material The electronic supplementary material is available with the online version of the article at http://dx.doi.org/10.1027/ 1015-5759/a000370 ESM 1. Additional tables. The tables show NFC item wordings in German and Finnish, detailed item-level statistics, and latent group factor means for the nested factor model.

References Ackerman, P. L. (1996). A theory of adult intellectual development: Process, personality, interests, and knowledge. Intelligence, 22, 227–257. Ackerman, P. L. (2014). Adolescent and adult intellectual development. Current Directions in Psychological Science, 23, 246–251. doi: 10.1177/0963721414534960 Asparouhov, T., & Muthén, B. (2010). Weighted least squares estimation with missing data. MplusTechnical Appendix. Retrieved from http://www.statmodel.com/download/ GstrucMissingRevision.pdf Baudson, T. G., & Preckel, F. (2011). Validierung einer modifizierten Kurzfassung des Fragebogens zur Erfassung emotionaler und sozialer Schulerfahrungen für die Klassen 1 bis 3

European Journal of Psychological Assessment (2019), 35(1), 137–149


148

U. Keller et al., A Need for Cognition Scale for Children and Adolescents

[Validation of a modified short form of the Questionnaire for the Assessment of emotional and Social School Experiences for grades 1 to 3.]. Presented at the 11. Arbeitstagung der Fachgruppe Differentielle Psychologie, Persönlichkeitspsychologie und Psychologische Diagnostik (September 26 to 28, 2011), Saarbrücken. Baudson, T. G., & Preckel, F. (2013). Development and validation of the German Test for (Highly) Intelligent Kids – T(H)INK. European Journal of Psychological Assessment, 29, 171–181. doi: 10.1027/1015-5759/a000142 Bless, H., Waenke, M., Bohner, G., Fellhauer, R., & Schwarz, N. (1994). Need for Cognition: Eine Skala zur Erfassung von Engagement und Freude bei Denkaufgaben [Need for Coginition: A scale for the assessment of engagement and enjoyment during cognitive tasks]. Zeitschrift für Sozialpsychologie, 25, 147–154. Bors, D. A., Vigneau, F., & Lalande, F. (2006). Measuring the need for cognition: Item polarity, dimensionality, and the relation with ability. Personality and Individual Differences, 40, 819–828. doi: 10.1016/j.paid.2005.09.007 Brouwer, D., Meijer, R. R., & Zevalkink, J. (2013). On the factor structure of the Beck Depression Inventory–II: G is the key. Psychological Assessment, 25, 136–145. doi: 10.1037/ a0029228 Brunner, M., Keller, U., Dierendonck, C., Reichert, M., Ugen, S., Fischbach, A., & Martin, R. (2010). The structure of academic self-concepts revisited: The nested Marsh/Shavelson model. Journal of Educational Psychology, 102, 964–981. doi: 10.1037/ a0019644 Brunner, M., Nagy, G., & Wilhelm, O. (2012). A tutorial on hierarchically structured constructs: Hierarchically structured constructs. Journal of Personality, 80, 796–846. doi: 10.1111/ j.1467-6494.2011.00749.x Cacioppo, J. T., & Petty, R. E. (1982). The need for cognition. Journal of Personality and Social Psychology, 42, 116–131. Cacioppo, J. T., Petty, R. E., Feinstein, J. A., & Jarvis, W. B. G. (1996). Dispositional differences in cognitive motivation: The life and times of individuals varying in need for cognition. Psychological Bulletin, 119, 197–253. Cacioppo, J. T., Petty, R. E., & Kao, C. F. (1984). The efficient assessment of need for cognition. Journal of Personality Assessment, 48, 306–307. Cattell, R. B. (1987). Intelligence: Its structure, growth and action. New York, NY: Elsevier. Chen, F. F. (2007). Sensitivity of goodness of fit indexes to lack of measurement invariance. Structural Equation Modeling, 14, 464–504. doi: 10.1080/10705510701301834 Cheung, G. W., & Rensvold, R. B. (2002). Evaluating goodness-offit indexes for testing measurement invariance. Structural Equation Modeling, 9, 233–255. Cohen, P., Cohen, J., Aiken, L. S., & West, S. G. (1999). The Problem of units and the circumstance for POMP. Multivariate Behavioral Research, 34, 315–346. doi: 10.1207/ S15327906MBR3403_2 Dickhäuser, O., & Reinhard, M.-A. (2010). How students build their performance expectancies: The importance of need for cognition. European Journal of Psychology of Education, 25, 399–409. doi: 10.1007/s10212-010-0027-4 Elosua, P. E. (2011). Assessing measurement equivalence in ordered-categorical data. Psicológica: Revista de Metodología Y Psicología Experimental, 32, 403–421. Feist, G. J. (2012). Predicting interest in and attitudes toward science from personality and need for cognition. Personality and Individual Differences, 52, 771–775. doi: 10.1016/ j.paid.2012.01.005

Fleischhauer, M., Enge, S., Brocke, B., Ullrich, J., Strobel, A., & Strobel, A. (2010). Same or different? Clarifying the relationship of need for cognition to personality and intelligence. Personality and Social Psychology Bulletin, 36, 82–96. doi: 10.1177/ 0146167209351886 Furnham, A., & Thorne, J. D. (2013). Need for cognition: Its dimensionality and personality and intelligence correlates. Journal of Individual Differences, 34, 230–240. doi: 10.1027/ 1614-0001/a000119 Ginet, A., & Py, J. (2000). Le besoin de cognition: une échelle française pour enfants et ses conséquences au plan sociocognitif [Need for Cognition: A French scale for children and its socio-cognitive effects]. L’année psychologique, 100, 585–627. doi: 10.3406/psy.2000.28665 Goff, M., & Ackerman, P. L. (1992). Personality-intelligence relations: assessment of typical intellectual engagement. Journal of Educational Psychology, 84, 537–552. Gottfried, A. E., & Gottfried, A. W. (2004). Toward the development of a conceptualization of gifted motivation. Gifted Child Quarterly, 48, 121–132. Hill, B. D., Foster, J. D., Elliott, E. M., Shelton, J. T., McCain, J., & Gouvier, W. D. (2013). Need for cognition is related to higher general intelligence, fluid intelligence, and crystallized intelligence, but not working memory. Journal of Research in Personality, 47, 22–25. doi: 10.1016/j.jrp.2012.11.001 Kinnunen, S. P., & Windmann, S. (2013). Dual-processing altruism. Frontiers in Psychology. doi: 10.3389/fpsyg.2013.00193. http:// blog.apastyle.org/apastyle/2015/05/how-to-cite-an-articlewith-an-article-number-instead-of-a-page-range.html Kokis, J. V. (2002). Individual differences in child’s reasoning. (Doctoral dissertation, University of Toronto). Retrieved from https://tspace.library.utoronto.ca/bitstream/1807/15218/1/ NQ63670.pdf Krkovic, K., Greiff, S., Kupiainen, S., Vainikainen, M.-P., & Hautamäki, J. (2014). Teacher evaluation of student ability: What roles do teacher gender, student gender, and their interaction play? Educational Research, 56, 244–257. doi: 10.1080/00131881.2014.898909 Little, T. D., Lopez, D. F., Oettingen, G., & Baltes, P. B. (2001). A comparative-longitudinal study of action-control beliefs and school performance: On the role of context. International Journal of Behavioral Development, 25, 237–245. doi: 10.1080/ 01650250042000258 Lubke, G. H., & Muthén, B. O. (2004). Applying multigroup confirmatory factor models for continuous outcomes to Likert scale data complicates meaningful group comparisons. Structural Equation Modeling, 11, 514–534. doi: 10.1207/ s15328007sem1104_2 Marsh, H. W., & O’Neill, R. (1984). Self Description Questionnaire III: The construct validity of multidimensional self-concept ratings by late adolescents. Journal of Educational Measurement, 21, 153–174. Martin, R., Ugen, S., & Fischbach, A. (Eds.). (2015). Épreuves Standardisées: Bildungsmonitoring für Luxemburg. Nationaler Bericht 2011 bis 2013 [Épreuves Standardisées: School monitoring for Luxembourg. National report 2011 to 2013]. Esch, Luxembourg: University of Luxembourg, LUCET. Retrieved from http://orbilu.uni.lu/handle/10993/21046 Meade, A. W., Johnson, E. C., & Braddy, P. W. (2008). Power and sensitivity of alternative fit indices in tests of measurement invariance. Journal of Applied Psychology, 93, 568–592. doi: 10.1037/0021-9010.93.3.568 Meier, E., Vogl, K., & Preckel, F. (2014). Motivational characteristics of students in gifted classes: The pivotal role of need for cognition. Learning and Individual Differences, 33, 39–46. doi: 10.1016/j.lindif.2014.04.006

European Journal of Psychological Assessment (2019), 35(1), 137–149

Ó 2016 Hogrefe Publishing


U. Keller et al., A Need for Cognition Scale for Children and Adolescents

Ministère de l’Éducation Nationale, de la Formation Professionnelle et des Sports (MENFRS). (2001). PISA 2000. Kompetenzen von Schülern im internationalen Vergleich: Nationaler Bericht Luxemburg [PISA 2000. Students’ competencies in international comparison: National report for Luxembourg]. Retrieved from http://www.script.men.lu/documentation/pdf/publi/pisa/ pisa-nat.pdf Mussel, P. (2010). Epistemic curiosity and related constructs: Lacking evidence of discriminant validity. Personality and Individual Differences, 49, 506–510. doi: 10.1016/j.paid. 2010.05.014 Mussel, P. (2013). Intellect: A theoretical framework for personality traits related to intellectual achievements. Journal of Personality and Social Psychology, 104, 885–906. doi: 10.1037/ a0031918 Muthén, L. K., & Muthén, B. O. (2012). Mplus User’s Guide (7th ed.) Los Angeles, CA: Muthén & Muthén. Retrieved from http:// www.statmodel.com/download/usersguide/Mplus%20user% 20guide%20Ver_7_r3_web.pdf Petty, R. E., Rucker, D. D., Bizer, G. Y., & Cacioppo, J. T. (2004). The elaboration likelihood model of persuasion. In J. S. Seiter & R. H. Gass (Eds.), Perspectives on persuasion, social influence, and compliance gaining (pp. 65–89). Boston, MA: Allyn and Bacon. Preckel, F. (2014). Assessing need for cognition in early adolescence: Validation of a German adaption of the Cacioppo/Petty scale. European Journal of Psychological Assessment, 30, 65–72. doi: 10.1027/1015-5759/a000170 Preckel, F., & Strobel, A. (2011). Grundschul-NFC: Eine Skala zur Erfassung von Need for Cognition bei Grundschulkindern [Elementary school NFC: A scale for the assessment of need for coginition in elementary school children]. Unpublished research instrument. Trier, Germany: University of Trier. Rauer, W., & Schuck, K.-D. (2003). Fragebogen zur Erfassung emotionaler und sozialer Schulerfahrungen von Grundschulkindern dritter und vierter Klassen–FEESS 3–4 [Questionnaire for the Assessment of emotional and social school experiences of third and fourth grade primary school children]. Göttingen, Germany: Hogrefe. Rauer, W., & Schuck, K.-D. (2004). Fragebogen zur Erfassung emotionaler und sozialer Schulerfahrungen von Grundschulkindern erster und zweiter Klassen–FEESS 1–2 [Questionnaire for the Assessment of emotional and social school experiences of first and second grade primary school children]. Göttingen, Germany: Hogrefe. Reise, S. P., Moore, T. M., & Haviland, M. G. (2010). Bifactor models and rotations: Exploring the extent to which multidimensional data yield univocal scale scores. Journal of Personality Assessment, 92, 544–559. doi: 10.1080/ 00223891.2010.496477 Reise, S. P., Scheines, R., Widaman, K. F., & Haviland, M. G. (2013). Multidimensionality and structural coefficient bias in structural equation modeling a bifactor perspective. Educational and Psychological Measurement, 73, 5–26. doi: 10.1177/0013164412449831 Rhemtulla, M., Brosseau-Liard, P. É., & Savalei, V. (2012). When can categorical variables be treated as continuous? A comparison of robust continuous and categorical SEM estimation methods under suboptimal conditions. Psychological Methods, 17, 354–373. doi: 10.1037/a0029315 Richardson, M., Abraham, C., & Bond, R. (2012). Psychological correlates of university students’ academic performance: A systematic review and meta-analysis. Psychological Bulletin, 138, 353–387. doi: 10.1037/a0026838 Sass, D. A. (2011). Testing measurement invariance and comparing latent factor means within a confirmatory factor analysis

Ó 2016 Hogrefe Publishing

149

framework. Journal of Psychoeducational Assessment, 29, 347–363. doi: 10.1177/0734282911406661 Sass, D. A., Schmitt, T. A., & Marsh, H. W. (2014). Evaluating model fit with ordered categorical data within a measurement invariance framework: A comparison of estimators. Structural Equation Modeling, 21, 167–180. doi: 10.1080/10705511. 2014.882658 Schermelleh-Engel, K., Moosbrugger, H., & Müller, H. (2003). Evaluating the fit of structural equation models: Tests of significance and descriptive goodness-of-fit measures. Methods of Psychological Research Online, 8, 23–74. Sijtsma, K. (2009). On the use, the misuse, and the very limited usefulness of Cronbach’s alpha. Psychometrika, 74, 107–120. Smith, B. N., Kerr, N. A., Markus, M. J., & Stasson, M. F. (2001). Individual differences in social loafing: Need for cognition as a motivator in collective performance. Group Dynamics: Theory, Research, & Practice, 5, 150–158. Tolentino, E., Curry, L., & Leak, G. (1990). Further validation of the short form of the Need for Cognition Scale. Psychological Reports, 66, 321–322. doi: 10.2466/PR0.66.1.321-322 Toplak, M. E., West, R. F., & Stanovich, K. E. (2014). Rational thinking and cognitive sophistication: Development, cognitive abilities, and thinking dispositions. Developmental Psychology, 50, 1037–1048. doi: 10.1037/a0034910 Vainikainen, M.-P. (2014). Finnish primary school pupils’ performance in learning to learn assessments: A longitudinal perspective on educational equity (Vol. 360). Helsinki, Finland: Picaset. Volman, I., Noordzij, M. L., & Toni, I. (2012). Sources of variability in human communicative skills. Frontiers in Human Neuroscience. doi: 10.3389/fnhum.2012.00310. http://blog. apastyle.org/apastyle/2015/05/how-to-cite-an-article-withan-article-number-instead-of-a-page-range.html von Stumm, S., & Ackerman, P. L. (2013). Investment and intellect: A review and meta-analysis. Psychological Bulletin, 139, 841–869. doi: 10.1037/a0030746 Woo, S. E., Harms, P. D., & Kuncel, N. R. (2007). Integrating personality and intelligence: Typical intellectual engagement and need for cognition. Personality and Individual Differences, 43, 1635–1639. doi: 10.1016/j.paid.2007.04.022 Wüstenberg, S., Stadler, M., Hautamäki, J., & Greiff, S. (2014). The Role of strategy knowledge for the application of strategies in complex problem solving tasks. Technology, Knowledge and Learning, 19, 127–146. doi: 10.1007/s10758-014-9222-8 Yu, C.-Y. (2002). Evaluating cutoff criteria of model fit indices for latent variable models with binary and continuous outcomes. Los Angeles, CA University of California. Retrieved from http:// statmodel2.com/download/Yudissertation.pdf Ziegler, M., Danay, E., Heene, M., Asendorpf, J., & Bühner, M. (2012). Openness, fluid intelligence, and crystallized intelligence: Toward an integrative model. Journal of Research in Personality, 46, 173–183. doi: 10.1016/j.jrp.2012.01.002 Received August 6, 2015 Revision received February 12, 2016 Accepted February 16, 2016 Published online November 7, 2016 Ulrich Keller LUCET University of Luxembourg L-4365 Esch-sur-Alzette Luxembourg Tel. +352 466644 9278 E-mail ulrich.keller@uni.lu

European Journal of Psychological Assessment (2019), 35(1), 137–149


Instructions to Authors The main purpose of the European Journal of Psychological Assessment is to present important articles, which provide seminal information on both theoretical and applied developments in this field. Articles reporting the construction of new measures or an advancement of an existing measure are given priority. The journal is directed to practitioners as well as to academicians: The conviction of its editors is that the discipline of psychological assessment should, necessarily and firmly, be attached to the roots of psychological science, while going deeply into all the consequences of its applied, practice-oriented development. Psychological assessment is experiencing a period of renewal and expansion, attracting more and more attention from both academic and applied psychology, as well as from political, corporate, and social organizations. The EJPA provides a meeting point for this movement, contributing to the scientific development of psychological assessment and to communication between professionals and researchers in Europe and worldwide. European Journal of Psychological Assessment publishes the following types of articles: Original Articles, Brief Reports, and Multistudy Reports. Manuscript submission: All manuscripts should in the first instance be submitted electronically at http://www.editorialmanager.com/ejpa. Detailed instructions to authors are provided at http://www.hogrefe.com/j/ejpa Copyright Agreement: By submitting an article, the author confirms and guarantees on behalf of him-/herself and any coauthors that the manuscript has not been submitted or published elsewhere, and that he or she holds all copyright in and titles to the submitted contribution, including any figures, photographs, line drawings, plans, maps, sketches, tables, and electronic supplementary material, and that the article and its contents do not infringe in any way on the rights of third parties. ESM will be published online as received from the author(s) without any conversion, testing, or reformatting. They will not be checked for typographical errors or functionality. The author indemnifies and holds harmless the publisher from any third-party claims. The author agrees, upon acceptance of the article for publication, to transfer to the publisher the exclusive right to reproduce and distribute the article and its contents, both physically and in nonphysical, electronic, or other form, in the journal to which it has

European Journal of Psychological Assessment (2019), 35(1)

been submitted and in other independent publications, with no limitations on the number of copies or on the form or the extent of distribution. These rights are transferred for the duration of copyright as defined by international law. Furthermore, the author transfers to the publisher the following exclusive rights to the article and its contents: 1. The rights to produce advance copies, reprints, or offprints of the article, in full or in part, to undertake or allow translations into other languages, to distribute other forms or modified versions of the article, and to produce and distribute summaries or abstracts. 2. The rights to microfilm and microfiche editions or similar, to the use of the article and its contents in videotext, teletext, and similar systems, to recordings or reproduction using other media, digital or analog, including electronic, magnetic, and optical media, and in multimedia form, as well as for public broadcasting in radio, television, or other forms of broadcast. 3. The rights to store the article and its content in machinereadable or electronic form on all media (such as computer disks, compact disks, magnetic tape), to store the article and its contents in online databases belonging to the publisher or third parties for viewing or downloading by third parties, and to present or reproduce the article or its contents on visual display screens, monitors, and similar devices, either directly or via data transmission. 4. The rights to reproduce and distribute the article and its contents by all other means, including photomechanical and similar processes (such as photocopying or facsimile), and as part of so-called document delivery services. 5. The right to transfer any or all rights mentioned in this agreement, as well as rights retained by the relevant copyright clearing centers, including royalty rights to third parties.

Online Rights for Journal Articles: If you wish to post the article to your personal or institutional website or to archive it in an institutional or disciplinary repository, please use either a pre-print or a post-print of your manuscript in accordance with the publication release for your article and the document ‘‘Guidelines on sharing and use of articles in Hogrefe journals’’ on the journal’s web page at www.hogrefe.com/j/ejpa.

November 2016

Ó 2019 Hogrefe Publishing


EAPA

APPLICATION FORM EAPA membership includes a free subscription to the European Journal of Psychological Assessment. To apply for membership in the EAPA, please fill out this application form and return it together with your curriculum vitae to: David Gallardo-Puj ol, PhD (EAPAH Secretary Gener al), Dept. of Clinical Psychology & Psychobiology, Campus , Mundet, Pg. de la VallH d H ebron, 171, 08035 B arcelona, Spain, E-mail david.gallardo@ub.edu.

Family name . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . First name . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Affiliation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Address . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . City

. . . . . . . . . . . . . . . .

Postcode . . . . . . . . . . . . . . . . . . . .

Country . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Phone

. . . . . . . . . . . . . . .

Fax . . . . . . . . . . . . . . . . . . . . . .

E-mail

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

ANNUAL FEES ◆ EURO 75.00 (US $ 98.00) – Ordinary EAPA members ◆ EURO 50.00 (US $ 65.00) – PhD students ◆ EURO 10.00 (US $ 13.00) – Undergraduate student members

FORM OF PAYMENT ◆ Credit card VISA

Mastercard/Eurocard

IMPORTANT! 3-digit security code in signature field on reverse of card (VISA/Mastercard) or 4 digits on the front (AmEx)

American Express

Number Expiration date

/

CVV2/CVC2/CID#

Card holder’s name . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Signature . . . . . . . . . . . . . .

Date

. . . . . . . . . . . . . . . . . . . . .

◆ Cheque or postal order Send a cheque or postal order to the address given above Signature . . . . . . . . . . . . . .

Date

. . . . . . . . . . . . . . . . . . . . .


Rorschachiana Journal of the International Society for the Rorschach

nline free o ssue le i samp

Editor-in-Chief Lionel Chudzik Austin, TX, USA Advisory Editor Sadegh Nashat University of Geneva, Switzerland Associate Editors Hiroshi Kuroda, Ibaraki, Japan Justine McCarthy Woods, London, UK Gregory J. Meyer, Toledo, USA Fernando Silberstein, Rosario, Argentina

ISSN-Print 1192-5604 ISSN-Online 2151-206X ISSN-L 1192-5604 2 online issues and 1 print compendium per annum (= 1 volume)

Subscription rates (2019) Individuals US $132.00 / € 94.00 (print & online) Institutions From US $259.00 / € 221.00 (print only; pricing for online access can be found in the journals catalog at hgf.io/journals2019) Postage / Handling US $8.00 / € 6.00

www.hogrefe.com

About the Journal Rorschachiana is the scientific publication of the International Society for the Rorschach. Its aim is to publish scientific work in the field for (and by) an international audience. The journal is interested in advancing theory and clinical applications of the Rorschach and other projective techniques, and research work that can enhance and promote projective methods. Rorschachiana appears as a journal with 2 online issues per year and an annual print compendium. All papers published are subject to rigorous peerreview to internationally accepted standards by external reviewers and the Society’s Board of Assessors, working under the auspices of the experienced international editorial team.

Manuscript Submissions All manuscripts should be submitted online at www.editorialmanager.com/ror, where full instructions to authors are also available. Electronic Full Text The full text of the journal – current and past issues (from 1993 onward) – is available online at econtent.hogrefe.com/loi/ror. A free sample issue is also available there. Abstracting Services The journal is abstracted  /  indexed in PsycINFO and PSYNDEX, Scopus, EMCare, and Cinahl Information Systems.


The latest knowledge on how to tackle the complexities of hoarding disorder “If you wish to help those who suffer with the debilitating problem of hoarding, get this book and learn from these experienced scientist–practitioners.” Michael A. Tompkins, PhD, ABPP, Co-Director, San Francisco Bay Area Center for Cognitive Therapy; Assistant Clinical Professor, University of California at Berkeley

Gregory S. Chasson / Jedidiah Siev

Hoarding Disorder (Series: Advances in Psychotherapy – Evidence-Based Practice – Volume 40) 2019, viii + 76 pp. US $29.80 / € 24.95 ISBN 978-0-88937-407-2 Also available as eBook Hoarding disorder, classified as one of the obsessive-compulsive and related disorders in the DSM-5, presents particular challenges in therapeutic work, including treatment ambivalence and lack of insight of those affected. This evidence-based guide written by leading experts presents the latest knowledge on assessment and treatment of hoarding disorder. The reader gains a thorough grounding in the treatment of choice for hoarding – a specific form of CBT interweaved

www.hogrefe.com

with psychoeducational, motivational, and harm-reduction approaches to enhance treatment outcome. Rich anecdotes and clinical pearls illuminate the science, and the book also includes information for special client groups, such as older individuals and those who hoard animals. Printable handouts help busy practitioners. This book is essential reading for clinical psychologists, psychiatrists, psychotherapists, and practitioners who work with older populations, as well as students.


Psychological Assessment Science and Practice Editors Tuulia M. Ortner, PhD, Austria Itziar Alonso-Arbiol, PhD, Spain Anastasia Efklides, PhD, Greece Willibald Ruch, PhD, Switzerland Fons J.R. van de Vijver, PhD, The Netherlands

About the series Each volume in the series Psychological Assessment – Science and Practice presents the state-of-the-art of assessment in a particular domain of psychology, with regard to theory, research, and practical applications. Editors and contributors are leading authorities in their respective fields. Each volume discusses, in a reader-friendly manner, critical issues and develop-

Volume 1, 2015, vi + 234 pp. US $63.00 / € 44.95 ISBN 978-0-88937-437-9

www.hogrefe.com

ments in assessment, as well as well-known and novel assessment tools. The series is an ideal educational resource for researchers, teachers, and students of assessment, as well as practitioners. Psychological Assessment – Science and Practice is edited with the support of the European Association of Psychological Assessment (EAPA).

Volume 2, 2016, vi + 346 pp. US $69.00 / € 49.95 ISBN 978-0-88937-452-2

Volume 3, 2016, vi + 336 pp. US $69.00 / € 49.95 ISBN 978-0-88937-449-2


Issuu converts static files into: digital portfolios, online yearbooks, online catalogs, digital photo albums and more. Sign up and create your flipbook.