European Psychologist 2020

Page 1

Volume 25 / Number 1 / 2020

Volume 25 / Number 1 / 2020

European Psychologist Editor-in-Chief Peter Frensch Managing Editor Kristen Lavallee Associate Editors Alexandra Freund Katariina Salmela-Aro

Official Organ of the European Federation of Psychologists’ Associations (EFPA)

Assessing, diagnosing, and treating maltreated children – trauma-focused CBT “Written by an outstanding team of scholars, this will be a much-referenced book for students, practitioners, and researchers in a wide variety of settings.” Sherry Hamby, PhD, Editor of Psychology of Violence, Director of Life Paths Appalachian Research Center (LPARC), Founder and Co-Chair of ResilienceCon, Research Professor of Psychology at the University of the South, Monteagle, TN

Christine Wekerle / David A. Wolfe / Judith A. Cohen /  Daniel S. Bromberg / Laura Murray

Childhood Maltreatment (Series: Advances in Psychotherapy – Evidence-Based Practice - Volume 4) 2nd ed. 2019, viii + 100 pp. US $29.80 / € 24.95  ISBN 978-0-88937-418-8 Also available as eBook The new edition of this popular, evidence-based guide compiles and reviews all the latest knowledge on assessment, diagnosis, and treatment of childhood maltreatment – including neglect and physical, sexual, psychological, or emotional abuse. Readers are led through this complex problem with clear descriptions of legal requirements for recognizing, reporting, and disclosing maltreatment as well as the best assessment and treatment methods. The focus

is on the current gold standard approach – trauma-focused CBT. An appendix provides a sample workflow of a child protection case and a list of extensive resources, including webinars. This book is invaluable for those training or working as expert witnesses in childhood maltreatment and is also essential reading for child psychologists, child psychiatrists, forensic psychologists, pediatricians, family practitioners, social workers, public health nurses, and students.

European Psychologist

Volume 25/ Number 1 /2020 Official Organ of the European Federation of Psychologists, Associations (EFPA)


Peter A. Frensch, Institute of Psychology, Humboldt-University of Berlin, Rudower Chaussee 18, 12489 Berlin, Germany, Tel. +49 30 2093 4922, Fax +49 30 2093 4910,

Managing Editor

Kristen Lavallee,

Founding Editor / Past Editor-in-Chief

Kurt Pawlik, Hamburg, Germany (Founding Editor) / Alexander Grob, Basel, Switzerland (Past Editor-in-Chief)

Associate Editors

Alexandra Freund, Institute of Psychology, University of Zurich, Binzmühlestrasse 14 / Box 26, 8050 Zurich, Switzerland, Tel. +41 44 635 7200, Katariina Salmela-Aro, University of Helsinki, P.O. Box 4, 00014 University of Helsinki, Finland, Tel. +358 50 415-5283,

EFPA News and Views Editor

Eleni Karayianni, Department of Psychology, University of Cyprus, P.O. Box 20537, Nicosia, Cyprus, Tel. +357 2289 2022, Fax +357 2289 5075,

Editorial Board

Louise Arseneault, UK Dermot Barnes-Holmes, Belgium Claudi Bockting, The Netherlands Gisela Böhm, Norway Mark G. Borg, Malta Serge Brédart, Belgium Catherine Bungener, France Torkil Clemmensen, Denmark Cesare Cornoldi, Italy István Czigler, Hungary Géry d’Ydewalle, Belgium Nicholas Emler, UK Iris Engelhard, The Netherlands Michael Eysenck, UK Rocio Fernandez-Ballesteros, Spain Magne Arve Flaten, Norway Marta Fulop, Hungary

Danute Gailiene, Lithuania John Gruzelier, UK Sami Gülgöz, Turkey Vera Hoorens, Belgium Paul Jimenez, Austria Remo Job, Italy Katja Kokko, Finland Anton Kühberger, Austria Todd Lubart, France Ingrid Lunt, UK Petr Macek, Czech Republic Mike Martin, Switzerland Lucia Mason, Italy Teresa McIntyre, USA Judi Mesman, The Netherlands Susana Padeliadu, Greece Ståle Pallesen, Norway

Georgia Panayiotou, Cyprus Sabina Pauen, Germany Marco Perugini, Italy Martin Pinquart, Germany José M. Prieto, Spain Jörg Rieskamp, Switzerland Sandro Rubichi, Italy Ingrid Schoon, UK Rainer Silbereisen, Germany Katya Stoycheva, Bulgaria Jan Strelau, Poland Tiia Tulviste, Estonia Jacques Vauclair, France Dieter Wolke, UK Rita Zukauskiene, Lithuania

The Editorial Board of the European Psychologist comprises scientists chosen by the Editor-in-Chief from recommendations sent by the member association of EFPA and other related professional associations, as well as individual experts from particular fields. The associations contributing to the current editorial board are: Berufsverband Österreichischer Psychologen/innen; The Belgian Association for Psychological Sciences; Cyprus Psychologists’ Association; Unie Psychologickych Asociaci, Czech Republic; Dansk Psykologforening; Union of Estonian Psychologists; Finnish Psychological Association; Fédération Française des Psychologues et de Psychologie; Sociéte Française de Psychologie; Berufsverband Deutscher Psychologinnen und Psychologen; Magyar Pszichológiai Társaság; Psychological Society of Ireland; Associazione Italiana di Pscicologia; Lithuanian Psychological Association; Société Luxembourgeoise de Psychologie; Malta Chamber of Psychologists; Norsk Psykologforening; Österreichische Gesellschaft für Psychologie; Colegio Oficial de Psicologos; Swiss Psychological Society; Turkish Psychological Association; European Association for Research on Learning and Instruction; European Association of Experimental Social Psychology; European Association of Personality Psychology; European Association of Psychological Assessment; European Health Psychology Society. Publisher

Hogrefe Publishing, Merkelstr. 3, 37085 Göttingen, Germany, Tel. +49 551 999 50 0, Fax +49 551 999 50 425,, North America: Hogrefe Publishing, 361 Newbury Street, 5th Floor, Boston, MA 02115, USA. Tel. (866) 823 4726, Fax (617) 354 6875,


Regina Pinks-Freybott, Hogrefe Publishing, Merkelstr. 3, 37085 Göttingen, Germany, Tel. +49 551 999 50 0, Fax +49 551 999 50 425,


Hogrefe Publishing, Herbert-Quandt-Str. 4, 37081 Göttingen, Germany, Tel. +49 551 99950-956, Fax +49 551 99950-998,


Marketing, Hogrefe Publishing, Merkelstr. 3, 37085 Göttingen, Germany, Tel. +49 551 999 50 423, Fax +49 551 999 50 425,


ISSN-L 1016-9040, ISSN-Print 1016-9040, ISSN-Online 1878-531X

Copyright Information

2020 Hogrefe Publishing. This journal as well as the individual contributions and illustrations contained within it are protected under international copyright law. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, microfilming, recording or otherwise, without prior written permission from the publisher. All rights, including translation rights, reserved.


Published in 4 issues per annual volume.

Subscription Prices

Calendar year subscriptions only. Rates for 2020: Institutions – from US $252.00/1194.00 (print only; pricing for online access can be found in the journals catalog at; Individuals – US $125.00/€93.00 (print & online); Members of psychological organizations supporting EP US $68.00/€49.00 (all plus US $16.00/€12.00 postage & handling). Single copies – US $66.00/€52.00 (plus postage & handling).


Payment may be made by check, international money order, or credit card, to Hogrefe Publishing, Merkelstr. 3, 37085 Göttingen, Germany. US and Canadian subscriptions can also be ordered from Hogrefe Publishing, 361 Newbury Street, 5th Floor, Boston, MA, 02115, USA.

Electronic Full Text

The full text of European Psychologist is available online at

Abstracting Services

Abstracted/indexed in Current Contents /Social and Behavioral Sciences , Social Sciences Citation Index , ISI Alerting Services , Social SciSearch , PsycINFO , PSYNDEX, ERIH, and Scopus. 2018 Impact Factor 2.167, 5-year Impact Factor 4.704, Journal Citation Reports (Clarivate Analytics, 2019)

European Psychologist (2020), 25(1)

2020 Hogrefe Publishing

Contents Original Articles and Reviews

Effectiveness of Programs for the Prevention of Child Sexual Abuse: A Comprehensive Review of Evaluation Studies Amaia Del Campo and Marisalva Fávero


An Existential Threat Model of Conspiracy Theories Jan-Willem van Prooijen


Religiosity’s Nomological Network and Temporal Change: Introducing an Extensive Country-Level Religiosity Index based on Gallup World Poll Data Mohsen Joshanloo and Jochen E. Gebauer


Are Difficult-To-Study Populations too Difficult to Study in a Reliable Way? Lessons Learned From Meta-Analyses in Clinical Neuropsychology Florian Lange


A Series of Meta-Analytic Tests of the Efficacy of Long-Term Psychoanalytic Psychotherapy Christian Franz Josef Woll and Felix D. Schönbrodt



Correction to Nuñez & León, 2015


EFPA News and Views

Meeting Calendar


2020 Hogrefe Publishing

European Psychologist (2020), 25(1)

Original Articles and Reviews

Effectiveness of Programs for the Prevention of Child Sexual Abuse A Comprehensive Review of Evaluation Studies Amaia Del Campo1 and Marisalva Fávero2,3 1

Department of Evolutionary and Educational Psychology, University of Salamanca, Spain


Research Center for Justice and Governance, Law School, University of Minho, Braga, Portugal Social and Behavioral Sciences Department, University Institute of Maia, Portugal


Abstract: During the last decades, several studies have been conducted on the effectiveness of sexual abuse prevention programs implemented in different countries. In this article, we present a review of 70 studies (1981–2017) evaluating prevention programs, conducted mostly in the United States and Canada, although with a considerable presence also in other countries, such as New Zealand and the United Kingdom. The results of these studies, in general, are very promising and encourage us to continue this type of intervention, almost unanimously confirming its effectiveness. Prevention programs encourage children and adolescents to report the abuse experienced and they may help to reduce the trauma of sexual abuse if there are victims among the participants. We also found that some evaluations have not considered the possible negative effects of this type of programs in the event that they are applied inappropriately. Finally, we present some methodological considerations as critical analysis to this type of evaluations. Keywords: child sexual abuse, prevention, effectiveness, review

Studies conducted on the prevalence of child sexual abuse (CSA) have provided unsettling indicators of the true magnitude of this issue. Finkelhor (1994) estimated the international epidemiology of this type of violence, by comparing 21 studies conducted in several countries (Germany, Australia, Austria, Belgium, Canada, Costa Rica, Denmark, Spain, United States, Finland, France, Greece, Great Britain, Ireland, The Netherlands, New Zealand, Norway, Sur Africa, Sweden, Switzerland, and the Dominican Republic), and found victimization rates of 7%–33% in women and 3%–19% in men, confirming the existence of CSA in high proportions and its status as an international problem. As the existence of CSA began to be recognized and became the object of social concern, the first efforts to protect boys and girls from this risk were initiated, through prevention programs designed to teach child and teenagers to remain safe. In the USA, a pioneer country in the prevention of CSA, it was the feminist organizations that first reacted in times of economic prosperity, which allowed the development of social and professional interest in prevention (López, 1995). In 1971, the New York Radical Feminist sponsored a conference to discuss the feasibility of prevention services. At this conference, victims of rape and of CSA reported their experiences and offered their help for the Ó 2019 Hogrefe Publishing

development of preventive interventions, so that the first campaigns for the prevention of CSA began with strong support from victims. The revelations of these women and the wide coverage provided by the media sparked political and social concern about the scope of CSA and its possible effects. In all probability, because most complaints were filed by women and the problem was seen as equivalent to rape, the early funding of prevalence studies was provided by the National Center for the Prevention and Control of Rape. On the other hand, it was the collective of women known as Women Against Rape (WAR) who started to develop prevention techniques for child and adolescent, initially conducting programs that taught children and adolescents self-defense strategies (Berrick & Gilbert, 1991). The Child Assault Prevention (CAP), considered the first CSA prevention program specifically directed toward children, was created when a school requested a prevention program from the WAR association, after a child had been raped. The theoretical model on which the first prevention programs were based was called empowerment and came from the radical feminist movement that explains CSA in the context of the imposition of male power. According to this theory, CSA occurs due to the asymmetry of knowledge and power between victim and aggressor. Most programs were based on the assumption that if you increase the power of European Psychologist (2020), 25(1), 1–15


A. Del Campo & M. Fávero, Effectiveness of Prevention Programs

children and adolescents, through receiving information on CSA and improving their sense of ownership and control over their bodies, they would be able to avoid their victimization. In one of the first reviews of preventive interventions, Tharinger et al. (1988), by analyzing the contents and theoretical orientation of 41 prevention programs, found that 61% were developed under the conceptual orientation of empowerment and nearly all the others, even if based on different theoretical models, included elements originating from the empowerment model, such as the concept of body ownership, developing assertiveness techniques, and so forth. This conceptualization of CSA and the foundation of preventive action on the empowerment theory have been highly criticized by many authors (Berrick & Gilbert, 1991; Gilbert, Berrick, Le Prohn, & Nyman, 1989; Krivacska, 1990, 1992; López, 1995; Tharinger et al., 1988). The original conceptualization of this theory, developed to prevent the sexual assault of adult women and based on the criteria of freedom of choice, responsibility and competence in sexual relations, can hardly be applicable to the prevention of CSA. Firstly, when a child and adolescent are faced with an abusive situation, he or she lacks the capacity of choice, when confronted with imposition from an adult. As pointed out by Glaser and Frosh (1997), children and adolescents, when faced with an abusive situation, cannot give informed consent. With regard to responsibility, the victim is never to blame for the abuse, nor for its possible consequences. Finally, one cannot demand skills from children and adolescents, as they normally do not receive adequate information about sexuality and sexual abuse. Furthermore, it is necessary to consider the possible negative consequences for a child and/or adolescent who participated in this type of programs, and who is assumed to have the skills to cope with the sexual abuse, if he or she cannot avoid the abusive situation (Krivacska, 1990, 1992). Later on, the first studies conducted on CSA show its specificities and effects, as well as its differences from the sexual assault of adult women. In this context, CSA begins to be considered a form of child abuse, prompting associations for child protection, such as the National Center on Child Abuse and Neglect (NCCAN) to take the place of feminist associations and to begin funding research and preventive strategies. The implementation of CSA prevention programs has been showing spectacular growth, especially in North America. In the United States, a few years after they started the first preventive interventions in schools, Plummer (1986) estimated that around a million children have already participated in several prevention programs. That same year, a national study conducted by the National Committee for the Prevention of Child Abuse revealed that over 25% of North American public schools performed

some type of CSA prevention program on a regular basis (Daro, Duerr, & LeProhn, 1986). Later, a national study conducted by Finkelhor and Dziuba-Leaterman (1995), with a sample of 2,000 people interviewed by telephone, found that 67% of children and adolescents reported having participated in such programs at some point throughout their education and 49% of them expressed having been exposed to this type of preventive interventions in at least two moments. The social implementation of these programs in the United States has been such that even in some states there is a requirement for this type of intervention to integrate the school curriculum. For example, in California, in 1984, the California General Assembly determined that all children and adolescents should participate in sexual victimization prevention programs on at least three occasions during the period of their schooling (O’Donohue, Geer, & Elliott, 1992). However, in recent years, there has begun to emerge criticism toward the widespread application of this type of interventions, dispensing evaluation or otherwise conducting evaluation of questionable scientific rigor (Bolen, 2003; Finkelhor, 2007; Finkelhor, 2009; Walsh, Zwi, Woolfenden, & Shlonsky, 2018). This criticism led to the start of systematic evaluations of prevention programs, whose results have been of great use both to improve the existing programs and materials, and for the development of new preventive strategies. Some systematic reviews of these studies and some meta-analyses have been carried out (Davis & Gidycz, 2000; Topping & Baron, 2009; Walsh et al., 2018). Both the systematic review of 22 studies of program evaluation carried out by Topping and Baron (2009) and the results of the meta-analyses carried out by Davis and Gidycz (2000) with 27 studies and Walsh et al. (2018) with 24 studies showed in all cases that the children who participated in these programs improved their knowledge and skills of self-protection. However, we still do not know whether these improvements in knowledge and skills ring about changes in the real life of students. Some studies have also evaluated the effectiveness of sexual abuse prevention programs addressed to parents and educators. The few studies that have assessed the effects of preventive interventions with parents have, in general, obtained fairly optimistic results (Berrick, 1988; Binder & McNiel, 1987; Burgess & Wurtele, 1998; Kolko, Moser, Litz, & Hughes, 1987; MacIntyre & Carr, 1999; Porch & Petretic-Jackson, 1986; Treacy & Fisher, 1993; Wurtele, Moreno, & Kenny, 2008). All studies found that, after participating in prevention programs, parents significantly improved their knowledge of sexual abuse and their ability to deal with these issues with a child. Some studies have also confirmed that parents who have adequate training can be as effective as educators in the transmission of

European Psychologist (2020), 25(1), 1–15

Ó 2019 Hogrefe Publishing

A. Del Campo & M. Fávero, Effectiveness of Prevention Programs

knowledge and skills about sexual abuse (Wurtele, Gillispie, Currier, & Franklin, 1992; Wurtele, Kast, & Melzer, 1992). Similarly, studies conducted with teachers show the positive impact of prevention programs on their knowledge, attitudes, and behaviors (Allsopp & Prosen, 1988; Binder & McNiel, 1987; Kleemeier, Webb, Hazzard, & Pohl, 1988, Kolko et al., 1987; MacIntyre & Carr, 1999; Randolph & Gold, 1994; Rheingold, Zajac, & Patton, 2012). Teachers, in addition to acquiring knowledge and skills in these programs, improve their opinions in favor of prevention and learn to transmit information about sexual abuse to their students. In addition, they significantly improve their ability to identify possible indicators of sexual abuse and to suggest appropriate interventions in these cases. Without underestimating the importance of programs aimed at parents, educators, and other professionals, whom we have already defended in other publications (Del Campo & López, 2006), in this study we will focus on prevention programs aimed at minors, which have been proven to be the most effective in preventing this risk, especially when parents and teachers are involved. This article presents the most extensive review to date of 70 evaluation studies of sexual abuse prevention programs aimed at minors, from 1981, the year in which the first study was published, until 2017. Through this review, we intend to evaluate the real effectiveness of prevention programs and analyze the evolution of these studies. More specifically, we intend to analyze whether, in addition to the improvements in knowledge and skills found in other reviews, changes have also taken place in the real life of minors.


Study Selection Criteria The inclusion criteria were: Type of publication: academic journals; Publication dates of 1980 to present; Studies were eligible for inclusion if they examined the effectiveness of primary prevention strategies for child sexual abuse; Studies that evaluated a school-based prevention program; Studies with children or adolescents (3–18 years); Literature reviews of such articles were included as well. The exclusion criteria were: Studies on other types of abuse (physical, emotional, neglect, etc.); Studies on prevention programs aimed at parents, educators, or other professionals working with children; Studies with children with disabilities or learning difficulties were also excluded.

Results of the Search and Coding of Data In the review, a total of 2,377 publications were found, 2,330 from the electronic search, and 47 articles from the manual search in specific journals. Of these, 344 were duplicate publications and 1963 were excluded because they did not meet the criteria. Finally, 70 studies (see Appendix) met all the inclusion criteria. The main characteristics of the studies were coded according to the following criteria: authors and year, type of sample, number of subjects, program applied, design used, evaluation carried out, and results obtained.

Results Method Search Strategy A systematic literature search was conducted to identify journal articles focused on CSA prevention programs. To identify studies for this review, we first conducted searches of the following online databases: ACADEMIC SEARCH, PsycINFO, PsycARTICLES, PubMed, ERIC, PROQUEST, Sociological Abstracts, MEDLINE, Dissertation Abstracts International, and GoogleScholar. Search terms included in the abstract: child sexual abuse, prevention, program, effectiveness, efficacy, and evaluation. Second, manual reviews of issues from relevant peer-reviewed journals (i.e., Aggression and Violent Behavior, Child Abuse and Neglect, Child Welfare, Education and Treatment of Children, Journal of Adolescent Health, Journal of Child Sexual Abuse, Journal of Interpersonal Violence, Psychological Reports and Psychology of Violence, Sexual Abuse: Journal of Research and Treatment). Ó 2019 Hogrefe Publishing

This section presents the main results obtained in the program evaluation studies. An increase in knowledge about sexual abuse, the acquisition of coping skills, changes taking place in real life, and the possible negative effects of prevention programs were all taken into account. Finally, some of the methodological limitations found in the studies are pointed out, as they must be considered in the interpretation of the results and in future evaluation studies of sexual abuse prevention programs. The Appendix summarizes some basic characteristics of the body of research reviewed that evaluated the effectiveness of the prevention programs applied.

Increasing Knowledge Evaluation studies of these programs have normally used questionnaires as instruments to measure the increase in knowledge produced in the children and adolescents after the implementation of the program. The recent interest in the study of this issue and the lack of standardized European Psychologist (2020), 25(1), 1–15


A. Del Campo & M. Fávero, Effectiveness of Prevention Programs

measurement instruments basically led to the development of a specific questionnaire for each study, especially during the first years of research. With younger children, open questions are usually employed, such as “What is sexual abuse,” “What would you do if someone touched you inappropriately,” and so forth, whereas, with children and adolescents attending primary or secondary education, there is use of questions with multiple choice items (e.g., Questionnaire for older children: Binder & McNiel, 1987) or a test with dichotomous responses (e.g., Personal Safety Questionnaire [PSQ]: Saslawsky & Wurtele, 1986). This heterogeneity of instruments, together with the diversity of programs, severely hampers the comparison of results between different studies (Finkelhor, 2009). Most studies reviewed, which used a pretest/posttest evaluation design, showed that children and adolescents displayed a significant increase in knowledge after participating in the programs. Still, most studies using control groups suggest that children and adolescents who participate in these programs display significantly better knowledge compared to groups in which intervention did not take place. We have only found two exceptions, among the studies reviewed, which presented unfavorable results (Leake, 1986a; Miltenberger & Thiesse-Duffy, 1988). One of the main current debates focuses on whether preschool students can indeed learn the concepts taught in the prevention programs. Although some early studies with preschool aged children suggested they were unable to learn concepts related to CSA (Borkin & Frank, 1986; Gilbert et al., 1989), further research conducted with greater methodological rigor showed that children of preschool age significantly improved their knowledge after preventive interventions, especially when behavioral techniques were used, such as modeling, repetition, reinforcement. (Brown, 2017; Harvey, Forehand, Brown, & Holmes, 1988; Hill & Jason, 1987; Kraizer, Witte, & Fryer, 1989; Nemerofsky, Carran, & Rosenberg, 1994; Prange & Atkinson, 1988; Tremblay, 1998; Tutty, 1997; Wurtele, 1990; Wurtele, Gillispie, et al., 1992; Wurtele, Kast, et al., 1989; Wurtele & Owens, 1997). The first studies that examined the effect of the gender variable in the acquisition of knowledge showed contradictory results. While Sigurdson, Strang, and Doig (1987) and Hazzard, Kleemeier, and Webb (1990) found that girls learned more, the results of Garbarino (1987) and Ostbloom, Richardson, and Galey (1987) showed higher scores in boys. However, subsequent studies found no significant differences in the improvement of knowledge according to the gender of the participants (Briggs & Hawkins, 1994; Del Campo & López, 2006; MacIntyre & Carr, 1999; Peraino, 1990; Taal & Edelaar, 1997; Tutty, 1997; Wurtele & Owens, 1997).

Concerning the maintenance of knowledge after the end of the program, early studies did not include long-term evaluation, whereas more recent studies emphasize the importance of evaluating knowledge retention in several moments (Barth & Derezotes, 1990; Berrick & Gilbert, 1991; Briggs & Hawkins, 1994; Hazzard, Webb, Kleemeier, Angert, & Pohl, 1991; MacIntyre & Carr, 1999; Oldfield, Hays, & Megel, 1996; Peraino, 1990; Ratto & Bogat, 1990; Taal & Edelaar, 1997; Tutty, 1992, 2000; Wurtele, 1990; Wurtele, Gillispie, et al., 1992; Wurtele, Kast, et al., 1992). The results of most studies reveal that the knowledge acquired by children and adolescents is remembered for several weeks, months, or even a year after the implementation of the program. However, some researchers found that, in follow-up studies, even if the overall increase in acquired knowledge is maintained, there is some loss, especially in younger children (Borkin & Frank, 1986; Miltenberger & Thiesse-Duffy, 1988; Ostbloom et al., 1987; Plummer, 1984; Tremblay, 1998).

European Psychologist (2020), 25(1), 1–15

Acquisition of Skills Some studies have proposed to also evaluate the success of certain preventive interventions in training skills to avoid and cope with sexual abuse. The measurement instruments most commonly used in the evaluation of skill acquisition and maintenance, similarly to studies on knowledge, are questionnaires (e.g., Children’s Safety and Skills Questionnaire; MacIntyre & Carr, 1999; The Children’s Knowledge of Abuse Questionnaire-Revised [CKAQ-R]; Tutty, 1997; What I Know About Touching; Hazzard et al., 1990). Another method to evaluate the acquired skills are structured interviews after the narration of one or several stories of sexual abuse (e.g., “What-If” Situations Test [WIST]: Saslawsky & Wurtele, 1986), the representation of potentially abusive situations with dolls (Downer, 1984) or after viewing videos that show images of CSA (e.g., What Would You Do?: Hazzard et al., 1991). A third measure of approach to the acquired skills is the use of role playing techniques, through which it is possible to simulate situations of appropriate and inappropriate contact and ask the children to explain, through role playing, what would be the correct way to proceed (Harbeck, Peterson, & Starr, 1992; Leake, 1986b; Lutter & Weisman, 1985; Ostbloom et al., 1987; Plummer, 1984; Ratto & Bogat, 1990; Wurtele & MillerPerrin, 1987). This type of evaluation requires children to apply the acquired knowledge in solving new problems, without limiting them, as in previous cases, to “reciting” the procedure rules learned. However, unfortunately, it is not possible to determine to what extent the responses children present or simulate in the hypothetical situations of

Ó 2019 Hogrefe Publishing

A. Del Campo & M. Fávero, Effectiveness of Prevention Programs


CSA relate to the ways to proceed in real life. Finally, trying to approach, as much as possible, the evaluation of the skills acquired and the adeptness in their use in real life, some authors have opted for evaluating the reactions of children in pseudo-real situations in which one member of the research team, unknown to the participants, approaches each one of them after the application of the program (Fryer, Kraizer, & Miyoshi, 1987; Miltenberger & ThiesseDuffy, 1988). Some authors aim at relevant criticism toward the use of these evaluation methods. For example, Conte (1987) points out that, with this kind of procedure, one runs the risk of desensitizing children to real situations. In addition, these evaluations only value the children’ responses to demands from strangers, obviating a fundamental part of the valuation of the effectiveness of prevention programs: the evaluation in abusive situations committed by acquaintances or family members. Practically all studies conducted confirm that after children and adolescents participate in the programs, there are important improvements in their skills of self-protection from sexual abuse.

having participated in a program for the prevention of sexual abuse, 49% of them on two or more occasions. Concerning the practical application of the contents covered by the programs, 40% of the participants stated that the knowledge and skills they learned had helped them in specific real-life situations and 25% revealed they had used the information from the programs to help a friend.

Changes in Real Life The complexity of the studies on the effects of these programs on the real lives of children led to the emergence of few studies that value this issue (Finkelhor, 2009; Scholes, Jones, Stieler-Hunt, Rolfe, & Pozzebon, 2012). There is research that reports important percentages of children and adolescents confirming that the program helped them greatly in real life, although, unfortunately, most follow-up studies do not specify the number of successes and failures of children after the intervention. Some of the most significant data refer to the increase in revelations after preventive interventions (Del Campo & López, 2006; Finkelhor, 2009). These results, which could initially be interpreted as a failure of the programs to reduce the incidence of sexual abuse cases, on the contrary, reveal their effectiveness achieving one of their priority goals: increase the likelihood that children who participate in them will disclose abusive situations. For example, one of the first studies that provided evidence of this effect (Beland, 1986) found that schools that applied prevention programs, 1 year later, doubled the percentage of sexual abuse revelations, compared to schools that had not applied any program. The utility of these programs for other aspects of the real lives of children and adolescents was evaluated by a very ambitious study conducted by Finkelhor and DziubaLeaterman (1995). These authors conducted telephone interviews with a sample of 2,000 boys and girls between the ages of 10 and 16 years, two thirds of which reported

Ó 2019 Hogrefe Publishing

Costs of Prevention Programs Studies exploring the possible negative effects derived from the application of this type of programs are normally based on changes of behavior observed by parents and educators (e.g., sleeping problems, eating problems, revulsion to physical contact), as well as the emergence of negative emotional responses, such as fears and anxieties. Overall, studies confirm that the negative emotional and behavioral consequences produced by preventive interventions are minimal (Binder & McNiel, 1987; Conte, Rosen, & Saperstein, 1985; Del Campo & López, 2006; Downer, 1984; Hazzard et al., 1991; MacIntyre & Carr, 1999; Miltenberger & Thiesse-Duffy, 1988; Ostbloom et al., 1987; Ratto & Bogat, 1990; Tutty, 1997; Wolfe, McPherson, Blount, & Wolfe, 1986; Wurtele, 1990; Wurtele & Miller-Perrin, 1987; Wurtele et al., 1989; Wurtele & Owens, 1997). Nonetheless, some studies alert to the need to consider the children and adolescents who, though few in number, manifest certain negative reactions after the intervention. For example, Pohl and Hazzard (1990), through reports from parents and educators, found that just under 5% of parents observed some behavioral problems in their children, such as sleeping problems or fear of men. In turn, teachers observed that, although generally negative reactions were scarce among students, the program seemed to cause bad memories in some children. It is also important to take into account that some prevention programs might produce negative effects in the relationship between the children and adults. On the one hand, putting into practice the skills acquired to reject abusive situations may be generalized to inappropriate contexts. In this regard, reports following some interventions reveal the difficulties of some children to distinguish between obedience to their parents and educators and the right to refuse unreasonable requests from them or other acquaintances or strangers (O’Donohue & Geer, 1992). On the other hand, there are several authors who alert to the risk that some children may generalize the procedure norms they learned for facing possible abusive contact to all types of physical contact, which can cause manifestations of fear or disgust toward affective contact from family members or adult friends (Daro, 1991; Hazzard et al., 1990; Finkelhor & Strapko, 1992; Krivacska, 1992; López, 1995).

European Psychologist (2020), 25(1), 1–15


A. Del Campo & M. Fávero, Effectiveness of Prevention Programs

Limitations of the Studies

the concepts taught in the prevention programs, as pointed out by Tutty (1992), it is questionable that such brief evaluation measures serve to adequately evaluate whether the children understood the ideas that were conveyed. For example, in a study by Saslawsky and Wurtele (1986), the difference found between the experimental group and the control group, although significant, was only of 2 points on a scale of 13 points. Similarly, Wolfe et al. (1986) found, in their study, that the significant improvement of the experimental group compared to the control group was no more than one point on a scale of seven points. Another important limitation found concerns the fact that only eight of the questionnaires used by the different studies had their psychometric properties analyzed (PSQ and WIST: Saslawsky & Wurtele, 1986; PSQ: Sigurdson et al., 1987; Children’s Knowledge Questionnaire: Binder & McNiel, 1987; Knowledge and Opinions Questionnaire: Kolko et al., 1987; CKAQ: Tutty, 1992; Touch Questionnaire: Taal & Edelaar, 1997; Children’s Knowledge Questionnaire: Del Campo & López, 2006). Moreover, except for 32 studies that had considerable samples (Barron & Topping, 2013; Barth & Derezotes, 1990; Beland, 1986; Berrick & Gilbert, 1991; Blumberg et al., 1991; Brown, 2017; Daro et al., 1986; Del Campo & López, 2006; Finkelhor, Asdigian, & Dziuba-Leaterman, 1995; Garbarino, 1987; Hazzard et al., 1990, 1991; Kolko et al., 1987; Kraizer et al., 1989; Liddell et al., 1988; MacIntyre & Carr, 1999; Nelson, 1985; Nemerofsky et al., 1994; Oldfield et al., 1996; Ostbloom et al., 1987; Pohl & Hazzard, 1990; Pulido et al., 2015; Ray & Dietzel, 1984; Sigurdson et al., 1987; Spungen, Jensen, Finkelstein, & Satinsky, 1989; Taal & Edelaar, 1997; Tremblay, 1998; Tutty, 1992; Tutty, 1997; Wolfe et al., 1986; Woods & Dean, 1986, Wurtele & Owens, 1997), the size of the samples used is generally very small. For example, Harbeck et al. (1992) relied on a sample of 20 participants with a wide range of ages, from 4 to 16 years, which hardly allows to generalize to groups with such disparate ages. The samples are especially reduced when studies are conducted with preschool children. For example, Poche et al. (1981) conducted a training of behavioral techniques with only three children. Hill and Jason (1987) included 23 children in their study, Ratto and Bogat (1990) included 39 school children, and Wurtele (1990) included 24 children and adolescents. In the study by Christian et al. (1988), the loss of sample size was such (initial N = 103, posttest N = 11) that it prevents the interpretation of the results. Finally, we found that most studies either do not value the stability of long-term knowledge, or they performed follow-up studies very closely in time. Only 31 of the 70 reviewed studies evaluated the long-term permanence of knowledge and skills acquired during the program. In addition, it is possible to observe that only 17 studies performed

We would like to point out some of the repeated methodological problems we have found in the various studies reviewed and that, in our view, should be considered when interpreting the results. For example, of the 70 studies analyzed, less than half used designs that included control groups (Barron & Topping 2013; Blumberg, Chadwick, Fogarty, Speth, & Chadwick, 1991; C ß ecßen-Eroğul & Kaf Hasirci, 2013; Conte et al., 1985; Del Campo & López, 2006; Downer, 1984; Fryer et al., 1987; Harvey et al., 1988; Hazzard et al., 1991; Kolko et al., 1987; Kraizer et al., 1989; MacIntyre & Carr, 1999; Nemerofsky et al., 1994; Nibert, Cooper, Fitch, & Ford, 1988; Oldfield et al., 1996; Poche, Yoder, & Miltenberger, 1988; Prange & Atkinson, 1988; Pulido et al., 2015; Ratto & Bogat, 1990; Saslawsky & Wurtele, 1986; Swan, Press, & Brigs, 1985; Taal & Edelaar, 1997; Tremblay, 1998; Tutty, 1992, 1997; Wolfe et al., 1986; Woods & Dean, 1986; Wurtele, 1990; Wurtele, Saslawsky, Miller, Marrs, & Britcher, 1986; Wurtele, Currier, Gillispie, & Franklin, 1991; Wurtele, Gillispie, et al., 1992; Wurtele & Owens, 1997). We considered the use of control groups to be especially important, since some studies have demonstrated the effect of sensitization of children and adolescents to pretest evaluation, showing improvement of their knowledge about CSA in posttest evaluation, in the absence of intervention (Hazzard et al., 1991; MacIntyre & Carr, 1999; Tutty, 1992; Wurtele & Owens, 1997). Therefore, with the omission of comparison control groups in most studies, it is possible to conclude that there are few studies that have been able to obtain definitive results, verifying that the prevention programs are responsible for the change in knowledge and skills of the participants. Other studies lack prior evaluation of the prevention programs, preventing the establishment of a baseline of knowledge that would allow to calculate the extent of the intervention effectiveness (Beland, 1986; Borkin & Frank, 1986; Garbarino, 1987; Leake, 1986a, 1986b; Liddell, Young, & Yamagishi, 1988; Nelson, 1985; Oldfield et al., 1996; Poche et al., 1988; Pohl & Hazzard, 1990; Wolfe et al., 1986; Wurtele et al., 1986). Many of the instruments used can be considered excessively reduced, as several studies based their analyses on results obtained from questionnaires consisting of 15 items or less (Beland, 1986; Berrick & Gilbert, 1991; Briggs & Hawkins, 1994; Christian, Dwyer, Schumm, & Coulson, 1988; Garbarino, 1987; Harvey et al., 1988; Kolko et al., 1987; Kraizer et al., 1989; Liddell et al., 1988; Miltenberger & Thiesse-Duffy, 1988; Ostbloom et al., 1987; Poche, Brouwer, & Swearingen, 1981; Prange & Atkinson, 1988; Ray & Dietzel, 1984; Wolfe et al., 1986; Woods & Dean, 1986). Taking into account the variety and complexity of European Psychologist (2020), 25(1), 1–15

Ó 2019 Hogrefe Publishing

A. Del Campo & M. Fávero, Effectiveness of Prevention Programs


follow-up studies exceeding 2 months (Barth & Derezotes, 1990; Berrick & Gilbert, 1991; Briggs & Hawkins, 1994; Del Campo & López, 2006; Finkelhor et al., 1995; Fryer et al., 1987; Hazzard et al., 1991; MacIntyre & Carr, 1999; Oldfield et al., 1996; Plummer, 1984; Poche et al., 1981; Ratto & Bogat, 1990; Ray & Dietzel, 1984; Saslawsky & Wurtele, 1986; Tremblay, 1998; Tutty, 1992; Wurtele, Gillispie, et al., 1992).

It is essential that interventions within the school setting be implemented at all levels, with children of all ages. The review of studies that evaluate these programs allows us to verify the effectiveness of these programs at all school levels, from preschool to high school. Similarly, concerns that preventive interventions may be inadequate for primary school children have been overcome by various studies confirming the positive impact of these programs on younger children (Nemerofsky et al., 1994; Peraino, 1990; Ratto & Bogat, 1990; Tremblay, 1998; Tutty, 1997; Wurtele & Owens, 1997). Secondly, the children in these programs should not only be regarded as possible victims, but also as potential perpetrators, especially in the case of older children. The vast majority of prevention programs reviewed focus on students as victims and rarely intervene with them as potential perpetrators in the present and/or future. Studies confirm that the effectiveness of prevention programs is not strongly associated with duration (Finkelhor & Dziuba-Leaterman, 1995). Therefore, long interventions are not more effective, but it is important that the programs have a continuity that allows children to receive this type of information on several occasions, amplifying their level of knowledge over the years. We also highlight the need to include CSA prevention programs in broader interventions focusing on sexual education or health education. Firstly, given that risk behaviors are closely related, the promotion of general abilities and the development of skills serve to avoid different risks. Secondly, it should be noted that the programs implemented at the margin or even in the absence of sexual education may have a negative effect on children’s sexuality. In order to avoid this, children must first receive vast information about normal sexual behavior, feelings, affection, and so forth, which convey a positive view of sexuality. The role of parents in prevention programs is essential. Several studies confirm that the participation of parents and educators in these programs increases their effectiveness, as well as the likelihood of disclosure in the case of abuse (Rheingold et al., 2012; Wurtele et al., 2008). As a final remark, we may conclude that prevention in school settings in necessary, but insufficient. Child sexual abuse is a social problem that requires a community approach that minimizes individual, family, social, and cultural factors associated with the risk of sexual abuse.

Discussion Despite the wide variety of existing programs (regarding the method of application and materials used) and the multitude of evaluation measures used in the studies, the results of almost all the studies suggest the effectiveness of this type of intervention in the conveyance of knowledge regarding sexual abuse. Similarly, research conducted on the negative effects of prevention programs concludes that training in prevention, including the unintended effects it may produce, is always more beneficial than the absence of intervention and that there is no cost significant enough to justify not recommending these programs (Del Campo & López, 2006). Nevertheless, we consider, according to Roberts and Miltenberger (1999), that, in the future, research avenues should include in their interests the study of factors that may contribute to the development of negative effects in some children and adolescents. This information would serve to produce the required changes in the programs, in order to reduce or eliminate possible undesired consequences. We believe that one way to mitigate these effects is to first work on the positive aspects of affective bonds, physical contact, and sexuality. Although no long-term research has been conducted to accurately confirm that CSA prevention programs actually reduce the frequency of this risk, there are other reasons for which the application of these interventions is recommended. On the one hand, CSA prevention programs encourage children and adolescents to disclose the abuse and, on the other hand, they may help to reduce the trauma of sexual abuse, when it occurs. It is known that one of the most traumatic effects, according to many victims, is the feeling of loneliness, since the victims are unaware that abuse also happens to other boys and girls. A challenge for the future would be to conduct continuous, global and community programs, by coordinating the various professionals involved in CSA, which would allow for the objective evaluation of the long-term effects of these interventions. Based on the results obtained in the review study, we present some suggestions for the implementation of future CSA prevention programs and for the development of new preventive interventions. Ó 2019 Hogrefe Publishing

References Allsopp, A., & Prosen, S. (1988). Teacher reactions to a child sexual abuse training program. Elementary School Guidance and Counseling, 22, 299–305. Barron, I., & Topping, K. (2013). Exploratory evaluation of a schoolbased child sexual abuse prevention program. Journal of Child

European Psychologist (2020), 25(1), 1–15


A. Del Campo & M. Fávero, Effectiveness of Prevention Programs

Sexual Abuse, 22, 931–948. 10538712.2013.841788 Barth, R., & Derezotes, D. (1990). Preventing adolescent abuse: Effective intervention and techniques. Lexington, MA: Lexington Books. Beland, K. (1986). Prevention of child sexual victimisation: A school-based statewide prevention model. Seattle, WA: Committee for Children. Berrick, J., & Gilbert, N. (1991). With the best of intentions. The child sexual abuse prevention movement. New York, NY: Guilford Press. Berrick, J. (1988). Parental involvement in child abuse prevention training: What do they learn? Child Abuse and Neglect, 12, 543– 553. Binder, R., & McNiel, D. (1987). Evaluation of a school-based sexual abuse prevention program: Cognitive and emotional effects. Child Abuse and Neglect, 11, 497–506. 10.1016/0145-2134(87)90075-5 Blumberg, E., Chadwick, M., Fogarty, L., Speth, T., & Chadwick, D. (1991). The touch discrimination component of sexual abuse prevention training: Unanticipated positive consequences. Journal of Interpersonal Violence, 6, 12–28. 10.1177/088626091006001002 Bolen, R. (2003). Child sexual abuse: Prevention or promotion. Social Work, 48, 174–185. Borkin, J., & Frank, L. (1986). Sexual abuse prevention for preschoolers: A pilot program. Child Welfare, 65, 75–82. Briggs, F., & Hawkins, M. (1994). Follow-up data on the effectiveness of New Zealand’s national school based child protection program. Child Abuse and Neglect, 18, 635–643. https://doi. org/10.1016/0145-2134(94)90013-2 Brown, D. (2017). Evaluation of safer, smarter kids: child sexual abuse prevention curriculum for kindergartners. Child and Adolescent Social Work Journal, 34, 213–222. 10.1007/s10560-016-0458-0 Burgess, E., & Wurtele, S. (1998). Enhancing parent–child communication about sexual abuse: A pilot study. Child Abuse & Neglect, 22, 1167–1175. (98)00094-5 C ß ecßen-Eroğul, A., & Kaf Hasirci, O. (2013). The effectiveness of psycho-educational school-based child sexual abuse prevention training program on Turkish elementary students. Educational Sciences: Theory and Practice, 13, 725–729. Christian, R., Dwyer, S., Schumm, W., & Coulson, W. (1988). Prevention of sexual abuse for preschoolers: Evaluation of a pilot program. Psychological Reports, 6, 387–396. https://doi. org/10.2466/pr0.1988.62.2.387 Conte, J. (1987). Ethical issues in evaluation of prevention programs. Child Abuse and Neglect, 11, 171–172. https://doi. org/10.1016/0145-2134(87)90054-8 Conte, J., Rosen, C., & Saperstein, L. (1985). An analysis of programs to prevent the sexual victimization of children. Journal of Primary Prevention, 6, 141–155. Conte, J. R., Rosen, C., Saperstein, L., & Shermack, R. (1985). An evaluation of a program to prevent the sexual victimization of young children. Child Abuse & Neglect, 9, 319–328. https://doi. org/10.1016/0145-2134(85)90027-4 Daro, D. (1991). Prevention programs. In C. Hollin & K. Howells (Eds.), Clinical approaches to sex offenders and their victims (pp. 285–306). New York, NY: Wiley. Daro, D., Duerr, J., & LeProhn, N. (1986). Child assault prevention instruction: What works with preschoolers. Chicago, IL: National Committee for the Prevention of Child Abuse. Davis, M., & Gidycz, C. (2000). Child sexual abuse prevention programs: A meta-analysis. Journal of Clinical Child Psychology, 29, 257–265.

Del Campo, A., & López, F. (2006). Evaluación de un programa de prevención de abusos sexuales a menores en educación primaria [Evaluation of school-based child sexual abuse prevention program]. Psicothema, 18, 1–8. Downer, A. (1984). An evaluation of Talking About Touching: A personal safety curriculum. Seattle, IL: Committee for Children. Finkelhor, D. (1994). The international epidemiology of child sexual abuse. Child Abuse and Neglect, 18, 409–417. 10.1016/j.chiabu.2008.07.007 Finkelhor, D. (2007). Prevention of sexual abuse through educational programs directed toward children. Pediatrics, 120, 640– 645. Finkelhor, D. (2009). The prevention of childhood sexual abuse. The Future of Children, 19, 169–194. 10.1353/foc.0.0035 Finkelhor, D., Asdigian, N., & Dziuba-Leaterman, J. (1995). Victimization prevention programs for children: A follow-up. American Journal of Public Health, 85, 1684–1689. https://doi. org/10.2105/AJPH.85.12.1684 Finkelhor, D., & Dziuba-Leaterman, J. (1995). Victimization prevention programs: A national survey of children’s exposure and reactions. Child Abuse and Neglect, 19, 129–139. https://doi. org/10.1016/0145-2134(94)00111-7 Finkelhor, D., & Strapko, N. (1992). Sexual abuse prevention education: A review of evaluation studies. In D. J. Willis, E. W. Holden, & M. Rosenberg (Eds.), Prevention of child maltreatment: Developmental perspectives (pp. 150–167). New York, NY: Willey. Fryer, G., Kraizer, S., & Miyoshi, T. (1987). Measuring actual reduction of risk to child abuse: A new approach. Child Abuse and Neglect, 11, 173–179. Garbarino, J. (1987). Children’s response to a sexual abuse prevention program. Child Abuse and Neglect, 11, 143–148. Gilbert, N., Berrick, J., Le Prohn, N., & Nyman, N. (1989). Protecting Young children from sexual abuse: Does school training work? Lexington, MA: Lexington Books. Glaser, D., & Frosh, S. (1997). Abuso sexual de niños [Child sexual abuse]. Buenos Aires, Argentina: Paidós. Harbeck, C., Peterson, L., & Starr, L. (1992). Previously abused child victims’ response to a sexual abuse prevention program: A matter of measures. Behavior Therapy, 23, 375–387. https:// Harvey, P., Forehand, R., Brown, C., & Holmes, T. (1988). The prevention of sexual abuse: Examination of the effectiveness of a program with kindergarten-age children. Behavior Therapy, 19, 429–435. Hazzard, A., Kleemeier, C., & Webb, C. (1990). Teacher versus expert presentations of sexual abuse prevention programs. Journal of Interpersonal Violence, 5, 23–36. 10.1177/088626090005001002 Hazzard, A., Webb, C., Kleemeier, C., Angert, L., & Pohl, J. (1991). Child sexual abuse prevention: Evaluation and one-year followup. Child Abuse and Neglect, 15, 123–138. 10.1016/0145-2134(91)90097-W Hill, J., & Jason, L. (1987). An evaluation of school-based child sexual abuse primary prevention program. Psychotherapy Bulletin, 22, 36–38. Jacobs, J., Hashima, Y., & Kenning, M. (1995). Children’s perceptions of the risk of sexual abuse. Child Abuse and Neglect, 19, 1443–1457. Kenning, M., Gallmeier, T., Jackson, T., & Plemons, S. (1987, July). Evaluation of child sexual abuse preventing program: A summary of two studies. Paper presented at the National Conference on Family Violence, Durham, NH. Kleemeier, C., Webb, C., Hazzard, A., & Pohl, J. (1988). Child sexual abuse prevention: Evaluation of a teacher training model. Child

European Psychologist (2020), 25(1), 1–15

Ó 2019 Hogrefe Publishing

A. Del Campo & M. Fávero, Effectiveness of Prevention Programs


Abuse and Neglect, 12, 555–561. Kolko, D. J., Moser, J., Litz, J., & Hughes, J. (1987). Promoting awareness and prevention of child sexual victimization using the red Flag/green flag program: An evaluation with follow-up. Journal of Family Violence, 2, 11–35. BF00976368 Kraizer, S., Witte, S., & Fryer, G. (1989). Child sexual abuse prevention programs: What makes them effective to protecting children? Children Today, 18, 23–27. Krivacska, J. (1990). Designing child sexual abuse prevention programs: Current approaches and a proposal for the prevention, reduction, and identification of sexual misuse. Springfield, IL: Thomas Publisher. Krivacska, J. (1992). Child sexual abuse prevention programs: The prevention of childhood sexuality? Journal of Child Sexual Abuse, 1, 83–112. Leake, H. (1986a). A study to compare the effectiveness two primary prevention programs in teaching first grade students’ children to recognize and avoid child sexual abuse and assault. San Joaquin County, CA: Sexual Assault Center of San Joaquin County. Leake, H. (1986b). A study to determine the effectiveness of the Child Assault Prevention program in teaching first grade students to recognize and avoid child sexual abuse and assault. San Joaquin County, CA: Sexual Assault Center of San Joaquin County. Liddell, T., Young, B., & Yamagishi, M. (1988). Implementation and evaluation of a preschool sexual abuse prevention resource. Seattle, WA: Department of Human Resources. López, F. (1995). Prevención de los abusos sexuales de menores y educación sexual [Child sexual abuse prevention and sex education]. Salamanca, Spain: Amarú. Lutter, Y., & Weisman, A. (1985). Sexual victimization prevention project: Campfire girls. Final report to the National Institute of Mental Health. MacIntyre, D., & Carr, A. (1999). Evaluation of the effectiveness of the Stay Safe Primary Prevention Program for child sexual abuse. Child Abuse and Neglect, 23, 1307–1325. https://doi. org/10.1016/S0145-2134(99)00092-7 Miltenberger, R., & Thiesse-Duffy, E. (1988). Evaluation of homebased programs for teaching personal safety skills to children. Journal of Applied Behavior Analysis, 21, 81–87. 10.1901/jaba.1988.21-81 Nelson, D. (1985). An evaluation of the student outcomes and instructional characteristics of the “You’re in Charge” Program. Salt Lake City, UT: Utah State Office of Education. Nemerofsky, A., Carran, D., & Rosenberg, L. (1994). Age variation in performance among preschool children in a sexual prevention program. Journal of Child Sexual Abuse, 3, 85–99. https://doi. org/10.1300/J070v03n01_06 Nibert, D., Cooper, S., Fitch, L., & Ford, J. (1988). Parents’ observations of the effect of a sexual abuse prevention program on preschool children. Child Welfare, 68, 539–546. O’Donohue, W. & Geer, J. (Eds.). (1992). The sexual abuse of children: Theory and Research (Vols. 1–2, . Hillsdale, NJ: Erlbaum. O’Donohue, W. T., Geer, J. H., & Elliott, A. (1992). The primary prevention of child sexual abuse. In W. O’Donohue & J. Geer (Eds.), The sexual abuse of children (Vol. 2, Clinical issues), (pp. 477–518). Hillsdale, NJ: Erlbaum. Oldfield, D., Hays, B., & Megel, E. (1996). Evaluation of effectiveness of project trust: An elementary school-based victimization prevention strategy. Child Abuse and Neglect, 20, 821–832. Ostbloom, N., Richardson, B., & Galey, M. (1987). Sexual abuse prevention projects. Des Moines, IA: National Committee for the Prevention of Child Abuse, Iowa Chapter.

Peraino, J. (1990). Evaluation of a preschool antivictimization prevention program. Journal of Interpersonal Violence, 5, 520– 528. Plummer, C. (1984). Preventing sexual abuse: What in-school programs teach children. Kalamazoo, MI: Prevention Training Associates. Plummer, C. (1986). Prevention education in perspective. In M. Nelson & K. Clark (Eds.), The educator’s guide to preventing child sexual abuse (pp. 1–5). Santa Cruz, CA: Network. Poche, C., Brouwer, R., & Swearingen, M. (1981). Teaching selfprotection to young children. Journal of Applied Behavior Analysis, 14, 169–176. Poche, C., Yoder, P., & Miltenberger, R. (1988). Teaching selfprotection to young children using television techniques. Journal of Applied Behavior Analysis, 21, 253–261. https://doi. org/doi 10.1901/jaba.1988.21-253 Pohl, J., & Hazzard, A. (1990). Reactions of children, parents and teachers to child sexual abuse prevention programs. Education, 110, 337–344. Porch, T., & Petretic-Jackson, P. (1986, August). Child sexual assault prevention: Evaluating parent education workshops. Report presented at the American Psychological Association convention, Washington, DC. Prange, J., & Atkinson, P. (1988). Sexual abuse prevention for preschoolers: Curriculum evaluation. Anchorage, AK: Alaska Pacific University. Pulido, M., Dauber, S., Tully, B., Hamilton, P., Smith, M., & Freeman, K. (2015). Knowledge gains following a child sexual abuse prevention program among urban students: A clusterrandomized evaluation. American Journal of Public Health, 105, 1344–1350. Randolph, M., & Gold, C. (1994). Child sexual abuse prevention: Evaluation of a teacher training program. School Psychology Review, 23, 485–495. Ratto, R., & Bogat, G. (1990). An evaluation of a preschool curriculum to educate children in the prevention of sexual abuse. Journal of Community Psychology, 18, 289–297. https://<289::AIDJCOP2290180310>3.0.CO;2-1 Ray, J. (1984). Evaluation of the Child Sex Abuse Prevention Project. Available from the Rape Crisis Network, N. 1226 Howard, Spokane, WA 99201. Ray, J., & Dietzel, M. (1984). Teaching child sexual abuse prevention. Washington, DC: Spokane. Rheingold, A., Zajac, K., & Patton, M. (2012). Feasibility and acceptability of a child sexual abuse prevention program for childcare professionals: comparison of a web-based and inperson training. Journal of Child Sexual Abuse, 21, 422–436. Roberts, J., & Miltenberger, R. (1999). Emerging issues in the research on child sexual abuse prevention. Education and Treatment of Children, 22, 84–102. Saslawsky, D., & Wurtele, S. (1986). Educating children about sexual abuse: Implications for pediatric intervention and possible prevention. Journal of Pediatric Psychology, 11, 235– 245. Scholes, L., Jones, C., Stieler-Hunt, C., Rolfe, B., & Pozzebon, K. (2012). The teachers’ role in child sexual abuse prevention programs: Implications for teacher education. Australian Journal of Teacher Education, 37, 104–131. 10.14221/ajte.2012v37n11.5 Sigurdson, E., Strang, M., & Doig, T. (1987). What do children know about preventing sexual assault? How can their awareness increased? Canadian Journal of Psychiatry, 32, 551–557.

Ó 2019 Hogrefe Publishing

European Psychologist (2020), 25(1), 1–15


A. Del Campo & M. Fávero, Effectiveness of Prevention Programs

Spungen, C., Jensen, S., Finkelstein, N., & Satinsky, F. (1989). Child personal safety: Model program for prevention of child sexual abuse. Social Work, 34, 127–131. Swan, H., Press, A., & Brigs, S. (1985). Child sexual abuse prevention: Does it work? Child Welfare, 64, 395–405. Taal, M., & Edelaar, M. (1997). Positive and negative effects of a child sexual abuse prevention program. Child Abuse and Neglect, 21, 399–410. Tharinger, D., Krivacska, J., Laye-McDonough, M., Jamison, L., Vincent, G., & Hedlund, A. (1988). Prevention of child sexual abuse: An analysis of issues, educational programs, and research findings. School Psychology Review, 17, 614–634. Topping, K., & Baron, I. (2009). School-based child sexual abuse prevention programs: A review of effectiveness. Review of Educational Research, 79, 431–463. 0034654308325582 Treacy, E., & Fisher, C. (1993). Foster parenting the sexually abused: A family life education program. Journal of Child Sexual Abuse, 2, 47–63. Tremblay, C. (1998). Évaluation de l’implication des parents dans la prévention des abus sexuels auprès des enfants [Assessment of parental involvement in the child sexual abuse prevention]. (Unpublished doctoral thesis). Universidad de Montreal, Montreal, Canada. Tutty, L. (1992). The ability of elementary school children to learn child sexual abuse prevention concepts. Child Abuse and Neglect, 16, 369–384. 90046-T Tutty, L. (1994). Developmental issues in young children’s learning of sexual abuse prevention concepts. Child Abuse and Neglect, 18, 179–192. Tutty, L. (1997). Child sexual abuse prevention programs: Evaluating who you tell. Child Abuse and Neglect, 21, 869–881. Tutty, L. (2000). What children learn from sexual abuse prevention programs: Difficult concepts and developmental issues. Research on Social Work Practice, 10, 275–300. Wall, H. (1983). Child assault/abuse prevention project pilot program evaluation. Concord, CA: Mount Diablo Unified School District. Walsh, K., Zwi, K., Woolfenden, S., & Shlonsky, A. (2018). Schoolbased programs for prevention of child sexual abuse: A Cochrane systematic review and meta-analysis. Research on Social Work Practice, 28, 33–55. 1049731515619705 Wolfe, D., McPherson, T., Blount, R., & Wolfe, V. (1986). Evaluation of a brief intervention for educating school children in awareness of physical and sexual abuse. Child Abuse and Neglect, 10, 85–92. Wood, L., & Rodwell, M. (1997). Focus groups with children: A resource for sexual abuse prevention program evaluation. Child Abuse and Neglect, 21, 1205–1216. S0145-2134(97)00095-1 Woods, S., & Dean, K. (1986). Sexual abuse prevention: Educational strategies. SIECUS Reports, 15, 8–9. Wurtele, S. (1990). Teaching personal safety skills to four-year-old children: A behavioural approach. Behavior Therapy, 21, 25–32. Wurtele, S., Currier, L., Gillispie, E., & Franklin, C. (1991). The efficacy of a parent-implemented program for teaching preschoolers personal safety skills. Behavior Therapy, 22, 69– 83. Wurtele, S., Gillispie, E., Currier, L., & Franklin, C. (1992). A comparison of teachers vs. parents as instructor of a personal safety program for preschoolers. Child Abuse and Neglect, 16, 127–137.

Wurtele, S., Kast, L., & Melzer, A. (1992). Sexual abuse prevention education for young children: A comparison of teachers and parents as instructors. Child Abuse and Neglect, 16, 865–876. Wurtele, S., Kast, L., Miller-Perrin, C., & Kondrick, P. (1989). Comparison of programs for teaching personal safety skills to preschoolers. Journal of Consulting and Clinical Psychology, 57, 505–511. Wurtele, S., & Miller-Perrin, C. (1987). An evaluation of sideeffects associated with participation in a child sexual abuse prevention program. Journal of School Health, 57, 228–231. Wurtele, S., Moreno, T., & Kenny, M. (2008). Evaluation of a sexual abuse prevention workshop for parents of young children. Journal of Child & Adolescent Trauma, 1, 331–340. https://doi. org/10.1080/19361520802505768 Wurtele, S., & Owens, J. (1997). Teaching personal safety skills to young children: An investigation of age and gender across five studies. Child Abuse and Neglect, 21, 805–814. 10.1016/S0145-2134(97)00040-9 Wurtele, S., Saslawsky, D., Miller, C., Marrs, S., & Britcher, J. (1986). Teaching personal safety skills for potential prevention of sexual abuse: A comparison of treatments. Journal of Consulting and Clinical Psychology, 54, 688–692. https://doi. org/10.1037/0022-006X.54.5.688

European Psychologist (2020), 25(1), 1–15

History Received October 16, 2018 Revision received February 6, 2019 Accepted March 25, 2019 Published online August 28, 2019 Marisalva Fávero Social and Behavioral Sciences Department University Institute of Maia Av. Carlos Oliveira Campos 4475-690 Castêlo da Maia Portugal

Amaia Del Campo, PhD, is an Associate Professor of psychology at the Department of Evolutionary and Educational Psychology, University of Salamanca, Spain. Her research focuses on child and juvenile abuse, juvenile’s antissocial behavior, child sexual abuse and preventing programs.

Marisalva Fávero, PhD, is an Associate Professor of psychology and she coordinates the Sexuality Observatory at the University Institute of Maia, Portugal. Her research interests include the victimology, sexual violence in the victim’s perspective, predictors and correlates of intimate partner violence victimization and perpetration, leaving processes in relationships characterized by intimate partner violence.

Ó 2019 Hogrefe Publishing

Ó 2019 Hogrefe Publishing


Primary school (9–10 years)

Primary school (8–12 years)

Daro et al. (1986)

Downer (1984)

Del Campo and López (2006)

Primary school (6–9 years)

Briggs and Hawkins (1994)

Primary school and day care center (4–10 years)

Preschool (3–5 years)

Borkin and Frank (1986)

Conte, Rosen, Saperstein, and Shermack (1985)

Primary school

Blumberg et al. (1991)


Primary school (5–12 years)

Binder and McNiel (1987)

Christian et al. (1988)

Primary school (1st and 3rd years)

Berrick and Gilbert (1991)


Primary school (2nd and 3rd years)

Beland (1986)

Elementary school (4th grade)

Adolescent grade 6–7–8

Barron and Topping (2013)

Çeçen-Eroğul and Kaf Hasirci, (2013)

High school (15–18 years)

Brown (2017)



Barth and Derezotes (1990)

















Control group

Pretest/posttest Follow up

(not available)

Control group

Pretest/posttest Posttest (1 week)


Pretest/posttest Control group


Pretest/posttest Follow up: posttest (1 year)

Posttest Follow up: posttest (4–6 weeks)

Control group

Pretest/posttest Posttest (in the instant)

Pretest/posttest Follow up: posttest (2–4 weeks)

Pretest/posttest Follow up: posttest (6 months)


Pretest/posttest Control group

Pretest/posttest Follow up: post (4 months)


2 Control groups

Prevention Program of Child Sexual Pretest/posttest Abuse Follow up (8 months)

CAP, TAT, Children’s Self-Help, Intervention and Education, Child Abuse Prevention, Youth Safety Awareness Project and SAFE TAT

Cook County Sheriff’s Office Program

The Sexual Abuse Prevention Program (SAPP) Video: No More Secrets

Good Touch/Bad Touch program (Childhelp, 2011)


Keeping Ourselves Safe Program

Bubbylonian Encounter (theater with dolls)

Role Play Program Multimedia Program

Child Assault Prevention (CAP) Program

Talking About Touching (TAT) Prevention Program 8 prevention curricula

Tweenees program

Six prevention curricula


Table A1. Studies about the effectiveness of programs for the prevention of child sexual abuse


Teachers’ Questionnaire

Children’s Knowledge Questionnaire (35 items) Parents’ Questionnaire

Questionnaire Interview

Test telling stories

Questionnaire (0–22 points)

Questionnaire (12 items)


Knowledge (questionnaire-based knowledge)

Test 5 Questions

Interview with one question: “What would you do if someone tries to touch you in a way that does not make you feel good?” Questionnaire (8 items)


Questionnaires: Children, Parents, and Teachers

Test (0–14 points)

Knowledge/skills questionnaire, systematic coding of disclosures, and video interaction analysis of lessons. Test (0–15 points)

Questionnaire (0–25 points)


A. Del Campo & M. Fávero, Effectiveness of Prevention Programs 11

European Psychologist (2020), 25(1), 1–15

Primary school and high school (10–16 years)

Primary school

Primary school (2nd, 4th, and 6th years) Preschool

Finkelhor et al. (1995)

Fryer et al. (1987)

Garbarino (1987)

Child sexual abuse victims (4–16 years)

Primary school

Primary school (3rd and 4th years)

Primary school (3rd and 4th years)


Primary school (2nd and 6th years)

Primary school

Primary school (3rd and 4th year)

Preschool and Primary school (3–10 years)

Primary school (1st, 2nd, and 3rd years)

Primary school (1st year)


Harbeck et al. (1992)

Harvey et al. (1988)

Hazzard et al. (1990)

Hazzard et al. (1991)

Hill and Jason (1987)

Jacobs, Hashima, and Kenning (1995)

Kenning, Gallmeier, Jackson, and Plemons (1987)

Kolko et al. (1987)

Kraizer et al. (1989)

Leake (1986a)

Leake (1986b)

Liddell et al. (1988)

Gilbert et al. (1989)



Table A1. (Continued)

European Psychologist (2020), 25(1), 1–15 183





72 44












Follow up

Prestest/Posttest Control group

(not available)

Pretest/posttest Posttest (2 months)

Pretest/posttest Control group

Control group

Pretest/posttest Follow up (1 year)

Comparison between groups


CAP Program and No More Secret

CAP Program

Posttest 2 experimental groups

Solo Posttest Posttest (1 week)

Posttest Posttest (1 week)

Control group

Questionnaire (0–13 points)

Role Playing (0–50 points)

Role Playing (0–50 points)

Test (1–14 points)

Questionnaire (1–11 points)


Vignettes: Situational Risk

Questionnaire Video



Questionnaire: What I Know About Touching Video: “What do you do? Questionnaires: What I Know About Touching and State-Trait Anxiety Inventory for Children (STATIC) and Parent Questionnaire

Questionnaire (1–5 points)

Control group

Pretest/posttest Follow up (7 weeks)

Knowledge Questionnaire, Behavioral Role-Plays, Personal Safety Questionnaire (PSQ), and Fear Assessment Thermometer Scale Vignettes Interview

Fill a story

Role playing and Questionnaires: Harter Perceived Competence Scale and Children Need to Knowledge-Attitude Test Questionnaire (10 items)

Interview by telephone


Pretest/posttest Posttest (2–4 weeks)

Pretest/posttest Posttest (6 weeks)


Control group

Pretest/posttest Follow up (6 months)

Posttest Follow up: posttest (8–24 months)


Safe Child Personal Safety Training Pretest/posttest Posttest (1 week)

Red Flag/Green Flag, Beter Safe than Sorry II (Video)

CAP Program TAT

TAT (lessons 1st, 2nd)

(not available)

Adaptation from Feeling Yes, Feeling No Feeling Yes, Feeling No Program, Spiderman comic, and Power Pack comic

Good Touch Bad Touch

Touch Continuum

7 programs

Spiderman comic book

Children Need to Know Personal Safety Training Program

Several programs


12 A. Del Campo & M. Fávero, Effectiveness of Prevention Programs

Ó 2019 Hogrefe Publishing

Ó 2019 Hogrefe Publishing

Primary school and Preschool


Ostbloom et al. (1987)

Peraino (1990)

Primary school (3rd and 4th years)

Primary school

Oldfield et al. (1996)

Pohl and Hazzard (1990)


Nibert et al. (1988)

Preschool and Primary school


Nemerofsky et al. (1994)

Poche et al. (1988)

Primary school (5th and 6th years)

Nelson (1985)

Preschool (3–5 years)

Preschool and Primary school (4–7 years)

Miltenberger and Thiesse-Duffy (1988)

Poche et al. (1981)

Primary school (7–10 years)

Primary school (5th year)


Lutter and Weisman (1985)

MacIntyre and Carr (1999)

Plummer (1984)



Table A1. (Continued)















Follow up (2 months)

Pretest/posttest Posttest (1 week)

Control group

Pretest/posttest Follow up: posttest II (3 months)

(not available)


Feeling Yes, Feeling No Program, Spiderman Comic and Video


No title: Modeling and Reinforcement Training

Education for the Prevention of Sexual Abuse (EPSA)

Antivictimization Program for Preschoolers

Happy Bear

Video: Touch Teaching Reaching Using Students and Teachers (TRUST)

CAP Program

Children’s Primary Prevention Training Program (CPPTP)


Posttest Control group

Follow up (12 weeks)

Pretest/posttest Posttest (1 week)

Posttest III (8 months)

Pretest/posttest (1 day) Posttest II (2 months)

Pretest/posttest Follow up (6.5 weeks)

Posttest II (1 day)

Pretest/posttest Posttest (instant)

Control group

Follow up (3 months)

Posttest Posttest (2 days)

Control group

Pretest/posttest Posttest (1 week)

Control group

Pretest/posttest Posttest (1 week)

Control group

You’re in Charge Program and Video Posttest 2 experimental groups

Book: Red Flag/Green Flag

The Stay Safe Primary Prevention Programme

Children’s Awareness Training


Interview (Continued)

Evaluation of verbal and motor responses

Behavioral assessment before Simulation: abusive situation with unknown person (0–6 points)

Questionnaire (23 items)


Role Playing Test (5 items)

CKAQ, Revised Children’s Manifest Anxiety Scale (RCMA) State-Trait Anxiety Inventory for Children (STATIC), and Maltreatment Disclosure Report Form (MDRF)

Vignettes Analysis of verbal responses

“What-if” Situations Test (WIST)

Questionnaire (0–20 points)

Children’s Safety and Skills Questionnaire, Battle Culture-Free SelfEsteem Inventory, Peabody Picture Vocabulary Test, Rutter Teachers’ Scale, and Children’s Program Evaluation Questionnaire Role Playing, Discrimination drawings and role play: abusive situation with unknown person

Test and Role Playing


A. Del Campo & M. Fávero, Effectiveness of Prevention Programs 13

European Psychologist (2020), 25(1), 1–15

European Psychologist (2020), 25(1), 1–15

Primary school (2nd and 3rd years)



Primary school

Primary school (3rd)

Preschool and primary school (5–7 and 10–12 years)

Primary school

Preschool and Primary school (3–12 years) Primary school (8–11 years)

Primary school (8–12 years)


Primary school (1st, 3rd, and 6th)

Pulido et al. (2015)

Prange and Atkinson (1988)

Ratto and Bogat (1990)

Ray (1984)

Ray and Dietzel (1984)

Saslawsky and Wurtele (1986)

Sigurdson et al. (1987)

Spungen et al. (1989)

Taal and Edelaar (1997)

Tremblay (1998)

Tutty (1992)

Primary school (1st, 3rd, and 6th)

Preschool and Primary school

Primary school

Tutty (1994)

Tutty (1997)

Wall (1983)

Swan et al. (1985)



Table A1. (Continued)

















CAP Program

Who Do You Tell Program

Touching Program

Touching Program

Programme CARE

Right to Security (Amsterdam Prevention Council for Sexual Violence)

Bubbylonian Encounter (game)

Child Personal Safety Program

Feeling Yes Filing No

Touch (Video) Group Discussion (15 min)


My Very Own Book About Me

Grossmont College Child Sexual Abuse Prevention Program

Safe Touch

Safe Touches


Case studies

Pretest/posttest Control group


Control group

Follow up (5 months)

Pretest/posttest Posttest (2 weeks)

Control group

Pretest/posttest Follow up (4 months)

Control group


Questionnaire: CKAQ-R

Questionnaire: CKAQ

Questionnaire: CKAQ


Control in Sexual Conflicts, The Choice of Safety Strategy Questionnaire, Feasibility Questionnaire, Touch Questionnaire, School Questionnaire, and Dutch Social Anxiety Scale for Children Questionnaire 24 items

Pretest/posttest Posttest (1 week) Follow up (6 weeks)

Video Questionnaire

(not available)

Video Questionnaire

PSQ and WIST Interview

Questionnaire (12 items)

Questionnaire, scholar book, and film

Questionnaire Role Playing

Questionnaire (0–13 points)

CKAQ of 33 items


Pretest/posttest Control group

No specific design


Control group

Pretest/posttest Follow up (3 months)

Follow up (6 months)

Pretest/posttest Posttest (instant)


Control group

Pretest/posttest Follow up (3 months)

Control group

Pretest/posttest Posttest (2 weeks)

Control group

Pretest/posttest Posttest (1 week)


14 A. Del Campo & M. Fávero, Effectiveness of Prevention Programs

Ó 2019 Hogrefe Publishing


Primary school (9–12 years)

Primary school

Primary school (3rd and 5th)

Preschool (4 years)

Preschool (3–5 years)





Preschool (3–5 years)

Primary school (1st, 5th, and 6th)


Wolfe et al. (1986)

Wood and Rodwell (1997)

Woods and Dean (1986)

Wurtele (1990)

Wurtele et al. (1991)

Wurtele, Gillispie, et al. (1992)

Wurtele, Kast, et al. (1992)

Wurtele, Kast, Miller-Perrin, and Kondrick (1989)

Wurtele and Miller-Perrin (1987)

Wurtele and Owens (1997)

Wurtele et al. (1986)

Table A1. (Continued)

Ó 2019 Hogrefe Publishing 71











N Posttest (3–5 days) Control group


Touch (Video) and BST



BST, Feelings Program and Attention Control Program


Follow up (2 months)

Teacher version

Control group

Posttest I Instant Posttest II (3 months)

Control group

Pretest/posttest Posttest (2 days)

Pretest/posttest Posttest (1 week)

Follow up (1 month)

Pretest/posttest Posttest (2 days)

Control group

Follow up (5 months)

Pretest/posttest Posttest (2 days)

Control group

Pretest/posttest Posttest (2 days)

Control group

Follow up (1–2 months)

Pretest/posttest Posttest (1 week)

Control group

Follow up (1 month)

Pretest/posttest Posttest (2 days)

Pretest/posttest Control group

BST Program: Parent version


Behavioral Skills Training (BST) Program

Spiderman Comic and Talking About Touching

Child Sexual Abuse Prevention Play Posttest

You’re In Charge



PSQ, Fear Assessment Thermometer Scale, Program Enjoyment Rating Scale, Parents’s Perceptions Questionnaire, and Teachers’s Perceptions Questionnaire Fear Assessment Thermometer Scale, Parent Perception Questionnaire, and Eyberg Child Behavior Questionnaire PSQ and WIST

PSQ, WIST, Parents’ Perceptions Questionnaire, and Teachers’ Perceptions Questionnaire

PSQ, WIST, Parents’ Perceptions Questionnaire, and Teachers’ Perceptions Questionnaire, Private Parts Rating Scale

PSQ, WIST, Background Information Questionnaire, and Parents’ Perceptions Questionnaire

Personal Safety Questionnaire PSQ, WIST, Parents’ Perceptions Questionnaire, and Teachers’ Perceptions Questionnaire

Focus groups with parents, teachers, and children. Content analysis Vignettes and Questionnaire (0–15 points)

Questionnaire (0–7 points)


A. Del Campo & M. Fávero, Effectiveness of Prevention Programs 15

European Psychologist (2020), 25(1), 1–15

Original Articles and Reviews

An Existential Threat Model of Conspiracy Theories Jan-Willem van Prooijen Department of Experimental and Applied Psychology, VU Amsterdam/The NSCR, Amsterdam, The Netherlands

Abstract: People endorse conspiracy theories particularly when they experience existential threat, that is, feelings of anxiety or uncertainty often because of distressing societal events. At the same time, such feelings also often lead people to support groups frequently implicated in conspiracy theories (e.g., the government). The present contribution aims to resolve this paradox by proposing an Existential Threat Model of Conspiracy Theories, which stipulates under what conditions existential threat does versus does not stimulate conspiracy theories. The model specifically illuminates that feelings of existential threat increase epistemic sense-making processes, which in turn stimulate conspiracy theories only when antagonistic outgroups are salient. Moreover, once formed conspiracy theories are not functional to reduce feelings of existential threat; instead, conspiracy theories can be a source of existential threat in itself, stimulating further conspiracy theorizing and contributing to a generalized conspiracist mindset. In the discussion, I discuss implications of the model, and illuminate how one may base interventions on the model to break this cyclical process and reduce conspiracy beliefs. Keywords: conspiracy theories, existential threat, epistemic sense-making processes, antagonistic outgroups

The Internet and social media are full of conspiracy theories, including climate change conspiracy theories, antivaccine conspiracy theories, flat-earth conspiracy theories, and many others. Conspiracy theories are explanatory beliefs assuming that a group of actors meets in secret to attain some evil goal (Van Prooijen, 2018). While some conspiracy theories turn out to be true (e.g., Watergate), surprisingly large numbers of citizens believe rather implausible conspiracy theories (Oliver & Wood, 2014). Furthermore, conspiracy theories are not exclusive to our modern digital age. In previous decades, many citizens also believed conspiracy theories, such as JFK conspiracy theories, anti-communist conspiracy theories (e.g., McCarthyism), and anti-Semitic conspiracy theories (e.g., during WWII). Conspiracy theories were common in Medieval times (e.g., Witch-hunts; Jewish conspiracy theories), and are common among members of traditional societies, who for instance often believe that enemy tribe members secretly commit sorcery to harm them (Van Prooijen & Douglas, 2018; West & Sanders, 2003). A tendency to be suspicious that others form secret and hostile conspiracies may be an inborn feature of human psychology (Van Prooijen & Van Vugt, 2018). One pertinent finding in empirical research is that people endorse conspiracy theories particularly when they experience existential threat. I define existential threat here as feelings of anxiety or uncertainty, often because of distressing events that call one’s values, one’s way of life, or even one’s existence into question. As such, existential threat is a composite term for a broad spectrum of everyday anxieties European Psychologist (2020), 25(1), 16–25

and insecurities that people feel when they, or the people around them, experience harm or expect to suffer losses. Conspiracy theories indeed surge particularly following distressing societal events that elicit existential threat among many citizens, such as terrorist strikes, revolutions, fires, floods, economic crises, wars, and rapid societal change (e.g., Pipes, 1997; Van Prooijen & Douglas, 2017). At the same time, existential threat does not lead to conspiracy theories all the time, or among all citizens. For instance, the 9/11 terrorist strikes inspired many conspiracy theories (e.g., the 9/11 truth movement), but at the same time, George W. Bush enjoyed historically high public approval ratings in the months after this event. Consistently, empirical research suggests that threats to control can increase belief in conspiracy theories about the government (Van Prooijen & Acker, 2015), yet at the same time, threats to control may increase people’s support for that same government (Kay, Gaucher, Napier, Callan, & Laurin, 2008). The present contribution seeks to resolve this paradox by developing a theoretical model that illuminates when and how existential threat increases belief in conspiracy theories. In the scientific study of conspiracy theories, there is a paucity of theoretical models to integrate previous findings and enable novel predictions (Van Prooijen & Douglas, 2018). Here, I propose an existential threat model of conspiracy theories, displayed graphically in Figure 1. This model articulates that existential threat is at the root of conspiracy theories by increasing people’s motivation to make sense of their social and physical environment. These sense-making processes, however, only lead to conspiracy Ó 2019 Hogrefe Publishing

J.-W. van Prooijen, Existential Threat and Conspiracy Theories


Antagonistic outgroup

Existential Threat

Sense-making processes

Conspiracy theories

Figure 1. An existential threat model of conspiracy theories.

theories when an antagonistic outgroup is salient. Put differently, conspiracy theories emerge if a despised outgroup is salient when people try to make sense of the world following distressing events. This outgroup may be high in power (e.g., politicians, CEOs) or low in power (e.g., ethnic minority groups); what matters is that perceivers mentally construe the suspected conspirators as an entitative outgroup that is not to be trusted, and different from “us” (e.g., regular citizens; employees). In the following, I introduce the model in more detail.

An Existential Threat Model of Conspiracy Theories The three core factors in the model to predict conspiracy beliefs – existential threat, sense-making processes, and an antagonistic outgroup – closely correspond to the assertion that conspiracy beliefs are rooted in three types of motives. Specifically, Douglas, Sutton, and Cichocka (2017) proposed that people believe conspiracy theories for existential, epistemic, or social motives. The Existential Threat Model of Conspiracy Theories expands on this perspective by proposing that these motives are not independent, but influence each other in a specific causal order. Feelings of existential threat increase epistemic sensemaking processes, subsequently leading to conspiracy theories; moreover, social motives moderate these effects by determining if these feelings make people more suspicious of the covert actions of a despised outgroup (cf. scapegoating). Sometimes the antagonistic outgroup can also be a source of existential threat itself, such as in the case of ideological conflict (e.g., Democrats vs. Republicans; Uscinski & Parent, 2014) or violent intergroup conflict (Pipes, 1997) – but also in these situations, the three motives Ó 2019 Hogrefe Publishing

underlying conspiracy theories are not independent, but instead interrelated in a specific causal order. Furthermore, The Existential Threat Model expands the Adaptive Conspiracism Hypothesis, which illuminates the distal, evolutionary origins and functions of the human tendency to believe conspiracy theories (Van Prooijen & Van Vugt, 2018). The Adaptive Conspiracism Hypothesis proposes that ancient hunter-gatherers evolved an adaptive tendency to be suspicious of hostile coalitions or outgroups, to protect against the frequent and realistic dangers of lethal intergroup conflict in an ancestral environment. In this evolutionary process, both antagonistic outgroups, as well as socio-ecological threat cues that increase the likelihood of intergroup conflict (e.g., floods, fires), are important antecedents of the human tendency to believe conspiracy theories. The Adaptive Conspiracism Hypothesis does not specify the proximate psychological processes through which these factors interact to increase conspiracy theories; however, the present model seeks to fill this void. In the following, I discuss the evidence for the various components of the model by (a) focusing on the effects of existential threat on sense-making processes and conspiracy theories, and (b) illuminating the moderating role of antagonistic outgroups in these processes. Furthermore, I propose that once formed, conspiracy theories are not functional to soothe feelings of existential threat, but instead often exacerbate such feelings, and may contribute to a general mindset that explains distressing events in the world through conspiracy theories (i.e., conspiracy mentality).

Existential Threat, Sense-Making, and Conspiracy Theories The core of the model is that feelings of existential threat increase mental sense-making processes, which subsequently stimulate belief in conspiracy theories. This idea European Psychologist (2020), 25(1), 16–25


originates from the assumption that existential threat elicits a vigilant reaction in organisms to pay careful attention to the imminent physical or social environment. These sensemaking processes are part of an inborn threat-management system that enables organisms to cope with existential threats in a functional manner. By quickly identifying the nature of the threat, people are able to take appropriate action in time, thus effectively protecting themselves and kin from harm (Neuberg, Kenrick, & Schaller, 2011). Sense-making processes can be defined as cognitive attempts to establish straightforward, meaningful, and 0causal relationships between stimuli. Several psychological theories proposed that people have a fundamental need to recognize these expected relationships, as this enhances the extent to which people experience their environment as predictable. For instance, the Meaning Making Model articulates that existential threats stimulate a fluid compensation process in which people seek to reestablish a sense of meaning by identifying clear and coherent relationships between stimuli (Heine, Proulx, & Vohs, 2006). Conspiracy theories satisfy these sense-making motivations by providing perceivers with the idea that they understand the root causes of feelings of existential threat. For instance, conspiracy theories offer perceivers a straightforward and meaningful narrative to understand the complex dynamics often involved in societal crisis situations, by attributing such events entirely to the actions of an all-evil conspiracy (Abalakina-Paap, Stephan, Craig, & Gregory, 1999; Hofstadter, 1966). Furthermore, conspiracy theories offer coherent relationships that are at the root of meaning making. Specifically, any belief needs to contain a number of critical ingredients before qualifying as conspiracy theory, and two of these ingredients are patterns and agency (Van Prooijen, 2018; Van Prooijen & Van Vugt, 2018). Patterns means that conspiracy theories always assume specific causal relationships between physical stimuli, events, and actors. For instance, one may perceive causal links between a disease epidemic, the quality of tap water, and assumed hostile intentions of governmental officials, laying the foundations for a conspiracy theory of how governmental officials poisoned the water supply to cause the epidemic. Agency means that conspiracy theories always make assumptions of intentionality or purpose. If one believes a technological malfunction caused a plane crash, this is in and of itself not a conspiracy theory. But if one additionally believes that a group of actors deliberately tampered with the engine to cause the malfunction, it is a conspiracy theory. By perceiving patterns and agency, conspiracy theories offer perceivers an explanatory framework to make sense of the world when experiencing feelings of existential threat. Four predictions follow from this line of reasoning. Specifically, (1) Existential threat activates epistemic European Psychologist (2020), 25(1), 16–25

J.-W. van Prooijen, Existential Threat and Conspiracy Theories

sense-making processes; (2) Existential threat predicts increased belief in conspiracy theories; (3) Sense-making processes predict increased belief in conspiracy theories; and (4) Sense-making processes mediate the effects of existential threat on conspiracy theories. Below, I review the evidence for each prediction. Existential Threat and Sense-Making Processes The core idea that existential threat activates sense-making processes is consistent with a range of well-established theories and findings across psychology. For instance, Park (2010) found that stressful life events (e.g., illness; disaster) stimulate a coping process by which people not only make sense of such events through specific appraisals, but also by searching for global meaning through, for instance, spirituality, justice, and religion. These sense-making processes may contribute to mental health and resilience in the face of adversity. Wiseman and Watt (2006) focused on paranormal beliefs and superstition as sense-making processes, and noted that such beliefs make perceivers experience an uncertain future as more predictable. Finally, sense-making is part of human being’s predicament to cope with existential challenges such as the certainty of death, and the unpredictability of the future (Greenberg, Koole, & Pyszczynski, 2004). The effects of existential threat on sense-making can also be observed in political attitudes and choices. For instance, feelings of uncertainty stimulate a preference for rigid and radical leaders, who offer simple (and therefore understandable) solutions for complex problems (Hogg, Meehan, & Farquharson, 2010). Relatedly, existential threat has stimulated extremist political movements across the 20th century (Midlarsky, 2011), and promotes political extremism among regular citizens (Van Prooijen & Krouwel, 2019). These insights suggest that existential threat promotes political views that offer perceivers epistemic clarity by reducing a complex reality into a coherent set of assumptions about the world (see also Burke, Kosloff, & Landau, 2013). Various specific empirical findings are relevant for the current purposes by revealing effects of existential threat on the specific sense-making processes that are involved in conspiracy theories. Notably, existential threat increases the extent to which people perceive patterns in random stimuli. For instance, Whitson and Galinsky (2008) found that threats to control increases illusory pattern perception, as reflected not only in conspiracy theories but also in seeing images in noisy pictures, seeing illusory correlations in random stock market information, and increased superstition. Likewise, threats to control make people rely more strongly on horoscopes, to the extent that these horoscopes help them better understand themselves or others (Wang, Whitson, & Menon, 2012). Furthermore, manipulations of Ó 2019 Hogrefe Publishing

J.-W. van Prooijen, Existential Threat and Conspiracy Theories

attitudinal ambivalence – and unpleasant experience related to uncertainty – shape illusory pattern perception in a snowy pictures task (Van Harreveld, Rutjens, Schneider, Nohlen, & Keskinis, 2014). Likewise, existential threat predicts an increased tendency to detect agency, that is, to perceive events as caused by intentional or purposeful agents. Agency detection is at the root of not only conspiracy theories but also many religious beliefs, by assuming the existence of agentic, moralizing gods. Feelings of uncertainty and fear, however, increase people’s belief in such agentic gods (Kay et al., 2008). Moreover, feelings of awe reduce people’s tolerance for uncertainty, which in turn increases agency detection (Valdesolo & Graham, 2014). Taken together, the evidence reveals that existential threat increases people’s tendency to endorse simplified models of reality, to perceive causal relations between stimuli that are not necessarily related in reality (pattern perception) and to perceive events as caused by purposeful agents (agency detection). Existential Threat and Conspiracy Theories One of the core propositions of the model is that impactful and anxiety-provoking societal events stimulate belief in conspiracy theories. These events can be incidental (e.g., a terrorist strike) or more continuous (e.g., an economic crisis), and moreover they can be real (e.g., climate change) or exist merely in the eyes of the perceiver (e.g., the belief that vaccines damage people’s health, for instance by causing autism). The feelings of anxiety and uncertainty that emerge due to such events often stimulate belief in conspiracy theories. As with other cognitions and beliefs, once formed such conspiracy theories may subsequently become a stable and integral part of a perceiver’s understanding of the world due to the epistemic processes of “seizing” and “freezing” (Kruglanski & Webster, 1996), even when the initial feelings of anxiety and uncertainty have long dissipated. For instance, historical events such as the JFK assassination and the 9/11 terrorist strikes have stimulated widespread conspiracy theories; but even though these events took place decades ago, large groups of citizens currently still endorse these theories with high confidence, and transmit them to others (Van Prooijen & Douglas, 2017). Empirical research supports a causal effect of existential threat on belief in conspiracy theories. One stream of research investigated the influence of consequential versus inconsequential threatening societal events. Scenario studies revealed that people believe conspiracy theories more strongly if the assassination of a president leads to a war than if it does not lead to a war (LeBoeuf & Norton, 2012). Furthermore, studies examined people’s responses to scenarios where an African opposition leader died in a car crash, or miraculously survived the car crash. Participants believed more strongly that a conspiracy sabotaged Ó 2019 Hogrefe Publishing


the car if the opposition leader died as opposed to survived the crash (Van Prooijen & Van Dijk, 2014). In sum, threatening and consequential societal events lead to stronger conspiracy belief than relatively inconsequential societal events. Also studies experimentally manipulating the emotions underlying existential threat support an effect on conspiracy beliefs. Various studies found that inducing a lack of control increases belief in conspiracy theories (Van Prooijen & Acker, 2015; Whitson & Galinsky, 2008), and leads people to ascribe exaggerated influence to their enemies (Sullivan, Landau, & Rothschild, 2010). Furthermore, inducing emotions that reflect uncertainty about the world increases belief in conspiracy theories (Whitson, Galinsky, & Kay, 2015; see also Van Prooijen & Jostmann, 2013), and people believe conspiracy theories more strongly following an experimentally induced threat to the status quo (Jolley, Douglas, & Sutton, 2018). Finally, attitudinal ambivalence increases feelings of anxiety and uncertainty, which in turn predicts belief in conspiracy theories (Van Harreveld et al., 2014). A range of correlational findings are consistent with these experimental findings, revealing relationships between belief in conspiracy theories and feelings of powerlessness (Abalakina-Paap et al., 1999), negative emotions (Grzesiak-Feldman, 2013; Van Prooijen & Acker, 2015), death-related anxiety (Newheiser, Farias, & Tausch, 2011), and perceived system identity threat, that is, the belief that society’s core values are changing (Federico, Williams, & Vitriol, 2018). Furthermore, conspiracy beliefs predict political attitudes commonly associated with feelings of existential threat, including political extremism (Van Prooijen, Krouwel, & Pollet, 2015) and populism (Silva, Vegetti, & Littvay, 2017). Furthermore, deprived life circumstances in general are associated with increased belief in conspiracy theories. For instance, low education reliably predicts increased belief in conspiracy theories, which is mediated not only by decreased analytic thinking skills but also by feelings of powerlessness (Van Prooijen, 2017). Furthermore, conspiracy theories are more common among marginalized minority group members than among majority group members in a society, due to a tendency to blame their groups’ actual problems (e.g., poverty, reduced opportunities) on discrimination (Crocker, Luhtanen, Broadnax, & Blaine, 1999). Minority members even believe conspiracy theories more strongly that are unrelated to their deprived life circumstances, due to a general belief that the societal system is rigged (e.g., belief in the cover-up of evidence for the existence of UFOs; Van Prooijen, Staman, & Krouwel, 2018). In sum, empirical evidence reveals that distressing societal events, feelings of anxiety and uncertainty, and deprived life circumstances reliably predict conspiracy beliefs. European Psychologist (2020), 25(1), 16–25


Sense-Making Processes and Conspiracy Theories The essence of sense-making is subjective attempts to understand reality by perceiving causal relationships, meaning, and purpose (e.g., Greenberg et al., 2004; Heine et al., 2006; Park, 2010). Conspiracy theories contribute to such sense-making by offering explanations of why distressing events occurred, through a set of explicit assumptions of patterns and agency. In doing so, conspiracy theories often make a complex reality more understandable by assuming that an all-evil group (i.e., the conspiracy) is solely responsible for any harm that has occurred (Hofstadter, 1966). This simplifying property of conspiracy theories contains a paradox, as many conspiracy theories are based on a relatively complex list of assumptions (e.g., 9/11 truth conspiracy theories). Empirical evidence reveals, however, that analytic thinking reduces conspiracy beliefs; intuitive thinking instead predicts increased conspiracy belief (Swami, Voracek, Stieger, Tran, & Furnham, 2014; see also Ståhl & Van Prooijen, 2018). Likewise, conspiracy beliefs are positively related with a tendency to perceive simple solutions for complex problems (Van Prooijen, 2017; Van Prooijen et al., 2015), and with other manifestations of people’s sense-making efforts including paranormal beliefs, superstition, belief in pseudoscience, and spirituality (e.g., Darwin, Neave, & Holmes, 2011; Newheiser et al., 2011). Various studies investigated the underlying process that pattern perception and agency detection predict conspiracy beliefs. Van Prooijen, Douglas, and De Inocencio (2018) found that perceiving patterns in random coin toss outcomes and in chaotic abstract paintings, as well as a general tendency to believe that world events do not occur through coincidence, are related with conspiracy beliefs. Furthermore, Wagner-Egger, Delouvée, Gauvrit, and Dieguez (2018) found that conspiracy beliefs are related with teleological thinking, defined as “the attribution of purpose and a final cause to natural events and entities” (p. R867). Finally, various studies found relationships between conspiracy beliefs and agency detection indicators, including anthropomorphism and a tendency to perceive agency in moving geometric figures (Douglas, Sutton, Callan, Dawtry, & Harvey, 2016; Imhoff & Bruder, 2014). A limitation of this part of the model is a relative paucity of causal evidence. One recent study directly tested the proposed causal effect in an experimental design, however (Van der Wal, Sutton, Lange, & Braga, 2018; Study 4). This study specifically manipulated the core features of pattern perception by varying whether harmful events (e.g., a mayor’s illness) co-occurred with similar recent events, and whether the described events were causally interconnected. Results revealed that perceiving clusters of similar events elicited stronger conspiracy theories than perceiving events in isolation; moreover, perceiving causal connections between events independently stimulated European Psychologist (2020), 25(1), 16–25

J.-W. van Prooijen, Existential Threat and Conspiracy Theories

stronger conspiracy theories than not perceiving causal connections between events. This study illuminates that the core elements of pattern perception – notably perceiving co-occurrences that are no coincidence – causally shape people’s belief in conspiracy theories. Sense-Making as Mediator A final proposition of this part of the model is that sensemaking mediates the effects of existential threat on belief in conspiracy theories. Although the empirical evidence is somewhat indirect at this point, various studies offer evidence that is consistent with this mediating process. Correlational studies reveal that the relationship between political attitudes associated with existential threat (i.e., political extremism) and conspiracy theories is mediated by a belief in simple solutions for complex problems (Van Prooijen et al., 2015). Likewise, the relationship of conspiracy beliefs with low education levels – which may reflect deprived life circumstances – is mediated by an increased tendency to detect agency where none exists (Douglas et al., 2016) and by a tendency to perceive simple solutions for complex problems (Van Prooijen, 2017). One study manipulated whether or not participants read about a conspiracy theory of the NSA surveillance program (Van Prooijen et al., 2018; Study 5), and as will be argued later, a conspiracy theory can be a source of existential threat in itself, generating belief in other conspiracy theories. Results indeed revealed that as compared with the control condition, exposure to an NSA conspiracy theory increased belief in conceptually unrelated conspiracy theories (e.g., about Ebola being made by humans). Of importance, this relationship was mediated by an increased tendency among participants to see patterns in world events. Finally, in one study, participants read how an African political activist died of food poisoning, and the study manipulated perspective-taking to vary participants’ emotional involvement in the event (Van Prooijen & Van Dijk, 2014; Study 5). Participants who felt emotionally involved believed conspiracy theories more strongly (i.e., beliefs that the activist was poisoned deliberately), and this effect was mediated by participants’ sense-making motivation. These findings are consistent with the notion that sense-making processes mediate the link between existential threat and conspiracy beliefs.

Antagonistic Outgroups The Existential Threat Model predicts that the processes articulated above only stimulate conspiracy beliefs in combination with one additional critical ingredient: A salient antagonistic outgroup that promotes conspiratorial suspicions during sense-making processes (Van Prooijen & Van Vugt, 2018). Without a salient antagonistic outgroup, Ó 2019 Hogrefe Publishing

J.-W. van Prooijen, Existential Threat and Conspiracy Theories

the sense-making processes following feelings of existential threat may lead people to find meaning in belief systems such as religiosity, spirituality, political ideology, and support for the status quo (e.g., Hogg et al., 2010; Kay et al., 2008; Park, 2010; Van Prooijen & Krouwel, 2019). When an antagonistic outgroup is salient, however, these sensemaking processes are likely to translate into beliefs that accuse members of this outgroup of secretly conspiring. For instance, the 9/11 terrorist strikes elicited conspiracy theories mostly among Democrats, who were more likely to blame the event on an inside job of the Republican administration that was in office at the time (Oliver & Wood, 2014; see also Uscinski & Parent, 2014). Likewise, information about climate change elicits conspiracy theories mostly among Republicans, who often interpret this information as a hoax by Democratic scientists and policy-makers (Van der Linden, 2015). Of course, groups are subjective social-psychological constructions, and sometimes one may wonder to what extent people consider the actors involved in common conspiracy theories as an “outgroup.” For instance, citizens often endorse conspiracy theories about the government of their own country. Citizens are likely to differ, however, in whether or not they mentally construe their nation’s government as part of their ingroup, or instead as a powerful outgroup. Findings reveal, for instance, that particularly citizens who feel alienated from their government endorse conspiracy theories (Abalakina-Paap et al., 1999; Goertzel, 1994). Relatedly, populist movements typically construe their nation’s government as part of “the corrupt elites” who oppose “the noble people,” and therefore these movements often endorse strong conspiracy theories (Müller, 2016). This suggests that governmental conspiracy theories emerge particularly among citizens who mentally construe their own government as an antagonistic outgroup within society. Previous research and theorizing suggest that conspiracy beliefs are associated with two complementary types of social motives, which are to uphold a strong ingroup identity, and to protect a valued ingroup against a hostile outgroup (Douglas et al., 2017; Van Prooijen & Douglas, 2018; Van Prooijen & Van Lange, 2014). Both these social motives are functional in the context of intergroup conflict, however, and increase the salience of antagonistic outgroups. For instance, collective narcissism is a tendency to perceive an ingroup as superior, reflecting a strong ingroup identity. Perceiving an ingroup as superior implies perceiving outgroups as inferior, however, leading people to more easily perceive salient outgroups as antagonistic. Consistently, collective narcissism predicts belief in conspiracy theories about different nations, minority groups, and competing political parties (Cichocka, Marchlewska, Golec de Zavala, & Olechowski, 2016; Golec de Zavala & Ó 2019 Hogrefe Publishing


Federico, 2018). Other individual difference variables that predict hostile intergroup perceptions – notably authoritarianism and social dominance orientation – also are associated with belief in conspiracy theories (Abalakina-Paap et al., 1999; Imhoff & Bruder, 2014; Swami, 2012). It should be noted that, sometimes, an antagonistic outgroup can be a direct source of existential threat. For instance, in a war an enemy group directly threatens the existence of one’s ingroup, and indeed, conspiracy theories about the enemy are common in wartime (Pipes, 1997). Furthermore, during an election campaign opposing political parties directly threaten one’s core values, stimulating conspiracy theories (Golec de Zavala & Federico, 2018). Conspiracy theories are particularly common among members of political parties that lose an election, yielding conspiracy theories that for instance accuse the winning party of foul play (Uscinski & Parent, 2014). In such cases, the links between existential threat, antagonistic outgroups, and conspiracy theories can be relatively straightforward, as these variables all involve a specific threat caused by a specific outgroup. In many cases, however, distressing events are not explicitly linked with a specific outgroup (e.g., an economic crisis, a natural disaster). In such cases, the model stipulates that the salience of an antagonistic outgroup acts as a moderator of the relationship between existential threat and conspiracy theories. Various studies support this hypothesized moderating process. One experiment among Indonesian citizens found that conspiracy theories – about how Western countries organized terrorist strikes in Indonesia – were stronger following information describing the West as threatening as opposed to non-threatening to Muslims. This effect only occurred among participants whose Muslim identity was made salient, facilitating the extent to which participants indeed construed the West as an antagonistic outgroup (Mashuri & Zaduqisti, 2015). Relatedly, feelings of uncertainty about the self predict conspiracy theories, but only among participants who experience inclusion in a social group (Van Prooijen, 2016). Furthermore, a distressing societal event (i.e., the death of an African politician) increases conspiracy theories only among people who emotionally and cognitively align themselves with the victimized group (i.e., the citizens of the deceased politician’s country; Van Prooijen & Van Dijk, 2014). While the above evidence pertains to a relatively indirect indicator of intergroup conflict – that is, a strong ingroup identity – other studies more directly varied the salience of antagonistic outgroups. Notably, Van Prooijen and Jostmann (2013) first manipulated uncertainty, after which they provided information that a salient target group (e.g., a foreign government) was either moral or immoral. Their results suggested that uncertainty increased conspiracy theories about the target group only when it was immoral. European Psychologist (2020), 25(1), 16–25


Another study focused on the need for cognitive closure, that is, the extent to which people are tolerant of uncertainty. This measure predicted conspiracy theories of societal crises only when explanations blaming the event on the covert activities of antagonistic outgroups were made salient (Marchlewska, Cichocka, & Kossowska, 2018). In sum, social cues that increase the salience of antagonistic outgroup enhance the likelihood that the sense-making processes caused by feelings of existential threat produce conspiracy theories.

J.-W. van Prooijen, Existential Threat and Conspiracy Theories

to further conspiracy theorizing. Empirical evidence indeed reveals that the single best predictor of belief in one conspiracy theory is belief in a different conspiracy theory (e.g., Abalakina-Paap et al., 1999; Goertzel, 1994). Due to the cyclical feedback loop described by the model, conspiracy theories may contribute to a generalized conspiracy mentality, that is, a mindset that habitually perceives conspiracies as responsible for major events in the world (Imhoff & Bruder, 2014).

Discussion and Conclusion Conspiracy Theories as a Source of Existential Threat Being the result of basic sense-making processes, conspiracy beliefs are a form of coping with existential threat. Does this imply that conspiracy theories help perceivers to reduce fear and uncertainty? In some situations, believing that powerful enemies caused negative events may be less frightening than believing that events happened randomly (Sullivan et al., 2010). Quite often, however, conspiracy theories only exacerbate feelings of uncertainty and fear (Douglas et al., 2017). Put differently, believing in the existence of powerful, evil, and secret conspiracies can cause feelings of existential threat, triggering belief in more conspiracy theories. These observations are consistent with the Adaptive Conspiracism Hypothesis (Van Prooijen & Van Vugt, 2018), which proposes that in the evolution of our species conspiracy theories have been adaptive not to reduce fear, but rather, to instill fear and anger in perceivers. In an ancestral environment where people regularly faced the realistic danger of hostile coalitions colluding in secret, it would be dysfunctional to respond to a suspected conspiracy with indifference or even reassurance. Instead, the functional (and often life-saving) response would be either fear-based (e.g., protect against the conspiracy by migrating to a safer environment) or anger-based (e.g., protect against the conspiracy by committing a pre-emptive strike). Empirical evidence is consistent with the notion that a conspiracy theory can be a source of existential threat in itself. For instance, anti-vaccines conspiracy theories increase a fear-based motivation to protect against the suspected harm, leading to lowered vaccination intentions (Jolley & Douglas, 2014). Furthermore, conspiracy theories have been associated with hostility (e.g., Abalakina-Paap et al., 1999), and contribute to the violent tendencies of extremist fringe groups (Bartlett & Miller, 2010). Conspiracy theories hence elicit fear and anger in perceivers, suggesting that they perpetuate and exacerbate feelings of existential threat. Conspiracy theories may therefore lead

European Psychologist (2020), 25(1), 16–25

The scientific study of conspiracy theories has been emerging in recent years, yet, the field is lacking solid theoretical models that integrate previous empirical findings and allow for novel predictions (Van Prooijen & Douglas, 2018). The model presented here addresses the questions how feelings of existential threat increase conspiracy theories, and why such feelings do not predict conspiracy theories in all situations. Furthermore, the model extends previous theorizing by specifying that the existential, epistemic, and social motives underlying conspiracy theories (Douglas et al., 2017) are not independent, but instead are all part of one specific causal process. Finally, the model extends the Adaptive Conspiracism Hypothesis (Van Prooijen & Van Vugt, 2018) by articulating how antagonistic outgroups and existential threats interact to produce conspiracy theories. Empirical research thus far supports the model articulated here. Yet, more experimental and longitudinal research needs to test all the hypothesized causal chains in the model. Moreover, future research may specify important nuances to the model that, based on the current state of the literature, are yet impossible to establish with confidence. For instance, the model only addresses actual beliefs in conspiracy theories, and no other forms of conspiracy endorsement (e.g., strategic spreading of conspiracy theories for political gain). Furthermore, people can consider many events threatening, some imminently dangerous (e.g., a natural disaster), some spread out over a longer time (e.g., an economic crisis), and some perhaps shocking but not necessarily detrimental to one’s own well-being (e.g., the unexpected death of a celebrity). While all of these events have been part of conspiracy theories, at present there is no hard evidence establishing that they influence conspiracy theories through identical processes. The processes articulated here have substantial implications for society, and enable policy-makers to predict the likelihood and shape of conspiracy theories after threatening societal events. For this purpose, it is important to recognize the sometimes subtle and complex history, power

Ó 2019 Hogrefe Publishing

J.-W. van Prooijen, Existential Threat and Conspiracy Theories

dynamics, and sentiments between subgroups in society. For instance, distressing societal events are likely to stimulate governmental conspiracy theories among citizens who do not feel represented by that government (cf. Abalakina-Paap et al., 1999; Uscinski & Parent, 2014). Similarly, distressing events may stimulate conspiracy theories about minority groups (e.g., Muslims) among politically right-wing citizens, and about powerful companies among politically left-wing citizens (cf. Van Prooijen et al., 2015). Moreover, opposing political groups may blame each other of conspiring (Oliver & Wood, 2014), and ethnic minority group members may believe in conspiracies that involve members of the dominant majority group in society (Crocker et al., 1999; Goertzel, 1994; Van Prooijen et al., 2018). What matters is what specific societal groups citizens perceive as antagonistic outgroups, which are salient when experiencing feelings of existential threat. Furthermore, the model supposes that belief in conspiracy theories is a cyclical and mutually reinforcing process: Once formed, one conspiracy theory fuels further feelings of existential threat, stimulating more conspiracy theories (Abalakina-Paap et al., 1999; Goertzel, 1994). Yet, the model also gives clues how to design interventions to break this cycle and reduce conspiracy theories. Indeed, interventions may target each of the four variables in the model. For instance, one may try to mitigate feelings of existential threat among citizens. Research indeed suggests that while lacking control increases conspiracy beliefs, increasing feelings of control reduces conspiracy beliefs (Van Prooijen & Acker, 2015). Likewise, one may target the shape of the sense-making processes underlying conspiracy theories. While conspiracy theories are rooted in a desire to understand and simplify complex events (Hofstadter, 1966), evidence suggests that providing people with good education, and good analytic thinking skills, decreases their tendency to simplify reality and therefore their belief in conspiracy theories (Douglas et al., 2016; Swami et al., 2014; Van Prooijen, 2017). Furthermore, it has been speculated that well-known interventions to reduce intergroup conflict – such as stimulating contact between subgroups in society (e.g., a politician getting out of parliament to talk with angry citizens) – may mitigate conspiracy theories by reducing psychological tensions between societal subgroups (Van Prooijen & Douglas, 2018). Finally, one may try to directly change conspiracy beliefs, and thus break the cycle described in the model. Research suggests that rationally refuting specific conspiracy theories, or ridiculing them, can reduce belief in them (Orosz et al., 2016). One should note that actually implementing such interventions is likely to run into a range of practical problems not captured by the model. For instance, some groups of citizens may easily perceive a governmental campaign to

Ó 2019 Hogrefe Publishing


reduce conspiracy theories as part of a cover-up, and might therefore backfire. Furthermore, while interventions can be effective among relatively moderate citizens – who believe conspiracy theories but are also open to being persuaded otherwise – these interventions may not be particularly effective among citizens who are deeply invested in the idea that the world is run by evil conspiracies (and who for instance are active on conspiracist websites). These practical issues notwithstanding, the model articulated here provides a starting point for policy-makers to develop interventions that are grounded in empirical research. To conclude, belief in conspiracy theories originate from feelings of existential threat, which stimulates sensemaking processes. The salience of antagonistic outgroups moderates these effects, explaining under what circumstances the sense-making processes following feelings of existential threat do and do not lead to conspiracy beliefs. These insights may not only resolve the paradox that feelings of existential threat sometimes stimulate support for the government (Kay et al., 2008), but may also explain why conspiracy theories are prevalent across times and cultures, as the variables central to conspiracy theorizing have been inherent to the human condition for millennia (Van Prooijen & Van Vugt, 2018). The model presented here may hence provide a solid theoretical basis to facilitate the empirical study of conspiracy theories.

References Abalakina-Paap, M., Stephan, W., Craig, T., & Gregory, W. L. (1999). Beliefs in conspiracies. Political Psychology, 20, 637– 647. Bartlett, J., & Miller, C. (2010). The power of unreason: Conspiracy theories, extremism and counter-terrorism. London, UK: Demos. Burke, B. L., Kosloff, S., & Landau, M. J. (2013). Death goes to the polls: A meta-analysis of mortality salience effects on political attitudes. Political Psychology, 34, 183–200. 10.1111/pops.12005 Cichocka, A., Marchlewska, M., Golec de Zavala, A., & Olechowski, M. (2016). “They will not control us”: In-group positivity and belief in intergroup conspiracies. British Journal of Psychology, 107, 556–576. Crocker, J., Luhtanen, R., Broadnax, S., & Blaine, B. E. (1999). Belief in U.S. government conspiracies against blacks among black and white college students: Powerlessness or system blame? Personality and Social Psychology Bulletin, 25, 941– 953. Darwin, H., Neave, N., & Holmes, J. (2011). Belief in conspiracy theories: The role of paranormal belief, paranoid ideation and schizotypy. Personality and Individual Differences, 50, 1289– 1293. Douglas, K. M., Sutton, R. M., Callan, M. J., Dawtry, R. J., & Harvey, A. J. (2016). Someone is pulling the strings: Hypersensitive agency detection and belief in conspiracy theories. Thinking and Reasoning, 22, 57–77. 13546783.2015.1051586

European Psychologist (2020), 25(1), 16–25


Douglas, K. M., Sutton, R. M., & Cichocka, A. (2017). The psychology of conspiracy theories. Current Directions in Psychological Science, 26, 538–542. Federico, C. M., Williams, A. L., & Vitriol, J. A. (2018). The role of system identity threat in conspiracy theory endorsement. European Journal of Social Psychology, 48, 927–938. https:// Goertzel, T. (1994). Belief in conspiracy theories. Political Psychology, 15, 733–744. Golec de Zavala, A., & Federico, C. M. (2018). Collective narcissism and the growth of conspiracy thinking over the course of the 2016 United States presidential election: A longitudinal analysis. European Journal of Social Psychology, 48, 1011–1018. Greenberg, J., Koole, S., & Pyszczynski, T. (2004). Handbook of experimental existential psychology. New York, NY: Guildford Press. Grzesiak-Feldman, M. (2013). The effect of high-anxiety situations on conspiracy thinking. Current Psychology, 32, 100–118. Heine, S. J., Proulx, T., & Vohs, K. D. (2006). The meaning maintenance model: On the coherence of social motivations. Personality and Social Psychology Review, 10, 88–110. https:// Hofstadter, R. (1966). The paranoid style in American politics. In R. Hofstadter (Ed.), The paranoid style in American politics and other essays (pp. 3–40). New York, NY: Knopf. Hogg, M. A., Meehan, C., & Farquharson, J. (2010). The solace of radicalism: Self-uncertainty and group identification in the face of threat. Journal of Experimental Social Psychology, 46, 1061– 1066. Imhoff, R., & Bruder, M. (2014). Speaking (un-)truth to power: Conspiracy mentality as a generalized political attitude. European Journal of Personality, 28, 25–43. 10.1002/per.1930 Jolley, D., & Douglas, K. (2014). The effects of anti-vaccine conspiracy theories on vaccination intentions. PLoS One, 9, e89177. Jolley, D., Douglas, K. M., & Sutton, R. M. (2018). Blaming a few bad apples to save a threatened barrel: The system-justifying function of conspiracy theories. Political Psychology, 39, 465– 478. Kay, A. C., Gaucher, D., Napier, J. L., Callan, M. J., & Laurin, K. (2008). God and the government: Testing a compensatory control mechanism for the support of external systems. Journal of Personality and Social Psychology, 95, 18–35. 10.1037/0022-3514.95.1.18 Kruglanski, A. W., & Webster, D. M. (1996). Motivated closing of the mind: “Seizing” and “Freezing”. Psychological Review, 103, 263–283. LeBoeuf, R. A., & Norton, M. I. (2012). Consequence-cause matching: Looking to the consequences of events to infer their causes. Journal of Consumer Research, 39, 128–141. https:// Marchlewska, M., Cichocka, A., & Kossowska, M. (2018). Addicted to answers: Need for cognitive closure and the endorsement of conspiracy theories. European Journal of Social Psychology, 48, 109–117. Mashuri, A., & Zaduqisti, E. (2015). The effect of intergroup threat and social identity salience on the belief in conspiracy theories over terrorism in Indonesia: Collective angst as a mediator. International Journal of Psychological Research, 8, 24–35. Midlarsky, M. L. (2011). Origins of political extremism. Cambridge, UK: Cambridge University Press. Müller, J.-W. (2016). What is populism? Philadelphia, PA: University of Pennsylvania Press.

European Psychologist (2020), 25(1), 16–25

J.-W. van Prooijen, Existential Threat and Conspiracy Theories

Neuberg, S. L., Kenrick, D. T., & Schaller, M. (2011). Human threat management systems: Self-protection and disease avoidance. Neuroscience and Biobehavioral Reviews, 35, 1042–1051. Newheiser, A.-K., Farias, M., & Tausch, N. (2011). The functional nature of conspiracy beliefs: Examining the underpinnings of belief in the Da Vinci Code conspiracy. Personality and Individual Differences, 51, 1007–1011. j.paid.2011.08.011 Oliver, J. E., & Wood, T. J. (2014). Conspiracy theories and the paranoid style(s) of mass opinion. American Journal of Political Science, 58, 952–966. , P., Paskuj, B., To th-Király, I., Böthe, B., & RolandOrosz, G., Kreko Lévy, C. (2016). Changing conspiracy beliefs through rationality and ridiculing. Frontiers in Psychology, 7, 1525. 10.3389/fpsyg.2016.01525 Park, C. L. (2010). Making sense of the meaning literature: An integrative review of meaning making and its effects on adjustment to stressful life events. Psychological Bulletin, 136, 257–301. Pipes, D. (1997). Conspiracy: How the paranoid style flourishes and where it comes from. New York, NY: Simon & Schusters. Silva, B. C., Vegetti, F., & Littvay, L. (2017). The elite is up to something: Exploring the relationship between populism and belief in conspiracy theories. Swiss Political Science Review, 23, 423–443. Ståhl, T., & Van Prooijen, J.-W. (2018). Epistemic rationality: Skepticism toward unfounded beliefs requires sufficient cognitive ability and motivation to be rational. Personality and Individual Differences, 122, 155–163. j.paid.2017.10.026 Sullivan, D., Landau, M. J., & Rothschild, Z. K. (2010). An existential function of enemyship: Evidence that people attribute influence to personal and political enemies to compensate for threats to control. Journal of Personality and Social Psychology, 98, 434–449. Swami, V. (2012). Social psychological origins of conspiracy theories: The case of the Jewish conspiracy theory in Malaysia. Frontiers in Psychology, 3, 1–9. 2012.00280 Swami, V., Voracek, M., Stieger, S., Tran, U. S., & Furnham, A. (2014). Analytic thinking reduces belief in conspiracy theories. Cognition, 133, 572–585. 2014.08.006 Uscinski, J. E., & Parent, J. M. (2014). American conspiracy theories. New York, NY: Oxford University Press. Valdesolo, P., & Graham, J. (2014). Awe, uncertainty, and agency detection. Psychological Science, 25, 170–178. 10.1177/0956797613501884 Van der Linden, S. (2015). The conspiracy-effect: Exposure to conspiracy theories (about global warming) decreases pro-social behavior and science acceptance. Personality and Individual Differences, 87, 171–173. Van der Wal, R., Sutton, R. M., Lange, J., & Braga, J. (2018). Suspicious binds: Conspiracy thinking and tenuous perceptions of causal connections between co-occurring and spuriously correlated events. European Journal of Social Psychology, 48, 970–989. Van Harreveld, F., Rutjens, B. T., Schneider, I. K., Nohlen, H. U., & Keskinis, K. (2014). In doubt and disorderly: Ambivalence promotes compensatory perceptions of order. Journal of Experimental Psychology: General, 143, 1666–1676. https:// Van Prooijen, J.-W. (2016). Sometimes inclusion breeds suspicion: Self-uncertainty and belongingness predict belief in conspiracy theories. European Journal of Social Psychology, 46, 267–279.

Ó 2019 Hogrefe Publishing

J.-W. van Prooijen, Existential Threat and Conspiracy Theories

Van Prooijen, J.-W. (2017). Why education predicts decreased belief in conspiracy theories. Applied Cognitive Psychology, 31, 50–58. Van Prooijen, J.-W. (2018). The psychology of conspiracy theories. Oxon, UK: Routledge. Van Prooijen, J.-W., & Acker, M. (2015). The influence of control on belief in conspiracy theories: Conceptual and applied extensions. Applied Cognitive Psychology, 29, 753–761. https://doi. org/10.1002/acp.3161 Van Prooijen, J.-W., & Douglas, K. M. (2017). Conspiracy theories as part of history: The role of societal crisis situations. Memory Studies, 10, 323–333. 1750698017701615 Van Prooijen, J.-W., & Douglas, K. M. (2018). Belief in conspiracy theories: Basic principles of an emerging research domain. European Journal of Social Psychology, 48, 897–908. https:// Van Prooijen, J.-W., Douglas, K., & De Inocencio, C. (2018). Connecting the dots: Illusory pattern perception predicts beliefs in conspiracies and the supernatural. European Journal of Social Psychology, 48, 320–335. ejsp.2331 Van Prooijen, J.-W., & Jostmann, N. B. (2013). Belief in conspiracy theories: The influence of uncertainty and perceived morality. European Journal of Social Psychology, 43, 109–115. https:// Van Prooijen, J.-W., & Krouwel, A. P. M. (2019). Psychological features of extreme political ideologies. Current Directions in Psychological Science, 25, 159–163. 0963721418817755 Van Prooijen, J.-W., Krouwel, A. P. M., & Pollet, T. (2015). Political extremism predicts belief in conspiracy theories. Social Psychological and Personality Science, 6, 570–578. 10.1177/1948550614567356 Van Prooijen, J.-W., Staman, J., & Krouwel, A. P. M. (2018). Increased conspiracy beliefs among ethnic and Muslim minorities. Applied Cognitive Psychology, 32, 661–667. 10.1002/acp.3442 Van Prooijen, J.-W., & Van Dijk, E. (2014). When consequence size predicts belief in conspiracy theories: The moderating role of perspective taking. Journal of Experimental Social Psychology, 55, 63–73. Van Prooijen, J.-W., & Van Lange, P. A. M. (2014). The social dimension of belief in conspiracy theories. In J.-W. van Prooijen & P. A. M. van Lange (Eds.), Power, politics, and paranoia: Why people are suspicious of their leaders (pp. 237–253). Cambridge, UK: Cambridge University Press. Van Prooijen, J.-W., & Van Vugt, M. (2018). Conspiracy theories: Evolved functions and psychological mechanisms. Perspectives on Psychological Science, 13, 770–788. 10.1177/1745691618774270

Ó 2019 Hogrefe Publishing


Wagner-Egger, P., Delouvée, S., Gauvrit, N., & Dieguez, S. (2018). Creationism and conspiracism share a common teleological bias. Current Biology, 28, R867–R868. j.cub.2018.06.072 Wang, C. S., Whitson, J. A., & Menon, T. (2012). Culture, control, and illusory pattern perception. Social Psychological and Personality Science, 3, 630–638. 1948550611433056 West, H. G., & Sanders, T. (2003). Transparency and conspiracy: Ethnographies of suspicion in the New World Order. Durham, NC: Duke University Press. Whitson, J. A., & Galinsky, A. D. (2008). Lacking control increases illusory pattern perception. Science, 322, 115–117. https://doi. org/10.1126/science.1159845 Whitson, J. A., Galinsky, A. D., & Kay, A. (2015). The emotional roots of conspiratorial perceptions, system justification, and belief in the paranormal. Journal of Experimental Social Psychology, 56, 89–95. Wiseman, R., & Watt, C. (2006). Belief in psychic ability and the misattribution hypothesis: A qualitative review. British Journal of Psychology, 97, 323–338. 000712605X72523 History Received November 8, 2018 Revision received February 22, 2019 Accepted May 16, 2019 Published online December 6, 2019 ORCID Jan Willem van Prooijen Jan-Willem van Prooijen Department of Experimental and Applied Psychology VU Amsterdam Van der Boechorststraat 7 1081BT Amsterdam The Netherlands

Jan-Willem van Prooijen received his PhD from Leiden University in 2002. He is currently an Associate Professor of Social Psychology at VU Amsterdam, and a Senior Researcher at the NSCR. His main research interests are conspiracy theories, political extremism, and unethical behavior.

European Psychologist (2020), 25(1), 16–25

Original Articles and Reviews

Religiosity’s Nomological Network and Temporal Change Introducing an Extensive Country-Level Religiosity Index based on Gallup World Poll Data Mohsen Joshanloo1

and Jochen E. Gebauer2,3


Department of Psychology, Keimyung University, Dalseo-Gu, Daegu South Korea


Department of Psychology, University of Mannheim, Germany


Department of Psychology, University of Copenhagen, Denmark

Abstract: Countries differ in their religiosity and these differences have been found to moderate numerous psychological effects. The burgeoning research in this area creates a demand for a country-level religiosity index that is comparable across a large number of countries. Here, we offer such an index, which covers 166 countries and rests on representative data from 1,619,300 participants of the Gallup World Poll. Moreover, we validate the novel index, use it to examine temporal change in worldwide religiosity over the last decade, and present a comprehensive analysis of country-level religiosity’s nomological network. The main results are as follows. First, the index was found to be a valid index of global religiosity. Second, country-level religiosity modestly increased between 2006 and 2011 and modestly decreased between 2011 and 2017 – demonstrating a curvilinear pattern. Finally, nomological network analysis revealed three things: it buttressed past evidence that religious countries are economically less developed; it clarified inconsistencies in the literature on the health status of inhabitants from religious countries, suggesting that their psychological and physical health tends to be particularly good once economic development is accounted for; and finally, it shed initial light on the associations between country-level religiosity and various psychological dimensions of culture (i.e., Hofstede’s cultural dimensions and country-level Big Five traits). These associations revealed that religious countries are primarily characterized by high levels of communion (i.e., collectivism and agreeableness). We are optimistic that the newly presented country-level religiosity index can satisfy the fast-growing demand for an accurate and comprehensive global religiosity index. Keywords: religiosity, culture, country-level religiosity index, gallup

Culture has profound effects on human thought, feeling, and behavior. The focal cultural dimension in the social sciences across recent decades was the individualismcollectivism dimension (Oyserman, Coon, & Kemmelmeier, 2002; Triandis, 1995). More recently, however, psychologists have focused their attention on other cultural dimensions, with one of them being country-level religiosity (Diener, Tay, & Myers, 2011; Sedikides & Gebauer, 2010). A focus on country-level religiosity is warranted on at least two counts. Firstly, few other cultural dimensions encompass a similarly broad range. For example, there are many countries where almost all inhabitants are religious (e.g., most African countries), but there are also countries where the majority of inhabitants are not religious (e.g., most Scandinavian countries; Gebauer et al., 2014; Joshanloo, 2019). Secondly, religiosity comes with a particularly large set of values, norms, and habits, which profoundly influence virtually all aspects of culture (Cohen & Hill, 2007; Gebauer, Sedikides, & Schrade, 2017). To illustrate, higher income predicts higher well-being (Diener, Ng, Harter, & Arora, 2010), but this classic effect is diminished European Psychologist (2020), 25(1), 26–40

in religious countries, where religious anti-wealth norms prevail (e.g., “It is easier for a camel to go through the eye of a needle than for someone who is rich to enter the kingdom of God”, Mark 10: 25) (Gebauer, Nehrlich, Sedikides, & Neberich, 2013). Along similar lines, positive affective experiences predict higher life satisfaction (Cohn, Fredrickson, Brown, Mikels, & Conway, 2009), but that effect is diminished in religious countries (Joshanloo, 2019), where religious anti-pleasure norms are in place (e.g., “Frustration is better than laughter, because a sad face is good for the heart. The heart of the wise is in the house of mourning, but the heart of fools is in the house of pleasure”, Ecclesiastes 7: 3–4). In order to understand the influence of the religious cultural climate on the human psyche, it is essential to possess knowledge about a country’s degree of religiosity, or – in more technical terms – to possess a valid and globally comparable religiosity index. The need for such an index is particularly pressing at present, because religiosity has become a mainstream topic in psychology (Saroglou, 2014; Sedikides, 2010). Resting on representative data from Ó 2019 Hogrefe Publishing

M. Joshanloo & J. E. Gebauer, Country-Level Religiosity

1,619,300 individuals and spanning 166 countries, the present research provides the most updated and inclusive index of country-level religiosity. Moreover, we use this index to gauge the change of global religiosity over the past decade and to provide a thorough analysis of country-level religiosity’s nomological network.

Existent Country-Level Religiosity Indices A few country-level religiosity indices already exist in the social science literature. All these indices have one thing in common. Specifically, they estimate the religiosity of a country by averaging the religiosity of individual participants in that country. Put differently, they aggregate individual-level responses to the country-level – a procedure that has been commonplace in sociology for decades (Books & Prysby, 1988; Stipak & Hensler 1982). The difference between existent country-level religiosity indices, then, is the nature of the individual-level data. Perhaps the most widely used index in the social sciences has been compiled by sociologist Phil Zuckerman (2007). Zuckerman’s index describes the percentage of atheists per country. The index makes use of individual-level data from multiple sources. Notably, though, the different sources report very different results for the same country. Consequently, for most counties, Zuckerman reports percent-ranges, rather than single percentages, and those ranges frequently possess wide margins (e.g., Denmark: 43%–80%, Finland: 28%–60%, Russia: 24%–48%). More recent country-level religiosity indices rely on data from a single source. Some of those indices do not rely on representative samples per country (Gebauer et al., 2014; Gebauer, Paulhus, & Neberich, 2013). Other indices do rely on representative samples, but sample size per country is relatively small and/or the small number of countries in the index leaves room for improvement and/or the distribution of countries is biased toward western, industrialized countries (e.g., Fincher & Thornhill, 2012). Moreover, existent indices rely on data that have been assessed over the course of several years. It is possible, however, that country-level religiosity changes so rapidly that it is not reasonable to aggregate the data across several years. Finally, existent country-level religiosity indices have been related to only few other country-level variables. Hence, we know little about the nomological network of national religiosity.

Temporal Change in Country-Level Religiosity It is no exaggeration to state that psychologists have ignored temporal change in country-level religiosity. Psychology’s stance is hard to understand given that country-level Ó 2019 Hogrefe Publishing


religiosity is a consequential moderator of individual-level effects (Gebauer, Nehrlich, et al., 2013; Joshanloo, 2019). For example, individual-level religiosity is associated with psychological health benefits, but those benefits appear to be restricted to religious countries (Gebauer, Sedikides, & Neberich, 2012; Gebauer, Sedikides, Schönbrodt, et al., 2017). Findings such as these suggest that psychologists need to study the temporal change in country-level religiosity in order to estimate how individual-level effects of religiosity will change in more secularized/religious futures. If, for example, country-level religiosity further declines in Western European countries (Edgell, 2012; Stark & Iannaccone, 1994), the psychological health benefits of religiosity may vanish altogether in those countries. In contrast to the practices of psychologists, sociologists have long sought to understand temporal change in country-level religiosity. Dating back to Max Weber (1930), secularization theory (Gorski & Altınordu, 2008; Norris & Inglehart, 2011) has been the predominant theory over much of sociology’s history. Stark and Iannaccone (1994, p. 230) pointedly described the theory and its prevalence: “For years everyone has agreed that many nations in Europe are extremely secularized – that few attend church services, that belief is on the wane, and that the power and presence of religion in public life has faded to a shadow of past glories. . . There also has been nearly universal agreement that Europe’s secularization represented the future of all societies – that the spread of science and modernity doomed religion.” However, over the last 25 years or so, market theory (Sherkat & Ellison, 1999; Warner, 1993) has begun to replace secularization theory as the predominant theory of temporal change in country-level religiosity. Edgell (2012, p. 249) succinctly summarizes the theory: “Market theory, in all its major variants, explicitly challenges secularization theory’s core argument that religion is a poor fit in the modern world. Market theorists argue that modernity creates the conditions that foster religious privatization, pluralism, and voluntarism, causing religion to thrive – and, ironically, to retain much of its public significance.” How has country-level religiosity changed worldwide over the last decade? The data used in the present study offer a unique opportunity to test this question. Secularization theory predicts a steady decline, whereas market theory predicts little change. Both theories are broad, macro-level theories. Yet, the last decade has seen a number of very European Psychologist (2020), 25(1), 26–40


specific historical events that may have influenced countrylevel religiosity over and above the general influences predicted by secularization theory and market theory. The rise of Islamist terrorism, for example, may have damaged the reputation of religions in general. Alternatively, it may have led religious people to stand by their religiosity even more firmly in order to profess solidarity with peaceful and wrongly accused religions. The latter effect may be most visible among Muslims, because Muslims may feel a particularly strong need to display solidarity due to external pressures and tensions. Regardless of reasons, it is timely and important to examine temporal change in worldwide religiosity over the last decade. That temporal change is not only an interesting topic of inquiry in itself; it is also relevant for our country-level religiosity index. If the change was too vast, it would be inappropriate to base our country-level religiosity index on data from a whole decade.

Nomological Network of Country-Level Religiosity Cronbach and Meehl (1955) considered it essential to understand the nomological network of any construct. For individual-level constructs, psychologists have unanimously endorsed that view. For country-level constructs, however, nomological network analyses are rare (if existing at all). This is not to say that country-level religiosity has never been related to other country-level constructs. However, those analyses have been rather limited and unsystematic. A brief review that describes what we know so far about country-level correlates of country-level religiosity follows. To begin with, there is firm evidence that religiosity is higher in economically less developed countries (Barro & McCleary, 2003; Oishi & Diener, 2014). Compared to secular countries, religious countries possess a lower gross domestic product (GDP) per capita (Rees, 2009), more income inequality (Gini coefficient; Barber, 2011), a larger proportion of agricultural employment (Barber, 2013), less tertiary education (i.e., viewer postsecondary programs; Barber, 2013), and lower scores on the Human Development Index (HDI; Gaskins, Golder, & Siegel, 2013). These associations are consistent with secularization theory, according to which, the upsurge of scientific knowledge (e.g., higher education) causes declines in religiosity, because religious explanations of how the “world works” are increasingly perceived as unlikely. The associations, however, are also consistent with need-based theories of religion (Sedikides, 2010). According to those theories, religiosity becomes less important in economically developed countries, because economic development satisfies many psychological needs for which human beings had to rely on religiosity before (e.g., control – Kay, Gaucher, European Psychologist (2020), 25(1), 26–40

M. Joshanloo & J. E. Gebauer, Country-Level Religiosity

McGregor, & Nash, 2010; psychological security – Diener et al., 2011). There is also evidence that people in religious countries are psychologically and physically less healthy than people in secular countries. Compared to inhabitants of secular countries, those of religious countries were found to evaluate their life more negatively and report more negative feelings (Diener et al., 2011). Notably, though, controlling for country-level economic development diminished (or even reversed) those associations (Deaton & Stone, 2013). Inhabitants of religious countries were also found to commit suicide more frequently, but controlling for country-level economic development reversed that effect (Stack, 1983). Regarding physical health, religious countries were found to possess higher rates of infant and maternal mortality and the life expectancy in religious countries were found to be lower than in secular countries (Reeve, 2009). Again, controlling for country-level economic development diminished those associations (and in one case, they disappeared altogether). Of note, there are exceptions to the just-described pattern of results. For example, Diener et al. (2011) found no association between country-level religiosity and countrylevel positive feelings, and Gebauer, Sedikides, Schönbrodt, et al. (2017) found a positive association between religiosity and self-esteem at the country-level. Similarly, two investigations found a negative association between country-level religiosity and country-level suicide rates even when no covariates were included (Neeleman, Halpern, Leon, & Lewis, 1997; Oishi & Diener, 2014). Together, the literature on country-level religiosity and psychological and physical health is less consistent than the literature on country-level religiosity and economic development. Thus, the former literature is in particular need of additional evidence. For that reason, we included many health variables in our nomological network analysis. Finally, little is known about the associations between country-level religiosity and traditional cultural variables, such as Hofstede’s four dimensions of culture – that is, individualism, masculinity, power distance, and uncertainty avoidance (Hofstede, Hofstede, & Minkov, 2010). Exceptions include two recent studies that found religious countries to be less individualistic than secular ones (Gebauer, Sedikides, Schönbrodt, et al., 2017; Oishi & Diener, 2014). To our knowledge, it has not yet been tested whether that association persists when country-level economic development is controlled for. Hofstede’s four dimensions capture many psychological differences between countries, but these dimensions certainly do not capture all those differences. Rentfrow, Gosling, and their colleagues have shown that geographical units also differ from each other along the Big Five personality dimensions – that is, agreeableness, conscientiousness, extraversion, openness to experience, Ó 2019 Hogrefe Publishing

M. Joshanloo & J. E. Gebauer, Country-Level Religiosity

and neuroticism (Rentfrow, 2010; Rentfrow, Gosling, & Potter, 2008). Yet, the associations between country-level religiosity and country-level Big Five personality have not yet been studied, even though there is much research on those associations at the individual level (Gebauer et al., 2014; Saroglou, 2010). The present research sought to close these gaps in the literature.

Methods Participants The present research capitalizes on 2005–2017 data from the Gallup World Poll (GWP). The GWP currently continually surveys residents in 167 countries and uses randomly selected, nationally representative samples. Typically, Gallup annually surveys 1,000 individuals over 15 years of age in each country. Unfortunately, in some countries data have not been collected in some years, and/or data collection has started later than 2005. The present research relies on all available GWP data, consisting of 1,862,900 participants across 166 countries (age: M = 40.97, SD = 17.47; sex: 53.4% women).1 Electronic Supplementary Material (ESM 1) includes the countries’ names, gender ratios, average ages, and national sample sizes.

Measures Religiosity Index The GWP measures religiosity with the following item: “Is religion an important part of your daily life?” Responses are coded as “yes,” “no,” “don’t know,” and “refused.” From the data, 1,619,300 participants answered this question. The country-level religiosity index was calculated as the percentage of people per country who responded with “yes” (i.e., stated that religiosity is an important part in their daily lives).2 The External Measure of Religiosity The World Values Survey (Inglehart et al., 2014) includes multiple religiosity items. We z-standardized and averaged





four of them to yield an individual-level measure that fits common definitions of general religiosity (Gebauer & Maio, 2012; Hill & Hood, 1999): (1) religious self-categorization (“Independently of whether you go to church or not, would you say you are a religious person”), (2) belief in God (“Which, if any, of the following do you believe in?: God”), (3) church attendance (“Apart from weddings, funerals and baptisms, about how often do you attend religious services these days?”), and (4) habits of prayer (“Do you take some moments of prayer, meditation or contemplation or something like that?”). In a second step, we averaged participants’ general religiosity within each country. We expected a high correlation between this measure and our new index of country-level religiosity. Prosperity The nine sub-indices of the Legatum Prosperity Index (Legatum Institute, 2017) assess (1) economic quality, (2) business environment, (3) governance, (4) personal freedom, (5) social capital, (6) safety and security, (7) education, (8) health, and (9) natural environment. Each sub-index consists of dozens of objective and subjective variables. Space does not permit to provide more details, but detailed information can be obtained from http://www.prosperity. com. We used the economic quality sub-index as a general measure of economic progress and national wealth in our analyses.3 Health Security and Ecological Stress Barber (2013) used country-level pathogen prevalence as an estimate of health security. We included country-level pathogen prevalence for the same reason in our analyses, utilizing Fincher and Thornhill’s (2012) index of nonzoonotic parasite prevalence (i.e., the prevalence of disease-causing parasites, which are transmittable between human beings). Along Barber’s (2013) line of reasoning, climatic harshness (i.e., very hot summers and/or very cold winters) can influence health security and is associated with pathogen prevalence (Van de Vliert, 2013). Thus, we included these variables in our analyses, utilizing Van de Vliert’s (2013) indices of cold demand (an estimate of hotness in the summer) and heat demand (an estimate of coldness in the winter).

The GWP includes one additional country (Oman), but the religiosity measure has never been administered in this country. We, thus, excluded that country. Diener et al. (2011) used the same procedure to estimate country-level religiosity on the basis of an early version of the GWP dataset. Diener et al.’s version includes data across 4 years only (2005–2009). It also includes only about a quarter of the participants of the current GWP. Additionally, it includes about 10% fewer countries. Finally, it includes relatively more Western countries. Joshanloo (2019) used a similar version of the country-level religiosity index described in the present article to test a hypothesis about life satisfaction. It is noteworthy that some of the prosperity variables rely on the GWP items that we also use here. For example, some of the affect items that we use in our analyses (i.e., enjoyment, smile, and sadness) have been used to construct the health sub-index. However, given that we use these variables in separate analyses, the overlap does not affect our analyses.

Ó 2019 Hogrefe Publishing

European Psychologist (2020), 25(1), 26–40


Depression and Anxiety The 2015 data from the Global Burden of Disease Study (Institute for Health Metrics and Evaluation, 2015) provides indices of disability-adjusted life years due to depression and anxiety, including years of life lost from premature death and years lived with less than full health. Suicide Rate We used the World Health Organization age-standardized suicide rates per country for the year 2015 (http://www. Hedonic Well-Being Eight aspects of hedonic well-being are measured by items from the GWP: (1) present life satisfaction, (2) future life satisfaction, (3) enjoyment, (4) laughter, (5) worry, (6) sadness, (7) stress, and (8) anger. ESM 2 includes those items. We averaged each item within each country to construct country-level scores. Purpose in Life Purpose in life is measured in the GWP with the following item: “Do you feel your life has an important purpose or meaning?” From 2008 to 2011, the item has been used in a small number of countries only and it was completely excluded from the GWP after 2011. Therefore, this variable is available for a smaller number of countries. Eudaimonic Well-Being The national eudaimonic well-being index (Joshanloo, 2018) based on data from 2005–2017 was used to measure optimal functioning. This index is composed of seven GWP items measuring learning experience, social support, respect, efficacy beliefs, sense of freedom, and prosociality. Human Development Index The 2015 HDI index was obtained from http://hdr.undp. org/en/data. Hofstede’s Dimensions of National Culture We measured Hofstede’s four cultural dimensions (individualism, masculinity, power distance, and uncertainty avoidance) with country-level scores from Hofstede et al. (2010). Big Five Personality Traits Countries differ in their inhabitants’ average agreeableness, conscientiousness, extraversion, openness for experience, and neuroticism. We used country-level scores based on the Big Five Inventory (John, Donahue, & Kentle, 1991), obtained from Gebauer et al. (2015). Self-Esteem Countries differ in their inhabitants’ average self-esteem. Gebauer et al. (2015) reported a country-level self-esteem European Psychologist (2020), 25(1), 26–40

M. Joshanloo & J. E. Gebauer, Country-Level Religiosity

score based on a version of the Single-Item Self-Esteem Scale (Robins, Hendin, & Trzesniewski, 2001), which is used here.

Results and Discussion Descriptive Statistics As a reminder, the GWP assesses religiosity with the following question: “Is religion an important part of your daily life?” 71.4% of the participants responded with “yes,” 26.9% responded with “no,” and only 1.3% responded with “don’t know” (0.4% refused to answer the question). Table 1 includes the national averages (i.e., the countrylevel religiosity index) and Figure 1 is a graphical display of those averages. Mean country-level religiosity was 73.6% (min = 15.8%, max = 98.9%, SD = 24.5%).

Relationship With the External Measure of Religiosity Our country-level religiosity index rests on a single item only. At the individual level, single items are often described as unreliable (albeit single-item measures of religiosity do not appear to suffer from low reliability; Gebauer, Paulhus, et al., 2013). Importantly, aggregation to the country-level should average out any possible unreliability of single-item measures. To empirically corroborate this logic and to test whether our country-level religiosity index captures global religiosity adequately, we examined the association between our index and the external measure of global religiosity from the World Values Survey. As expected, the association was near-perfect, r(95) = .894, p < .001. To examine the robustness of this association, we split the GWP data of each country in two random halves and calculated two independent country-level religiosity indices, each one based on one half of the GWP data. The association between those two GWB-based religiosity indices and the external measure of global religiosity were, r(95) = .893, p < .001, and, r(95) = .895, p < .001 (the two GWB-based indices were also near perfectly intercorrelated, r(166) = .999, p < .001). This suggests that our countrylevel religiosity index is a robust index of religiosity at the country-level.

Temporal Change in Country-Level Religiosity We conducted a latent growth curve analysis to examine the global, temporal change in country-level religiosity Ó 2019 Hogrefe Publishing

M. Joshanloo & J. E. Gebauer, Country-Level Religiosity


Table 1. National religiosity scores Country










1. Somaliland

98.9 35. Morocco

94.7 69. Bolivia

88.2 103. Turkmenistan

73.3 137. Canada

2. Sri Lanka

98.6 36. Nepal

94.5 70. Botswana

87.5 104. Cyprus

72.8 138. Lithuania


3. Indonesia

98.4 37. Mauritius

94.5 71. Benin

87.5 105. Moldova

72.7 139. Switzerland


4. Bangladesh

98.2 38. Jordan

94.4 72. El Salvador

87.3 106. Armenia

72.1 140. Hungary


5. Malawi

98.1 39. Burundi

94.4 73. Angola

87.0 107. Greece

71.3 141. Germany


6. Egypt

97.8 40. Swaziland

94.4 74. Guyana

86.8 108. Bosnia Herz

71.1 142. Bulgaria


7. Comoros

97.8 41. Kenya

94.1 75. Colombia

86.8 109. Kyrgyzstan

70.1 143. Spain

8. Ethiopia

97.5 42. Ghana

94.1 76. Cambodia

86.7 110. Nagorno Karabakh 70.0 144. Slovenia


39.0 37.5

9. Yemen

97.5 43. Cameroon

94.0 77. Togo

86.5 111. Chile

68.0 145. Latvia


10. Afghanistan

97.4 44. Pakistan

94.0 78. Ecuador

85.7 112. Poland

66.9 146. Luxembourg


11. Niger

97.4 45. Palestine

93.9 79. Trinidad and Tobago 85.2 113. Italy

65.2 147. Belarus


12. Senegal

97.2 46. UAE

93.7 80. Nicaragua

65.1 148. Iceland


13. Bhutan

97.2 47. Rwanda

93.7 81. Lebanon

85.2 115. USA

65.0 149. Vietnam


14. Mauritania

97.0 48. Congo Kinshasa 93.7 82. South Africa

85.1 116. Northern Cyprus

64.1 150. Cuba


15. Somalia

97.0 49. Liberia

93.5 83. Iraq

85.0 117. Croatia

63.7 151. Belgium


16. Sierra Leone

96.7 50. Algeria

92.9 84. Honduras

84.6 118. Argentina

63.4 152. Russia


92.7 85. Malta

84.0 119. Mexico

63.3 153. Australia

32.3 32.1

17. Cent. Af. Republic 96.7 51. Chad

85.2 114. Belize

18. Libya

96.6 52. Madagascar

92.5 86. Peru

83.3 120. Singapore

62.5 154. New Zealand

19. Djibouti

96.3 53. Tunisia

92.3 87. Puerto Rico

83.2 121. Portugal

61.9 155. Finland


20. Saudi Arabia

96.2 54. South Sudan

91.6 88. Panama

83.0 122. Montenegro

58.1 156. Netherlands


21. Guinea

96.1 55. Sudan

91.4 89. Jamaica

82.7 123. Uzbekistan

58.0 157. United Kingdom 29.9

22. Tanzania

96.0 56. Burkina Faso

91.2 90. Syria

82.6 124. Serbia

54.9 158. France

23. Thailand

95.7 57. Kuwait

91.2 91. India

82.5 125. Ireland

54.6 159. Czech Republic 25.5

24. Gambia

95.7 58. Namibia

90.8 92. Iran

82.5 126. Albania

52.0 160. Japan


25. Laos

95.7 59. Congo Brazzaville 90.5 93. Costa Rica

82.3 127. Slovakia

47.7 161. Hong Kong


26. Myanmar

95.3 60. Ivory Coast

81.6 128. Austria

46.4 162. Norway


90.4 94. Romania


27. Mali

95.3 61. Malaysia

90.0 95. Kosovo

81.4 129. Taiwan

46.0 163. Estonia


28. Bahrain

95.1 62. Mozambique

90.0 96. Haiti

80.6 130. South Korea

45.2 164. Denmark


29. Nigeria

95.1 63. Paraguay

89.8 97. Georgia

80.5 131. Israel

45.0 165. Sweden


30. Lesotho

95.0 64. Brazil

89.4 98. Turkey

79.7 132. Kazakhstan

44.8 166. China


31. Qatar

94.9 65. Gabon

89.2 99. Suriname

78.6 133. Mongolia


32. Uganda

94.9 66. Zimbabwe

88.8 100. Tajikistan

78.2 134. Ukraine


33. Zambia

94.8 67. Guatemala

88.4 101. Venezuela

78.1 135. Uruguay


34. Philippines

94.8 68. Dominican Rep.

88.4 102. Macedonia

76.0 136. Azerbaijan

42.0 Total

between 2006 and 2017.4 Latent growth curve modeling defines two higher-order latent variables, one capturing initial status (intercept) and the other one capturing rate of change (slope). The mean and variance estimates for the rate of change factor were of particular interest. The mean estimate indicates the magnitude and direction of average national change over time, whereas the variance estimate indicates whether there are significant differences between countries regarding the rate of change (Duncan & Duncan, 2004; Preacher, 2008). Our latent growth curve model used country-level religiosity per year as manifest indicators. We used Mplus 8.0 (Muthén & Muthén, 2017) 4


for data analysis and estimated our models with Robust Maximum Likelihood (MLR) estimation and handled missing data via Full Information Maximum Likelihood (FIML). ESM 3 includes the annual religiosity scores for each country. A linear growth curve model (with slope factor loadings of 0–11 for the 12 years) yielded the following fit indices: w2(73) = 232.957, p < .001; RMSEA = 0.115, CFI = 0.947. A quadratic growth curve model (with centered time codes: 5.5, 4.4, 3.5, 2.5, 1.5, 0.5, 0.5, 1.5, 2.5, 3.5, 4.5, 5.5) provided superior fit: w2(69) = 177.793, p < .001; RMSEA = 0.097, CFI = 0.964. We chose to base our

We excluded the year 2005 from these analyses, because religiosity data from that year were available for 26 countries only.

Ó 2019 Hogrefe Publishing

European Psychologist (2020), 25(1), 26–40


M. Joshanloo & J. E. Gebauer, Country-Level Religiosity

Figure 1. Religiosity across 166 countries.

analyses on the quadratic model, because the quadratic model yielded better fit than the linear model, Δw2(4) = 56.183, p < .001. Consequently, we will not discuss the linear model any further. The value of national religiosity at the mid-point of the study (between the years 2011 and 2012) was estimated to be 74.040%. There were significant differences in the mid-point levels of religiosity between nations (variance = 597.004, p < .001). During the 12-year period, the linear trajectory was not significant (M = 0.028, p = 0.683),5 yet the quadratic trajectory was significant (M = 0.038, p = .003). These results suggest that the global level of religiosity has followed a quadratic trajectory over the past 12 years and Figure 2 displays that trajectory. The rates of linear change (variance = 0.581, p < .001) and quadratic change (variance = 0.015, p < .001) varied significantly between nations. In a follow-up analysis, we conducted a latent class growth curve analysis to assess heterogeneity in the trajectories of religiosity across identifiable groups of countries. For this purpose, the quadratic growth curve model was extended to incorporate categorical latent classes (Muthén & Muthén, 2000). We tested models with one, two, three, and four classes. To choose the optimal number of classes, we relied on the Bayesian Information Criterion (BIC), the sample-size adjusted BIC (SSABIC), entropy, 5

the Lo-Mendell-Rubin adjusted likelihood ratio test (LMRLRT), and the bootstrapped LRT (BLRT). ESM 4 includes those results. Given that entropies and BLRTs were not much different across the four models, we relied on BIC, SSABIC, and LMR-LRT. The three- and four-class models had better BIC/SSABIC values than the two-class model. Hoverer, the LMR-LRT value for the four-class model did not show a significant improvement over the three-class model. Therefore, we selected the three-class model as the optimal model. Figure 3 and ESM 5 show the country classification. The linear and quadratic slopes were not significant in Groups 1 and 2 (brown and purple groups in Figure 3). In Group 3 (green), the linear component of the slope was not significant (M= 0.088, p = .167), but the quadratic component was (M = 0.053, p = .005). This pattern suggests a quadratic trajectory in the 101 countries of Group 3, which seems to account for the global quadratic trend given the relatively large size of this group. Figure 4 shows this quadratic trajectory. The average intercepts were 65.417% (Group 1), 35.796% (Group 2), and 91.037 (Group 3), indicating that what distinguishes Group 3 from the other two groups is its higher average level of religiosity. In sum, these results suggest that the quadratic global trend is largely accounted for by a rather small quadratic trajectory in highly religious nations.

The mean of the linear trajectory represents the estimate of the instantaneous rate of change at the midpoint of the study.

European Psychologist (2020), 25(1), 26–40

Ó 2019 Hogrefe Publishing

M. Joshanloo & J. E. Gebauer, Country-Level Religiosity


Figure 2. Sample and estimated means for national religiosity in the whole sample (2006–2017).

Figure 3. Country classification based on latent class growth analyses.

Trends for Christianity and Islam From 2008 to 2017 In order to measure the rate of change for major religious affiliations, we ran five separate analyses, one analysis for each of the five major religious denominations. For Christianity, in a first step, we recoded religiosity-responses from non-Christian participants to “0.” In essence, the newly calculated item is conceptually equivalent to the following statement: “Is Christianity an important part of your daily life?” Next, we averaged the responses to that Christianity Ó 2019 Hogrefe Publishing

item across all participants in each country. This procedure was repeated for Islam, Buddhism, Hinduism, and secularism/no religion. In many countries, there were very low rates of respondents for Buddhism, Hinduism, and secularism/no religion. As a result, the analyses involving those religions did not converge. Thus, we only report the results for Christianity and Islam. We examined linear models only, because quadratic models did not converge either. We excluded the years 2005–2007 from those analyses, because of insufficient religion-specific data for those years. European Psychologist (2020), 25(1), 26–40


M. Joshanloo & J. E. Gebauer, Country-Level Religiosity

Figure 4. Estimated means for national religiosity in Group 3 (2006–2017). ESM 10 presents the sample means.

Our religion-specific latent growth curve models used religion-specific, country-level religiosity in each of the 10 years as manifest indicators. A linear growth curve model of Christianity (with slope factor loadings of 0–9) yielded the following fit indices: w2(50) = 197.192, p < .001; RMSEA = 0.135, CFI = 0.950. Based on the modification indices, three covariances were specified to improve model fit: between 2011 and 2012, 2012 and 2013, and 2011 and 2013. The modified model provided an acceptable fit: w2(47) = 115.567, p < .001; RMSEA = 0.095, CFI = 0.977. The initial level of global Christianity was estimated to be 41.753%. There were significant differences in the average initial levels of Christianity between nations (variance = 1,135.562, p < .001). During the 10-year period, the rate of change for Christianity was negative, very small, and significant (M = 0.130, p = .036; see Figure 5). The rates of change significantly varied across nations (variance = 0.431, p < .001). These results suggest a very small downward trend of global Christianity over the last decade. Among Muslims, a linear growth curve model yielded acceptable fit indices: w2(50) = 106.336, p < .001; RMSEA = 0.083, CFI = 0.967. The initial level of Islamic religion was estimated to be 25.484%. There were significant differences in the average initial levels of Islamic religion between nations (variance = 1,319.498, p < .001). During the 10-year period, the rate of change for Islam was 6

very small and nonsignificant (estimated to be 0.003, p = .969). The rates of change significantly varied across nations (variance = 0.463, p = .002). These results suggest that the level of Islamic religion has not changed over the past decade. In all, we found nuanced and interesting patterns of temporal change in religiosity over the last decade. Those changes were not very large in size. This pattern of results suggests that it is appropriate to base our country-level religiosity index on all GWP data (i.e., data from 2005–2017). Another way to probe for that suitability is to calculate 12 country-level religiosity indices – one for each year – and to examine their interrelation. The interrelations between those 12 indices were virtually perfect, mean r = .97 (.94 r .99). Hence, it appears justifiable to base our country-level religiosity index on all available GWP data (i.e., on data between 2005 and 2017).

Nomological Network of Country-Level Religiosity Table 2 includes the correlations between our country-level religiosity index and 36 other country-level variables.6 To begin with, we probed for the well-documented association between higher country-level religiosity and lower economic development. We replicated that association.

We ran the same analyses in the two split samples, and found highly similar results (see ESM 6 and 7). This shows that all results are robust and replicable.

European Psychologist (2020), 25(1), 26–40

Ó 2019 Hogrefe Publishing

M. Joshanloo & J. E. Gebauer, Country-Level Religiosity


Figure 5. Global change in Christianity between 2008 and 2017.

Compared to secular countries, religious countries evidenced (1) lower scores on the HDI, (2) lower economic quality, (3) a worse business environment, (4) worse political governance, (5) less safety and security, (6) less freedom, (7) less social capital, (8) worse performance in the quality of the natural environment and preservation efforts, and (9) worse education. Next, we examined the association between countrylevel religiosity and psychological and physical health. As a reminder, the results on that issue have been somewhat mixed in prior research (see Introduction). For the most part, the prior results indicated that higher country-level religiosity was associated with worse psychological and physical health. Our non-adjusted correlation results were in line with those prior results. Specifically, religious countries evidenced (1) lower life satisfaction, (2) less enjoyment, (3) more worry, (4) more sadness, (5) higher anger, (6) a worse health status,7 and (7) lower health security (higher pathogen prevalence) and more heat demand (but less cold demand). On the other hand, country-level religiosity was also unrelated to some health aspects and even positively related to a few others. Specifically, country-level religiosity was not associated with (1) future life satisfaction, (2) laughter, (3) stress, (4) eudaimonic well-being, and (5) suicide rates. Moreover, people in religious countries evidenced lower rates of disability due to (1) depression and (2) anxiety, (3) reported higher self-esteem, and (4) much more

7 8

purpose in life. Thus, in line with past research, the health status of religious countries was generally lower than that of secular countries. Also in line with previous research, however, those results were less consistent than the results on economic development. Previous research has consistently shown that most negative associations between higher country-level religiosity and worse health outcomes are largely due to the worse economic development in religious countries (see Introduction). Put differently, there is little evidence that countrylevel religiosity per se (i.e., once economic development is controlled) is detrimental for psychological and physical health. Thus, statistically controlling for economic development should curb the negative associations between higher country-level religiosity and worse health outcomes. This, too, was the case with our data. Controlling for economic quality8 rendered nonsignificant four of the seven associations between country-level religiosity and health that were previously significant (i.e., life satisfaction, worry, sadness, and health). In addition, one previously negative association turned positive (i.e., enjoyment). Only three associations remained positive and significant indicating worse health for religious nations (i.e., anger, pathogen prevalence, and heat demand), but the size of those associations was reduced. Plus, whereas the unadjusted association between religiosity and stress was nonsignificant, it became positive and significant once economic quality was controlled

The prosperity health sub-index measures basic physical and mental health, health infrastructure, and preventative care. Conceptually very similar results emerged when controlling for the HDI instead of economic quality. ESM 8 displays those results.

Ó 2019 Hogrefe Publishing

European Psychologist (2020), 25(1), 26–40


M. Joshanloo & J. E. Gebauer, Country-Level Religiosity

Table 2. Correlations with country-level religiosity Variable





Economic Development HDI

.715*** 158


.661*** 148

.414*** 147 –


.642*** 148




.629*** 148




.630*** 148




.564*** 148



Social capital

.339*** 148


148 148


.507*** 148



.754*** 148

.489*** 148

Psychological and Physical Health Life satisfaction

.579*** 166



















.383*** 166




.614*** 148



Pathogen prevalence

.626*** 160

.380*** 148

Heat demand

.517*** 160

.397*** 148

Cold demand

.706*** 160

.616*** 148

Future life satisfaction







.381*** 148





Eudaimonic well-being



.326*** 148







.395*** 161


148 148

148 148


.292*** 161





Purpose in life

.664*** 132

.529*** 123

Power distance












Psychological Dimensions of Culture






Uncertainty avoidance






























Note. rp = partial correlation controlling for economic quality. *p < .05; **p < .01; ***p < .001.

for. Finally, controlling for economic quality turned significant four of the five associations that were previously nonsignificant (i.e., more future life satisfaction, laughter, and


eudaimonic well-being, and lower suicide rates).9 In all, country-level religiosity per se does not appear to be associated with poorer country-level psychological and physical health. To the contrary, results after controlling for economic quality suggest that country-level religiosity is associated with certain health benefits. A possible explanation for the latter may be the following: Inhabitants of religious countries experience a greater sense of purpose in life (Oishi & Diener, 2014) and that greater sense of purpose may explain the better rate of psychological and physical health in religious countries. Indeed, Oishi and Diener (2014) found initial evidence consistent with this prediction. Specifically, they found that country-level religiosity was associated with lower suicide rates and that association was partly explained (i.e., mediated) by a greater sense of purpose in religious countries. A set of mediation analyses (see ESM 9) were partially consistent with the just-described explanatory role of purpose in life (for analog individual-level results; see Diener et al., 2011). Finally, we inspected the associations between countrylevel religiosity and psychological dimensions of culture (i.e., Hofstede’s cultural dimensions and country-level Big Five traits). Country-level religiosity evidenced particularly strong associations with two psychological dimensions of culture (and those associations remained largely unaffected by controlling for economic quality): higher collectivism (i.e., lower individualism) and higher agreeableness (Table 2). Collectivism and agreeableness can all be understood as indicators of the even more fundamental dimension of communion (Abele & Wojciszke, 2014; Gebauer, Leary, & Neberich, 2012). Thus, our findings resonate with extensive research which shows that communion probably is the most central abstract value across all world religions (Gebauer, Paulhus, et al., 2013; Gebauer, Sedikides, & Schrade, 2017).

Concluding Remarks Countries differ widely in their degree of religiosity, and those country-level differences qualify numerous psychological effects that have been deemed universal (e.g., psychological health benefits of religiosity – Gebauer, Sedikides, et al., 2012; psychological health benefits of income – Gebauer, Nehrlich, et al., 2013; life satisfaction benefits derived from affective experience – Joshanloo, 2019).

Notably though, controlling for economic quality decreased all four previously positive associations between country-level religiosity and health. Specifically, in two cases the associations were rendered nonsignificant (i.e., depression, and self-esteem) and in one case the association changed signs (i.e., anxiety). Yet, the positive association between country-level religiosity and meaning in life remained strong and positive even after economic quality was controlled for.

European Psychologist (2020), 25(1), 26–40

Ó 2019 Hogrefe Publishing

M. Joshanloo & J. E. Gebauer, Country-Level Religiosity

Hence, there is a great demand for a valid index of countrylevel religiosity that is available for a large number of countries. The present study provided such an index. The index is based on representative data from 1,619,300 individuals and spans 166 countries worldwide. We made use of that index to examine temporal change in worldwide religiosity over the last decade and to gain a more complete understanding of country-level religiosity’s nomological network. We found strong evidence that our country-level religiosity index is a valid and robust index of global religiosity. More precisely, its correlation with an external four-item index of global religiosity was near-perfect.10 Moreover, we randomly divided the full GWP sample into two independent subsamples and used those subsamples to compute two additional country-level religiosity indices. Those two entirely independent country-level religiosity indicators, too, correlated near-perfectly with the external country-level index of global religiosity. That finding speaks for the robustness of our index. The two independent country-level religiosity indices also demonstrated a nearperfect correlation, a finding that attests to the validity and reliability of our index. In addition, we estimated the worldwide temporal change in religiosity between 2006 and 2017. We found a quadratic trajectory (Figure 2). For the year 2006, we estimated that 73.039% of the world population considered religiosity an important part of their daily lives. From 2006 to 2011, we found an increase in worldwide religiosity levels. By 2011, we estimated the highest level of religiosity worldwide – 74.044% of the world population considered religiosity an important part of their daily lives. Finally, from 2011 to 2017, we found a decrease in worldwide religiosity. By 2017, the worldwide level of religiosity was (descriptively) lower than in 2006 – 72.725% of the world population considered religiosity an important part of their daily lives. Additional analyses identified a subset of 101 countries which drove the just-described quadratic trajectory (Figure 3). Finally, we estimated the worldwide temporal changes in Christianity and Islam between 2008 and 2017. We found a small linear decline in Christianity from 41.753% in 2008 to 40.582% in 2017 and no significant change in Islam within the same period of time (25.484% in 2008 and 25.508% in 2017). The small temporal change over the last decade buttressed our decision to base our country-level religiosity index on the cumulative data from 2005 to 2017. A detailed analysis of why we observed the just-described pattern of temporal change is beyond the scope of the present work, but it certainly is an interesting and timely topic for future research. We acknowledge that



the temporal patterns we discovered in the present analyses may be subject to unexpected changes in the near future, and at present, it does not seem possible to predict the future of worldwide religiosity. Moreover, the present research includes the most complete analysis of country-level religiosity’s nomological network ever conducted, involving 36 external variables. First, in replication of much previous research, country-level religiosity was associated with lower economic development. It is noteworthy that previous research typically capitalized on a single indicator of economic development, whereas we used several complementary indicators and found highly convergent results. Second, we examined the association between country-level religiosity and a large array of country-level health indicators (psychological health, physical health, and health security). The results were less consistent than in the case of economic development. However, on the whole it seems fair to conclude that country-level religiosity was mostly negatively related to health, but that negative relation was driven almost entirely by the poor economic conditions in most religious countries. In fact, after accounting for country-level differences in economic conditions, religious countries were by and large healthier than non-religious countries. Mediation analyses were largely consistent with Oishi and Diener’s (2014) proposal that the health benefits of country-level religiosity are partly due to higher levels of purpose in life in religious countries. Finally, little has been known about the associations between country-level religiosity and psychological dimensions of culture (i.e., Hofstede’s cultural dimensions and country-level Big Five traits). We found that religious countries primarily differed from their non-religious counterparts on dimensions that belong to the fundamental communion dimension (i.e., collectivism and agreeableness; Abele & Wojciszke, 2014). This finding squares with the high importance that all world religions place on communal values and norms (Gebauer, Paulhus, et al., 2013; Gebauer, Sedikides, & Schrade, 2017). Overall, the nomological network analysis conducted here buttressed previous research, increased the confidence in our country-level religiosity index, and expanded our understanding of country-level religiosity. In conclusion, the present research introduced and validated the most extensive country-level religiosity index to date, examined its worldwide temporal trajectory over the last decade, and clarified its nomological network. We (optimistically) hope that our country-level religiosity index will be helpful for the large and fast-growing community of scholars interested in the powerful role of country-level religiosity for human thought, feelings, and behavior.

The existence of that external index does not render our index superfluous, because the external index only spans across 97 countries.

Ó 2019 Hogrefe Publishing

European Psychologist (2020), 25(1), 26–40


Electronic Supplementary Materials The electronic supplementary materials are available with the online version of the article at 10.1027/1016-9040/a000382 ESM 1. Countries’ names, number of participants, gender ratios, and average ages. ESM 2. Gallup world poll items used in the present research. ESM 3. National religiosity scores across years. ESM 4. Results of latent class growth analyses. ESM 5. Country classification based on latent class growth curve analyses. ESM 6. Correlations with religiosity at the national level (half sample 1). ESM 7. Correlations with religiosity at the national level (half sample 2). ESM 8. Correlations with religiosity at the national level (whole sample) controlling for HDI. ESM 9. Mediation analysis at the national level. ESM 10. Sample means for national religiosity in the third group.

References Abele, A. E., & Wojciszke, B. (2014). Communal and agentic content in social cognition: A dual perspective model. In J. M. Olson & M. P. Zanna (Eds.), Advances in experimental social psychology (Vol. 50, pp. 195–255). Waltham, MA: Academic Press. Barber, N. (2011). A cross-national test of the uncertainty hypothesis of religious belief. Cross-Cultural Research, 45, 318–333. Barber, N. (2013). Country religiosity declines as material security increases. Cross-Cultural Research, 47, 42–50. 10.1177/1069397112463328 Barro, R. J., & McCleary, R. M. (2003). Religion and economic growth across countries. American Sociological Review, 68, 760–781. Books, J., & Prysby, C. (1988). Studying contextual effects on political behavior. American Politics Research, 16, 211–238. Cohen, A. B., & Hill, P. C. (2007). Religion as culture: Religious individualism and collectivism among American Catholics, Jews, and Protestants. Journal of Personality, 75, 709–742. Cohn, M. A., Fredrickson, B. L., Brown, S. L., Mikels, J. A., & Conway, A. M. (2009). Happiness unpacked: Positive emotions increase life satisfaction by building resilience. Emotion, 9, 361. Cronbach, L. J., & Meehl, P. E. (1955). Construct validity in psychological tests. Psychological Bulletin, 52, 281–302. Deaton, A., & Stone, A. A. (2013). Two happiness puzzles. American Economic Review, 103, 591–597. aer.103.3.591

European Psychologist (2020), 25(1), 26–40

M. Joshanloo & J. E. Gebauer, Country-Level Religiosity

Diener, E., Ng, W., Harter, J., & Arora, R. (2010). Wealth and happiness across the world: Material prosperity predicts life evaluation, whereas psychosocial prosperity predicts positive feeling. Journal of Personality and Social Psychology, 99, 52–61. Diener, E., Tay, L., & Myers, D. G. (2011). The religion paradox: If religion makes people happy, why are so many dropping out? Journal of Personality and Social Psychology, 101, 1278–1290. Duncan, T. E., & Duncan, S. C. (2004). An introduction to latent growth curve modeling. Behavior Therapy, 35, 333–363. https:// Edgell, P. (2012). A cultural sociology of religion: New directions. Annual Review of Sociology, 38, 247–265. 10.1146/annurev-soc-071811-145424 Fincher, C. L., & Thornhill, R. (2012). Parasite-stress promotes ingroup assortative sociality: The cases of strong family ties and heightened religiosity. Behavioral and Brain Sciences, 35, 61–79. Gaskins, B., Golder, M., & Siegel, D. A. (2013). Religious participation, social conservatism, and human development. The Journal of Politics, 75, 1125–1141. S0022381613000765 Gebauer, J. E., Bleidorn, W., Gosling, S. D., Rentfrow, P. J., Lamb, M. E., & Potter, J. (2014). Cross-cultural variations in Big Five relationships with religiosity: A sociocultural motives perspective. Journal of Personality and Social Psychology, 107, 1064–1091. Gebauer, J. E., Leary, M. R., & Neberich, W. (2012). Big Two personality and Big Three mate preferences: Similarity attracts, but country-level mate preferences crucially matter. Personality and Social Psychology Bulletin, 38, 1579–1593. https://doi. org/10.1177/0146167212456300 Gebauer, J. E., & Maio, G. R. (2012). The need to belong can motivate belief in God. Journal of Personality, 80, 465–501. Gebauer, J. E., Nehrlich, A. D., Sedikides, C., & Neberich, W. (2013). The psychological benefits of income are contingent on individual-level and culture-level religiosity. Social Psychological and Personality Science, 4, 569–578. 10.1177/1948550612469819 Gebauer, J. E., Paulhus, D. L., & Neberich, W. (2013). Big Two personality and religiosity across cultures: Communals as religious conformists and agentics as religious contrarians. Social Psychological and Personality Science, 4, 21–30. https:// Gebauer, J. E., Sedikides, C., & Neberich, W. (2012). Religiosity, social self-esteem, and psychological adjustment: On the cross-cultural specificity of the psychological benefits of religiosity. Psychological Science, 23, 158–160. https://doi. org/10.1177/0956797611427045 Gebauer, J. E., Sedikides, C., & Schrade, A. (2017). Christian selfenhancement. Journal of Personality and Social Psychology, 113, 786–809. Gebauer, J. E., Sedikides, C., Schönbrodt, F. D., Bleidorn, W., Rentfrow, P. J., Potter, J., & Gosling, S. D. (2017). The religiosity as social value hypothesis: A multi-method replication and extension across 65 countries and three levels of spatial aggregation. Journal of Personality and Social Psychology, 113, e18–e39. Gebauer, J. E., Sedikides, C., Wagner, J., Bleidorn, W., Rentfrow, P. J., Potter, J., & Gosling, S. D. (2015). Cultural norm fulfillment, interpersonal belonging, or getting ahead? A large-scale cross-cultural test of three perspectives on the function of selfesteem. Journal of Personality and Social Psychology, 109, 526–548.

Ó 2019 Hogrefe Publishing

M. Joshanloo & J. E. Gebauer, Country-Level Religiosity

Gorski, P. S., & Altınordu, A. (2008). After secularization? Annual Review of Sociology, 34, 55–85. annurev.soc.34.040507.134740 Hill, P. C., & Hood, R. W. (1999). Measures of religiosity. Birmingham, AL: Religious Education Press. Hofstede, G., Hofstede, G. J., & Minkov, M. (2010). Cultures and organizations: Software of the mind. New York, NY: McGrawHill. Inglehart, R., Haerpfer, C., Moreno, A., Welzel, C., Kizilova, K., DiezMedrano, J., . . ., Puranen, B. (Eds.). (2014). World values survey: Round six – country-pooled datafile 2010–2014. Madrid: JD Systems Institute. Retrieved from Institute for Health Metrics and Evaluation. (2015). GBD compare data visualization. Seattle, WA: IHME, University of Washington. Retrieved from John, O. P., Donahue, E. M., & Kentle, R. L. (1991). The Big Five Inventory – Versions 4a and 54. Berkeley, CA: University of California at Berkeley, Institute of Personality and Social Research. Joshanloo, M. (2018). Optimal human functioning around the world: A new index of eudaimonic well-being in 166 nations. British Journal of Psychology, 109, 637–655. 10.1111/bjop.12316 Joshanloo, M. (2019). Cultural religiosity as the moderator of the relationship between affective experience and life satisfaction: A study in 147 countries. Emotion, 19, 629–636. 10.1037/emo0000469 Kay, A. C., Gaucher, D., McGregor, I., & Nash, K. (2010). Religious belief as compensatory control. Personality and Social Psychology Review, 14, 37–48. 1088868309353750 Legatum Institute. (2017). Legatum Prosperity Index 2017: Methodology report. Retrieved from Muthén, B. O., & Muthén, L. K. (2000). Integrating personcentered and variable-centered analyses: Growth mixture modeling with latent trajectory classes. Alcoholism: Clinical and Experimental Research, 24, 882–891. 10.1111/j.1530-0277.2000.tb02070.x Muthén, L. K., & Muthén, B. O. (2017). Mplus user’s guide. Los Angeles, CA: Muthén and Muthén. Neeleman, J., Halpern, D., Leon, D., & Lewis, G. (1997). Tolerance of suicide, religion and suicide rates: An ecological and individual study in 19 Western countries. Psychological Medicine, 27, 1165–1171. Norris, P., & Inglehart, R. (2011). Sacred and secular: Religion and politics worldwide. Cambridge, UK: Cambridge University Press. Oishi, S., & Diener, E. (2014). Residents of poor nations have a greater sense of meaning in life than residents of wealthy nations. Psychological Science, 25, 422–430. 10.1177/0956797613507286 Oyserman, D., Coon, H. M., & Kemmelmeier, M. (2002). Rethinking individualism and collectivism: Evaluation of theoretical assumptions and meta-analyses. Psychological Bulletin, 128, 3–72. Preacher, K. J. (2008). Latent growth curve modeling. Los Angeles, CA: Sage. Rees, T. J. (2009). Is personal insecurity a cause of cross-national differences in the intensity of religious belief? Journal of Religion and Society, 11, 1–24. Reeve, C. L. (2009). Expanding the g-nexus: Further evidence regarding the relations among national IQ, religiosity and national health outcomes. Intelligence, 37, 495–505. https://

Ó 2019 Hogrefe Publishing


Rentfrow, P. J. (2010). Statewide differences in personality: Toward a psychological geography of the United States. American Psychologist, 65, 548–558. a0018194 Rentfrow, P. J., Gosling, S. D., & Potter, J. (2008). A theory of the emergence, persistence, and expression of geographic variation in psychological characteristics. Perspectives on Psychological Science, 3, 339–369. 2008.00084.x Robins, R. W., Hendin, H. M., & Trzesniewski, K. H. (2001). Measuring global self-esteem: Construct validation of a single-item measure and the Rosenberg Self-Esteem Scale. Personality and Social Psychology Bulletin, 27, 151–161. 10.1177/0146167201272002 Saroglou, V. (2010). Religiousness as a cultural adaptation of basic traits: A five-factor model perspective. Personality and Social Psychology Review, 14, 108–125. 10.1177/1088868309352322 Saroglou, V. (2014). Religion, personality, and social behavior. New York, NY: Psychology Press. Sedikides, C. (2010). Why does religiosity persist? Personality and Social Psychology Review, 14, 3–6. 1088868309352323 Sedikides, C., & Gebauer, J. E. (2010). Religiosity as selfenhancement: A meta-analysis of the relation between socially desirable responding and religiosity. Personality and Social Psychology Review, 14, 17–36. 1088868309351002 Sherkat, D., & Ellison, C. (1999). Recent developments and current controversies in the sociology of religion. Annual Review of Sociology, 25, 363–394. 25.1.363 Stack, S. (1983). The effect of religious commitment on suicide: A cross-national analysis. Journal of Health and Social Behavior, 24, 362–374. Stark, R., & Iannaccone, L. R. (1994). A supply-side reinterpretation of the “secularization” of Europe. Journal for the Scientific Study of Religion, 33, 230–252. 1386688 Stipak, B., & Hensler, C. (1982). Statistical inference in contextual analysis. American Journal of Political Science, 26, 151–175. Triandis, H. C. (1995). Individualism and collectivism. Boulder, CO: Westview Press. Van de Vliert, E. (2013). Climato-economic habitats support patterns of human needs, stresses, and freedoms. Behavioral and Brain Sciences, 36, 465–480. S0140525X12002828 Warner, R. S. (1993). Work in progress toward a new paradigm for the sociological study of religion in the United States. American Journal of Sociology, 98, 1044–1093. 230139 Weber, M. (1930). The Protestant ethic and the spirit of capitalism. New York, NY: Scribner. Zuckerman, P. (2007). Atheism: Contemporary numbers and patterns. In M. Martin (Ed.), The Cambridge companion to atheism (pp. 47–66). Cambridge, UK: Cambridge University Press. History Received August 8, 2018 Revision received April 23, 2019 Accepted June 25, 2019 Published online December 6, 2019

European Psychologist (2020), 25(1), 26–40


Acknowledgments We would like to thank Professor Robert Dickey for his valuable contributions. Open Data The Gallup World Poll data can be purchased at the following link: Data from sources other than Gallup are publicly available and can be obtained from original sources. Detailed information about these data and their sources are provided in the Methods section. Funding This work was supported by the Ministry of Education and National Research Foundation of the Republic of Korea (NRF2017S1A3A2066611) and a grant from the German Research Foundation (DFG; Grant GE 2515/6-1). ORCID Mohsen Joshanloo Mohsen Joshanloo Department of Psychology Keimyung University 2800 Dalgubeol-Daero Dalseo-Gu Daegu 42601 South Korea

European Psychologist (2020), 25(1), 26–40

M. Joshanloo & J. E. Gebauer, Country-Level Religiosity

Mohsen Joshanloo is an associate professor of psychology at Keimyung University, South Korea. His research focuses on well-being, culture, religion, and measurement. Mohsen Joshanloo completed his PhD in 2013 at Victoria University of Wellington, New Zealand.

Jochen Gebauer is the HeisenbergProfessor of Cross-Cultural Social and Personality Psychology at the University of Mannheim and Full Professor of Social and Personality Psychology at Copenhagen University. Jochen received his PhD from Cardiff University, and held postdoc positions at the University of Southampton and the HumboldtUniversity of Berlin.

Ó 2019 Hogrefe Publishing

Original Articles and Reviews

Are Difficult-To-Study Populations too Difficult to Study in a Reliable Way? Lessons Learned From Meta-Analyses in Clinical Neuropsychology Florian Lange Behavioral Engineering Research Group, KU Leuven, Belgium

Abstract: Replication studies, pre-registration, and increases in statistical power will likely improve the reliability of scientific evidence. However, these measures face critical limitations in populations that are inherently difficult to study. Members of difficult-to-study populations (e.g., patients, children, non-human animals) are less accessible to researchers, which typically results in small-sample studies that are infeasible to replicate. Nevertheless, meta-analyses on clinical neuropsychological data suggest that difficult-to-study populations can be studied in a reliable way. These analyses often produce unbiased effect-size estimates despite aggregating across severely underpowered original studies. This finding can be attributed to a neuropsychological research culture involving the non-selective reporting of results from standardized and validated test procedures. Consensus guidelines, test manuals, and psychometric evidence constrain the methodological choices made by neuropsychologists, who regularly report the results from neuropsychological test batteries irrespective of their statistical significance or novelty. Comparable shifts toward more standardization and validation, complete result reports, and betweenlab collaborations can allow for a meaningful and reliable study of psychological phenomena in other difficult-to-study populations. Keywords: credibility, replication, standardization, meta-analysis, clinical neuropsychology

The combination of publication bias (Ioannidis, 2005; Kühberger, Fritz, & Scherndl, 2014) and researcher degrees of freedom (Gelman & Loken, 2014; Simmons, Nelson, & Simonsohn, 2011) has led to the inflation of false-positive rates and effect sizes in many fields of science. The list of proposed remedies is long and diverse (Munafò et al., 2017). Perhaps most prominently, increases of statistical power, systematic replication efforts, and pre-registrations of study plans have been argued to improve the reliability of scientific evidence (Asendorpf et al., 2013; Wagenmakers, Wetzels, Borsboom, van der Maas, & Kievit, 2012). There is little reason to doubt that implementation of these measures will lead to more accurate effect-size estimates in future research. However, scientists are likely to encounter serious problems when attempting to apply these measures in study populations that are inherently difficult to study (Zwaan, Etz, Lucas, & Donnellan, 2018). The term “difficult-to-study population” does not imply that members of these populations would be less willing to participate in research studies or that their behavior would not lend itself to systematic observation. Instead, it refers to the difficulty of sampling a large number of Ó 2019 Hogrefe Publishing

population members for a given research study, a difficulty that may interfere with the abovementioned attempts to increase scientific reliability. In contrast to the typical sample of undergraduate students, groups of young children, non-human great apes, professional athletes, criminal offenders, or patients with rare clinical disorders are arguably less accessible to researchers. The motor neuron disease amyotrophic lateral sclerosis (ALS), for example, is estimated to affect about 5.4 out of 100,000 Europeans (Chiò et al., 2013). This implies that researchers from Birmingham, Brussels, or Prague who are interested in studying behavioral changes associated with ALS would have to draw from populations of about 50–60 local patients. Even if all these patients were willing, capable, and eligible to participate in a behavioral study, statistical power would be unacceptably low for a wide range of possible research designs and expected effect sizes. Similarly, it would hardly be possible to draw multiple non-overlapping samples from a population that limited in size. This characteristic of difficult-to-study populations likely forms an obstacle to replication attempts, especially when those require participants to be naive to aspects of European Psychologist (2020), 25(1), 41–50


the study. Pre-registration of hypotheses and methods before collecting data from difficult-to-study populations remains a viable and recommendable option, but it cannot compensate for the power problems mentioned above. Consider, for example, the possibility to pre-register a target sample size based on an explicit power analysis. While this practice reduces researcher degrees of freedom in data collection, it does not ensure that the required data can actually be collected. Hence, when pre-registering their study plans, many researchers studying difficult-to-study populations might realize that they are simply unable to study plausible effect-size estimates at desirable levels of statistical power. Against this background, one may tend to conclude that some difficult-to-study populations are too difficult to study in a reliable way. In the remainder of this article, I shall offer a more optimistic view by describing research practices that have led to the production of reliable evidence in some fields of clinical neuropsychology. Clinical neuropsychologists study brain-behavior relationships by assessing, diagnosing, and treating cognitive alterations in patients with known changes to the central nervous systems (American Psychological Association, 2010). Most clinical neuropsychological research involves difficult-to-study populations (e.g., patients with neurological diseases such as ALS), and I will not argue that individual studies sampling from these populations are more reliable than studies in other fields (Demakis, 2006; Gelman & Geurts, 2017). Instead, I will describe how aggregating across these individual studies using meta-analytical methods can lead to powerful tests of research hypotheses and to accurate effect-size estimates in some lines of neuropsychological research. This praise of meta-analyses may be surprising to researchers interested in the current reproducibility debate. When following this debate, one can get the impression that meta-analyses are more often discussed as part of the problem than as part of the solution (Engber, 2016; Van Elk et al., 2015). This impression holds true for fields of (psychological) science where severe validity issues undermine the usefulness of meta-analytic research. However, in some streams of clinical neuropsychological research, these validity issues seem substantially less pronounced. As a result, meta-analyses in these lines of research can yield largely unbiased and thus useful estimates of true effect sizes.

Validity Problems of Meta-Analyses One of the main goals of meta-analysis is to obtain better estimates of the true size of an effect than could be obtained in an individual study. However, the accuracy of effect-size European Psychologist (2020), 25(1), 41–50

F. Lange, Difficult-To-Study Populations

estimates is critically threatened by a number of problems (Rosenthal & DiMatteo, 2001; Sharpe, 1997). In the following, I will briefly discuss the most prominent of these problems and illustrate their threatening potential by using examples from research on the so-called ego-depletion effect. This example was chosen because of the comparatively large amount of available information on the degree of bias in this literature. The ego-depletion effect refers to performance declines in a self-control task following the completion of another self-control task. A high-powered meta-analysis has estimated this effect to be statistically significant, robust, and medium-to-large (d = 0.62) in size (Hagger, Wood, Stiff, & Chatzisarantis, 2010). However, the database of this meta-analysis seems to be affected by validity issues to a degree that there might be, in fact, no evidence for a non-zero effect of ego depletion (Carter, Kofler, Forster, & McCullough, 2015; Carter & McCullough, 2013, 2014). Which are the factors that account for this apparent contradiction between the conclusions of Hagger and colleagues and Carter and colleagues?

Comparing Apples and Oranges If a literature is not exclusively composed of direct replications of a particular study design, it will involve some degree of heterogeneity (e.g., with regard to study population, study design, or construct operationalization). A meta-analytical aggregation across a heterogeneous set of studies can only be taken to apply to the common conceptual denominator of these studies. This is not necessarily problematic. If, for example, all studies included by Hagger and colleagues involved a valid manipulation of the independent variable and a valid measurement of the dependent variable in study designs that allowed studying ego depletion, a meta-analysis of these studies should yield a meaningful test of the ego-depletion hypothesis. In fact, methodological heterogeneity can even open doors to informative moderator analyses that might reveal that an effect is particularly large when studied under particular conditions (Field & Gillett, 2010). The problem arises when heterogeneity extends from the methodological to the conceptual level (e.g., when ego-depletion studies do not manipulate or measure self-control and thus do not test the ego-depletion hypothesis; Lurquin & Miyake, 2017). For example, some ego-depletion studies used self-report measures of participants’ willingness to show helping behavior as a dependent variable (i.e., as the second selfcontrol task). It can be questioned whether responding to such a questionnaire item is a self-control task at all (Lange & Eggert, 2014). More importantly, high willingness to help is regarded as an indicator of good self-control in some and as an indicator of poor self-control in other ego-depletion studies (Carter et al., 2015). Such studies are unlikely to test Ó 2019 Hogrefe Publishing

F. Lange, Difficult-To-Study Populations

the same effect and aggregating across them is unlikely to yield a meaningful meta-analytical estimate.

Garbage in, Garbage out When the individual studies included in a meta-analysis are biased, a meta-analysis of these studies will also produce a biased effect-size estimate. The term “garbage” in this phrase is typically used to refer to studies of inferior methodological quality (Eysenck, 1978). Studies without appropriate controls, validated manipulations, and reliable measurements are unlikely to produce reliable effect-size estimates. When these methodological weaknesses can be identified from the study report, meta-analytical weighting of effect sizes according to study quality can alleviate the “garbage in, garbage out” problem (Rosenthal & DiMatteo, 2001). However, in recent years, it became apparent that undisclosed researcher degrees of freedom can introduce another, less identifiable, type of “garbage” into the literature. Reports of seemingly excellent study designs include meaningless p values and inflated effect sizes when only one of multiple possible hypothesis tests is reported (Gelman & Loken, 2014). To illustrate, consider the outcome measures reported in studies on ego depletion. In many of these studies, self-control performance is measured with the Stroop task (or rather with one of numerous variants of the Stroop task; MacLeod, 1991). Stroop-task variants can yield a large number of possible outcome measures including, among others, mean or median response times, response-time variability, error rates, and responsetime or error-rate differences between incongruent and congruent trials. Importantly, there seems to be a lack of compelling reasons to prefer one of these measures as an indicator of self-control (Carter et al., 2015). As a consequence, researchers are free in choosing an outcome measure for their study, as they are free in deciding whether and how to exclude trials with extreme values and whether and how to transform their data before analysis. When ego-depletion researchers limit their analytical freedom beforehand (i.e., by means of pre-registration), they typically fail to find substantial ego-depletion effects despite high power and methodological quality (Hagger et al., 2016; Lurquin et al., 2016). This example suggests that researcher degrees of freedom can introduce a bias on the level of individual studies that then translates into biased meta-analytical effect-size estimates.

Publication and Sampling Bias All studies testing a particular hypothesis should have the same chance of being included in a meta-analysis on this hypothesis. The suspicion that this assumption may be Ó 2019 Hogrefe Publishing


violated in most lines of psychological research is not new (Rosenthal, 1979; Sterling 1959). Studies with p values just below the significance threshold are clearly overrepresented in many meta-analytical samples; statistically nonsignificant studies with small effect sizes are systematically excluded from the literature (Kühberger et al., 2014). This preference for statistically significant results to be published leads to inflated effect-size estimates in the corresponding metaanalyses. There are several ways to assess the severity of publication bias (or other small-study biases), for example, by examining the relationship between effect sizes and sample sizes of the studies included in a meta-analysis (Begg & Mazumdar, 1994). In contrast to large studies, effect sizes have to be larger for small studies to reach the significance threshold. If passing the significance threshold determines the likelihood of being included in the literature or meta-analysis, one should find a negative correlation between effect size and sample size. This correlation was found to be substantial in the published ego-depletion literature (Carter et al., 2015; Carter & McCullough, 2014). When correcting for this relationship, Carter and colleagues obtained an estimate for the size of the ego-depletion effect that was not distinguishable from zero anymore. Of note, these bias-correction techniques have been criticized themselves (Inzlicht, Gervais, & Berkman, 2015) and it is not realistic to expect that they will fix the problems introduced by a biased research literature. In the absence of appropriate ways to correct for publication bias, accumulating evidence for publication biases in many research literatures leaves us with the (not worthless, but often frustrating) insight that the results of many meta-analyses are essentially uninterpretable.

Apples, Oranges, Garbage, and Biases in Clinical Neuropsychology In the following, I will argue that the above-mentioned validity problems are mitigated in some streams of clinical neuropsychological research. As a result, meta-analyses of these research literatures produce largely unbiased effectsize estimates. To illustrate this phenomenon, I will draw on evidence obtained with the Wisconsin Card Sorting Test (WCST; Berg, 1948; Grant & Berg, 1948), a neuropsychological test of executive functioning that has been subject to a large number of meta-analyses in the recent past. I will not argue that the lack of bias in this line of research is representative for the whole of clinical neuropsychology. Instead, the present paper aims to shed light on a plausible mechanism underlying such bias-free neuropsychological research literatures. This mechanism involves a combination of standardization and what I call a research European Psychologist (2020), 25(1), 41–50


culture of incidental replications. These fortunate circumstances are not common to all lines of neuropsychological research, but they can be expected to produce unbiased and credible research literatures wherever they are established.

Standardization in Clinical Neuropsychology Researchers in clinical neuropsychology do not suffer from a lack of scientific creativity. Creative and rigorous experimentation was necessary to, for example, dissociate the neural substrates of different types of memory (Keane, Gabrieli, Mapstone, Johnson, & Corkin, 1995) or different types of visual capabilities (Goodale et al., 1994) through studying patients with circumscribed brain lesions. However, parallel streams of neuropsychological research occupy themselves with the development, psychometric evaluation, and consistent application of standardized assessment tools (Stuss & Levine, 2002). In contrast to the ad hoc measures used in many other fields (e.g., in many ego-depletion studies), these tools have been tested and optimized based on objective criteria. Neuropsychologists have prepared detailed instructions on how to use them, collected data on their reliability and validity, and conducted normative studies that allow interpreting the results of individual examinees. These investments into standardization can be considered an effective antidote to researcher degrees of freedom. Outcome measures are clearly described in the respective test manuals and if researchers want to limit themselves to the use of only a subset of these measures, they can base their decision on explicit psychometric information. This standardization on the outcome level is accompanied by a similar degree of standardization on the predictor level. In a large proportion of neuropsychological studies, test performance is compared between patients with a particular disorder and healthy control participants (or a different, clinically relevant control group). To be included in the patient group, individuals have to meet a number of diagnostic criteria. These criteria are developed by international consensus groups and often empirically evaluated against alternative criteria (e.g., Geevasinga et al., 2016). To my knowledge, no comparable efforts exist to ensure that ego-depletion researchers share a common, tangible, and appropriate definition of the concept of a self-control task. As a result of this standardization, it is possible to find, for example, Italian (Palmieri et al., 2009), Dutch (Raaphorst et al., 2010), and French (Kilani et al., 2004) studies that all assessed the same WCST outcome measure according to the same manual in healthy controls and in patients with European Psychologist (2020), 25(1), 41–50

F. Lange, Difficult-To-Study Populations

ALS diagnosed according to the same criteria. Undesired conceptual heterogeneity (“apples and oranges”) and undesired researcher degrees of freedom (“garbage”) should not present a major problem when aggregating across these studies. Of course, these benefits of standardization would be largely irrelevant if WCST results were only reported when being statistically significant or if the likelihood of the corresponding paper to be published depended on the statistical significance of the test results. Luckily, this does not seem to be the case.

Incidental Replication in Clinical Neuropsychology In clinical practice (i.e., when assessing potential cognitive deficits in individual patients), neuropsychologists typically apply a battery of standardized assessment tools rather than relying on results from one individual test (Stuss & Levine, 2002). This allows identifying relative strengths and weaknesses in the profile of a patient (Crawford, 2004). This preference for neuropsychological profile analysis can also be recognized in research studies with a neuropsychological component. For example, the study by Raaphorst and colleagues (2010) mentioned above was not limited to the administration of the WCST, but involved a large test battery of cognitive tests. Obviously, with the typical sample sizes in this field (here: 30 patients with ALS and 24 controls), it is very hard to establish that performance on one of these tests is more severely affected than performance on the other tests. Even finding a group difference that passes an uncorrected .05 significance threshold on a particular test (e.g., the WCST) is rather unlikely. Given meta-analytic effect-size estimates (Lange, Vogts, et al., 2016), the chance to find ALS-related card-sorting impairment at α = .05 is about 43% when examining a sample as large as the one tested by Raaphorst and colleagues. It is thus not particularly surprising that Raaphorst and colleagues did, in fact, not find evidence for impaired card-sorting performance in patients with ALS. Nevertheless, their manuscript was published in one of the top journals in the field. Apparently, the success of their manuscript in the publication process did not hinge on the size or statistical significance of ALS-related impairment on the WCST. This can be taken as a first indicator that some data from standardized neuropsychological tests are not subject to publication bias. Relatedly, the paper by Raaphorst and colleagues was not rejected based on the fact that ALS-related WCST impairment had already been examined in previous studies (i.e., for being “just a replication”). One of the reasons for this might be that, irrespective of their statistical significance and novelty, these data are considered to be Ó 2019 Hogrefe Publishing

F. Lange, Difficult-To-Study Populations

valuable. Per definition, data from difficult-to-study populations are difficult to obtain and given the use of standardized assessment tools, data can be compared across multiple studies to accumulate knowledge about these populations. In addition, neuropsychological data might not be subject to publication bias because many reports do not (exclusively) focus on them. For example, Abrahams and colleagues (1996) and Hammer, Vielhaber, RodriguezFornells, Mohammadi, and Münte (2011) compared WCST performance between patients with ALS and healthy controls, but the main focus of their studies lay on data obtained from neuroimaging techniques. It is possible that these studies are affected by publication bias, but this bias would be based on the significance or novelty of the neuroimaging results, not on the significance or novelty of the results from neuropsychological testing. Most likely, the authors of these studies would not describe their work as a neuropsychological replication study. They rather report group comparisons involving data from established neuropsychological tests to provide a comprehensive characterization of their sample. By these means, they incidentally replicate the work of others and contribute to an unbiased and truly cumulative research literature in clinical neuropsychology.

Implications for Meta-Analyses in Clinical Neuropsychology Thanks to the combination of standardization and incidental replications, some meta-analyses in neuropsychology are largely unaffected by typical validity problems. As a result, these meta-analyses can keep their promise of allowing for powerful tests of research hypotheses and accurate estimations of true effect sizes. As discussed above, individual studies reporting neuropsychological data are often substantially underpowered (Demakis, 2006). Based on the data from Raaphorst and colleagues (2010) alone, we would not be able to conclude whether ALS is associated with impairment on the WCST. In fact, the majority of published studies did not report statistically significant WCST impairment in patients with ALS. However, when aggregating across all available studies, meta-analysts (Lange, Vogts, et al., 2016) found a medium-sized effect with a confidence interval that clearly excluded the value of zero. Importantly, they did not observe a negative correlation between effect sizes and sample sizes across the individual studies, which would have pointed to a distorting influence of publication bias. Hence, this meta-analysis generated robust evidence for ALS-related impairment on the WCST – evidence that cannot be generated in individual studies of this difficult-to-study population. Ó 2019 Hogrefe Publishing


A similar meta-analysis revealed evidence for mediumsized WCST impairment in patients with another neurological disorder (i.e., primary dystonia; Lange, Seer, Salchow, et al., 2016). Five out of six individual studies reporting WCST data from these patients did not find statistically significant group differences, which contributed to the view that “there is evidence of little or no alteration of cognitive functions in primary dystonia” (Stamelou, Edwards, Hallett, & Bhatia, 2012, p. 1672). However, this meta-analysis demonstrated that cognitive alterations as assessed with the WCST are substantial in primary dystonia, albeit not that substantial that they could be detected in individual studies with an average of 25 participants per group. Powerful comparisons of overall test performance between patients and control participants are not the only meta-analytical possibility offered by an unbiased set of effect sizes. Such a database can also be used to test whether one facet of test performance is more affected than another in a particular disorder. For example, Li (2004) conducted a meta-analysis of 59 studies testing whether patients with schizophrenia are more prone to commit perseverative versus non-perseverative errors on the WCST. In another meta-analysis that focused on a different patient population, Lange, Brückner, Knebel, Seer, and Kopp (2018) found a small but statistically significant difference in the size of Parkinson’s disease (PD)-related WCST deficits when comparing different WCST outcome measures. The size of such differential impairments suggests that it might be futile to study comparable questions in individual studies. However, the standardized body of neuropsychological evidence allows addressing them in adequately powered meta-analyses. In the absence of conceptual heterogeneity (i.e., in sets of studies that all test the same hypothesis), meta-analyses on neuropsychological data can yield meaningful estimates of not only effect sizes, but also method-related effectsize heterogeneity. More importantly, meta-analysts can attempt to explain this heterogeneity by analyzing the moderating role of study or patient characteristics. For example, Demakis (2003) found small-to-medium differences in WCST performance between patients with injuries to the frontal lobe and patients with non-frontal damage. However, the size of this group difference varied as a function of the time that passed between the injury and the administration of the WCST. When tested within the first 12 months after the injury, patients with frontal damage showed large impairment as compared to patients with non-frontal damage. In contrast, across the studies assessing patients more than 12 months after the injury, the difference between the two groups was found to be small and statistically not significant. Similarly, Lange, Seer, MüllerVahl, and Kopp (2017) have recently reported mediumsized WCST impairment, but also substantial effect-size European Psychologist (2020), 25(1), 41–50


heterogeneity in patients with Gilles-de-la-Tourette syndrome (GTS). Moderator analysis revealed that a large proportion of this heterogeneity was accounted for by participants’ age. WCST performance deficits were large in children and adolescents with GTS, but small and statistically not significant in adult patients with GTS. Possibilities with regard to the analysis of moderators increase even further when larger sets of studies are available for meta-analysis. The meta-analysis of WCST deficits in patients with PD mentioned above (Lange et al., 2018) involved effect sizes from 161 individual studies. The size of PD-related WCST impairment varied substantially across studies and was found to be moderated by disease duration, motor impairment, and medication status. Relationships as those between the size of WCST performance deficits and time since frontal-lobe injury, age of patients with GTS, and PD-related disease characteristics are difficult to establish in individual studies, but they have important implications for understanding cognitive changes in the respective disorders.

Implications for Other Fields of Psychological Science Clinical neuropsychology has not developed a research culture involving standardization and incidental replications as part of the “credibility revolution” (Vazire, 2018) in psychological science. These research practices should rather be considered to have emerged from the specific historic background and research agenda of clinical neuropsychology (Stuss & Levine, 2002). Nonetheless, the meta-analytical results reviewed above highlight the potential of these practices to contribute to the credibility of scientific evidence in clinical neuropsychology and beyond. Shifts toward more standardization and (the systematic uncovering of) incidental replications can reduce bias in many research literatures, especially in those involving difficult-to-study populations.

Standardization and Validation One lesson we can learn from clinical neuropsychology relates to the value of standardized and validated measurement tools. Returning to the example of ego depletion (Carter et al., 2015), it can be seen that some other fields of psychological research rather rely on the use of ad hoc measures that have been developed specifically for the research question at hand. Sometimes, these measures are described as being “established” when they have been adopted from a previous study on a similar question, but this should not be confused with them being standardized and European Psychologist (2020), 25(1), 41–50

F. Lange, Difficult-To-Study Populations

validated. When a measure is standardized and validated, we can assume that it has certain psychometric properties when we apply it in certain way. When neuropsychologists apply the WCST, they know about the reliability and internal factor structure of the test, its sensitivity to demographic and clinical characteristics, and its relationship to other measures of executive functioning (Kopp, Lange, & Steinke, 2019; Strauss, Sherman, & Spreen, 2006). No comparable information is available with regard to most of the self-control tasks used in the ego-depletion literature. This also implies that there is no a priori psychometric information that could constrain the choice of selfcontrol measures. The associated researcher degrees of freedom contribute to the reliability problems of egodepletion research and they can effectively be reduced by the development of standardized and validated self-control tasks. If, for example, validation studies showed that performance on a (clearly specified variant of the) Stroop task is related to other measures of self-control, but not to measures that are thought to be unrelated to self-control, then this might be a reason to consider this task for a study on ego depletion. If the response-time variability on incongruent trials on this task appeared to be a more reliable measure than the overall error rate, then this information could be taken to justify the choice of the particular outcome measure. In other words, researchers should have the possibility (and responsibility) to show that the measure they use meets objective psychometric criteria of a reliable and valid self-control measure. This call for standardization and validation should not constrain researchers who would like to test relevant theoretical predictions in novel and creative ways. However, in order to efficiently pursue this goal, empirical psychological research has to meet some basic criteria. For example, studying the effect of engaging in one self-control task on performance in another self-control task might be premature as long as it is not independently verified that these tasks are self-control tasks (see also Lurquin & Miyake, 2017). Similarly, studying small (albeit theoretically interesting) effects in necessarily small samples drawn from difficult-to-study populations seems to be a rather inefficient use of resources. Intermediate steps involving the validation and standardization of procedures might be required before these effects can adequately be approached. A shift toward research using validated and standardized measures and manipulations might thus lead to a deceleration of hypothesis-testing research. Yet, this deceleration would be accompanied by an increase in the reliability of research results. When associated with improvements in measurement precision, it may also enhance the statistical power of research designs (LeBel & Paunonen, 2011), which should be particularly appealing to researchers studying difficult-to-study populations. In addition, the products of Ó 2019 Hogrefe Publishing

F. Lange, Difficult-To-Study Populations

the validation and standardization process would be applicable to research questions in other fields. The development of versatile high-quality research tools can often have greater impact on scientific progress than originally envisioned by their developer. Where this is not sufficient to stimulate costly large-scale studies on the development, validation, and standardization of measures, additional incentives might be necessary. Different incentive changes are currently under discussion or already implemented to promote the reproducibility of science (Munafò et al., 2017), and many of them could be extended to further the development of standardized measures. In addition, it would be possible to distribute the costs related to rigorous standardization studies among many shoulders in the context of multi-lab collaborations (see below).

Complete Result Reports Cumulative research in many fields of psychological science would also benefit from the development of a neuropsychology-like research culture of more complete result reports. Such a culture would involve that potentially valuable data are routinely reported although the authors of a publication might not consider them relevant for the research question they want to address. If valuable data are easily accessible, other researchers can use them to pursue research goals other than the ones pursued by the original data collectors (Morey et al., 2016). For example, only few of the authors of the 161 studies reporting WCST data from patients with PD (meta-analyzed by Lange et al., 2018) might have had the explicit goal of assessing the size of WCST deficits in this clinical population. However, by reporting these data as encouraged by the research culture in clinical neuropsychology, they incidentally contributed to addressing this and more sophisticated research questions. They created a rich and unbiased data set for metaanalysis by reporting data independent of their novelty or statistical significance. Other fields of research could benefit from adopting similar practices. For example, as long as no psychometric data with regard to the optimal operationalization of self-control are available, ego-depletion researchers could be encouraged to report results on all justifiable operationalizations (see also Steegen, Tuerlinckx, Gelman, & Vanpaemel, 2016). Researchers having good reasons to prefer one possible operationalization over the other (e.g., response times over error rates) can, of course, state those reasons (ideally in a pre-registered study protocol) and then focus on the chosen operationalization. However, the usefulness of their report for future research will substantially increase if they also report results from all alternative operationalizations. The form of these more complete result reports can be very different. Although it seems intuitive that Ó 2019 Hogrefe Publishing


well-documented and standardized sharing of the raw data set should be most informative, other formats might be helpful as well. For example, researchers can present results that are not immediately relevant to the goal of their study (e.g., zero-order correlations between all study variables) in a supplementary table. Similarly, information from supplementary multiverse visualizations that illustrate the sensitivity of results to analytical choices (Steegen et al., 2016) can be of value to future researchers if provided by a sufficient number of studies. When choosing one of these options to broaden the scope of a result report, it is important to ensure that the core message of a scientific communication is still conveyed in a clear way. Improved technical possibilities for data sharing are likely to be helpful in addressing this challenge (Morey et al., 2016). In general, every reduction of bias on the level of results reporting can be expected to lead to a reduction of bias in the overall research literature. Note that the encouragement of more complete result reports should contribute to less biased research literatures independent of whether or not a shift toward more standardization and validation occurs. Yet, the combined effect of the two practices promises to be more than additive. If more data are reported per study and if parts of the data are more similar between studies, the unbiased aggregation of results across studies should become more fruitful. Especially researchers who want to study difficult-to-study populations in a reliable way might benefit from such a combination of standardization, validation, and complete result reports. Where sample sizes are not sufficient to, for example, contrast alternative explanations for, moderators of, or processes underlying medium-sized effects, it may be advisable to abandon such research goals and to collect more valuable data instead. As illustrated by the reviewed examples from clinical neuropsychology, such practices can produce reliable research literatures in the long run if adopted by a sufficient number of researchers in a field.

Multi-Lab Collaborations By providing results obtained from standardized procedures in a non-selective way, present researchers implicitly collaborate with future researchers (e.g., meta-analysts). Another possibility to generate largely unbiased evidence is through the encouragement of more explicit multi-lab collaborations. Multi-lab collaborations have already proven valuable in assessing the reproducibility of psychological effects (Klein et al., 2014; Open Science Collaboration, 2015; see McShane, Tackett, Bockenholt, & Gelman, 2019, for an overview). In addition, this approach might be particularly attractive in research fields involving difficult-to-study populations. Researchers in clinical psychophysiology, for example, collect comparatively expensive data from patients European Psychologist (2020), 25(1), 41–50


who meet narrow inclusion criteria to examine the physiological correlates of psychological changes in different disorders. As a corollary, psychophysiological studies often have low statistical power while involving numerous researcher degrees of freedom (Baldwin, 2017; Seer, Lange, Georgiev, Jahanshahi, & Kopp, 2016). Researchers interested in the psychophysiology of patients with schizophrenia have addressed these problems by administering standardized psychophysiological assessment protocols to patients at multiple study sites (Light et al., 2015; Turetsky et al., 2015). By pooling resources across sites, they were able to generate robust evidence for schizophrenia-related psychophysiological changes and to investigate moderators that can account for between-site variation in the size of these changes. Similarly, comparative psychologists studying psychological processes in non-human animals typically rely on small sample sizes (Stevens, 2017). MacLean and colleagues (2014) addressed this limitation by administering the same two self-control tasks to different species in numerous international research labs. Based on this unusually large data set, they identified several factors that might account for the evolution of between-species differences in self-control performance. Notably, all of these examples involved replacing myriads of researcher degrees of freedom with standardized assessment protocols. Not all details of these protocols might be acceptable to all researchers in a field. Researchers may argue that other recording settings would be more appropriate to assess the psychophysiological responses targeted by Light, Turetsky, and colleagues, or that the tasks used by MacLean and colleagues do not provide a valid measure of self-control. Certainly, there has been some debate (Baumeister & Vohs, 2016; Hagger & Chatzisarantis, 2016) with regard to the appropriateness of the self-control task used in a multi-lab study on ego depletion (Hagger et al., 2016). This discussion illustrates the need for independent validation of the methods used in all fields of (psychological) science. It is, of course, an option to repeat large-scale multi-lab collaborations with alternative methods, but given the often barely constrained number of methodological choices, this approach promises to be highly expensive. In the long run, shifting resources to systematic validation efforts may prove substantially more efficient.

Conclusion Standardization and incidental replications in some fields of clinical neuropsychology mitigate the credibilitythreatening impact of researcher degrees of freedom and publication bias. Other fields of psychology could benefit from encouraging similar practices, especially when studying difficult-to-study populations. The proposed European Psychologist (2020), 25(1), 41–50

F. Lange, Difficult-To-Study Populations

combination of method standardization and validation, complete result reports, and multi-lab collaboration can complement other measures (Munafò et al., 2017) in generating more credible psychological evidence.

References Abrahams, S., Goldstein, L. H., Kew, J. J. M., Brooks, D. J., Lloyd, C. M., Frith, C. D., & Leigh, P. N. (1996). Frontal lobe dysfunction in amyotrophic lateral sclerosis A PET study. Brain, 119, 2105–2120. American Psychological Association. (2010). Clinical Neuropsychology. Retrieved from specialize/neuro.aspx Asendorpf, J. B., Conner, M., De Fruyt, F., De Houwer, J., Denissen, J. J., Fiedler, K., . . . Perugini, M. (2013). Recommendations for increasing replicability in psychology. European Journal of Personality, 27, 108–119. Baldwin, S. A. (2017). Improving the rigor of psychophysiology research. International Journal of Psychophysiology, 111, 5–16. Baumeister, R. F., & Vohs, K. D. (2016). Misguided effort with elusive implications. Perspectives on Psychological Science, 11, 574–575. Begg, C. B., & Mazumdar, M. (1994). Operating characteristics of a rank correlation test for publication bias. Biometrics, 50, 1088–1101. Berg, E. A. (1948). A simple objective technique for measuring flexibility in thinking. The Journal of General Psychology, 39, 15–22. Carter, E. C., Kofler, L. M., Forster, D. E., & McCullough, M. E. (2015). A series of meta-analytic tests of the depletion effect: Self-control does not seem to rely on a limited resource. Journal of Experimental Psychology: General, 144, 796–815. Carter, E. C., & McCullough, M. E. (2013). Is ego depletion too incredible? Evidence for the overestimation of the depletion effect. Behavioral and Brain Sciences, 36, 683–684. https://doi. org/10.1017/S0140525X13000952 Carter, E. C., & McCullough, M. E. (2014). Publication bias and the limited strength model of self-control: Has the evidence for ego depletion been overestimated? Frontiers in Psychology, 5, 823. Chiò, A., Logroscino, G., Traynor, B. J., Collins, J., Simeone, J. C., Goldstein, L. A., & White, L. A. (2013). Global epidemiology of amyotrophic lateral sclerosis: A systematic review of the published literature. Neuroepidemiology, 41, 118–130. https:// Crawford, J. R. (2004). Psychometric foundations of neuropsychological assessment. In L. H. Goldstein & J. E. McNeil (Eds.), Clinical neuropsychology: A practical guide to assessment and management for clinicians (pp. 121–140). Chichester, UK: Wiley. Demakis, G. J. (2003). A meta-analytic review of the sensitivity of the Wisconsin Card Sorting Test to frontal and lateralized frontal brain damage. Neuropsychology, 17, 255–264. https:// Demakis, G. J. (2006). Meta-analysis in neuropsychology: An introduction. The Clinical Neuropsychologist, 20, 5–9. https:// Engber, D. (2016). Everything is crumbling. Slate. Retrieved from story/2016/03/ego_depletion_an_influential_theory_in_psychology_may_have_just_been_debunked.html Ó 2019 Hogrefe Publishing

F. Lange, Difficult-To-Study Populations

Eysenck, H. J. (1978). An exercise in mega-silliness. American Psychologist, 33, 517. 517.a Field, A. P., & Gillett, R. (2010). How to do a meta-analysis. British Journal of Mathematical and Statistical Psychology, 63, 665–694. Geevasinga, N., Loy, C. T., Menon, P., de Carvalho, M., Swash, M., Schrooten, M., . . . Noto, Y. I. (2016). Awaji criteria improves the diagnostic sensitivity in amyotrophic lateral sclerosis: A systematic review using individual patient data. Clinical Neurophysiology, 127, 2684–2691. 2016.04.005 Gelman, A., & Geurts, H. M. (2017). The statistical crisis in science: How is it relevant to clinical neuropsychology? The Clinical Neuropsychologist, 31, 1000–1014. 10.1080/13854046.2016.1277557 Gelman, A., & Loken, E. (2014). The statistical crisis in science. American Scientist, 102, 460–465. 2014.111.460 Goodale, M. A., Meenan, J. P., Bülthoff, H. H., Nicolle, D. A., Murphy, K. J., & Racicot, C. I. (1994). Separate neural pathways for the visual analysis of object shape in perception and prehension. Current Biology, 4, 604–610. 10.1016/S0960-9822(00)00132-9 Grant, D. A., & Berg, E. (1948). A behavioral analysis of degree of reinforcement and ease of shifting to new responses in a Weigltype card-sorting problem. Journal of Experimental Psychology, 38, 404–411. Hagger, M. S., & Chatzisarantis, N. L. (2016). Commentary: Misguided effort with elusive implications, and sifting signal from noise with replication science. Frontiers in Psychology, 7, 621. Hagger, M. S., Chatzisarantis, N. L., Alberts, H., Anggono, C. O., Batailler, C., Birt, A. R., . . . Calvillo, D. P. (2016). A multilab preregistered replication of the ego-depletion effect. Perspectives on Psychological Science, 11, 546–573. 10.1177/1745691616652873 Hagger, M. S., Wood, C., Stiff, C., & Chatzisarantis, N. L. (2010). Ego depletion and the strength model of self-control: A metaanalysis. Psychological Bulletin, 136, 495–525. 10.1037/a0019486 Hammer, A., Vielhaber, S., Rodriguez-Fornells, A., Mohammadi, B., & Münte, T. F. (2011). A neurophysiological analysis of working memory in amyotrophic lateral sclerosis. Brain Research, 1421, 90–99. Inzlicht, M., Gervais, W., & Berkman, E. (2015). Bias-correction techniques alone cannot determine whether ego depletion is different from zero: Commentary on Carter, Kofler, Forster, & McCullough, 2015. Retrieved from Ioannidis, J. P. (2005). Why most published research findings are false. PLoS Medicine, 2, e124. pmed.0020124 Keane, M. M., Gabrieli, J. D., Mapstone, H. C., Johnson, K. A., & Corkin, S. (1995). Double dissociation of memory capacities after bilateral occipital-lobe or medial temporal-lobe lesions. Brain, 118, 1129–1148. Kilani, M., Micallef, J., Soubrouillard, C., Rey-Lardiller, D., Demattei, C., Dib, M., . . . Blin, O. (2004). A longitudinal study of the evolution of cognitive function and affective state in patients with amyotrophic lateral sclerosis. Amyotrophic Lateral Sclerosis, 5, 46–54. Klein, R. A., Ratliff, K. A., Vianello, M., Adams, R. B. Jr., Bahník, Š., Bernstein, M. J., . . . Cemalcilar, Z. (2014). Investigating variation in replicability. Social Psychology, 45, 142–152. https://doi. org/10.1027/1864-9335/a000178

Ó 2019 Hogrefe Publishing


Kopp, B., Lange, F., & Steinke, A. (2019). The reliability of the Wisconsin Card Sorting Test in clinical practice. Assessment. Advance online publication. 1073191119866257 Kühberger, A., Fritz, A., & Scherndl, T. (2014). Publication bias in psychology: A diagnosis based on the correlation between effect size and sample size. PLoS One, 9, e105825. https://doi. org/10.1371/journal.pone.0105825 Lange, F., Brückner, C., Knebel, A., Seer, C., & Kopp, B. (2018). Executive dysfunction in Parkinson’s disease: A meta-analysis on the Wisconsin Card Sorting Test literature. Neuroscience & Biobehavioral Reviews, 93, 38–56. neubiorev.2018.06.014 Lange, F., & Eggert, F. (2014). Sweet delusion. Glucose drinks fail to counteract ego depletion. Appetite, 75, 54–63. https://doi. org/10.1016/j.appet.2013.12.020 Lange, F., Seer, C., Müller-Vahl, K., & Kopp, B. (2017). Cognitive flexibility and its electrophysiological correlates in Gilles de la Tourette syndrome. Developmental Cognitive Neuroscience, 27, 78–90. Lange, F., Seer, C., Salchow, C., Dengler, R., Dressler, D., & Kopp, B. (2016). Meta-analytical and electrophysiological evidence for executive dysfunction in primary dystonia. Cortex, 82, 133–146. Lange, F., Vogts, M.-B., Seer, C., Fürkötter, S., Abdulla, S., Dengler, R., Kopp, B., & Petri, S. (2016). Impaired set-shifting in amyotrophic lateral sclerosis: An event-related potential study of executive function. Neuropsychology, 30, 120–134. LeBel, E. P., & Paunonen, S. V. (2011). Sexy but often unreliable: The impact of unreliability on the replicability of experimental findings with implicit measures. Personality and Social Psychology Bulletin, 37, 570–583. 0146167211400619 Li, C. S. R. (2004). Do schizophrenia patients make more perseverative than non-perseverative errors on the Wisconsin Card Sorting Test? A meta-analytic study. Psychiatry Research, 129, 179–190. 016 Light, G. A., Swerdlow, N. R., Thomas, M. L., Calkins, M. E., Green, M. F., Greenwood, T. A., . . . Pela, M. (2015). Validation of mismatch negativity and P3a for use in multi-site studies of schizophrenia: Characterization of demographic, clinical, cognitive, and functional correlates in COGS-2. Schizophrenia Research, 163, 63–72. 09.042 Lurquin, J. H., Michaelson, L. E., Barker, J. E., Gustavson, D. E., Von Bastian, C. C., Carruth, N. P., & Miyake, A. (2016). No evidence of the ego-depletion effect across task characteristics and individual differences: A pre-registered study. PLoS One, 11, e0147770. Lurquin, J. H., & Miyake, A. (2017). Challenges to ego-depletion research go beyond the replication crisis: A need for tackling the conceptual crisis. Frontiers in Psychology, 8, 568. https:// MacLean, E. L., Hare, B., Nunn, C. L., Addessi, E., Amici, F., Anderson, R. C., . . . Boogert, N. J. (2014). The evolution of selfcontrol. Proceedings of the National Academy of Sciences, 111, E2140–E2148. MacLeod, C. M. (1991). Half a century of research on the Stroop effect: An integrative review. Psychological Bulletin, 109, 163–203. McShane, B. B., Tackett, J. L., Bockenholt, U., & Gelman, A. (2019). Large scale replication projects in contemporary psychological research. The American Statistician, 73, 99–105. https://doi. org/10.1080/00031305.2018.1505655

European Psychologist (2020), 25(1), 41–50


Morey, R. D., Chambers, C. D., Etchells, P. J., Harris, C. R., Hoekstra, R., Lakens, D., . . . Vanpaemel, W. (2016). The Peer Reviewers’ Openness Initiative: Incentivizing open research practices through peer review. Royal Society Open Science, 3, 150547. Munafò, M. R., Nosek, B. A., Bishop, D. V., Button, K. S., Chambers, C. D., du Sert, N. P., . . . Ioannidis, J. P. (2017). A manifesto for reproducible science. Nature Human Behaviour, 1, 0021. Open Science Collaboration. (2015). Estimating the reproducibility of psychological science. Science, 349, aac4716. https://doi. org/10.1126/science.aac4716 Palmieri, A., Abrahams, S., Sorarù, G., Mattiuzzi, L., D’Ascenzo, C., Pegoraro, E., & Angelini, C. (2009). Emotional lability in MND: Relationship to cognition and psychopathology and impact on caregivers. Journal of the Neurological Sciences, 278, 16–20. Raaphorst, J., de Visser, M., van Tol, M. J., Linssen, W. H., van der Kooi, A. J., de Haan, R. J., . . . Schmand, B. (2010). Cognitive dysfunction in lower motor neuron disease: Executive and memory deficits in progressive muscular atrophy. Journal of Neurology, Neurosurgery & Psychiatry, 82, 170–175. https://doi. org/10.1136/jnnp.2009.204446 Rosenthal, R. (1979). The file drawer problem and tolerance for null results. Psychological Bulletin, 86, 638–641. https://doi. org/10.1037/0033-2909.86.3.638 Rosenthal, R., & DiMatteo, M. R. (2001). Meta-analysis: Recent developments in quantitative methods for literature reviews. Annual Review of Psychology, 52, 59–82. 10.1146/annurev.psych.52.1.59 Seer, C., Lange, F., Georgiev, D., Jahanshahi, M., & Kopp, B. (2016). Event-related potentials and cognition in Parkinson’s disease: An integrative review. Neuroscience & Biobehavioral Reviews, 71, 691–714. Sharpe, D. (1997). Of apples and oranges, file drawers and garbage: Why validity issues in meta-analysis will not go away. Clinical Psychology Review, 17, 881–901. 10.1016/S0272-7358(97)00056-1 Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). Falsepositive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, 22, 1359–1366. 0956797611417632 Stamelou, M., Edwards, M. J., Hallett, M., & Bhatia, K. P. (2012). The non-motor syndrome of primary dystonia: Clinical and pathophysiological implications. Brain, 135, 1668–1681. Steegen, S., Tuerlinckx, F., Gelman, A., & Vanpaemel, W. (2016). Increasing transparency through a multiverse analysis. Perspectives on Psychological Science, 11, 702–712. https://doi. org/10.1177/1745691616658637 Sterling, T. D. (1959). Publication decisions and their possible effects on inferences drawn from tests of significance – or vice versa. Journal of the American Statistical Association, 54, 30–34. Stevens, J. R. (2017). Replicability and reproducibility in comparative psychology. Frontiers in Psychology, 8, 862. https://doi. org/10.3389/fpsyg.2017.00862 Strauss, E., Sherman, E., & Spreen, O. (2006). A compendium of neuropsychological tests (3rd ed.). New York, NY: Oxford University Press.

European Psychologist (2020), 25(1), 41–50

F. Lange, Difficult-To-Study Populations

Stuss, D. T., & Levine, B. (2002). Adult clinical neuropsychology: Lessons from studies of the frontal lobes. Annual Review of Psychology, 53, 401–433. psych.53.100901.135220 Turetsky, B. I., Dress, E. M., Braff, D. L., Calkins, M. E., Green, M. F., Greenwood, T. A., . . . Radant, A. D. (2015). The utility of P300 as a schizophrenia endophenotype and predictive biomarker: Clinical and socio-demographic modulators in COGS-2. Schizophrenia Research, 163, 53–62. 10.1016/j.schres.2014.09.024 Van Elk, M., Matzke, D., Gronau, Q., Guang, M., Vandekerckhove, J., & Wagenmakers, E. J. (2015). Meta-analyses are no substitute for registered replications: A skeptical perspective on religious priming. Frontiers in Psychology, 6, 1365. https://doi. org/10.3389/fpsyg.2015.01365 Vazire, S. (2018). Implications of the credibility revolution for productivity, creativity, and progress. Perspectives on Psychological Science, 13, 411–417. 1745691617751884 Wagenmakers, E. J., Wetzels, R., Borsboom, D., van der Maas, H. L., & Kievit, R. A. (2012). An agenda for purely confirmatory research. Perspectives on Psychological Science, 7, 632–638. Zwaan, R. A., Etz, A., Lucas, R. E., & Donnellan, M. B. (2018). Making replication mainstream. Behavioral and Brain Sciences, 41, e120. History Received November 20, 2018 Revision received July 8, 2019 Accepted August 4, 2019 Published online December 6, 2019 Funding This work received funding from the FWO and European Union’s Horizon 2020 research and innovation program under the Marie Skłodowska-Curie grant agreement No 665501. ORCID Florian Lange Florian Lange Behavioral Engineering Research Group KU Leuven Naamsestraat 69 3000 Leuven Belgium

Florian Lange is a FWO [PEGASUS]2 Marie Skłodowska-Curie Fellow working in the Behavioral Engineering Research Group at KU Leuven. His research interests range from the neuropsychology of cognitive flexibility to the assessment and promotion of pro-environmental behavior.

Ó 2019 Hogrefe Publishing

Original Articles and Reviews

A Series of Meta-Analytic Tests of the Efficacy of Long-Term Psychoanalytic Psychotherapy Christian Franz Josef Woll1 and Felix D. Schönbrodt2 1

Department of Psychology, Clinical Psychology of Children and Adolescents and Psychology of Interventions, Ludwig-Maximilians-Universität Munich, Germany


Department of Psychology, Psychological Methods and Assessment, Ludwig-Maximilians-Universität Munich, Germany

Abstract: Recent meta-analyses come to conflicting conclusions about the efficacy of long-term psychoanalytic psychotherapy (LTPP). Our first goal was to reproduce the most recent meta-analysis by Leichsenring, Abbass, Luyten, Hilsenroth, and Rabung (2013) who found evidence for the efficacy of LTPP in the treatment of complex mental disorders. Our replicated effect sizes were in general slightly smaller. Second, we conducted an updated meta-analysis of randomized controlled trials comparing LTPP (lasting for at least 1 year and 40 sessions) to other forms of psychotherapy in the treatment of complex mental disorders. We focused on a transparent research process according to open science standards and applied a series of elaborated meta-analytic procedures to test and control for publication bias. Our updated metaanalysis comprising 191 effect sizes from 14 eligible studies revealed small, statistically significant effect sizes at post-treatment for the outcome domains psychiatric symptoms, target problems, social functioning, and overall effectiveness (Hedges’ g ranging between 0.24 and 0.35). The effect size for the domain personality functioning (0.24) was not significant (p = .08). No signs for publication bias could be detected. In light of a heterogeneous study set and some methodological shortcomings in the primary studies, these results should be interpreted cautiously. In conclusion, LTPP might be superior to other forms of psychotherapy in the treatment of complex mental disorders. Notably, our effect sizes represent the additional gain of LTPP versus other forms of primarily long-term psychotherapy. In this case, large differences in effect sizes are not to be expected. Keywords: efficacy, long-term, psychoanalytic psychotherapy, meta-analysis, publication bias

Since its origin at the end of the 19th century, psychoanalysis has always been a controversial issue. Some have criticized its scientific standing during the 20th century (Grünbaum, 1988; Popper, 1972), whereas others have corroborated psychoanalytic concepts with empirical evidence (Masling, 1983; Werner & Langenmayr, 2006). To grasp the complexity of psychoanalysis, one first needs to distinguish between psychoanalysis as a theory of the human mind, as a methodology to study human and social phenomena, and as a therapeutic method to treat mental disorders (Mertens, 2013). In our paper, we solely focus on psychoanalysis as a therapeutic method. In the introduction of their systematic review on longterm psychoanalytic therapy, de Maat, de Jonghe, Schoevers, and Dekker (2009) list 15 studies dated from 1917 till 1968 showing that efforts have been made from the very beginning to prove the effectiveness of psychoanalytic therapy. Recent meta-analyses come to conflicting conclusions about the efficacy of long-term psychoanalytic psychotherapy (Leichsenring, Abbass, Luyten, Hilsenroth, & Rabung, 2013; Smit et al., 2012). In our study, we aim Ó 2019 Hogrefe Publishing

to address this conflicting evidence by replicating the most recent meta-analysis by Leichsenring et al. (2013) and by conducting an updated meta-analysis of long-term psychoanalytic psychotherapy. However, before opening the discourse in more detail, we first outline some basic definitions and the current status of psychoanalytic psychotherapy research. Long-term psychoanalytic therapy encompasses long-term psychoanalytic psychotherapy (LTPP) and psychoanalysis proper (de Maat et al., 2013). Short-term psychoanalytic psychotherapy (STPP) is a less time requiring and more focused treatment modality. Figure 1 gives a schematic overview of the psychoanalytic treatment modalities, also showing the equivalent use of the terms psychoanalytic and psychodynamic. There is no generally accepted definition of the dosage of short-term and long-term psychoanalytic psychotherapy. Abbass et al. (2014) define short-term as lasting 40 or fewer sessions in total. However, some researchers begin to divide the range up to 40 sessions into short-term and medium-term (Leichsenring, Leweke, Klein, & Steinert, 2015), which might be useful for an increasing European Psychologist (2020), 25(1), 51–72


C. F. J. Woll & F. D. Schönbrodt, Efficacy of LTPP

Figure 1. Psychoanalytic treatment modalities (de Maat et al., 2013, p. 108). Figure available at, under a CC-BY 4.0 license.

data base (Leichsenring & Rabung, 2011). Gabbard (2017) argues that therapies lasting for 40 or more sessions should be regarded as long-term. The difference between the two long-term modalities is most frequently explained by the therapeutic setting (de Maat et al., 2013). In psychoanalysis proper, the therapist is sitting behind the patient lying on a couch, whereas in LTPP, both of them are sitting on chairs facing each other. The frequency of sessions ranges from two to five sessions per week in psychoanalysis proper and from one to two sessions per week in LTPP. The controversial debate about the efficacy of STPP seems settled since there is a wide range of studies fulfilling the criteria of evidence-based psychotherapy (Abbass, Hancock, Henderson, & Kisely, 2006; Abbass et al., 2014; Leichsenring et al., 2015; Leichsenring, Rabung, & Leibing, 2004). However, every form of psychotherapy has to be rescrutinized as methods progress since effect sizes might be overestimated (Cuijpers, Karyotaki, Reijnders, & Ebert, 2019). Nevertheless, one should carefully consider the choice of the control condition when interpreting an effect size (Munder et al., 2019). For the two long-term treatment modalities LTPP and psychoanalysis proper, the controversial debate is still going on. To begin with, it is worth noting that many patients, in particular those with chronic mental disorders or personality disorders, may need more extended treatments because shorter forms of psychotherapy are not sufficient enough to cure them effectively and sustainably (Kopta, Howard, Lowry, & Beutler, 1994; Zimmermann et al., 2015). Further findings support this assumption because psychological treatments for mood disorders in adults which offer an additional maintenance and continuation treatment have been found to reduce the risk of recurrence (Hollon & Ponniah, 2010). These findings are consistent with results from meta-analyses suggesting that LTPP is superior to shorter forms of psychotherapy in the treatment of complex mental disorders, meaning multiple mental disorders, chronic mental disorders, or personality disorders European Psychologist (2020), 25(1), 51–72

(Leichsenring et al., 2013; Leichsenring & Rabung, 2011). These meta-analyses only included randomized controlled trials (RCTs) and showed positive results for the efficacy of LTPP. The number of RCTs with 10 and 12 studies in the meta-analyses from 2011 and 2013, respectively, is relatively small, though. Based on an overlapping dataset, Smit et al. (2012) conducted another meta-analysis of RCTs coming to the conclusion that the evidence for an effect of LTPP is limited and conflicting. However, being more in line with the psychoanalytic research tradition, there is a higher number of naturalistic, uncontrolled studies revealing moderate to large effects for LTPP as shown in a systematic review by de Maat et al. (2009). Considering de Maat et al.’s (2013) meta-analysis of outcome research on psychoanalysis proper, the empirical evidence for this long-term treatment modality is based on a limited number of mainly naturalistic, uncontrolled studies providing evidence for pre to post changes in psychoanalysis patients with complex mental disorders. Only one of the included studies was an RCT, conducted by Huber, Zimmermann, Henrich, and Klug (2012). Currently, Beutel et al. (2016) and Benecke, Huber, Staats, et al. (2016) each run an RCT comparing long-term cognitive behavioral therapy to long-term psychoanalytic therapy (long-term CBT vs. LTPP and psychoanalysis proper). Their study designs follow the highest standards of current psychotherapy research. They will soon provide further information on the outcomes of long-term psychoanalytic therapy. To summarize, at this stage there is considerable evidence for an effect of STPP and, at the same time, the number of RCTs of psychoanalysis proper is too sparse to conduct a reasonable meta-analysis. It was, therefore, our goal to make a contribution towards resolving the conflicting evidence for LTPP (Leichsenring et al., 2013; Smit et al., 2012) by conducting a meta-analysis of RCTs. Focusing on RCTs is required by international guidelines such as the current, updated concept of empirically supported treatments Ó 2019 Hogrefe Publishing

C. F. J. Woll & F. D. Schönbrodt, Efficacy of LTPP

(ESTs) by the American Psychological Association (APA; Tolin, McKay, Forman, Klonsky, & Thombs, 2015). Notably, even though the concept was updated, it can still be criticized as being too far from clinical, psychotherapeutic reality (e.g., neglecting comorbidity and applying a random assignment; Orlinsky, 2008, Seligman, 1995). When conducting a meta-analysis, errors are common and the number of effect sizes which cannot be reproduced is much higher than is desirable (Lakens, Hilgard, & Staaks, 2016). Therefore, transparent, reproducible replications and updates of meta-analyses are needed to increase the credibility of meta-analytic conclusions. Another important issue of combining information across multiple studies constitutes the adjustment for publication bias, meaning statistically nonsignificant, counterfactual results are less likely to be published (McShane, Böckenholt, & Hansen, 2016). Fanelli (2012) provides overwhelming evidence for publication bias across disciplines and countries. Hence, it is crucial to employ statistical techniques to assess and adjust for publication bias. In the update of their meta-analysis from 2011, Leichsenring et al. (2013) critically reviewed the metaanalysis conducted by Smit et al. (2012) who doubt an effect of LTPP. Leichsenring et al. (2013) found several methodological shortcomings such as the inclusion of studies that regarding content do not represent LTPP. Additionally, Smit et al. (2012) included two studies that do not meet their own inclusion criterion of the length of LTPP. Furthermore, Smit et al. (2012) actually compared LTPP to other forms of long-term therapy as indicated by a mean session ratio of 1.35 (i.e., mean number of sessions in LTPP vs. mean number of sessions in the control group). They found LTPP to be as efficacious and, thus, did not really find results contradicting those of the previous meta-analysis by Leichsenring and Rabung (2011) in which the session ratio was 1.96. Their session ratio of 1.96 indicated that about twice as many sessions were carried out in the LTPP condition as compared to the controls, meaning that they compared LTPP to shorter or less intensive forms of psychotherapy. Since the meta-analysis by Smit et al. (2012) has already been critically reviewed, we decided to review and update Leichsenring et al.’s (2013) most recent meta-analysis of LTPP. In a first step, we tried to reproduce Leichsenring et al.’s (2013) effect sizes using the same set of studies they had used. Second, we conducted our own meta-analysis (a) updating the data base with recent RCTs of complex mental disorders, (b) following Lakens et al.’s (2016) recommendations for a transparent research process, (c) relating to international guidelines to assess the risk of bias in the primary studies (Higgins et al., 2011), and (d) employing elaborated statistical techniques to assess and adjust for publication bias. Ó 2019 Hogrefe Publishing


With LTPP causing higher direct financial costs because of a higher number of sessions, its effects need to exceed those of less intensive treatments. We, therefore, compared LTPP to other forms of psychotherapy by regarding the session ratio of each study as a possible continuous moderating variable, implying that a higher session ratio should result in a larger effect size and vice versa. After accounting for publication bias in our meta-analysis, we expected our effect sizes to be smaller than those of Leichsenring et al. (2013) but different from a null effect considering moderate to large effects from naturalistic, uncontrolled studies (de Maat et al., 2009).

Method For ensuring a reproducible research process, we (a) disclosed all of our meta-analytic data and statistical scripts on the Open Science Framework (see vec5d/), (b) specified which effect size calculations were used and which assumptions were made for missing data to facilitate quality control, (c) adhered to reporting guidelines, in our case the Meta-Analysis Reporting Standards (MARS; APA Publications and Communications Board Working Group on Journal Article Reporting Standards, 2008) to guarantee a clear and thorough description, (d) pre-registered our introduction and method section on the Open Science Framework (see before starting with data collection and data analysis to be able to distinguish between confirmatory and exploratory analyses, (e) hereby allow other researchers to re-analyze our data using our data files (including our entire literature hits from databases in common file formats) and statistical scripts, and (f) recruited expertise. We followed the model of the APA by Tolin et al. (2015) by trying to classify the evidential value of our results as very strong, strong, or weak. For this classification, we applied the recommended GRADE system (Atkins et al., 2004; Guyatt et al., 2008) to assess the quality of our meta-analysis. A PICOTS approach was used to clearly define the population of interest (P), the intervention (I), the comparisons considered (C), the outcomes examined (O), the timing of outcome assessment (T), and the setting of treatment (S). However, a complete review process of LTPP according to Tolin et al. (2015) was not feasible in our study because our focus lay on RCTs and we did not review naturalistic, uncontrolled studies, which would be required for a full evaluation.

Definition of Long-Term Psychoanalytic Psychotherapy We followed the definition of LTPP given in a previous meta-analysis by Leichsenring and Rabung (2011), but European Psychologist (2020), 25(1), 51–72


differed in the definition of the duration and dosage. Even though some experts regard long-term as lasting for 50 or more sessions (Crits-Christoph & Barber, 2000), we follow a current expert recommendation by Gabbard (2017) arguing that therapies lasting for 40 or more sessions should be included in the overarching rubric of long-term treatment. In accordance with Smit et al. (2012), we defined the duration and dosage of LTPP as lasting for at least 1 year and 40 sessions (in contrast to 1 year or 50 sessions in Leichsenring & Rabung, 2011). Additionally, we clearly distinguished between psychoanalysis and LTPP by considering the setting (de Maat et al., 2013). We, therefore, only included intervention conditions in which the therapist and the patient are in a sitting position.

Inclusion Criteria In the first step of our study, we analyzed the same set of studies which Leichsenring et al. (2013) had included in their meta-analysis in order to reproduce their effect sizes and results. In our updated meta-analysis, we applied the following inclusion criteria according to the PICOTS approach: (1) Patients/Problems: A clearly delineated sample of patients ( 18 years of age) with complex mental disorders (chronic mental disorders, more than one mental disorder, or personality disorder) is examined. (2) Intervention: LTPP meets the definition given above and lasts for at least 1 year and 40 sessions. (3) Comparator: Active control treatments differing substantially from the intervention treatment must be applied (a short-term treatment or a different type of treatment). (4) Outcomes: Reliable and valid outcome measures are used (see Outcome Measures). (5) Timing of outcome assessment: We analyzed pre- and post-treatment assessments (post-treatment defined as the point in time when the longer one of the compared treatments was finished); follow-up assessments were analyzed separately if provided. (6) Setting/Study design: Studies are randomized or quasi-randomized (e.g., randomization by alternation or date of birth) controlled trials of individual therapy. (7) Additionally, indicated data must allow to determine between-group effect sizes.

Information Sources and Search We chose the online databases EBSCO (including MEDLINE, PsycINFO, PsycARTICLES), Web of Science, and Cochrane Library for our literature search. Document type was set to academic journals, journals, and dissertations for EBSCO, and articles, reviews, proceedings papers, and European Psychologist (2020), 25(1), 51–72

C. F. J. Woll & F. D. Schönbrodt, Efficacy of LTPP

meeting abstracts for Web of Science. In Cochrane Library, search limits were set to all cochrane reviews, other reviews, and trials. We did not deviate in any other way from the default settings of these three databases. We used the same, well-tried search terms as Leichsenring and Rabung (2008, 2011): (psychodynamic OR dynamic OR psychoanalytic* OR transference-focused OR self psychology OR psychology of self) AND (therapy OR psychotherapy OR treatment) AND (study OR studies OR trial*) AND (outcome OR result* OR effect* OR change*) AND (psych* OR mental*) AND (rct* OR control* OR compar*). Additionally, we communicated with experts in the field to search for additional published as well as unpublished data. We had the opportunity to send a request for additional data via an international mailing list of around 500 psychoanalytic researchers. Additional data were accepted till June 11th, 2017. Besides, we manually screened reference lists in articles, reviews, and textbooks. Furthermore, all studies included in Leichsenring and Rabung (2011), Leichsenring et al. (2013), and Smit et al. (2012) were checked for inclusion in our meta-analysis.

Study Selection and Data Collection The main author (CW) selected studies for inclusion. To remove ambiguity, the second author (FS) was consulted and consensus was reached through discussion. Data extraction was done in duplicate for all raw data. To assess the reliability of data extraction, we calculated Cohen’s κ as recommended by Cooper (2010), keeping the limitations of this measure in mind (Eagan et al., 2017). Before our final analysis, consensus was reached through discussion for all codes. We extracted data for the following variables: author names, publication year, date and place of the trial, publication status, psychiatric disorder of the sample, age and gender of patients, general clinical experience of therapists (years), specific experience with the patient group under study (years), duration of follow-up period, and co-interventions (e.g., use of psychotropic medication). The following variables were extracted for LTPP and the control treatment, respectively: type of treatment, duration, mean number of sessions, drop-outs, and mean number of sessions in completers. In a second data base, we collected all relevant data for effect size calculations (see Outcome Measures).

Assessment of the Risk of Bias in Individual Trials To assess the risk of bias in individual trials, we used the Cochrane Risk of Bias Tool (Higgins et al., 2011). According to the tool, we assigned a judgement of low, high, or unclear Ó 2019 Hogrefe Publishing

C. F. J. Woll & F. D. Schönbrodt, Efficacy of LTPP


Table 1. Summary assessments of risk of bias within and across studies (adapted from Higgins et al., 2011) Risk of bias Low risk of bias Unclear risk of bias High risk of bias


Within trial

Across trials

Bias, if present, is unlikely to alter the results seriously A risk of bias raises some doubt about the results Bias may alter the results seriously

Low risk of bias for all key items

The majority of trials carry a low risk of bias The majority of trials carry a low or unclear risk of bias The majority of trials carry a high risk of bias

Low or unclear risk of bias for all key items High risk of bias for one or more key items

risk of bias to our key items random sequence generation, allocation concealment, blinding of outcome assessment, incomplete outcome data, and selective reporting. The item blinding of participants and personnel was rated as unclear for all studies as considered by Higgins et al. (2011). Blinding of participants and personnel is impossible in psychotherapy research, but outcomes may be influenced by participants knowledge of which intervention they received. Hence, all studies were treated as being at a possible risk of bias on this item. The following additional biases arose in our research process: (a) ongoing treatments were included, (b) data to calculate the exact effect size and/or session ratio were incomplete, which made an approximation necessary, and (c) therapies in the intervention and control group were carried out by the same therapists. We included these biases as key items. If information concerning a key item was unavailable in a primary study, we rated the item as unclear. Mainly following an example by Higgins et al. (2011), we summarized the assessment of risk of bias within and across trials as shown in Table 1. Since we a priori expected a small number of studies in our meta-analysis, we chose Higgins et al.’s (2011) recommended strategy: Present a meta-analysis of all studies and include the summary of the risk of bias across studies in the interpretation of the results. For the sake of relevance, we used six out of eight quality criteria (two criteria are covered by the Cochrane Risk of Bias Tool) proposed by Cuijpers, van Straten, Bohlmeijer, Hollon, and Andersson (2010) to assess important aspects of psychotherapy research. We checked if (a) patients were diagnosed using a diagnostic system, (b) a treatment manual was used, (c) treatment integrity was checked (supervision or analysis adherence), (d) therapists were trained for intervention under study, (e) an intention-totreat analysis was included, and (f) adequate statistical power and n 50 were given. Additionally, we documented whether clinically significant change was measured and reported as recommended by Tolin et al. (2015). The exact coding rules for each item are provided in our uploaded excel file. Higgins et al. (2011) argue not to generate summary scores comprising different quality criteria across studies, which is why we solely assessed the single Ó 2019 Hogrefe Publishing

quality criteria across studies and included the findings in the interpretation of our results.

Outcome Measures Following Leichsenring et al. (2013), we assessed effect sizes for the domains general psychiatric symptoms, personality functioning, social functioning, target problems, and overall effectiveness. Leichsenring et al. (2013) argue that recovery rates, which Smit et al. (2012) used, do not seem a reliable measure across studies of LTPP because definitions and measures of recovery are still too heterogeneous. To assess general psychiatric symptoms, we searched the studies for broad symptom checklists such as the Symptom Checklist 90 (SCL-90; Derogatis, 1977) or other more specific symptom measures (e.g., measures of depression or anxiety as well as direct or indirect measures of single symptoms such as suicide attempts or hospitalization/ emergency room visits for borderline patients). Personality functioning was assessed by measures of personality structure and personality characteristics focusing on the individual patient such as the Reflective Function Scale (Fonagy, Steele, Steele, & Target, 1998). For the assessment of social functioning, we used measures concerning a patient’s social, interpersonal environment (e.g., the Social Adjustment Scale; Weissman & Bothwell, 1976, or measures assessing work ability or quality of life). To measure target problems, specific symptom measures (e.g., measures of depression for depressed patients, or measures of impulse control for borderline patients), direct and indirect measures of single symptoms, as well as measures of personality and social functioning which were specific to the patient group under study (e.g., measures of personality structure for borderline patients) were included. We first assigned measures to one of the three categories general psychiatric symptoms, personality functioning, or social functioning. Specific symptom measures, measures of single symptoms as well as measures of personality and social functioning which were specific to the patient group under study could then be additionally included in the domain target problems in order not to narrow the data basis in the respective outcome domain. Target problems are, thus, not independent of the three other domains. European Psychologist (2020), 25(1), 51–72


In the overarching outcome measure overall effectiveness, we, therefore, only included the three independent domains general psychiatric symptoms, personality functioning, and social functioning by averaging the effect sizes of these domains. We excluded outcome measures for which the direction of the effect could not be clearly defined as unidirectional (e.g., fewer medication use can be generally regarded as an improvement, but for patients suffering from a bipolar disorder, for example, medication compliance may be regarded as a therapeutic target). The first author (CW) assigned each primary outcome measure to an outcome category. The first half of assignments was matched with a second rater of our working group to discuss and specify the definition of our outcome measures. We then calculated Cohen’s κ for the second half of assignments by the two raters. Before our final analysis, consensus was reached through discussion for all assignments. If a study contained more than one outcome measure for one domain, we assessed the effect size and variance separately for each measure and then calculated the mean effect size and mean variance for these measures. We averaged the variances because we did not know the correlation between the respective outcomes, which constitutes the conservative approach because one implies a correlation of 1.00 leading to an overestimation of the variance and an underestimation of precision (Borenstein, Hedges, Higgins, & Rothstein, 2009). However, this approach also assumes that the underlying true effects are homogeneous, which is not a conservative assumption. Hence, in case of heterogeneity, our effect sizes should be interpreted cautiously. If a study contained more than one intervention or control group, we preferred an outpatient individual LTPP intervention group to an inpatient LTPP intervention group because more trials deal with outpatients. Thus, we tried to reduce heterogeneity. If a study contained more than one outpatient intervention group, we assessed the effect size and variance separately for each group and then calculated the mean effect size and mean variance. Notably, this approach makes an implicit assumption of homogeneity which may lead to inaccuracies in case of heterogeneous interventions. We, therefore, tried to strictly follow our definition of LTPP with the goal to reduce heterogeneity. In the case of more than one control group, we chose (1) an evidence-based treatment (e.g., cognitive behavior therapy, interpersonal psychotherapy, or shortterm psychoanalytical psychotherapy) over (2) a structured, non-evidence-based treatment with the most similar treatment intensity over (3) a structured, non-evidence-based treatment with the most similar treatment mode (outpatient or inpatient therapy, individual or group therapy) over (4) treatment as usual (TAU) or another non-structured treatment. We are aware of the fact that the approach of selecting the strongest comparator wastes information European Psychologist (2020), 25(1), 51–72

C. F. J. Woll & F. D. Schönbrodt, Efficacy of LTPP

because data of the excluded comparators are not considered. However, by using this approach we tried to follow the APA recommendations by Tolin et al. (2015) to compare a psychotherapeutic intervention to the strongest comparator possible. Since effect sizes may be overestimated if only patients who completed the treatment are included in the analysis, we collected intention-to-treat (ITT) data, in contrast to completers’ data, where available (Hollis & Campbell, 1999). The concept of ITT analyses is that all randomized patients must be included in the analyses, no matter what happens after randomization (e.g., drop-out, detection of false inclusion). Exclusion must be very well justified. Common strategies for ITT analyses are carrying forward the last observed response or conservatively setting the effects for drop-out patients to zero. We followed Leichsenring and Rabung (2011) by choosing the latter strategy for studies which did not present ITT data. They adjusted a reported completers sample by multiplying a pre-post treatment difference of 0.5, for example, by the ratio of patients who completed the study and all included patients. With a completers’ sample of 80 patients and 20 patients who dropped out of the study, the adjusted pre-post difference for the ITT analysis would be 80 0:4ð0:5 100 Þ. Furthermore, we assessed effect sizes separately for post-treatment and the longest available follow-up because there is some empirical evidence that, after psychoanalytic psychotherapy was finished, psychotherapeutic gains may not only be maintained, but continue to improve (Town et al., 2012).

Statistical Analysis We quantified between-group effect sizes by calculating standardized group mean differences bias-corrected for small sample sizes (i.e., Hedges’ g; Hedges & Olkin, 1985). To calculate Hedges’ g and its associated 95% confidence interval (CI), we collected means, sample standard deviations, and sample sizes for each outcome in each study. Our specific calculation was to subtract the mean pre-treatment to post-treatment (or follow-up) difference of the control condition from the respective difference of LTPP. This difference is divided by the pooled pre-treatment standard deviation and multiplied by a coefficient correcting for small sample size (Morris, 2008). We chose this approach in contrast to the standard Hedges’ g calculation (see Table 2 for the different calculations) in order to show the difference in pre-post changes of the compared groups which, thus, comprises more information than the standard calculation. Pre-post correlations are needed for the exact computation of the variance for this outcome measure, but were unavailable for virtually all studies. Only one study Ó 2019 Hogrefe Publishing

C. F. J. Woll & F. D. Schönbrodt, Efficacy of LTPP


Table 2. Overview of the three different outcome metrics: Standard Hedges’ g, pre-post-control Hedges’ g, and complemented pre-post-control Hedges’ g Metric

Research question



Standard Hedges’ ga

RQ 1: Do participants of the long-term group have better outcomes at the end of the therapy than participants of the control group (ignoring potential group differences at the beginning of the therapy)?

Standard g

(1) ca

Mpost;C Mpost;T SDpost

Pre-post-control Hedges’ g

RQ 2: Do participants of the long-term group have a stronger increase of positive outcomes from the beginning of the therapy to the end, compared to the increase of participants in the control group (i.e., taking potential group differences at the beginning of the therapy into account)?


(2) cb

ðMpre;T Mpost;T Þ ðMpre;C Mpost;C Þ SDpre

Complemented pre-post-control Hedges’ gb

Same as RQ 2, but complementing 46 outcomes without pre-measurements (out of 191 outcomes), which are missing in RQ 2, by their post group comparison as in RQ 1 (primarily provided for a comparison with Leichsenring et al., 2013)

compl. ppc g

Formula (2); formula (1) if pretest outcomes were not presented

Notes. c{a,b} = coefficient correcting for small sample size; M = means of the pre-treatment (pre, T) and pre-control (pre, C) as well as post-treatment (post, T) and post-control (post, C) outcomes; SDpost = the pooled post-treatment standard deviation; SDpre = the pooled pre-treatment standard deviation. aMetric applied by Smit et al. (2012). bOur main outcome measure, also applied by Leichsenring et al. (2013).

(Levy et al., 2006) reported pre-post correlations for four outcome measures (rs = .22, .56, .72, .82). Given this range of pre-post correlations, we generally plugged in a default correlation of .5. If means and standard deviations are not reported in a primary study, but the t-test or F-test associated with the difference in post means is presented, it is also possible to calculate Hedges’ g from these metrics (Cooper, 2010). In the following, we will refer to our approach of calculating Hedges’ g as pre-post-control (ppc) g complemented by calculations from other metrics as well as standard Hedges’ g calculations if pre-means were not presented (i.e., compl. ppc-g). We complemented the ppc-g in order not to narrow the data basis, since we expected a small set of studies. Notably, with an increasing data base future meta-analysts should not combine these metrics since they test slightly different hypotheses. To provide a robustness check for our approach, we still reported results from standard Hedges’ g calculations and ppc-g with no complementation. We decided to provide these different calculations because the previous metaanalyses by Smit et al. (2012) and Leichsenring et al. (2013) differed in their calculations and we wanted to compare our results to both meta-analyses. Smit et al. (2012) used the standard Hedges’ g, whereas Leichsenring et al. (2013) used the compl. ppc-g. In our case, a positive effect size indicated improvement. Signs were inverted if necessary. We asked the original authors for necessary data if not reported. If only the overall sample size was reported, we assumed sample sizes to be equal across groups (in case of an odd total sample size, we placed the remainder in the LTPP group). The effect sizes were aggregated using a random-effects model because we expected some dispersion or heterogeneity in observed effects due to different types of Ó 2019 Hogrefe Publishing

disorders and a variety of control treatments (Borenstein et al., 2009). Meta-analytical heterogeneity was assessed by using the w2, Q statistic, the I2 statistic, and the parameter τ2 (Borenstein et al., 2009). To directly communicate the amount of heterogeneity in the underlying true effects, we present the 95% prediction intervals (PI) for each outcome measure (IntHout, Ioannidis, Rovers, & Goeman, 2016). Prediction intervals show the expected range of true effects in similar studies. These intervals are broader than the 95% CIs as they additionally take heterogeneity into account. We explored heterogeneity by considering the session ratio as a possible moderating variable. The relation between session ratio and effect size was examined by meta-regression. We log-transformed the ratios, so that they are symmetric around 1. To assess and adjust for publication bias, we followed Carter, Kofler, Forster, and McCullough (2015) by conducting the Egger test (Egger, Smith, Schneider, & Minder, 1997), the Precision Effect Test (PET), and the Precision Effect Estimation with Standard Error (PEESE; Stanley and Doucouliagos, 2014). Additionally, we applied the p-uniform approach (van Assen, van Aert, & Wicherts, 2015), as well as a three-parameter selection model (3PSM; McShane et al., 2016). First, we examined funnel plot asymmetry by conducting the Egger test as well as p-uniform’s test for publication bias, and by determining the selection parameter of 3PSM. In the presence of publication bias, a funnel plot (i.e., a plot of trials’ effect sizes against their precision) is skewed and asymmetrical with more studies on the right side of an inverted funnel. To quantify the relation of reported effect sizes to its standard error, Egger et al. (1997) used a linear approximation. By applying a weighted least squares (WLS) regression model in which trials’ effect sizes are regressed on the respective standard error, European Psychologist (2020), 25(1), 51–72


weighted by the inverse of the variances, publication bias may be quantified as the slope coefficient – for this model (Stanley & Doucouliagos, 2014). Testing H0: α = 0 provided us with the information whether publication bias was present or not. The significance level was set at .10 as recommended by Egger et al. (1997). Additionally, testing H0: γ = 0, with γ as the intercept in this WLS regression model, has been proposed as a test for a real empirical effect beyond publication bias. Stanley and Doucouliagos (2014) proposed that the coefficient γ of this model may serve as an estimate of the effect size adjusted for publication bias, which is called PET. PET provides an attempt to correct for publication bias, but according to recent simulation studies not a very accurate one (Carter, Schönbrodt, Hilgard, & Gervais, 2017). PEESE works with the same method, but uses variances instead of standard errors as the predictor in the WLS regression model. Stanley and Doucouliagos (2014) argued to apply PET and PEESE regardless of the statistical significance of the Egger test because the Egger test notoriously has low power. Furthermore, their simulation studies showed that in the case of a genuine effect, PET tends to underestimate if an underlying effect exists, meaning that it overcorrects for the influence of publication bias. In case of no underlying effect, PEESE tends to overestimate the effect, meaning that it undercorrects for publication bias. Therefore, Stanley and Doucouliagos (2014) recommended using the estimate of the effect size given by PEESE if PET is statistically significant, and using the estimate given by PET if PET is not statistically significant. We report both estimates to provide a thorough documentation. Additionally, we applied the p-uniform approach by van Assen et al. (2015) to test and correct for publication bias. The distribution of statistically significant p-values (p < .05) across a set of studies is called p-curve. The shape of the distribution is uniform if there is no effect, and rightskewed if there is one. For a given set of statistically significant p values, it is, thus, possible to diagnose the presence of publication bias and to compute an effect size estimate that corrects for publication bias. The p-curve approach by Simonsohn, Nelson, and Simmons (2014) is based on the same basic idea, but simply differs in implementation. Since both approaches basically yield the same results, we decided to only apply p-uniform because this method also provides confidence intervals. Our final approach to assess and adjust for publication bias were two selection methods (Hedges, 1984; McShane et al., 2016). Selection methods have two components: A data model and a selection model. The data model characterizes how the data are generated, whereas the selection model characterizes the publication process. Selection models can adapt to different biases in the publication process, such as (a) only studies with statistically significant European Psychologist (2020), 25(1), 51–72

C. F. J. Woll & F. D. Schönbrodt, Efficacy of LTPP

results are published, (b) only studies with statistically significant and directionally consistent results are published, or (c) studies with non-significant results (or results which are directionally inconsistent) are less likely to be published than studies with significant and directionally consistent results. One-parameter approaches, such as described in Hedges (1984), p-curve, and p-uniform, assume that (a) only studies with statistically significant results are published and (b) effect sizes do not vary across studies (i.e., they are homogeneous). Since both assumptions are almost always false in behavioral research, McShane et al. (2016) showed in a large simulation study that a three-parameter selection model (3PSM) provides better estimates and tests. Such a selection estimates (a) the underlying effect size adjusted for publication bias, (b) the degree of heterogeneity adjusted for publication bias, and (c) the degree of publication bias, which is formalized as a weight parameter, that provides the probability that a study with statistically non-significant results is published relative to a study with statistically significant results. Notably, our averaging of multiple outcomes may violate the assumptions of p-uniform and selection modeling since these methods assume that outcomes are published on the basis of their individual p-values, rather than the composite p-value of averaged outcomes. In case methods disagreed about the presence and/or magnitude of a bias-corrected meta-analytical effect, we pre-registered to give 3PSM the largest weight, as this method, among other reasons, had consistently a better performance in many conditions than other bias-correcting methods (Carter et al., 2017; McShane et al., 2016). Furthermore, we screened for outliers in an exploratory manner and conducted sensitivity analyses where necessary. Sensitivity analyses and all bias-correcting methods were only conducted for our primary outcome, meaning the post-treatment compl. ppc-g uncorrected for ITT. The data basis for the follow-up assessment was too narrow and the post-hoc correction for ITT, where ITT data were not reported in the primary studies, can only be considered as a very rough estimation. Fortunately, most of the primary studies reported ITT data which made a post-hoc correction unnecessary for these studies. Post hoc, we decided to include a four-parameter selection model (4PSM) for further exploratory analyses. The fourth parameter, in other words, the second selection parameter, provides the likelihood that a study with a negative effect size is published. We used p-value bin thresholds at a one-sided p-value of .025 and .50. Our statistical analyses were conducted using R (version 3.4.0; R Development Core Team, 2008). Data were aggregated using the metafor package (Viechtbauer, 2010). For the assessment of publication bias, we additionally used the puniform package (van Aert, 2017; van Aert, Wicherts, Ó 2019 Hogrefe Publishing

C. F. J. Woll & F. D. Schönbrodt, Efficacy of LTPP


Table 3. Comparing Leichsenring et al.’s (2013) parameter estimates of the comparison of long-term psychoanalytic psychotherapy (LTPP) versus shorter forms of psychotherapy to our replicated estimates Domain




Leichsenring et al. (2013)



[0.20, 0.67]




[0.09, 0.47]


Target problems

Leichsenring et al. (2013)



[0.06, 0.72]





[0.08, 0.54]


Social functioning

Leichsenring et al. (2013)



[0.23, 0.97]




[0.14, 0.72]


Personality functioning

Leichsenring et al. (2013)



[ 0.18, 1.00]





[0.18, 0.71]


Leichsenring et al. (2013)



[0.05, 0.74]





[0.16, 0.53]


Psychiatric symptoms


Overall effectiveness

95% CI

Q 13.60

Notes. All data were calculated using a random-effects model. k = number of comparisons; g = estimate of the average underlying effect (pre-post-control Hedges’ g complemented by calculations from t-values and other metrics as well as by standard Hedges’ g calculations); CI = the upper and lower limits of the 95% confidence interval; Q = Q statistic for statistical heterogeneity. yp < .1; *p < .05; **p < .01; ***p < .001.

& van Assen, 2016) and the weightr package (Coburn & Vevea, 2017).

Changes to the Pre-Registration We did not differ from our pre-registration in any major way. We present all minor changes in Electronic Supplementary Material (ESM 1).

Results Replication of Leichsenring et al. (2013) For the replication of Leichsenring et al. (2013), we included the same set of 12 studies they had included (See Table 4, but excluding the more recent studies (7), (11), and (13)). For the replication we also included the study by Dare, Eisler, Russell, Treasure, and Dodge (2001) which did not fulfill our duration and dosage criteria for our updated meta-analysis. Table 3 summarizes the comparison between the outcomes of Leichsenring et al. (2013) and our replication for the five outcome domains. Our classification of outcome domains differed for the study by Giesen-Bloo et al. (2006). We could not clarify with Falk Leichsenring whether he and his team had included data from this article or from a second article of this study (van Asselt et al., 2008). We decided to include data from the latter article because it reported means instead of medians and more treatments were finished than at the time of the first article. In accordance with the definition of the outcome domains, both authors and an independent rater clearly agreed to assign the outcome measures of this study to the domain psychiatric symptoms (the Borderline Personality Disorder Severity Index, BPDSI) and to the domain social functioning (the utility scores), Ó 2019 Hogrefe Publishing

whereas Leichsenring et al. (2013) assigned both outcome measures to the domain personality functioning. The utility scores capture dimensions such as mobility, self care, and daily activities, which is why we clearly assigned them to social functioning. The assignment of the BPDSI can be controversially discussed. The BPDSI is an indicator of borderline symptoms and represents the DSM-IV borderline personality disorder criteria, which can be considered to reflect personality functioning because they describe a personality disorder. However, we wanted to distinguish between measures which focus more on evaluating the diagnostic criteria, symptoms or severity of a disorder, which we assigned to the category of psychiatric symptoms, and measures which focus more on personality structure and structural representations, such as the Reflective Function Scale (Fonagy et al., 1998), which we assigned to the category of personality functioning. It was difficult to draw the line for measures of borderline personality disorder, because the criteria of personality disorder reflect personality functioning. Compared to the Reflective Function Scale, we (and an independent second coder) still considered the BPDSI rather to be a measure of psychiatric symptoms. The discrepancy between Leichsenring et al.’s (2013) and our assignment explains the differing number of comparisons for the outcome domains psychiatric symptoms, social functioning and personality functioning. Consistent with the meta-analysis by Leichsenring et al. (2013), we found LTPP to be significantly superior to comparators in all outcome domains. However, our effect sizes were smaller by 0.15, 0.08, 0.17, and 0.05 for the outcome domains psychiatric symptoms, target problems, social functioning, and overall effectiveness, respectively. For personality functioning, we received a slightly higher effect size (by 0.04). Our 95% CIs were narrower for all outcome domains. The Q statistic provided only one statistically significant result for the outcome domain social functioning European Psychologist (2020), 25(1), 51–72


C. F. J. Woll & F. D. Schönbrodt, Efficacy of LTPP

Table 4. Main characteristics of included studies of long-term psychoanalytic psychotherapy ID (1) (2) (3) (4) (5)

(6) (7) (8) (9) (10)

(11) (12)

(13) (14)



LTPP intervention

Control intervention



Bachar et al. (1999) Bateman and Fonagy (1999) Bateman and Fonagy (2009) Bressi et al. (2010) Clarkin et al. (2007)

Bulimia and anorexia BPD

Self psychological treatment

Cognitive orientation treatment

17 (3)

17 (5) 1.0

Mentalization-based therapy

General psychiatric outpatient care

22 (3)

22 (3) 1.8


Mentalization-based therapy

Structured clinical management

71 (19)

63 (16) NA

Anxiety or depressive disorder BPD

Psychodynamic psychotherapy

Drug treatment and clinical interviews Dialectical behavior therapy

30 (6)

30 (6) 1.6

61 (16)

29 (12) 1.0

Doering et al. (2010) Fonagy et al. (2015) Giesen-Bloo et al. (2006) Gregory et al. (2008) Huber, Zimmermann, et al. (2012) Jørgensen et al. (2013) Knekt, Lindfors, Härkänen, et al. (2008) Poulsen et al. (2014) Svartberg et al. (2004)


52 (20)

52 (35) 2.6

67 (16)

62 (16) 3.7

42 (21)

44 (11) 1.2

Depressive disorder BPD

Transference-focused psychotherapy; dynamic supportive treatment Transference-focused Experienced community psychotherapy psychotherapy Psychodynamic psychotherapy Treatment as usual

BPD and alcohol use disorder Depressive disorder

Transference-focused psychotherapy Dynamic deconstructive psychotherapy Psychodynamic psychotherapy

Schema-focused therapy Treatment as usual

15 (5)

15 (6) 0.6

Cognitive behavior therapy

35 (5)

41 (10) 2.0


Mentalization-based therapy

Supportive group treatment

58 (16)

27 (6) 4.0

Anxiety and depressive Psychodynamic psychotherapy disorder

Short-term psychodynamic psychotherapy


Psychoanalytic psychotherapy

Cognitive behavior therapy

Cluster C personality disorder

Psychodynamic psychotherapy

Cognitive therapy

128 (47) 101 (13) 5.0

34 (10)

36 (8) 3.3

26 (1)

25 (0) 1.0

Notes. Only the main article of a study is listed here. LTPP = long-term psychoanalytic psychotherapy; n = sample size (drop-outs); SR = session ratio of the mean number of sessions of LTPP versus the mean number of sessions of the control group; BPD = borderline personality disorder; NA = not available.

in Leichsenring et al.’s (2013) meta-analysis. In our replication, statistical heterogeneity was statistically significant for all outcome domains except personality functioning. Falk Leichsenring supplied us with their aggregated effect size data of the single studies for the single domains. The 51 single comparisons across all studies and all outcome domains are shown in Table 1 in ESM 1. Of course this comparative table cannot tell which of the extracted effect sizes are the more appropriate, but it highlights the difficulties of producing reproducible meta-analyses (Lakens et al., 2017). To sum up, we replicated the direction and general tendency of results, but our replicated effect sizes were in general slightly smaller and showed higher heterogeneity. The effect size for personality functioning was slightly higher.

Updated Meta-Analysis Study and Outcome Selection For our updated meta-analysis, we screened a total of 9,170 records. We excluded 9,144 of them. Figure 2 illustrates our European Psychologist (2020), 25(1), 51–72

search, screening, and selection process. Studies were mainly excluded because the intervention did not meet our definition of LTPP or the trial was not randomized and controlled. Worth mentioning are the studies by Linehan et al. (2006) and McMain et al. (2009) because they were included in the meta-analysis by Smit et al. (2012). We agree with Leichsenring et al. (2013) that no LTPP group was examined in these studies. Three studies were excluded because they were not randomized and/or not clearly controlled (Klar, 2005; Korner, Gerull, Meares, & Stevenson, 2006; Puschner, Kraft, Kächele, & Kordy, 2007). We did not include two studies because they did not meet the dosage or the duration criteria for our definition of LTPP (Dare, Eisler, Russell, Treasure, & Dodge, 2001; Zipfel et al., 2014). We found 14 studies described in 26 articles meeting our inclusion criteria. We did not receive any unpublished or additional data from researchers in the field. The 14 main articles are: Bachar, Latzer, Kreitler, and Berry (1999), Bateman and Fonagy (1999, 2009), Bressi, Porcellana, Marinaccio, Nocito, and Magri (2010), Clarkin, Levy, Lenzenweger, and Kernberg (2007), Doering et al. Ó 2019 Hogrefe Publishing

C. F. J. Woll & F. D. Schönbrodt, Efficacy of LTPP


Figure 2. Flowchart of study search, screening and selection. Figure available at https://, under a CC-BY 4.0 license.

(2010), Fonagy et al. (2015), Giesen-Bloo et al. (2006), Gregory et al. (2008), Huber, Zimmermann, et al. (2012), Jørgensen et al. (2013), Knekt, Lindfors, Härkänen, et al. (2008), Poulsen et al. (2014), and Svartberg, Stiles, and Seltzer (2004). Additional outcome measures or follow-up data were reported in the following 12 articles: Bateman and Fonagy (2008), Gregory, DeLucia-Deranja, and Mogle (2010), Huber, Henrich, Clarkin, and Klug (2013), Huber, Henrich, Gastner, and Klug (2012), Jørgensen et al. (2014), Knekt et al. (2015), Knekt, Lindfors, Laaksonen, et al. (2008), Knekt et al. (2016), Levy et al. (2006), Lindfors, Knekt, Heinonen, Härkänen, and Virtala (2015), Lindfors, Knekt, Virtala, and Laaksonen (2012), and van Asselt et al. (2008). Doering et al. (2010) and Giesen-Bloo et al. (2006) reported on ongoing treatments which is why we ran a sensitivity analysis without them. The study by Clarkin et al. (2007) did not report means and standard deviations. We contacted the authors twice, but did not receive a reply. Therefore, we calculated effect sizes based on the reported effect sizes r of their regression analyses. Being an approximation, we regarded this calculation as another risk of bias in our assessment of the risk of bias within studies. Furthermore, we discussed the inclusion of Jørgensen et al. (2013) because the intervention group and the control group were partly treated by the same therapists, which raises the question whether the difference between the two treatment arms was too narrow. Since the two treatÓ 2019 Hogrefe Publishing

ment arms were well-described as applying combined mentalization-based therapy (one individual and one group session per week) in the treatment group and supportive group treatment (one group session biweekly) in the comparison group, we regarded the difference as sufficient enough and decided to include the study. It is noteworthy that the treatment group was confounded by a group session which is part of the mentalization-based treatment program. The same applied to the studies by Bateman and Fonagy (1999, 2009). With a hopefully increasing data basis, future meta-analysts should examine the effects of mentalization-based therapy in a subgroup analysis. Some studies reported more than one intervention or control group. For the study by Clarkin et al. (2007), we combined the transference-focused psychotherapy group with the dynamic supportive treatment group to one intervention group because both treatments fulfilled our criteria of LTPP. Huber, Zimmermann, et al. (2012) examined three groups: Patients receiving (1) CBT, (2) LTPP, or (3) psychoanalysis proper. Leichsenring et al. (2013) combined the LTPP and the psychoanalysis proper groups, whereas we considered it to be more accurate according to our predefined inclusion criteria to exclude the psychoanalysis proper group and to solely examine the comparison of CBT versus LTPP. We chose the cognitive orientation treatment plus nutritional counselling group over the solely nutritional counselling group as control conditions of the study by Bachar et al. (1999), and the STPP group over European Psychologist (2020), 25(1), 51–72


C. F. J. Woll & F. D. Schönbrodt, Efficacy of LTPP

Figure 3. Summary assessment of the risk of bias within studies by applying the Cochrane Risk of Bias Tool (Higgins et al., 2011). Figure available at https://, under a CC-BY 4.0 license.

the solution-focused therapy group of Knekt, Lindfors, Härkänen, et al. (2008). Study Characteristics Table 4 presents the main characteristics of the 14 included studies. In total, the 14 included studies encompassed 658 patients who received LTPP and 564 patients who were treated with comparative treatments. The mean number of sessions across all studies was 81.8 (SD = 66.1) for the LTPP condition and 49.0 (SD = 45.6) for the treatments in the control groups, implying that we compared LTPP primarily to other forms of long-term psychotherapy. The medians were 58 and 37.5, respectively. The overall session ratio was 1.67, implying that patients treated with LTPP received 1.67 times as many sessions as patients treated with the control treatments. Seven studies reported follow-up data, some in different articles (Bateman & Fonagy, 2008; Fonagy et al., 2015; Gregory et al., 2010; Huber, Zimmermann, et al., 2012; Jørgensen et al., 2014; Knekt et al., 2016; Svartberg et al., 2004). Since many treatments were ongoing in the study by Giesen-Bloo et al. (2006), we extracted data from the paper by van Asselt et al. (2008) and considered them as post-treatment data. More treatments were finished in this second paper but not all of them. Assessment of the Risk of Bias in Individual Trials Figure 3 summarizes our findings for each study across the items of the Cochrane Risk of Bias Tool. We detected a high risk of bias for five studies because these studies presented incomplete outcome data and/or did not blind the outcome assessors. The rest of the studies carried an unclear risk of bias, mainly because they did not report whether the intended blinding of outcome assessment was effective or not, a criterion which is required by the tool. According to our pre-defined summary assessment across trials (see Table 1), we concluded that our set of studies carried an unclear risk of bias, suggesting that the results of the following meta-analysis should be interpreted cautiously. The assessment of the six quality criteria by Cuijpers et al. (2010) revealed the following results across the 14 European Psychologist (2020), 25(1), 51–72

studies for the single items: In 13 studies, diagnostic systems were used to diagnose patients; in 10 studies, treatment manuals were used, which may be regarded as a very positive result given the difficulties with manuals for long-term and especially psychoanalytic treatments; in 13 studies, treatment integrity was checked; in 13 studies, therapists were trained for the intervention under study; in 11 studies, ITT analysis was included; and in 6 studies, an adequate statistical power and a sample size larger than 50 were given. Besides, we identified only five studies which measured and reported clinically significant change as required by Tolin et al. (2015). Considering these quality criteria, we drew the conclusion that a high proportion of studies fulfilled five out of seven criteria, implying that our primary studies abided by the bigger part of required standards. The assessment of the criteria regarding statistical power and clinically significant change, however, revealed poor results. Reliability of Our Coding Cohen’s κ was 0.97 for the assignment to the mutually exclusive outcome domains, and 1.00 for the additional assignment to the domain target problems (52% of the outcome measures were additionally included in the domain target problems). The (almost) perfect agreement between the two raters suggests a very thorough definition of our outcome measures. The reliability of our data extraction was substantial. Cohen’s κ was 0.93 for the general data, 0.70 for the outcome data, and 0.73 for the additional metrics. All inconsistencies were resolved by discussion prior to data analysis. Different Calculations of Hedges’ g and Follow-Up Data The summary effects of the post data revealed very similar values for the standard g and our primary outcome, the compl. ppc-g, suggesting that our robustness check of the compl. ppc-g as our primary outcome measure was positive (see ESM 1 for all values). The pure ppc-g calculations yielded to some extent higher results for all outcome Ó 2019 Hogrefe Publishing

C. F. J. Woll & F. D. Schönbrodt, Efficacy of LTPP


Table 5. Comparing long-term psychoanalytic psychotherapy (LTPP) with other forms of psychotherapy: Parameter estimates for the randomeffects models of the main outcome Domain




Psychiatric Symptoms




95% CI [0.06, 0.42]

Q 26.71*



95% PI



[ 0.30, 0.78]

Target Problems




[0.02, 0.48]




[ 0.57, 1.06]

Social Functioning




[0.11, 0.59]




[ 0.50, 1.20]

Personality Functioning




[ 0.03, 0.51]




[ 0.58, 1.07]

Overall Effectiveness




[0.09, 0.47]




[ 0.33, 0.89]

Notes. k = number of comparisons; g = estimate of the average underlying effect (pre-post-control Hedges’ g complemented by calculations from t-values and other metrics as well as by standard Hedges’ g calculations); SE = standard error of the estimate of the average underlying effect; CI = the upper and lower limits of the 95% confidence interval; Q = Q statistic for statistical heterogeneity; τ2 = estimate of the between-study variance; I2 = percentage of the observed variance which is due to real differences in effect sizes; PI = the upper and lower limits of the 95% prediction interval. yp < .1; *p < .05; **p < .01; ***p < .001.

domains, implying, however, that our calculation constitutes a more conservative effect size estimation. The differences between ITT corrected values and the respective uncorrected values were negligible, except for the moderation model of personality functioning. The follow-up data yielded to some extent higher effect sizes than the post data for most of the three different calculations (see ESM 1). These data should be interpreted cautiously because they are only based on seven studies. In the following, we will, therefore, only report on further results of our primary outcome, meaning compl. ppc-g uncorrected for ITT and based on the post-treatment data. We chose the outcomes uncorrected for ITT because the difference to the post-hoc corrected values was negligible and because a post-hoc correction can only be considered as a very rough estimation (see Method section). Main Outcomes Table 5 presents the parameter estimates for the randomeffects models of the main outcome (posttest data of compl. ppc-g). For the outcome domains psychiatric symptoms, target problems and overall effectiveness, all 14 included studies offered data. The number of comparisons was by one smaller for social functioning because Bachar et al. (1999) did not include outcome measures concerning social functioning. Four studies did not include outcome measures of personality functioning (Bateman & Fonagy, 2009; Bressi et al., 2010; Fonagy et al., 2015; van Asselt et al., 2008). The meta-analytic effect sizes were statistically significant for all outcome domains (ps < .05), except the domain personality functioning (p = .08). According to Cohen’s (1988) benchmarks (0.2 < d < 0.5 for small effects, 0.5 < d < 0.8 for medium effects, d 0.8 for large effects), the sizes of the effects are regarded as small. The lower limits of the 95% CIs lay in the range of a negligible effect for all outcome domains, whereas the upper limits lay in the range of a medium effect size for the domains social functioning as well as personality functioning and were higher than Ó 2019 Hogrefe Publishing

0.40 for the other domains. The Q statistic suggests effect size heterogeneity in all outcome domains. Concerning the I2 descriptor (Higgins & Green, 2008), the outcome domain psychiatric symptoms and overall effectiveness fell within the range of moderate heterogeneity (i.e., 30% < I2 < 60%) and all outcomes fell within the range of substantial heterogeneity (i.e., 50% < I2 < 90%). The estimates of psychiatric symptoms and overall effectiveness lay in the overlap of the two ranges suggesting a moderate to substantial heterogeneity. The moderate to substantial heterogeneity was also expressed by the wide 95% PIs, which included medium to large effects in favor of the LTPP group, but also small to medium effects in the reverse direction (i.e., in favor of the control group) for all outcome measures. The effect sizes of the single studies with their 95% CI are presented in Figure 4 for the outcome domain overall effectiveness. The funnel and forest plots for the other four domains are presented in the ESM 1. The studies varied in size and direction of the effect for all outcome domains with the majority of studies showing positive effect sizes. To conclude, the random-effects models yielded significant but small positive effect sizes for all outcome domains, except the domain personality functioning (p = .08). Considering the moderate to substantial heterogeneity and the range of the 95% CI and 95% PI, these results should be interpreted cautiously. Sensitivity Analysis We conducted a sensitivity analysis excluding two studies of ongoing treatments (Doering et al., 2010; Giesen-Bloo et al., 2006). The differences of the effect sizes to our main outcome were very small (see ESM 1 for all parameters). We concluded that despite the inclusion of ongoing treatments, our results of the main outcome are robust. However, in accordance with Leichsenring et al.’s (2013) argumentation and findings that the inclusion of ongoing treatments yields smaller effect sizes, our effect sizes of the main outcome were also slightly smaller for the European Psychologist (2020), 25(1), 51–72


C. F. J. Woll & F. D. Schönbrodt, Efficacy of LTPP



Figure 4. Funnel plot (A) and forest plot (B) for the outcome domain overall effectiveness. In the funnel plot, the standard error is presented on the inverted vertical axis and the standardized mean difference (g, in our case meaning the pre-post-control g complemented by t-values and other metrics as well as standard Hedges’ g calculations) on the horizontal axis. Each dot represents one study. The inverted funnel is centered on the random-effects meta-analysis estimate of LTPP. The forest plot shows the effect size g with its associated 95% confidence interval for each study and for the summary effect at the bottom. The boxes show the effect size and the size of the boxes represents the relative weight. Figure available at, under a CC-BY 4.0 license.

domains psychiatric symptoms, target problems, social functioning and overall effectiveness compared to those of the sensitivity analysis. Besides, we visually screened the funnel plots of all outcome domains for outliers which we defined as studies clearly lying outside of the inverted funnel, meaning not touching the inverted funnel. Removing the outliers (see ESM 1for an exact description of each outcome domain), our results only differed very slightly in the size of the effect for the domains psychiatric symptoms, target problems, social functioning and overall effectiveness. However, our marginal significant result for the domain personality functioning was not robust to our sensitivity analysis. Moderation by Session Ratio The outcomes of the meta-regression for our main outcome (compl. ppc-g) revealed that the session ratio was only associated with personality functioning. The regression coefficients for the intercept and the slope of the mixed-effects model were statistically significant (b0 = 0.52, p = .004; b1 = 0.39, p = .03). However, these results were not in line with our hypothesis that an increasing session ratio leads to an increasing effect size. In contrast, with a negative slope the results point in the opposite direction, meaning that the larger the dosage of LTPP compared to the control group is, the smaller the effect size is. These results should be interpreted very cautiously because the model was solely based on 10 studies. Furthermore, the p-value of the slope would not have survived a correction for multiple testing. No association between the session ratio and the effect size was found for the other outcome domains (p-values for the intercepts and slopes were > .07 and > .63, respectively). European Psychologist (2020), 25(1), 51–72

Assessment of and Adjustment for Publication Bias To assess publication bias, we first examined funnel plot asymmetry by conducting the Egger test as well as p-uniform’s test for publication bias and by determining the selection parameter of 3PSM and 4PSM. To illustrate the association of effect size and precision, the funnel plot for the domain overall effectiveness is presented in Figure 4. The Egger test only revealed a statistically significant result for the domain personality functioning (p = .03). Publication bias for the domains psychiatric symptoms, target problems, social functioning, and overall effectiveness was not statistically significant (p = .40, .11, .41, .28, respectively). p-Uniform’s test for publication bias (p = .63, .75, 1.00, .98, .94) as well as the likelihood-ratio test of 3PSM and 4PSM (3PSM: p = .48, .99, .31, .45, .96; 4PSM: p = .72, .65, .13, .70, 1.00; in the order: psychiatric symptoms, target problems, social functioning, personality functioning and overall effectiveness, respectively) yielded no statistically significant results for all outcome domains, suggesting that an adjustment for publication bias was not indicated. However, it is noteworthy that due to the small number of primary studies the bias detection tests have a small power. It can be useful, though, to consider the adjusted estimate even when the bias test is nonsignificant (see ESM 1 for all bias-corrected estimates). The parameter estimates corrected for publication bias by PET and PEESE are presented in ESM 1. According to Stanley and Doucouliagos’s (2014) recommendation on the conditional PET-PEESE estimator, we should use the estimates given by PET because PET is not statistically significant for all outcome domains (ps > .37). All estimates for the effect size given by PET are smaller than ours and Ó 2019 Hogrefe Publishing

C. F. J. Woll & F. D. Schönbrodt, Efficacy of LTPP

even in the negative range for personality functioning. However, given the inaccurate performance of this approach shown in recent simulation studies (Carter et al., 2017), these estimations should not be taken seriously (especially with small study-level sample sizes and in light of moderate to substantial heterogeneity). In conclusion, all applied bias detection methods suggested the absence of publication bias. The Egger test was solely statistically significant for the domain personality functioning. This result could be explained by three studies characterized by high standard errors and large effect sizes (Bachar et al., 1999; Bateman & Fonagy, 1999; Gregory et al., 2008). In light of nonsignificant results of p-uniform, 3PSM and 4PSM for publication bias, we focus on our main outcomes of the random-effects meta-analysis presented in Table 5. We considered them to be the currently best possible estimates, extensively scrutinized by elaborated statistical tests for publication bias. Given the small number of studies and the low power of bias detection tests, however, this should be seen as a preliminary statement, waiting for updates when more primary studies are available.

Discussion The empirical evidence for LTPP is still a controversial issue. Recent meta-analyses come to conflicting conclusions about whether LTPP is more efficacious than other forms of therapy. We aimed to reproduce the most recent meta-analysis by Leichsenring et al. (2013) and to conduct an updated meta-analysis which adds three primary studies and applies recent developments in statistical bias detection and correction.

Replication of Leichsenring et al. (2013) We replicated the direction and general tendency of the meta-analytic results achieved by Leichsenring et al. (2013), but our replicated effect sizes were slightly smaller and showed higher heterogeneity. The largest difference was in the domain psychiatric symptoms, where we found a g of 0.28 compared to the originally reported 0.43. One reason for the different findings might be that our categorization of outcome measures slightly differed. Additionally, we assumed that Leichsenring et al. (2013) calculated standard Hedges’ g for the study by Bressi et al. (2010) instead of ppc-g as they did for the other studies. Exceptionally, the single standard Hedges’ g calculations for the five outcome domains of this primary study were substantially higher than the ppc-g calculations (see Table 1 in ESM 1), which

Ó 2019 Hogrefe Publishing


could partly explain the higher effects sizes found by Leichsenring et al. (2013). We thank Falk Leichsenring for sending us their interim results and approving to make them public (see ESM 1), but without their raw data of the primary studies, it is not possible to fully explain the difference in the size of the effects. In line with Lakens et al. (2017) we recommend that future meta-analysts should (a) clearly indicate which data were used to calculate an effect size, (b) specify all individual effect sizes and equations that are used to calculate them, (c) explain how multiple effect size estimates from the same primary study are combined, and (d) share raw data acquired from original authors or unpublished research reports. We tried to follow all these recommendations and the ones described in the method section in order to facilitate the reproduction and update of our meta-analysis.

Updated Meta-Analysis Summary and Discussion of the Level of Evidence In the second part of our study, the updated meta-analysis, the results were quite similar to our replicated effect sizes, but slightly smaller. For all but one outcome domain, we received small, but statistically significant effect sizes (see Table 5). The small effect size for the outcome domain personality functioning could be labeled as marginally significant, but should be interpreted cautiously because it builds on a narrow data basis and was not robust to our sensitivity analysis. Consequently, the effect size of target problems may as well be regarded as marginally insignificant (p = .04). Furthermore, in all outcome domains the effect sizes were qualified by moderate to substantial heterogeneity, suggesting that not a single underlying effect was measured, but that the summary estimate for each domain was an average across measures of multiple effects. This seems plausible because we included different psychiatric disorders, different forms of LTPP, and different forms of control treatments. A possible solution to reduce heterogeneity would have been to conduct separate metaanalyses of subsets, which was not possible at this stage because the data basis was still too narrow for meaningful subgroup analyses. The amount of heterogeneity could be illustrated by our prediction intervals ranging from small to medium effects in favor of the control group to medium to large effects in favor of the LTPP group for all outcome measures. However, our prediction intervals should also be interpreted cautiously because the estimate of the prediction interval is imprecise if it is based on imprecise heterogeneity estimates based on only few studies (IntHout et al., 2016). Furthermore, our assessment of the risk of bias within the primary studies suggested that our set of studies carried an unclear risk of bias, resulting from several

European Psychologist (2020), 25(1), 51–72


methodological shortcomings in the primary studies (see Figure 3). Notably, despite particular difficulties for long-term treatments, the majority of studies applied treatment manuals and checked treatment integrity. It is also noteworthy that the primary studies by Bateman and Fonagy (2009) and Fonagy et al. (2015) applying the most advanced psychoanalytic psychotherapy and research standards revealed a medium overall effect (0.6 and 0.5, respectively) compared to the control group. Nevertheless, in light of moderate to substantial heterogeneity and an unclear risk of bias due to methodological shortcomings in most of the primary studies, our small, statistically significant summary effects should be interpreted cautiously. To specify the scope of our findings, we concluded to have found small summary effect sizes of LTPP which (1) apply to our pre-defined combination of disorders called complex mental disorders, (2) are common to different forms of LTPP and (3) result from a comparison primarily to other forms of long-term psychotherapy (mean session ratio = 1.67). More specific evidence of the efficacy of LTPP cannot be provided at this time, as the data basis is still too narrow. Our findings were robust to an extensive assessment of publication bias and different ways of calculating Hedges’ g. According to the recent APA model by Tolin et al. (2015), a clear recommendation for the empirical support of LTPP can, thus, not be given so far. A classification into the categories very strong, strong, or weak empirical support would at least require meaningful subgroup analyses of specific forms of LTPP (e.g., mentalization based therapy or transference focused psychotherapy) applied to specific disorders (e.g., borderline personality disorder) controlled for by homogeneous control treatments (at best other specialized forms of psychotherapy). Additionally, the quality of evidence needs to be rated according to the GRADE system (Atkins et al., 2004; Guyatt et al., 2008). First, we cannot adhere to the first quality criterion, as we were not able to include a wide range of studies in our analysis due to the narrow data basis. Furthermore, some studies were also characterized by major limitations. Second, our studies do not vary slightly, as required by the GRADE system, but moderately to substantially. Third, our CIs are narrower than those of Smit et al. (2012) and Leichsenring et al. (2013), but still cross the benchmark of 0.2, possibly suggesting a negligible effect. The quality and certainty of our evidence may, therefore, be considered as moderate to low. Furthermore, only five studies measured and reported clinically significant change as required by the model. In total, Tolin et al. (2015) would describe LTPP as still lacking sufficient evidence of efficacy. However, we still criticize the concept of the APA as not grasping the complexity of mental disorders since the required specificity seems unrealistic, especially in light of European Psychologist (2020), 25(1), 51–72

C. F. J. Woll & F. D. Schönbrodt, Efficacy of LTPP

a high co-morbidity of mental disorders (Orlinsky, 2008; Seligman, 1995). Hence, on the one hand, we call for an extension of international guidelines to deal with the complexity especially of severe disorders and with the particular challenges of conducting a long-term study (e.g., treatment manuals, fewer studies possible, etc.; Benecke, Huber, Schauenburg, & Staats, 2016). On the other hand, we demand more studies of LTPP oriented on current standards of psychotherapy research such as the LAC depression study (Beutel et al., 2016) and the APD study (Benecke, Huber, Staats, et al., 2016) which are currently being conducted. Given the aforementioned necessity of long-term treatments for patients with complex mental disorders, more funds should be provided for such studies. Regarding our effect of LTPP as a summary effect of multiple effects in a random effects model, we still consider our findings to have some practical and clinical relevance. There is no agreed upon definition for a clinically relevant effect, though (Steinert, Munder, Rabung, Hoyer, & Leichsenring, 2017). Cuijpers, Turner, Koole, Dijke, and Smit (2014), for example, proposed a threshold of 0.24 for depressive disorders. Leichsenring et al. (2015) recommended 0.5. Since no agreement exists so far, we cautiously assume small clinically relevant effects of LTPP according to Cuijpers et al. (2014). More research is urgently necessary to corroborate these findings. Future research should examine which patients may need long-term psychotherapy and which patients may sufficiently benefit from short-term psychotherapy, irrespective of whether the treatments are rooted in cognitive behavior therapy, psychoanalytic psychotherapy or another bona fide treatment approach (Leichsenring et al., 2013). Limitations and Future Perspectives Besides the scarcity of randomized, controlled trials, further limitations of our meta-analysis need to be addressed. Several primary studies suffered from methodological shortcomings (see Figure 3). Hence, our set of studies carried an unclear risk of bias. Especially the issue of a successful blinding of outcome assessment needs to be addressed in future research. In general, future studies on LTPP should more carefully abide by international guidelines and quality criteria (Higgins et al., 2011; Tolin et al., 2015). Additionally, only seven studies reported follow-up data. For a discipline assuming that treatment effects carry on and even continue to improve after the treatment was finished (Huber et al., 2013; Leichsenring & Rabung, 2008; Town et al., 2012), more extensive follow-up data should be provided in the future. Keeping our narrow data basis in mind, we found some evidence that follow-up data may in fact yield higher effect sizes than post-treatment data. Notably, Benecke, Huber, Schauenburg, et al. (2016) reasonably argued that a late follow-up assessment at the same point in time Ó 2019 Hogrefe Publishing

C. F. J. Woll & F. D. Schönbrodt, Efficacy of LTPP

constitutes the most adequate point in time for comparing short-term and long-term treatments. Given our very limited data base of follow-up data, this was not possible in our study. Our post-treatment assessment at the point in time when LTPP was finished might have been a benefit of LTPP because for short term patients, this post-assessment is actually a follow-up assessment. Short-term patients might have already acquired a more realistic view of their life during this follow-up period. However, the majority of control conditions were other forms of long-term treatments with a more similar duration. Another limitation may be seen in the heterogeneity of control conditions. Five out of 14 control interventions were treatment as usual conditions representing a relatively weak comparator. However, the other nine control conditions were other (specialized) forms of psychotherapy providing stronger comparators. Future research on LTPP should only include strong comparators to allow conclusions about the specific mechanisms of change (Chambless & Hollon, 1998; Tolin et al., 2015) or should consider “treatment as usual” versus a more involved control group as a potential moderator. Different forms of LTPP as well as differences in disorders and outcome assessment measures may be also seen as limitations because they probably may have led to our identified moderate to substantial heterogeneity. Psychotropic medication and other forms of therapy as treatment confounders were found in almost all studies. Especially for severe disorders, pharmacotherapy cannot be excluded. However, we call for a more systematic monitoring of treatment confounders in future research of LTPP. Furthermore, investigator allegiance may distort the results of comparative treatment studies (Munder, Flückiger, Gerger, Wampold, & Barth, 2012). As recommended by Luborsky et al. (1999), our team represented a mix of different allegiances: the main author (CW) had an allegiance to (long-term) psychoanalytic therapy, whereas the second author (FS) had no allegiance. Thus, we attempted to minimize an allegiance effect in our research process. Most of the primary studies, however, have been conducted by proponents of LTPP, and, therefore, the allegiance of trialists may be a rival explanation for the advantage of LTPP. Finally, future primary studies and meta-analyses should include cost-efficiency analyses to provide a solid argumentation for the implementation in health care systems. Final Considerations Comparing our updated effect sizes of LTPP to those of Leichsenring et al. (2013), our effect sizes were somewhat smaller than their estimates (0.28 vs. 0.40 for overall effectiveness). We hypothesized to find smaller effect sizes than Leichsenring et al. (2013) after accounting for publication bias. Since we did not need to adjust for publication Ó 2019 Hogrefe Publishing


bias, our difference in effect size is probably explained by our smaller session ratio of 1.67 compared to Leichsenring et al.’s (2013) session ratio of 1.96. They compared LTPP primarily to shorter or less intensive forms of psychotherapy, whereas we compared LTPP primarily to other forms of long-term psychotherapy. A clear classification of our comparison condition into short-term or long-term is not possible, though, because six comparison conditions may be classified as short-term (< 40 sessions) and seven comparison conditions may be classified as long-term ( 40 sessions). Gregory et al. (2008) did not report on the mean of the comparison condition. In sum, the small effect sizes we found represent the additional gain of 81.8 of LTPP versus 49.0 sessions of other, primarily long-term, forms of psychotherapy. Here, large differences in effect sizes are not to be expected. We tried to quantify this gain by considering the session ratio as a continuous moderator. The session ratio was only associated with the effect size of personality functioning, but in the unexpected direction. Therefore, the session ratio might not be regarded as a valid moderator, implying that other variables than the number of sessions might account for the additional gain. With an increasing data base, future meta-analysts should compare LTPP only to other forms of long-term psychotherapy by conducting non-inferiority and equivalence analyses (see Steinert et al., 2017, for such analyses of STPP).

Conclusion We found small statistically significant effect sizes for the outcome domains psychiatric symptoms, target problems, social functioning, and overall effectiveness, when comparing LTPP to other, primarily long-term forms of psychotherapy. The effect size for the outcome domain personality functioning was not significant (p = .08). It is noteworthy that large differences in effect sizes are not to be expected since the reported effect sizes represent the additional gain of LTPP versus other forms of primarily long-term psychotherapy. Our effect sizes were robust to an extensive assessment of publication bias and different ways of calculating Hedges’ g, and according to proposed thresholds, we assume some clinical relevance of our findings. Patients suffering from complex mental disorders seem to benefit slightly more from a treatment with LTPP compared to other, primarily long-term forms of psychotherapy. However, in light of heterogeneous data, an unclear risk of bias across the primary studies, and prediction intervals crossing zero these results should be interpreted cautiously. Further research and improved primary studies are urgently needed to corroborate these results. European Psychologist (2020), 25(1), 51–72


Plain Language Summary: A Meta-Analysis of the Efficacy of Long-Term Psychoanalytic Psychotherapy What Is the Aim of This Review? The aim of this meta-analysis was to find out whether longterm psychoanalytic psychotherapy is more efficacious than other forms of psychotherapy. The authors collected and analyzed all relevant studies to answer this question and found 14 studies.

C. F. J. Woll & F. D. Schönbrodt, Efficacy of LTPP

The meta-analysis shows that when patients are treated with long-term psychoanalytic psychotherapy, compared to other, primarily long-term forms of psychotherapy: 1. Patients may show slightly less psychiatric symptoms (low-certainty evidence). 2. Patients may show better capacities to manage social or interpersonal situations (low- to moderate-certainty evidence). 3. Patients may show slightly less problems which are specific to their disorder (e.g., impulse control for borderline patients; low-certainty evidence). 4. We are uncertain whether patients show a higher personality structure (very low-certainty evidence).

Long-term psychoanalytic psychotherapy may be more efficacious than other forms of psychotherapy in the treatment of chronic mental disorders, more than one mental disorder, or a personality disorder.

The range where the actual effects may be shows that long-term psychoanalytic psychotherapy may lead to a small additional gain, but may also lead to little or no additional gain when compared to other, primarily long-term forms of psychotherapy. Notably, when compared to primarily long-term forms of psychotherapy, large differences are not to be expected.

What Was Studied in the Review?

How Up-to-Date Is This Review?

Adult patients suffering from a chronic mental disorder, more than one mental disorder, or a personality disorder might need more extended treatments. A more extended form of psychotherapy is long-term psychoanalytic psychotherapy. It originated from the theories by Sigmund Freud and is defined as a form of psychotherapy in which the patient and the therapist meet once to twice weekly and the therapy takes place in a sitting position for at least 1 year and 40 sessions. As long-term psychoanalytic psychotherapy causes higher direct financial costs because of a higher number of sessions, its positive effects need to exceed those of less intensive treatments.

The authors searched for studies that had been published up to June 11, 2017.

Key Messages

Electronic Supplementary Material The electronic supplementary material is available with the online version of the article at 1016-9040/a000385 ESM 1. Supplementary material to the meta-analysis of the efficacy of LTPP

What Are the Main Results of the Review?


The authors found 14 relevant studies. These studies compared different forms of long-term psychoanalytic psychotherapy (e.g., mentalization-based therapy) to other forms of psychotherapy. Comparators primarily included other (non-psychoanalytic) long-term therapies, such as dialectical behavior therapy, and some forms of short-term treatments, such as short-term cognitive behavior therapy or basic health support. In each study, patients suffering from a chronic mental disorder, more than one mental disorder, or a personality disorder were randomly assigned to either long-term psychoanalytic psychotherapy or a different form of psychotherapy.

Abbass, A. A., Hancock, J. T., Henderson, J., & Kisely, S. (2006). Short-term psychodynamic psychotherapies for common mental disorders. Cochrane Database of Systematic Reviews, 4, 1–50. Abbass, A. A., Kisely, S. R., Town, J. M., Leichsenring, F., Driessen, E., De Maat, S., . . . Crowe, E. (2014). Short-term psychodynamic psychotherapies for common mental disorders. Cochrane Database of Systematic Reviews, 7, 1–90. https://doi. org/10.1002/14651858.CD004687.pub4 APA Publications and Communications Board Working Group on Journal Article Reporting Standards. (2008). Reporting standards for research in psychology: Why do we need them? What might they be? American Psychologist, 63, 839–851. https://doi. org/10.1037/0003-066x.63.9.839

European Psychologist (2020), 25(1), 51–72

Ó 2019 Hogrefe Publishing

C. F. J. Woll & F. D. Schönbrodt, Efficacy of LTPP

Atkins, D., Eccles, M., Flottorp, S., Guyatt, G. H., Henry, D., Hill, S., . . . The GRADE Working Group. (2004). Systems for grading the quality of evidence and the strength of recommendations I: Critical appraisal of existing approaches. BMC Health Services Research, 4, 38. Bachar, E., Latzer, Y., Kreitler, S., Berry, E. M. (1999). Empirical comparison of two psychological therapies: Self psychology and cognitive orientation in the treatment of anorexia and bulimia. The Journal of Psychotherapy Practice and Research, 8, 115–128. Bateman, A., & Fonagy, P. (1999). Effectiveness of partial hospitalization in the treatment of borderline personality disorder: A randomized controlled trial. American Journal of Psychiatry, 156, 1563–1569. Bateman, A., & Fonagy, P. (2008). 8-year follow-up of patients treated for borderline personality disorder: Mentalization-based treatment versus treatment as usual. American Journal of Psychiatry, 165, 631–638. Bateman, A., & Fonagy, P. (2009). Randomized controlled trial of outpatient mentalization-based treatment versus structured clinical management for borderline personality disorder. American Journal of Psychiatry, 166, 1355–1364. 10.1176/appi.ajp.2009.09040539 Benecke, C., Huber, D., Schauenburg, H., & Staats, H. (2016). Wie können Langzeittherapien mit kürzeren Behandlungen verglichen werden? Designprobleme und Lösungsvorschläge am Beispiel der APS-Studie [How can long-term therapies be compared with shorter term treatment? Design problems and solution proposals exemplified by the APD study]. Psychotherapeut, 61, 476–483. Benecke, C., Huber, D., Staats, H., Zimmermann, J., Henkel, M., Deserno, H., . . . Schauenburg, H. (2016). A comparison of psychoanalytic therapy and cognitive behavioral therapy for anxiety (panic/agoraphobia) and personality disorders (APD study): Presentation of the RCT study design. Zeitschrift für Psychosomatische Medizin und Psychotherapie, 62, 252–269. Beutel, M. E., Bahrke, U., Fiedler, G., Hautzinger, M., Kallenbach, L., Kaufhold, J., . . . Ernst, M. (2016). LAC-Depressionsstudie [LAC depression study]. Psychotherapeut, 61, 468–475. https:// Borenstein, M., Hedges, L. V., Higgins, J. P. T., & Rothstein, H. R. (2009). Introduction to meta-analysis. Chichester, UK: Wiley. Bressi, C., Porcellana, M., Marinaccio, P. M., Nocito, E. P., & Magri, L. (2010). Short-term psychodynamic psychotherapy versus treatment as usual for depressive and anxiety disorders: A randomized clinical trial of efficacy. The Journal of Nervous and Mental Disease, 198, 647–652. nmd.0b013e3181ef3ebb Carter, E. C., Kofler, L. M., Forster, D. E., & McCullough, M. E. (2015). A series of meta-analytic tests of the depletion effect: Self-control does not seem to rely on a limited resource. Journal of Experimental Psychology: General, 144, 796–815. Carter, E. C., Schönbrodt, F. D., Gervais, W., & Hilgard, J. (2019). Correcting for bias in psychology: A comparison of metaanalytic methods. Advances in Methods and Practices in Psychological Science 2, 115–144. 2515245919847196 Chambless, D. L., & Hollon, S. D. (1998). Defining empirically supported therapies. Journal of Consulting and Clinical Psychology, 66, 7–18. Clarkin, J. F., Levy, K. N., Lenzenweger, M. F., & Kernberg, O. F. (2007). Evaluating three treatments for borderline personality disorder: A multiwave study. American Journal of Psychiatry, 164, 922–928.

Ó 2019 Hogrefe Publishing


Coburn, K. M., & Vevea, J. L. (2017). weightr: Estimating weight-function models for publication bias. R package version 1.1.2. Retrieved from weightr Cohen, J. (1988). Statistical power analysis for the behavioral sciences. Hillsdale, NJ: Erlbaum. Cooper, H. M. (2010). Research synthesis and meta-analysis: A step-by-step approach. Los Angeles, CA: Sage. Crits-Christoph, P., & Barber, J. P. (2000). Long-term psychotherapy. In C. R. Snyder & R. E. Ingram (Eds.), Handbook of psychological change: Psychotherapy processes & practices for the 21st century (pp. 455–473). Hoboken, NJ: Wiley. https:// Cuijpers, P., Karyotaki, E., Reijnders, M., & Ebert, D. D. (2019). Was Eysenck right after all? A reassessment of the effects of psychotherapy for adult depression. Epidemiology and Psychiatric Sciences, 28, 21–30. S2045796018000057 Cuijpers, P., Turner, E. H., Koole, S. L., Dijke, A., & Smit, F. (2014). What is the threshold for a clinically relevant effect? The case of major depressive disorders. Depression and Anxiety, 31, 374–378. Cuijpers, P., van Straten, A., Bohlmeijer, E., Hollon, S. D., & Andersson, G. (2010). The effects of psychotherapy for adult depression are overestimated: A meta-analysis of study quality and effect size. Psychological Medicine, 40, 211–223. https:// Dare, C., Eisler, I., Russell, G., Treasure, J., & Dodge, L. (2001). Psychological therapies for adults with anorexia nervosa. The British Journal of Psychiatry, 178, 216–221. 10.1192/bjp.178.3.216 de Maat, S., de Jonghe, F., de Kraker, R., Leichsenring, F., Abbass, A. A., Luyten, P., . . . Dekker, J. (2013). The current state of the empirical evidence for psychoanalysis: A meta-analytic approach. Harvard Review of Psychiatry, 21, 107–137. https:// de Maat, S., de Jonghe, F., Schoevers, R., & Dekker, J. (2009). The effectiveness of long-term psychoanalytic therapy: A systematic review of empirical studies. Harvard Review of Psychiatry, 17, 1–23. Derogatis, L. R. (1977). SCL-90-R: Administration, scoring and procedures manual-I for the revised version. Baltimore, MD: Clinical Psychometric Research. Doering, S., Hörz, S., Rentrop, M., Fischer-Kern, M., Schuster, P., Benecke, C., . . . Buchheim, P. (2010). Transference-focused psychotherapy v. treatment by community psychotherapists for borderline personality disorder: Randomised controlled trial. The British Journal of Psychiatry, 196, 389–395. 10.1192/bjp.bp.109.070177 Eagan, B., Rogers, B., Serlin, R., Ruis, A., Arastoopour Irgens, G., & Williamson Shaffer, D. (2017). Can we rely on IRR? Testing the assumptions of inter-rater reliability. Retrieved from https:// Egger, M., Smith, G. D., Schneider, M., & Minder, C. (1997). Bias in meta-analysis detected by a simple, graphical test. British Medical Journal, 315, 629–634. bmj.315.7109.629 Fanelli, D. (2012). Negative results are disappearing from most disciplines and countries. Scientometrics, 90, 891–904. https:// Fonagy, P., Rost, F., Carlyle, J.-A., McPherson, S., Thomas, R., Pasco Fearon, R., . . . Taylor, D. (2015). Pragmatic randomized controlled trial of long-term psychoanalytic psychotherapy for treatment-resistant depression: The Tavistock Adult Depression Study (TADS). World Psychiatry, 14, 312–321. 10.1002/wps.20267

European Psychologist (2020), 25(1), 51–72


Fonagy, P., Steele, M., Steele, H., & Target, M. (1998). Reflexivefunction manual: Version 5.0 for application to the adult attachment interview Unpublished manual, University College, London, UK. Gabbard, G. O. (2017). Long-term psychodynamic psychotherapy: A basic text. Arlington, VA: American Psychiatric Association. Giesen-Bloo, J., van Dyck, R., Spinhoven, P., van Tilburg, W., Dirksen, C., van Asselt, T., . . . Arntz, A. (2006). Outpatient psychotherapy for borderline personality disorder: Randomized trial of schema-focused therapy vs transference-focused psychotherapy. Archives of General Psychiatry, 63, 649–658. Gregory, R. J., Chlebowski, S., Kang, D., Remen, A. L., Soderberg, M. G., Stepkovitch, J., & Virk, S. (2008). A controlled trial of psychodynamic psychotherapy for co-occurring borderline personality disorder and alcohol use disorder. Psychotherapy: Theory, Research, Practice, Training, 45, 28. 10.1037/0033-3204.45.1.28 Gregory, R. J., DeLucia-Deranja, E., & Mogle, J. A. (2010). Dynamic deconstructive psychotherapy versus optimized community care for borderline personality disorder co-occurring with alcohol use disorders: A 30-month follow-up. The Journal of Nervous and Mental Disease, 198, 292–298. 10.1097/NMD.0b013e3181d6172d Grünbaum, A. (1988). Die Grundlagen der Psychoanalyse: Eine philosophische Kritik [The foundations of psychoanalysis: A philosophical critique]. Stuttgart, Germany: Reclam. Guyatt, G. H., Oxman, A. D., Vist, G. E., Kunz, R., Falck-Ytter, Y., Alonso-Coello, P., & Schunemann, H. J. (2008). GRADE: An emerging consensus on rating quality of evidence and strength of recommendations. British Medical Journal, 336, 924–926. Hedges, L. V. (1984). Estimation of effect size under nonrandom sampling: The effects of censoring studies yielding statistically insignificant mean differences. Journal of Educational Statistics, 9, 61–85. Hedges, L. V., & Olkin, I. (1985). Statistical methods for metaanalysis. San Diego, CA: Academic Press. Higgins, J. P. T., Altman, D. G., Gøtzsche, P. C., Jüni, P., Moher, D., Oxman, A. D., . . . Cochrane Statistical Methods Group. (2011). The Cochrane Collaboration’s tool for assessing risk of bias in randomised trials. British Medical Journal, 343, d5928. https:// Higgins, J. P. T. Green, S. (Eds.). (2008). Cochrane handbook for systematic reviews of interventions. Chichester, UK: WileyBlackwell. Hollis, S., & Campbell, F. (1999). What is meant by intention to treat analysis? Survey of published randomised controlled trials. British Medical Journal, 319, 670–674. 10.1136/bmj.319.7211.670 Hollon, S. D., & Ponniah, K. (2010). A review of empirically supported psychological therapies for mood disorders in adults. Depression and Anxiety, 27, 891–932. 10.1002/da.20741 Huber, D., Henrich, G., Clarkin, J. F., & Klug, G. (2013). Psychoanalytic versus psychodynamic therapy for depression: A three-year follow-up study. Psychiatry: Interpersonal & Biological Processes, 76, 132–149. Huber, D., Henrich, G., Gastner, J., & Klug, G. (2012). Must all have prizes? The Munich psychotherapy study. In R. A. Levy, J. S. Ablon, & H. Kächele (Eds.), Psychodynamic Psychotherapy Research (pp. 51–69). New York, NY: Humana Press. https:// Huber, D., Zimmermann, J., Henrich, G., & Klug, G. (2012). Comparison of cognitive-behaviour therapy with psychoanalytic and psychodynamic therapy for depressed patients – A three-

European Psychologist (2020), 25(1), 51–72

C. F. J. Woll & F. D. Schönbrodt, Efficacy of LTPP

year follow-up study. Zeitschrift für Psychosomatische Medizin und Psychotherapie, 58, 299–316. zptm.2012.58.3.299 IntHout, J., Ioannidis, J. P. A., Rovers, M. M., & Goeman, J. J. (2016). Plea for routinely presenting prediction intervals in meta-analysis. BMJ Open, 6, e010247. bmjopen-2015-010247 Jørgensen, C. R., Bøye, R., Andersen, D., Døssing Blaabjerg, A. H., Freund, C., Jordet, H., & Kjølbye, M. (2014). Eighteen months post-treatment naturalistic follow-up study of mentalizationbased therapy and supportive group treatment of borderline personality disorder: Clinical outcomes and functioning. Nordic Psychology, 66, 254–273. 2014.963649 Jørgensen, C. R., Freund, C., Bøye, R., Jordet, H., Andersen, D., & Kjølbye, M. (2013). Outcome of mentalization-based and supportive psychotherapy in patients with borderline personality disorder: A randomized trial. Acta Psychiatrica Scandinavica, 127, 305–317. Klar, F. J. (2005). Wirksamkeit individualpsychologisch-psychoanalytischer Psychotherapie [The efficacy of individual psychological-psychoanalytical psychotherapy]. Zeitschrift für Individualpsychologie, 30, 28–50. Knekt, P., Heinonen, E., Härkäpää, K., Järvikoski, A., Virtala, E., Rissanen, J., & Lindfors, O. (2015). Randomized trial on the effectiveness of long- and short-term psychotherapy on psychosocial functioning and quality of life during a 5-year follow-up. Psychiatry Research, 229, 381–388. 10.1016/j.psychres.2015.05.113 Knekt, P., Lindfors, O., Härkänen, T., Välikoski, M., Virtala, E., Laaksonen, M. A., . . . Renlund, C. (2008). Randomized trial on the effectiveness of long- and short-term psychodynamic psychotherapy and solution-focused therapy on psychiatric symptoms during a 3-year follow-up. Psychological Medicine, 38, 689–703. Knekt, P., Lindfors, O., Laaksonen, M. A., Raitasalo, R., Haaramo, P., & Järvikoski, A. (2008). Effectiveness of shortterm and long-term psychotherapy on work ability and functional capacity – A randomized clinical trial on depressive and anxiety disorders. Journal of Affective Disorders, 107, 95–106. Knekt, P., Virtala, E., Härkänen, T., Vaarama, M., Lehtonen, J., & Lindfors, O. (2016). The outcome of short- and longterm psychotherapy 10 years after start of treatment. Psychological Medicine, 46, 1175–1188. s0033291715002718 Kopta, S. M., Howard, K. I., Lowry, J. L., & Beutler, L. E. (1994). Patterns of symptomatic recovery in psychotherapy. Journal of Consulting and Clinical Psychology, 62, 1009–1016. https://doi. org/10.1037/0022-006X.62.5.1009 Korner, A., Gerull, F., Meares, R., & Stevenson, J. (2006). Borderline personality disorder treated with the conversational model: A replication study. Comprehensive Psychiatry, 47, 406–411. Lakens, D., Hilgard, J., & Staaks, J. (2016). On the reproducibility of meta-analyses: Six practical recommendations. BMC Psychology, 4, 1–10. Lakens, D., LeBel, E. P., Page-Gould, E., van Assen, M. A. L. M., Spellman, B., Schönbrodt, F. D., & Hertogs, R. (2017). Examining the reproducibility of meta-analyses in psychology: A preliminary report. Retrieved from Leichsenring, F., Abbass, A. A., Luyten, P., Hilsenroth, M., & Rabung, S. (2013). The emerging evidence for long-term psychodynamic therapy. Psychodynamic Psychiatry, 41, 361– 384.

Ó 2019 Hogrefe Publishing

C. F. J. Woll & F. D. Schönbrodt, Efficacy of LTPP

Leichsenring, F., Leweke, F., Klein, S., & Steinert, C. (2015). The empirical status of psychodynamic psychotherapy – An update: Bambi’s alive and kicking. Psychotherapy and Psychosomatics, 84, 129–148. Leichsenring, F., & Rabung, S. (2008). Effectiveness of long-term psychodynamic psychotherapy: A meta-analysis. Journal of the American Medical Association, 300, 1551–1565. https://doi. org/10.1001/jama.300.13.1551 Leichsenring, F., & Rabung, S. (2011). Long-term psychodynamic psychotherapy in complex mental disorders: Update of a metaanalysis. The British Journal of Psychiatry, 199, 15–22. https:// Leichsenring, F., Rabung, S., & Leibing, E. (2004). The efficacy of short-term psychodynamic psychotherapy in specific psychiatric disorders: A meta-analysis. Archives of General Psychiatry, 61, 1208–1216. Levy, K. N., Clarkin, J. F., Yeomans, F. E., Scott, L. N., Wasserman, R. H., & Kernberg, O. F. (2006). The mechanisms of change in the treatment of borderline personality disorder with transference focused psychotherapy. Journal of Clinical Psychology, 62, 481–501. Lindfors, O., Knekt, P., Heinonen, E., Härkänen, T., & Virtala, E. (2015). The effectiveness of short- and long-term psychotherapy on personality functioning during a 5-year follow-up. Journal of Affective Disorders, 173, 31–38. 10.1016/j.jad.2014.10.039 Lindfors, O., Knekt, P., Virtala, E., & Laaksonen, M. A. (2012). The effectiveness of solution-focused therapy and short- and longterm psychodynamic psychotherapy on self-concept during a 3-year follow-up. The Journal of Nervous and Mental Disease, 200, 946–953. Linehan, M. M., Comtois, K. A., Murray, A. M., Brown, M. Z., Gallop, R. J., Heard, H. L., . . . Lindenboim, N. (2006). Two-year randomized controlled trial and follow-up of dialectical behavior therapy vs therapy by experts for suicidal behaviors and borderline personality disorder. Archives of General Psychiatry, 63, 757–766. Luborsky, L., Diguer, L., Seligman, D. A., Rosenthal, R., Krause, E. D., Johnson, S., . . . Schweizer, E. (1999). The researcher’s own therapy allegiances: A “wild card” in comparisons of treatment efficacy. Clinical Psychology: Science and Practice, 6, 95–106. Masling J. (Ed.). (1983). Empirical studies of psychoanalytical theories. Hillsdale, NJ: Erlbaum. McMain, S. F., Links, P. S., Gnam, W. H., Guimond, T., Cardish, R. J., Korman, L., & Streiner, D. L. (2009). A randomized trial of dialectical behavior therapy versus general psychiatric management for borderline personality disorder. American Journal of Psychiatry, 166, 1365–1374. ajp.2009.09010039 McShane, B. B., Böckenholt, U., & Hansen, K. T. (2016). Adjusting for publication bias in meta-analysis: An evaluation of selection methods and some cautionary notes. Perspectives on Psychological Science, 11, 730–749. 1745691616662243 Mertens, W. (2013). Psychoanalyse als Methode, Theorie und Praxis [Psychoanalysis as a methodology, theory and practice]. In W. Mertens, C. Benecke, L. Gast, & M. Leuzinger-Bohleber (Eds.), Psychoanalyse im 21. Jahrhundert: Eine Standortbestimmung (pp. 13–32). Stuttgart, Germany: Kohlhammer. Morris, S. B. (2008). Estimating effect sizes from pretestposttest-control group designs. Organizational Research Methods, 11, 364–386. Munder, T., Flückiger, C., Gerger, H., Wampold, B. E., & Barth, J. (2012). Is the allegiance effect an epiphenomenon of true efficacy differences between treatments? A meta-analysis.

Ó 2019 Hogrefe Publishing


Journal of Counseling Psychology, 59, 631–637. 10.1037/a0029571 Munder, T., Flückiger, C., Leichsenring, F., Abbass, A. A., Hilsenroth, M. J., Luyten, P., & Wampold, B. E. (2019). Is psychotherapy effective? A re-analysis of treatments for depression. Epidemiology and Psychiatric Sciences, 28(3), 268–274. https:// Orlinsky, D. (2008). Die nächsten 10 Jahre Psychotherapieforschung: Eine Kritik des herrschenden Forschungsparadigmas mit Korrekturvorschlägen [The next 10 years of psychotherapy research: A critique of the prevailing research paradigm]. Psychotherapie Psychosomatik Medizinische Psychologie, 58, 345–354. Popper, K. R. (1972). Conjectures and refutations: The growth of scientific knowledge. London, UK: Routledge and Kegan Paul. Poulsen, S., Lunn, S., Daniel, S. I., Folke, S., Mathiesen, B. B., Katznelson, H., & Fairburn, C. G. (2014). A randomized controlled trial of psychoanalytic psychotherapy or cognitive-behavioral therapy for bulimia nervosa. American Journal of Psychiatry, 171, 109–116. Puschner, B., Kraft, S., Kächele, H., & Kordy, H. (2007). Course of improvement over 2 years in psychoanalytic and psychodynamic outpatient psychotherapy. Psychology and Psychotherapy: Theory, Research and Practice, 80, 51–68. 10.1348/147608306x107593 R Development Core Team. (2008). R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing. Retrieved from Seligman, M. E. (1995). The effectiveness of psychotherapy: The Consumer Reports study. American Psychologist, 50, 965–974. Simonsohn, U., Nelson, L. D., & Simmons, J. P. (2014). p-curve and effect size: Correcting for publication bias using only significant results. Perspectives on Psychological Science, 9, 666–681. Smit, Y., Huibers, M. J., Ioannidis, J. P., van Dyck, R., van Tilburg, W., & Arntz, A. (2012). The effectiveness of long-term psychoanalytic psychotherapy – A meta-analysis of randomized controlled trials. Clinical Psychology Review, 32, 81–92. Stanley, T. D., & Doucouliagos, H. (2014). Meta-regression approximations to reduce publication selection bias. Research Synthesis Methods, 5, 60–78. 1095 Steinert, C., Munder, T., Rabung, S., Hoyer, J., & Leichsenring, F. (2017). Psychodynamic therapy: As efficacious as other empirically supported treatments? A meta-analysis testing equivalence of outcomes. American Journal of Psychiatry, 174, 943–953. Svartberg, M., Stiles, T. C., & Seltzer, M. H. (2004). Randomized, controlled trial of the effectiveness of short-term dynamic psychotherapy and cognitive therapy for cluster C personality disorders. American Journal of Psychiatry, 161, 810–817. Tolin, D. F., McKay, D., Forman, E. M., Klonsky, E. D., & Thombs, B. D. (2015). Empirically supported treatment: Recommendations for a new model. Clinical Psychology: Science and Practice, 22, 317–338. Town, J. M., Diener, M. J., Abbass, A. A., Leichsenring, F., Driessen, E., & Rabung, S. (2012). A meta-analysis of psychodynamic psychotherapy outcomes: Evaluating the effects of research-specific procedures. Psychotherapy, 49, 276–290. van Aert, R. C. M. (2017). p-uniform. Retrieved from https://

European Psychologist (2020), 25(1), 51–72


van Aert, R. C. M., Wicherts, J. M., & van Assen, M. A. L. M. (2016). Conducting meta-analyses based on p values: Reservations and recommendations for applying p-uniform and p-curve. Perspectives on Psychological Science, 11, 713–729. https:// van Asselt, A. D., Dirksen, C. D., Arntz, A., Giesen-Bloo, J. H., van Dyck, R., Spinhoven, P., . . . Severens, J. L. (2008). Out-patient psychotherapy for borderline personality disorder: Cost-effectiveness of schema-focused therapy v. transference-focused psychotherapy. The British Journal of Psychiatry, 192, 450–457. van Assen, M. A. L. M., van Aert, R. C. M., & Wicherts, J. M. (2015). Meta-analysis using effect size distributions of only statistically significant studies. Psychological Methods, 20, 293–309. Viechtbauer, W. (2010). Conducting meta-analyses in R with the metafor package. Journal of Statistical Software, 36, 1–48. Weissman, M. M., & Bothwell, S. (1976). Assessment of social adjustment by patient self-report. Archives of General Psychiatry, 33, 1111–1115. 01770090101010 Werner, C., & Langenmayr, A. (2006). Die Bedeutung der frühen Kindheit. Psychoanalyse und Empirie [The significance of early childhood. Psychoanalysis and Empirical Evidence]. Göttingen, Germany: Vandenhoeck & Ruprecht. Zimmermann, J., Löffler-Stastka, H., Huber, D., Klug, G., Alhabbo, S., Bock, A., & Benecke, C. (2015). Is it all about the higher dose? Why psychoanalytic therapy is an effective treatment for major depression. Clinical Psychology & Psychotherapy, 22, 469–487. Zipfel, S., Wild, B., Groß, G., Friederich, H.-C., Teufel, M., Schellberg, D., . . . Herpertz, S. (2014). Focal psychodynamic therapy, cognitive behaviour therapy, and optimised treatment as usual in outpatients with anorexia nervosa (ANTOP study): Randomised controlled trial. The Lancet, 383, 127–137. https:// History Received March 16, 2018 Revision received July 6, 2019 Accepted August 6, 2019 Published online December 6, 2019

Acknowledgments We thank Falk Leichsenring and all other authors who provided us with additional information and/or data. Furthermore, we are grateful to Wiebke Goltz, Martina Stumpp, and Jens Christian for their comments on an earlier version of the manuscript. We also thank Larissa Sust for double coding our raw data and outcome measures.

European Psychologist (2020), 25(1), 51–72

C. F. J. Woll & F. D. Schönbrodt, Efficacy of LTPP

Open Data We embrace the values of openness and transparency in science ( The preregistration, open data, and reproducible scripts for all data analyses reported in this paper can be accessed at Christian Franz Josef Woll Department of Psychology Clinical Psychology of Children and Adolescents and Psychology of Interventions Ludwig-Maximilians-Universität Munich Leopoldstr. 13 80802 Munich Germany

Christian Franz Josef Woll (MSc, Clinical Psychology) is a research fellow at the Department of Clinical Psychology of Children and Adolescents at the Ludwig-MaximiliansUniversität Munich. His major research interests include the impact of psychiatric disorders on caregiver-infant interaction, meta-analytic methods, and current developments in open science. He also is a candidate of psychoanalytic training at the Akademie für Psychoanalyse und Psychotherapie, Munich.

Felix Schönbrodt (PhD) is a principle investigator at the Ludwig-Maximilians- Universität Munich and the managing director of the LMU Open Science Center. He obtained his PhD in psychology in 2010 at the Humboldt-University Berlin and received his habilitation 2014 at the LudwigMaximilians-Universität Munich. His research interests include implicit and explicit motives, quantitative methods in Bayesian statistics and meta-analysis, data visualization, and issues revolving around open science and the replicability of research.

Ó 2019 Hogrefe Publishing

Erratum Correction to Nuñez & León, 2015 The article entitled “Autonomy support in the classroom: A review from self-determination theory” by Nuñez, J. L. & León, J. (2015, European Psychologist, 20, 275–283. contained an error. The following funding information is missing on page 280:


Funding This work was supported by the University of Las Palmas de Gran Canaria, Spain (ULPGC 2013-13).

Published online February 24, 2020

Nuñez, J. L., & León, J. (2015). Autonomy support in the classroom: A review from self-determination theory. European Psychologist, 20, 275–283.

The authors regret any inconvenience or confusion this error may have caused.

Ó 2020 Hogrefe Publishing

European Psychologist (2020), 25(1), 73

EFPA News and Views Meeting Calendar April 9–16, 2020 34th EFPSA Congress Castlebar, Ireland Contact: EFPSA, Brussels, Belgium, E-mail, Web https:// April 28–29, 2020 Faculty for People with Intellectual Disabilities – Annual Conference London, UK Contact: BPS Conferences, https:// May 21–24, 2020 32nd APS Annual Convention Chicago, IL, USA Contact: June 24–27, 2020 10th European Conference on Positive Psychology Reykjavik, Iceland Contact: SENA, Reykjavik, Iceland, E-mail, Web https://

European Psychologist (2020), 25(1), 74

July 6–9, 2020 International Congress of Infant Studies Glasgow, UK Contact: ICIS Secretariat, Victoria, BC, Canada, Web July 19–24, 2020 32nd International Congress of Psychology (ICP) Prague, Czech Republic Contact: Congress Secretariat, Computer System Group a.s., 5. Kvetna 65, 140 21 Prague, Czech Republic, E-mail, Web

March 25–27, 2021 International Convention of Psychological Science Brussels, Belgium Contact: https://www.psychological September 2021 17th European Congress of Psychology (ECP2021) Ljubljana, Slovenia Contact: Slovenian Psychological Association, Web, and European Federation of Psychologists’ Associations, E-mail headoffice@

August 6–9, 2020 128th Annual Convention of the American Psychological Association Washington, DC, USA Contact:

Ó 2020 Hogrefe Publishing

EFPA News and Views


European Federation of Psychologists’ Associations What Is EFPA? EFPA is the leading Federation of National Psychologists Associations. It provides a forum for European cooperation in a wide range of fields of academic training, psychology practice and research. There are 37 member associations of EFPA representing about 300,000 psychologists. The member organizations of EFPA are concerned with promoting and improving psychology as a profession and as a discipline, particularly, though not exclusively, in applied settings and with emphasis on the training and research associated with such practice. The psychologists in the member associations include practitioners as well as academic and research psychologists. The Federation has as one of its goals the integration of practice with research and the promotion of an integrated discipline of psychology.

A full list of all EFPA members associations can be found at A full list of all EFPA associate members can be found at European Psychologist Members of EFPA Member Associations and other European psychology organizations supporting the European Psychologist are entitled to a special subscription rate of 149.00 per year The European Psychologist is published quarterly. Only calendar year subscriptions are available. Prices exclude shipping and handling charges. All subscriptions include print issues and access to full-text online. EFPA Executive Council (EC)

What EFPA Does • • • • • • • • • •

Representation, advocacy, and lobbying at European level Promotion of psychology education, research and profession EuroPsy certification Test User Accreditation Support for Member Associations European projects Advice on professional affairs, work areas, new developments Publicity and information sharing European Congress of Psychology European Psychologist (Official Organ of the EFPA, http:// • The EFPA News Magazine ( EFPA Membership Membership is open to the national psychologist association of all European countries but there may be only one member association per country. EFPA has no individual members. In countries where there is more than one national psychologist association, the Federation should endeavour to identify the most representative organization and, if appropriate, encourage the development of a national federation in order to promote cooperation among psychological associations. Membership is determined by the General Assembly upon presentation of the applicant association’s articles, statutes, and code of ethics plus details of the membership (see EFPA Statutes Article 5). As a European federation representing the interests of psychologists in Europe, EFPA is committed to making contact with member associations of psychologists from European countries, which are not yet members of EFPA. This is particularly the case in relation to psychologists’ associations from countries in the east of Europe, with whom EFPA is concerned to make contacts, to share information and to promote collaboration. A member of the EFPA Executive Council has responsibility for developing these links and contacts, and the EC will attempt to help to support initiatives.

2020 Hogrefe Publishing

President: Vice President / Secretary General: Vice President / Treasurer: EC Member: EC Member: EC Member: EC Member:

Christoph Steinebach (2019-2023) Ole Tunold (2019-2023) Nicola Gale (2019-2023) Eleni Karayianni (2019-2023) Anna Leybina (2019-2023) Josip Lopizic (2017-2021) Koen Lowet (2019-2023)

EFPA Head Office Brussels Director: Office Manager: Management Assistant: Communication Coordinator:

Sabine Steyaert Julie Van den Borre Ivana Marinovic Ruth Mozagba

EFPA Head Office Grasmarkt 105 / 39 1000 Brussels Belgium Tel.: +32 2 503-4953 Fax: +32 2 503-3067 E-mail: Web For further information on EFPA events please visit the EFPA website:

European Psychologist (2020), 25(1), 75

Instructions to Authors - European Psychologist European Psychologist is a multidisciplinary journal that serves as the voice of psychology in Europe, seeking to integrate across all specializations in psychology and to provide a general platform for communication and cooperation among psychologists throughout Europe and worldwide. European Psychologist publishes the following types of articles: Original Articles and Reviews, EFPA News and Views. Manuscript Submission: Original Articles and Reviews manuscripts should be submitted online at http://www.editorial Items for inclusion in the EFPA New and Views section should be submitted by email to the EFPA News and Views editor Eleni Karayianni ( Detailed instructions to authors are provided at http://www. Copyright Agreement: By submitting an article, the author confirms and guarantees on behalf of him-/herself and any coauthors that he or she holds all copyright in and titles to the submitted contribution, including any figures, photographs, line drawings, plans, maps, sketches and tables, and that the article and its contents do not infringe in any way on the rights of third parties. The author indemnifies and holds harmless the publisher from any third-party claims. The author agrees, upon acceptance of the article for publication, to transfer to the publisher on behalf of him-/herself and any coauthors the exclusive right to reproduce and distribute the article and its contents, both physically and in nonphysical, electronic, and other form, in the journal to which it has been submitted and in other independent publications, with no limits on the number of copies or on the form or the extent of the distribution. These rights are transferred for the duration of copyright as defined by international law. Furthermore, the author transfers to the publisher the following exclusive rights to the article and its contents:

European Psychologist (2020), 25(1)

1. The rights to produce advance copies, reprints, or offprints of the article, in full or in part, to undertake or allow translations into other languages, to distribute other forms or modified versions of the article, and to produce and distribute summaries or abstracts. 2. The rights to microfilm and microfiche editions or similar, to the use of the article and its contents in videotext, teletext, and similar systems, to recordings or reproduction using other media, digital or analog, including electronic, magnetic, and optical media, and in multimedia form, as well as for public broadcasting in radio, television, or other forms of broadcast. 3. The rights to store the article and its content in machinereadable or electronic form on all media (such as computer disks, compact disks, magnetic tape), to store the article and its contents in online databases belonging to the publisher or third parties for viewing or downloading by third parties, and to present or reproduce the article or its contents on visual display screens, monitors, and similar devices, either directly or via data transmission. 4. The rights to reproduce and distribute the article and its contents by all other means, including photomechanical and similar processes (such as photocopying or facsimile), and as part of so-called document delivery services. 5. The right to transfer any or all rights mentioned in this agreement, as well as rights retained by the relevant copyright clearing centers, including royalty rights to third parties. Online Rights for Journal Articles: Guidelines on authors’ rights to archive electronic versions of their manuscripts online are given in the document ‘‘Guidelines on sharing and use of articles in Hogrefe journals’’ on the journals’ web page at http://

November 2016

2020 Hogrefe Publishing

Cultural diversity – challenge and opportunity “It’s a book that we were all waiting for, and will be useful not only to psychologist practitioners and students, but also to stakeholders and policy makers in education.” Bruna Zani, Professor of Social and Community Psychology, Department of Psychology, Alma Mater Studiorum-University of Bologna, Bologna, Italy; EFPA Executive Council Member

Alexander Thomas (Editor)

Cultural and Ethnic Diversity How European Psychologists Can Meet the Challenges 2018, x + 222 pp. US $56.00 / € 44.95 ISBN 978-0-88937-490-4 Also available as eBook Culture and diversity are both challenge and opportunity. This volume looks at what psychologists are and can be doing to help society meet the challenges and grasp the opportunities in education, at work, and in clinical practice. The increasingly international and globalized nature of modern societies means that psychologists in particular face new challenges and have new opportunities in all areas of practice and research. The contributions from leading European experts cover relevant intercultural issues and topics in areas as

diverse as personality, education and training, work and organizational psychology, clinical and counselling psychlogy, migration and international youth exchanges. As well as looking at the new challenges and opportunities that psychologists face in dealing with people from increasingly varied cultural backgrounds, perhaps more importantly they also explain and discuss how psychologists can deepen and acquire the intercultural competencies that are now needed in our professional lives.

32nd International Congress of Psychology

July 19 - 24, 2020 Prague, Czech Republic 5-day Scientific Programme Over 25 State-of-the-Art Lectures Over 100 Keynote Addresses Over 190 Invited Symposia Over 5 Controversial Debates and much more …

Represent your country and join us at the ICP 2020! Follow us on Facebook, Twitter, Instagram!