Page 1

Leibnitz Universität Hannover Institut für Politische Wissenschaft Magisterarbeit

Propensity Score Matching for Improved Aid Effectiveness The Use of Rigorous Impact Evaluation Tools in the Context of the International Aid Effectiveness Discussion

Claudia Schwegmann Matrikelnummer 26909307 Birkenkamp 41 30900 Wedemark

Studiengang: Magister Studienfächer: Politikwissenschaft, Volkswirtschaft, kath. Theologie Erstprüfer: Prof. Dr. Markus Klein Zweitprüferin: Prof. Dr. Christiane Lemke


Table of contents 1Introduction............................................................................................................4 2The current state of aid - in Germany and beyond.................................................8 2.1Why development aid? ..................................................................................8 2.2The goals of German development aid.........................................................10 2.2.1The Millennium Development process ...........................................10 2.2.2National aid policy in Germany.......................................................13 2.3The financial scale of aid .............................................................................14 2.4The primary agents of development aid in Germany...................................17 2.5Does aid work ? – The challenge of aid effectiveness.................................19 2.6Principal Agent Theory in development cooperation ..................................22 2.6.1The basic elements of the principal agent theory ............................22 2.6.2Information asymmetry and conflicts of interest in development cooperation ..............................................................................................23 2.6.3Obstacles to reform in development cooperation: Incentives..........28 2.6.4Rigorous impact evaluation as one solution to aid effectiveness?...31 3Evaluation............................................................................................................32 3.1Evaluation in development cooperation ......................................................32 3.2Constituting elements of evaluation ............................................................33 3.2.1Data collection and analysis.............................................................34 3.2.2Value-based judgement....................................................................35 3.2.3Programme theory............................................................................36 3.2.4Credible knowledge.........................................................................37 3.2.5Use of evaluation findings...............................................................38 3.3Definition of evaluation ...............................................................................39 3.4Concepts and Classifications of evaluation .................................................40 3.5The call for rigorous impact evaluations in development cooperation .......42 3.5.1Evidence-based policy ....................................................................42 3.5.2International initiatives promoting RIE...........................................43 3.5.3Arguments in favour of RIE?...........................................................45 3.5.4Causality and Certainty....................................................................46 4Propensity Score Matching for hard data.............................................................47 4.1Theorectical basis for causal analysis...........................................................48 4.1.1The Rubin Causal Model.................................................................48 4.1.2Basic assumptions for causal analysis based on the RCM...............49 4.1.3Approaches to establish a counterfactual.........................................52 4.2Propensity Score Matching and its limits ....................................................56 4.2.1Description of the method ...............................................................56 4.2.2Limitations of PSM..........................................................................60 4.3Limitations of RIE generally........................................................................63 4.3.1Limitations in Research Design.......................................................63 4.3.2Limitations in data collection...........................................................64 4.3.3Limitations in data analysis..............................................................65 4.3.4Limitations in data interpretation.....................................................66 4.4Do Propensity Score Matching and Rigorous Impact Evaluation produce 2

hard facts?..........................................................................................................68 5Can evaluation findings improve aid policy – theoretical analysis and empirical evidence of use.......................................................................................................69 5.1.1The concepts of use and the crisis of utilisation .............................69 5.2Instrumental use............................................................................................73 5.3Conceptual use..............................................................................................78 5.4Political use..................................................................................................81 5.5Factors .........................................................................................................85 5.5.1Internal incentives ...........................................................................85 5.5.2The political window of opportunity ..............................................87 5.5.3Communication of findings ............................................................88 5.5.4Timing of findings............................................................................90 5.5.5Relevance and quality......................................................................90 5.6Use of evaluation findings for policy development in the light of the principal-agent theory?......................................................................................91 6Propensity Score Matching for better aid? ..........................................................94 6.1Will RIEs produce hard facts?......................................................................94 6.2Will more scientific evidence on impact lead to better aid?.........................97 6.3Aid as key lever for poverty reduction.......................................................101 6.4Conclusions................................................................................................102 7References..........................................................................................................103 7.1Books and monographs..............................................................................103 7.2Articles in anthologies................................................................................106 7.3Articles in journals.....................................................................................108 7.4Internet resources........................................................................................110 “The authors of this Guidance document believe that the ultimate reason for promoting impact evaluations is to learn about 'what works and what doesn't and why' and thus to contribute to the effectiveness of future development interventions.“ Leeuw, Vaessen, xxi ‘The policy-making process is a political process, with the basic aim of reconciling interests in order to negotiate a consensus, not of implementing logic and truth’ (Weiss, 1977, p533).


1 Introduction Few policy fields in Germany face as much fundamental criticism as development cooperation. Highly publicized critics such as William Easterly (2006) and Dambisa Moyo (2009) have fueled public debate about the merit of development cooperation. In September 2008 the „Bonner Aufruf (Neudeck et al, 2008) condeming the failed development cooperation of Germany caused a major stir among development practitioners and experts1. Politicians and practitioners are trying to justify aid. But since one development decade follows the next and the news of dire poverty keep coming back, the intensity is increasing. The key question in public and academic debate is whether aid is really working. This question has also been central to the trends in development cooperation in the last 10 years. Starting with the Millenium Development Summit in 2000 numerous publications, workshops, international fora and conventions have dealt with the effectiveness of aid and probably made „aid effectiveness“ the development cooperation buzz word of the decade. Aid effectiveness concerns political and administrative decisions both in the countries giving aid and receiving aid and has become one of the central issues both for the German ministry for economic cooperation and development (BMZ, 2010a) and international fora such as the Organisation for Economic Cooperation and Development (OECD). Two documents, the Paris Declaration (2005) and the Accra Agenda for Action (2008) set the aid effectiveness agenda for most national and multilateral actors in development cooperation and shape the international debate (OECD-DAC, 2008a). Amongst other strategies, donors and recipient governments agreed in the Paris Declaration to focus more on results of development cooperation and to promote information-based decision making. „Managing for results (MfDR) means managing and implementing aid in a way that focuses on the desired results and uses information to improve decision-making.“ (ibid., article No. 43: 7) 1 In the field of development cooperation the term „developement practitioner“ is frequently used. It designates people working either in development projects in development countries, staff of development agencies in developed countries as well as consultants, trainers and evaluators frequently working at the project level. The term expert is used here to designate researchers working in development cooperation.


In the strife for more evidence-based development policy, evaluations are considered to be of key importance. Indeed, there is a strong movement in development cooperation promoting more and better evaluations to improve decision making. Evaluations should be more rigorous and should provide clear evidence about impact. The call for rigorous impact evaluations (RIE) is supported by international stakeholders such as the OECD and the World Bank, by researchers and by national stakeholders in Germany such as the German Ministry for Economec Cooperation and Development (BMZ). The assumption underlying this call for more RIE is that better evidence will improve the knowledge base of decision making and increase accountability. Better knowledge and higher accountability it is assumed will result in improved development policies which in turn will result in more effective aid (Leeuw/Vaessen 2009:xxi). Finally, it is assumed, that more effective aid will indeed reduce poverty. The ambition of this thesis is to analyse the first two steps in this chain of underlying assumptions. Visualisierung? The first assumption concerns the quality of evidence that RIEs can provide. In political debate evidence based on quantitative research is often regarded as „hard facts“ - this is also true for development cooperation. Chapter four of this thesis will be therefore to explore the quality of evidence that can be achieved with RIEs. RIE comprises a large number of different tools and approaches. Given the limited scope of this thesis, only one RIE will be analysed in more detail: propensity score matching. Quantitative research in development cooperation faces several important challenges in terms of data quality. One important challenge is the difficulty to establish a comparison group (control group). Propensity score matching (PSM) is frequently cited in RIE literature as an adequate tool to meet this challenge (e.g. Baker 2000:6; Ravallion XX; Leeuw/Vaessen 2009: 25, Bamberger 2006:1; Ruprah 2008:6; Caspari/Barbu 2008:13; White 2006:14-16; Jones et al. 2009). The focus of chapter four will be how PSM can contribute to evidence. Other tools, often used in combination with RIE will only be discussed briefly. The second assumption underlying the RIE movement is that better evidence will


lead to better policy. In the context of this thesis the term „policy“ will refer to strategic decisions in development cooperation on the project level, on agency level and on the governmental level. Chapter five will analye this assumption in detail from the perspective of how RIE findings are used in policy decisionmaking. This discussion of use will be based on the theoretical concept of the Principal-Agent-Theory: How does the information provided by RIE influence the aid system? The last two elements of the causal chain mentioned above (good policy leads to better aid; better aid leads to less poverty) are beyond the scope of this thesis. However it is worthwhile to keep the whole causal chain in mind. One might reasonably question the relevance of such a methodological research topic, particularly in the realm of political science. There are two reasons to discuss the role of RIEs in development cooperation. The first reason is, that evaluation is an inherently political activity and evaluation design have an impact on the development agenda. „Evaluation is by its nature a political activity. It serves decisionmakers, results in reallocations of resources, and legitimizes who gets what. It is intimately implicated in the distribution of basic goods in society. It is more than a statement of ideas; it is a social mechanism for distribution, one which aspires to institutional status.”(House 1980:121)2 If methodological choices are political, then a strong movement in the international and national development community to promote one particular evaluation approach to improve aid effectiveness is politically relevant. A second motive for this research topic is, that evaluations need to be costeffective. RIEs in development cooperation often require extensive surveys and can be very expensive. Just like the investment in social programs needs justification, the call to invest much more money in RIE needs justification. The third and most important motive for this research question concerns the political agenda in the aid effectiveness debate. The debate about what needs to be done to improve aid and to reduce poverty has been ongoing for many years and a 2 According to Weiss evaluation is political in three ways. „(1) Programs and policies are 'creatures of political decisions' so evaluations implicitly judge those decisions; (2) evaluations feed political decision making and compete with other perspectives in the political process; and (3) evaluation is inherently political by its very nature because of the issues it addresses and the conclusions it reaches.“ (Weiss, 1993 zitiert nach Patton 1997:343).


number of researchers have highlighted the structural deficits of the aid system, particularly the given structure of incentives (Martens 2008; Faust/Messner 2007; Borrmann/Michaelowa 2005; Barder 2009). The question is, whether an important investment in more RIEs will change this system and really enhance aid effectiveness, or whether it is just one more symptom of a dysfunctional system. In order to address the above questions the structure of this paper will be as follows: In the second chapter the current state of aid will be described briefly. The goals and priorities of governmental development cooperation, its financial volume, its key stakeholders will be introduced and the aid effectiveness debate will be discussed. Chapter III provides an overview of evaluation theory, its definition and evaluation standards in development cooperation. Chapter IV and V are the core chapters of the thesis. In chapter IV the contribution of PSM and of RIE in general to „hard evidence“ will be analysed in detail in the context of statistical theory. Chapter V will look at the use of scientific evidence for policy making from a theortical point of view and present the limited evidence on use of evaluation findings in development cooperation. In the final chapter VI the results of the analysis for the discussion will be formulated. Some general comments and questions on the wider context of aid effectiveness and poverty reduction will conclude the thesis. In this thesis the terms development cooperation and aid will be used interchangeably. The focus of the thesis will be on governmental development cooperation and particularly German development cooperation.


2 The current state of aid - in Germany and beyond The core question of this thesis is how RIE can enhance aid effectiveness. To address this question the following section will provide a brief overview of the current state of aid and of the aid effectiveness debate. Development cooperation comprises financial support, technical cooperation, policy discourse at a political level and a number of other elements. RIE and evaluations generally are almost exclusively used for the financial and technical support for projects in development countries. Therefore other areas of development cooperation will not be considered here. Other differentiations in development cooperation include bilateral versus multilateral aid, governmental aid (ODA) versus nongovernmental aid and development cooperation by different donor countries. The focus in this thesis will be the bilateral, govenmental development cooperation of Germany.

2.1 Why development aid? Why is Germany investing in development cooperation? In official government publications and international declarations the moral responsability of rich nations to contribute to poverty reduction is often highlighted. According to the German Ministry for Economic Cooperation and Development (BMZ 2010b) development aid in the face of hunger, epidemics, poverty and environmental degradation is an act of humanity. This motive is to a large extent backed by the German public (Faust/Messner 2007:2) and by non-governmental agencies such as ONE, Kindernothilfe, Ă„rzte ohne Grenzen or Deutsche Welthungerhilfe lobbying for more and better official development aid. While poverty reduction is the most prominent goal, German development cooperation from its beginnings in 1961, has been undertaken for a number of different motives, some more openly than others.3 Numerous studies have analysed the motives of governmental development cooperation in great detail and

3 In 1952 Germany made its first financial contribution to a development programme of the United Nations. In 1956 the first fund for development aid was created. The Ministry for Economic Cooperation and Development (BMZ) was founded in 1961 and Walter Scheel was nominated its first minister. (BMZ 2010b)


comfirm the diversity of motives for aid Martens.4 (Martens 2008:295; Barder 2009:9) The most important motives apart from poverty reduction are political goals, environmental concerns, commercial and ideological interests. Strategic interests and political goals nearly always guide development cooperation (Barder 2009:9). Global economic, environmental and security threats are often caused by poverty and inequality (BMZ 2001; Faust/Messner 2007:3). In this sence development cooperation is promoting the national interest of Germany. In recent times, political scientists and politicians alike have particularly stressed the importance of aid projects to contain civil unrest, war among developing nations and global terrorism. Development aid has thus become one key aspect in international security policy (Nuscheler, 2008, 30). Likewise, environmental protection measures in developing countries are in the national interests of Germany and an important sector in development cooperation. (Andersen 2005:54-55; BMZ 2010b). Development aid has always been pursued for economic reasons. Collaboration with developing nations is supposed to secure access to primary resources, investment possibilities and consumer markets (Nuscheler 2008:22-24; Faust/Messner 2007:3; BMZ 2010c). Following the world wide financial break in 2008, the stabilisation of international financial markets and stocks in primary ressources is highlighted (ibid.). The new German government elected in 2009, has highlighted the promotion of German commercial interests in development cooperation (Niebel 2009). Development cooperation can also be ideologically motivated. In the past, western nations have used development aid massively as one foreign policy means in the cold war (Nuscheler 1991:220-222). By funding developing nations the influence of the Sowjet Union was to be contained. Germany in particular has used development aid to prevent the international recognition of the German Democratic Republic (Andersen 2005:54-56). Development aid to contain socialism is no longer relevant today, however western governments and multinational organisations still put forward ideological arguments for a strong commitment to 4 Specifically, the „www literature“ (who gives aid to whom and why) identified colonial ties, commercial interests and „genuine aid interests“ as guiding motives for governmental development cooperation. (Martens 2008:295)


development aid. In the last decade the number of projects and programmes to directly promote democracy and good governance have multiplied. While these arguments may be very valid in the sense that democracies are more accountable for government services to their citizens, they still are ideological in nature. The different motives for giving aid are reflected in the choice of recipient countries and the amounts of net ODA. The current state of aid, as far as the level of investment is concerned, is illustrated further below.

2.2 The goals of German development aid The goals of German development aid are shaped by German national priorities as well as international trends in development cooperation. In recent years the questionable effectiveness of aid and the Millennium Development goals have dominated the international debate. This debate has resulted in commitments at the level of the United Nations and the OECD have also strongly influenced German development policy (Reade2009:273; Kevenhรถrster/van den Boom 2009:34-36).

2.2.1 The Millennium Development process Since the beginnings of official development aid in the 50s, there have been numerous international conferences, declarations and action programs produced by politicians, practitioners and researchers. The poverty related cooperation of governments and institutions has gained momentum in the 90s with the organisations of many high profile conferences on socio-economic development. The most important conferences in the last decade were the Vienna Conference on the Least Developed Countries in 1990, the Rio conference in 1992 on environment, the Vienna conference on human rights in 1993, the Cairo conference 1994 on world population, the Copenhagen conference on social development, the World Climate conference in Berlin and the World Women Conference in Peking all in 1995 as well as the Istanbul Conference on habitat in 1996. Despite the impressive rhetoric of theses conferences, the results in terms of less poverty and more development fell short of the expectations (Andersen 2005:46-47; Birdsall 2008:515)


Given the serious challenges still facing the development world, the 189 heads of state meeting at the United Nations Millennium Summit in New York in September 2000 adopted the Millennium Declaration “to establish a just and lasting peace all over the world” and to solve the “international problems of an economic, social, cultural or humanitarian character” (UN 2000a). Based on the agreements reached at the United Nations conferences in the 90s, and based on the values and principles espoused by the United Nations, the Millennium Declaration outlines four major fields of cooperation for the international community: 1) Peace, Security and Disarmament, 2) Development and Poverty Eradication, 3) Protecting our Common Environment and 4) Human Rights, Democracy and Good Governance (ibid.). In chapter III of the declaration, “Development and poverty eradication”, the signing governments committed themselves to eight measurable goals, the socalled Millennium Development Goals (MDGs). Each of these goals consists of several concrete targets, most of them measurable in quantitative terms. According to the Millennium Declaration, a process to monitor the achievement of these goals and targets is to be set up and the goals are to be reached by 2015. In fact, the Millennium Declaration and the MDGs have become a the global reference point for development cooperation. “With the adoption of the Millennium Declaration in September 2000 and the Millennium Development Goals (MDGs) later derived from it, the inter-national community has for the first time achieved a consensus on the road map to chart the way out of poverty and global injustice, towards greater environmental sustainability, democracy,equality and peace.” (BMZ 2008a:Preface). The MDG comprise the following goals: 1. Eradicate extreme poverty and hunger. 2. Achieve universal primary education. 3. Promote Gender Equality and empower women. 4. Reduce child mortality. 5. Improve maternal health. 6. Combat HIV/AIDS, Malaria and other diseases 7. Ensure environmental sustainability. 8. Develop a global partnership for development.


In an effort to ensure the implementation of the Millennium Summit, the international community agreed both on more aid and better aid. Following the Millennium Development Summit in 2000 a serie of conferences was organised that shaped the international discussion on the levels of funding and on aid effectiveness. The participants of the International Conference on Financing for Development in Monterrey, Mexico (Monterrey 2002), called for donors to invest considerably more in development aid. On a European level, the European Union decided in 2005 to gradually increase the ODA rate of its members in order to implement the so called Monterrey Consensus. According to this plan all members that joined the European Union before 2002 commit to reach an ODA rate of 0.7 % of the GNP by 2015 (BMZ 2009a). The call for more aid, was supported by the “big push” theory of Jeffrey Sachs5, director of the United Nations Millennium Project from 2002 to 2006, and lead to more promises of development funding at the G-8 Summits in the Gleneagles, Scotland in 2005, Heiligendamm, Germany, in 2007 and L'Aquila, Italy in 2009 (G8 Information Centre) As for better aid, a first follow-up conference of the Millennium Summit was held in Rome in 2003: the High-Level Forum on Harmonization. At this meeting the donors “committed to take action to improve the management and effectiveness of aid” and to foster harmonization of aid activities (HLF 2010a). Also in 2003 the OECD/DAC created the Working Party on aid effectiveness and donor practices where donor countries and recipient countries work in five clusters to improve the aid system (OECD-DAC 2010a) In 2005 the High Level Forum in Paris on Aid Effectiveness elaborated the Paris Declaration which has become the central document of the aid effectiveness debate and a probably milestone in development cooperation as a whole. Considering the major flaws of the aid system, the Paris Declaration formulates five principles of effective development cooperation: 5 Jeffrey Sachs is a former Harvard Professor in Economics and Director of the Earth Institute at Columbia University. Currently he is also the Special Advisor to United Nations Secretary General Ban Ki-Moon, and the founder and co-President of the Millennium Promise Alliance. He is author of “The end of poverty” which argues, a sufficiently high amount of development aid can abolish poverty. Researchers in development aid tend to discredite the „big push“ theory and point to more pertinent levers of poverty reduction (Easterly 2006: Easterly 2008; Nuscheler 2008:27.35; Reinikka 2008:194-195; Faust/Messner 2007). Nevertheless it still has strong influence on aid politics and is promoted by vocal lobby organisations.


“Ownership: Partner countries exercise effective leadership over their development

policies and strategies, and coordinate development actions. Alignment: Donors base their overall support on partner countries' national

development strategies, institutions, and procedures. Harmonization: Donors' actions are more harmonized, transparent, and collectively

effective. Managing for results: Managing resources and improving decision making for

development results. Mutual accountability: Donors and partners are accountable for development

results.” (OECD-DAC 2008a) At the third High Level Meeting on Aid Effectiveness 2008 about 1700 participants met in the capital of Ghana to follow up on the Millennium process and to review progress made on the principles agreed upon in the Paris Declaration. The participating representatives of governments of developed and developing countries, of donor agencies, UN and multilateral institutions as well as 80 civil society organizations adopted the Accra Agenda for Action (AAA), another key document on aid effectiveness (ibid). Main points of the AAA are the predictability and transparency of aid, the need to review aid conditionality and to untie aid, the importance of ownership and of the use of country systems, the fostering of south-south partnerships and the reduction of aid fragmentation. 2.2.2

National aid policy in Germany

The guiding principles of German development policy are the reduction of poverty worldwide, the protection of the natural environment, the promotion of peace, democracy and equitable forms of globalisation (BMZ 2010d). In line with these guiding principles Germany has been actively involved in the Millennium Development process and supports the MDGs, the Paris Declaration and the AAA. All three strategy papers had a major impact on German development policy. “Germany bases its development policy on international agreements and commitments, particularly the Millennium Declaration and the MillenniumDevelopment Goals. The German government is also working at national and international level to implement the provisions of the Paris Declaration on Aid Effectiveness.” (BMZ 2008a:12) 13

In April 2001, the German government adopted the 'Programme of Action 2015“ geared at the implementation of the MDGs. According to this programme the eradication of poverty as envisaged by the MDGs is the overarching task of German development politics (BMZ 2001:2). The Programme of Action specifies the approaches chosen by the German government to achieve the MDGs. In 2004 the BMZ published a strategy paper that emphasized the importance of the MDGs and outlined the mandatory process of adapting German development policy to the MDGs (BMZ 2004). This process was to have an impact on programming, controlling, monitoring, impact assessment, strategy, allocation of resources and concertation among ministries (BMZ 2005). Until today the Programme of Action as the implementation plan of the MDGs is one of the key policy document of the BMZ. The 13th Development Policy Report of the German Government in 2008 evaluates progress in terms of the achievement rate of the MDGs (BMZ 2008a). The new German government elected in 2009 explicitly supports the MDGs and aid effectiveness process (BMZ 2010e).

2.3 The financial scale of aid The core question of aid effectiveness is whether the investments in foreign aid are achieving their goals at a reasonable price. In order to approach the issue of aid effectiveness it is therefore indispensable to get a general idea of how much the German government is spending on official development aid (ODA)6 in relation to overall budgets both in donor and recipient countries. While the figures presented above give a general overview, it needs to be highlighted that the OECD-DAC definition of ODA is disputed.7 6 ODA is defined as a transfer by the government of a donor country of money, goods or services to developing countries or citizens of developing countries aiming at the socio-economic development of these countries (OECD-DAC 2010d). ODA includes expenses for students or migrants of developing countries in the donor country, scientific research benefiting primarily developing countries, expenses for awareness raising programmes on development issues in donor countries, debt cancellations, administration of development aid as well as loans to developing countries comprising a grant element of at least 25 %. In the case of such loans, not just the 25 % grant element, but the whole loan is calculated as ODA. Net ODA as opposed to ODA is the sum of total ODA transfers minus the loan repayments. The OECD-DAC has established a list of countries considered developing countries in the calculation of ODA as well as a definition of what constitutes ODA and what doesn't (BMZ 2010g) 7 Particularly non-governmental organisations and members of the civil society in Germany and abroad criticise that the official ODA figures provide a distorted and embellished picture of development aid (Martens, J. 2008:49-51; Martens 2007). It also needs to be highlighted that


According to the most recent data available on the OECD website , the official development aid (ODA) reported to the OECD-DAC amounted to 128 milliards US$ in 2006, 126 milliards US$ in 2007 and 158 millards US$ in 2008 (OECDDAC 2010c)In absolute terms the countries giving most aid are the USA, France, Germany and the United Kingdom. In relative terms Sweden, Norway, Luxembourg, Denmark and the Netherlands allocate the highest percentage of their GNP to development aid (BMZ 2009b). In 2008 Germany's net ODA was 9.692 million Euros (0,38% of GNP) with about 1/3 of this sum being transferred to multilateral institutions (BMZ 2009e). According to the BMZ the German net ODA in 2008 was 9.692,9 million Euro (BMZ 2009c). Among the German government ministries participating in development aid the BMZ is the main donor with 5217,9 million Euros (53,8 % of German ODA). The Foreign Ministry contributes 6,6 % to German ODA, the Länder 7,1 %. Other ministries contribute less then 1 % to ODA (BMZ 2009d) In the Federal Government Budget for 2009, 5 813 million Euros, about 2,45 % of the federal budget, are allocated to the BMZ. Among all German government ministries the BMZ thus occupies the seventh rank in budget volume, preceded by the ministry for economy and technology and followed by the Interieur Ministry (REG online 2009). There are 60 recipient countries of German development aid. The list of the countries benefiting most of German ODA reflects the diversity of motives for official development work. Only four countries received nearly 30 % of total German ODA in 2007. Considering the five reasons for Germany to invest in development aid, the choice of countries receiving the most ODA is to a large extent guided by foreign and security policy, economic and environmental interests. Table: Germany's strategic interests in development aid Recipient Country

ODA 2007

% of ODA

% cum.

Germany's strategic interests and focus of cooperation, (BMZ 3, 2009; GTZ 1,

net-ODA figures do not represent the amount of money transferred to developing countries. The OECD developed a new financial measure for aid, country programmable aid (CPA) to designate real transfers, „fresh money“. The gap between CPA and ODA can be considerable (OECD-DAC 2008c).





17,13 Foreign and security policy, democracy, resources.

2. Cameroon



23,55 Democracy, environmental protection, former.

3. China



26,36 Economic relations, foreign/security policy, democracy, environment, renewable energy sources, poverty reduction.

4. Afghanistan



29,02 Foreign and security policy (terrorism and drugs), basic needs: water, energy; democracy.

5. India



31,09 Economic relations, foreign and security policy, environmental protection, poverty reduction.

6. Ethiopia



33,07 Poverty reduction, public administration.

7. Morocco



34,84 Economic cooperation, environmental protection, renewable energy sources, water.

8. Egypt



36,59 Foreign and security policy, economic interests, environmental protection.

9. Tanzania



38,30 Poverty reduction.

10. Pakistan



39,99 Foreign and security policy, fight against terrorism, poverty reduction, good governance.

11. Vietnam



41,67 Economic cooperation, foreign policy, poverty reduction.

12. Palestine



43,23 Foreign & security policy, poverty reduction, nation building.

13. Mozamb.



44,51 Poverty reduction, formerly containment policy.

14. Bangladesh



45,73 Democracy, poverty reduction (health), environmental protection and promotion of renewable energy sources.

15. Serbia



46,91 Foreign and security policy, conflict resolution, democracy, economic cooperation.

16. South Africa



48,05 Foreign and security policy, economic interests, poverty reduction, renewable energy sources.

17. Uganda



49,16 Poverty reduction.



50,25 Foreign and security policy, poverty reduction.

1. Iraq

18. D. R. Congo Total


50,25 100,49

Source: BMZ 2009f


In 2008 the distribution of ODA has been more balanced with only Irak receiving 13,3 % of German ODA followed by China, Botsuana and Afghanistan receiving about 3 % of German ODA each. The Democratic Republic of Congo and Iraq are no longer partner countries of the BMZ in 2009. Data about aid dependency is available for 128 developing countries. Among these 128 countries 28 received more than 10 % out of their Gross national income (GNI) in aid. 11 Countries received more than 20 % of their GNI in aid. Liberia, Afghanistan, Burundi and the Solomon Islands receive over 40 % of their GNI in aid (World Bank 2010e). In discussions about development aid it need to be kept in mind, that official development cooperation is only a fraction of those funds flowing into developing countries compared to private donations, remittances that migrants living in industrialised countries send home and private investments (Miller 2010:14; Martens, J. 2008:22-25). For some donor countries like the USA remittances can represent up to 60% of all financial flows to developing countries (Miller 2010:14).

2.4 The primary agents of development aid in Germany Agents in development aid are traditionally the government of developed countries, multilateral institutions representing developed countries and civil society organisations. In recent years business companies have become more important due to increased investments in developing countries and private-public partnerships. The institutional setting of development aid in Germany is complex. There is a large number of actors, complex relationships of funding and cooperation and a lack of strong coordination. According to the 2005 OECD-DAC peer review the German development cooperation is highly compartmentalised and a gain in efficiency is not possible given the complex structure (OECD-DAC 2005).8 8 In an effort to improve national and international development cooperation of all DAC members, the OECD-DAC has been organising peer reviews of national aid systems of its member nations since 1996 (OECD-DAC 2010e). The last peer review of the German aid system was in 2005 and the next one is planned for 2010.


Graph (BMZ: actors of German aid) Z.B. Snag it Weißbuch BMZ 2008

In Germany the main actor of development aid in terms of budgets and mandate is the German Ministry for Economic Cooperation and Development (BMZ). The BMZ finances development aid through different channels and coordinates development aid on a national level. Other responsibilities include the support of aid and development related research, the safeguard of overall coherence of government policies, the support of awareness raising activities in the German public and the policy coordination on an international level. For the execution of development aid, the BMZ collaborates with five main subcontractors in Germany, which are partially or entirely owned by the German government.9 A large part of the German development aid budget is channelled through international organisations.10 While the funding aspect of this multilateral cooperation is important, the political concertation among donor nations and among donors and recipients of aid is primordial. The international discussion process on aid effectiveness, which will be described further below, is a good example for the weight of political concertation on the international level. In addition to governmental, quasi-governmental and multilateral institutions a large number of civil society organisations are active in development aid.11 Many of the 9 The Kreditanstalt für Wiederaufbau (KFW) und die Deutsche Investitions- und Entwicklungsgesellschaft (DEG) are executing programs of financial cooperation. The Deutsche Gesellschaft für Technische Zusammenarbeit (GTZ), the German Development Service (DED) and Inwent-Capacity Building International are contracted by the BMZ for programs of technical cooperation. Currently over 5000 development workers are working on behalf of the German government in programs of technical cooperation in developing countries. 10 Alongside other developed countries Germany finances and supports many organisations of the United Nations (UNDP, WHO, UNICEF, UNEP, WFP, etc.), the World Bank (WB), the International Monetary Fund (IMF), the Organisation for Economic Cooperation and Development (OECD), regional development banks and the different channels of development aid of the European Union. 11 The Association of German development non-governmental organisations (VENRO) regroups most of the non-governmental aid institutions and counts currently 118 member organisations. Some of the largest civil-society organisations in Germany are church affiliated like MISEREOR, Christopher Blindenmission e.V. and Evangelischer Entwicklungsdienst. Some are attached to political parties like the Friedrich Ebert Stiftung and the Heinrich Böll Stiftung. Other major civil society organisations in development aid are the Deutsche Welthungerhilfe,


non-governmental aid organisations in Germany are partly financed by the German government, partly by member contributions, donations and endowments. Some larger civil society organisations have representatives in developing countries to execute and supervise projects and programs, however most organisations limit their activities to the funding of projects in developing countries and awareness raising on development issues in Germany.

2.5 Does aid work ? – The challenge of aid effectiveness Given the five decades of development cooperation and the substantial amount of money invested proponents and critics alike are asking „Does aid work?“ Aid effectiveness is a vast area of contentious research, practically impossible to cover in one publication, let alone in one subchapter. The purpose of this section is therefore limited to sketch the key areas of discussion within the aid effectiveness debate and highlight the relevance of this debate for the subject of this thesis RIE. There are two different levels within the aid effectiveness discussion. On a fundamental level the debate focuses on the question of whether aid is effective at all, whether it is without effect, whether it possibly has negative effects and whether policy fields such as trade, financial markets and immigration policy have much more impact on poverty reduction than aid. A second level of debate assumes that aid is an important lever of poverty reduction and the challenge is to find make aid more effective. Development cooperation has been a policy field for about 50 years in most western countries and large amounts of money, expertise and goods have been transfered to the developing countries. Nevertheless progree, especially in Africa seems to be slow and a large number of authors are criticizing the aid business for not achieving its goals (Nuscheler 2008:6-7). Highly critical publications on development cooperation are not new. Gunnar Myrdal (1984) Brigitte Erler (1985) have strongly influenced the aid debate in the 80s and 90s. In recent years former World Bank economist and university professor William Easterly (2006) pointed out the apparent failure of five decades of development aid to have a major impact on poverty or growth. Zambian economist Dambisa Moyo (2009) is the most Terre des hommes and Kindernothilfe.


prominent voice of authors, who point out the devastating impact of development aid on corruption and governance in Africa and sparked a heated international debate on positive and negative effects of aid (Financial times 2009). Economists of the International Monetary Fund (IMF) even reported a negative impact of development aid on growth because of the adverse effects of a high influx of money on wages and employment on labor intensive and export sectors (Rajan/ Subramanian 2005). Negative impact of aid on institutions and on accountability in recipient countries have been confirmed by a number of studies (Barder 2009: 3; Brautigam/Knack 2004; Knack 2001). In Germany, over 100 former diplomats, development experts, politicians, media representatives and academics all with at times extensive experience in development cooperation published an open letter in September 2008 , the “Bonner Aufruf” denouncing the ineffectiveness of German development aid and calling for a drastic changes in Germany's development policy (Neudeck et al. 2008). Kenyan Economist James Shikwati and Ugandan journalist Andrew Mwenda fueld the discussion in the German media in declaring that development aid not only aggravated but actually caused poverty in Africa (Nuscheler 2008:6). Proponents of development cooperation responded to the criticism by pointing out successes in individual such as the fight against small pox and polio and the green revolution in Asia.12 Researchers in development cooperation also warn against sweeping criticism of aid and urge to differentiate among different aid recipient countries, different types of development cooperation and different concepts of „positive impact“ (Nuscheler 2008:8). A lot of publications in this decade treated the issue of aid effectiveness in terms of impact of development cooperation on growth. A study by Burnside and Dollar (1997:33) found a positive impact of aid on economic growth for recipient 12 „Aid has often worked. In many of the successes, improvements in outcomes such as vaccination rates were so dramatic—and their importance to people’s welfare so obvious—that statistical analysis has easily confirmed their value. The World Health Organization, for example, led the successful campaign to rid the world of smallpox and the Pan American Health Organization did the same for eliminating polio from the Western Hemisphere. Donors also played an important supporting role in the Green Revolution. Twentieth-century advances in the manipulation of nature made both of these successes possible. Where science is less central, as in education, it is harder to find such spectacular achievements. The late sociologist Peter Rossi noted that “[i]n the social program field, nothing has yet been invented which is as effective in its way as the small pox vaccine was for the field of public health.” (Roodman 2007:2-3)


countries with good governance and good policy environments. This study triggered a vigorous debate in scientific circles about the statistical methods used by Burnside and Dollar and the robustness of the findings by Burnside and Dollar (Deaton et al. 2006:52-57; Easterly 2008:15-17; Nuscheler 2008:23-24: Raghuram/ Subramanian 2005; Roodman 2007 and Roodman 2008). Banerjee and He do not find reliable evidence of positive impact of aid on growth. „In the end, there is little that is reliable that we can say about donor performance. The most we can say is that we found no prima facie evidence of great effectiveness.“ (2008:56) David Roodman assents and argues that makro economic analysis of aid effectiveness „has repeatedly offered hope and repeatedly disappointed“(2007:20). Instead, he argues, aid effectiveness should rather be assessed at the micro-level by rigorous impact evaluations and qualitative case studies. „Attacking smaller, practical questions, such as about the effects in various contexts of microfinance or roads, is more likely to achieve what ought to be the primary purpose of studying aid effectiveness, which is to improve it.“ (Roodman 2007:21) The call for RIE is directly linked to the aid effectiveness debate and the hitherto unsuccessful attempts of aid agencies and academics to prove that aid on the whole is effective in reducing poverty. Nuscheler also challenges the aid-growth debate and points out that aid effectiveness and thus development cannot be reduced to economic growth and that aid effectiveness on a local and regional level may not be visible on a national level (Nuscheler 2008:25). Not only critics, but also proponents of development cooperation, representatives of aid agencies, aid professionals and politicians recognize that aid effectiveness is a problem. Already now official sources within donor and multilateral agencies recognize that the Millennium Development Goals will not be achieved until 2015 (UN 2008). The problem of aid effectiveness does not only concern bad governance and corruption in developing countries but also considerable deficiencies in the donor system (Nuscheler 2008:13; Easterly/Pfutze 2008). While extreme critics call for development cooperation to be abolished, some of its proponents call for structural reform along the lines laid out by the Paris Declaration and the AAA


(harmonisation of donor activities, recipient country ownership, untying of aid13, etc.). However, as Barder points out, many weaknesses of the aid system have already been known since the Pearson Report from 1969.14 „In principle these problems could be ameliorated if donors simply chose to live up to their existing commitments to improve the way that they give aid. The necessary steps have been identified and donors have committed themselves to change. Yet progress has been extremely slow.“ (Barder 2009:5-6) To address this apparent paradox and to advance the aid effectiveness debate a number of researchers proposed to frame the aid system from the perspective of the principal-agent modell of political economy. The principal-agent model represents an important theoretical progress in the aid effectiveness debate and will be used in chapter to assess the possible contribution of RIE to aid effectiveness. The following section will therefore describe the principal-agent model applied to the aid system.

2.6 Principal Agent Theory in development cooperation 2.6.1 The basic elements of the principal agent theory The principal agent theory is one element of the agency theory of institutional economics and is based on contract theories within a company by Ronald Coase (1937). The agency theory studies contractual relationships between stakeholders in situations of conflicts of interests and information asymmetry. Agency theory assumes a relationship where one stakeholder, the agent, is contracted to act on behalf of another stakeholder, the agent. The agent usually has specific knowledge or competence which is the basis for the contract. Agency theory further assumes that individuals generally are motivated by self-interest. Therefore the agent in a contractual situation is assumed to act first and foremost in his own interest and not automatically in the interest of the principal. The agency problem consists of the challenge for the principal to ensure that the agent fulfils his contract as much 13 Tied aid is understood as the provision of grants orloans by donors, which oblige the recipients to buy services and goods from the donor country. 14 In 1968 World Bank president Robert McNamara commissioned former Canadian Prime Minister Lester Pearson to review development cooperation in the 50s and 60s and develop recommendations to improve the aid system. The commission headed by Lester Pearson published its report in 1969. (World Bank 2010b)


as possible in the interest of the principal despite the given information asymmetry. If the principal cannot observe the agent at all times for cost and other reasons, he may not know exactly to what extent the contract was fulfilled (hidden action). Furthermore the agent may have context information influencing his contractual obligations unknown to the principal (hidden information). The possibility of hidden action and hidden information result in moral hazard – a certain degree of risk on behalf of the principal that the agent may unduly benefit from the information asymmetry. In addition the agent may be aware of his own characteristics (competence, ability, quality of service) relevant for the contract and unkown to the principal (hidden characteristics). This third form of information asymmetry may result in adverse selection, where the principal would not have concluded the contract had he known these characteristics. While there is no optimal solution for the principal in this model, the second best option for the principal is to designs a system of incentives to motivate the agent to act in the best interest of the principal. This system of incentives implies, according to the principal agent theory, possible costs for both the principal and the agent (agency costs) and is laid down in the contract. Principal agent problems exist in many relationships, for example between an employer and and employee, between a doctor and a patient or between the government and bureaucracies implementing policy. Michaelowa/Borrmann

2.6.2 Information asymmetry and conflicts of interest in development cooperation In recent years the principal-agent problem and the structure of incentives has been translated into the aid effectiveness discussion by a number of authors (Barder 2009; Martens, B. 2008; Faust/Messner 2008; Michaelowa/Borrmann 2005 and Martens et al. 2002). The principal-agent problem is useful to describe the current aid system and to explain the hitherto unsuccessful effort at reforming 23

development cooperation. According to this model, the core problem of the aid system is its incentive structure, what donors and other stakeholders do and what they avoid to do in order to make the aid system more effective is inherently related to the given structure of incentives (Svensson 2008:330). In a principalagent model of aid there is a chain of contractual relationships. There are a number of stakeholders (principals) that contract other stakeholders (agents) for their purposes. One stakeholder,e.g. an organisation can be an agent and a principal at the same time if the organisation signs contract with e.g. a government ministry to provide certain services and in turn subcontracts part of the work to a e.g. an NGO.

The first main principal is the electorate in the donor countries, for example in Germany. The electorate delegates responsability for development cooperation to the government, which in turn delegates it to the ministry and subsequent implementers. For simplicity reasons only the ministry and one implementor will be assumed here. To allow collective action the government and the electorate jointly delegate decision-making to a common agent, the implementing agency. The implementing agency cooperates with a project in the recipient country (other forms of development cooperation will be neglected here for simplicity reasons).


The project may be a local government department, a local NGO or an international agency. Ideally this project reports not only to the donor agency, but also to the local population which is the second main principal in the system. (Faust/ Messner 2007:2) While the population in the donor and recipient country are the primary principals, other stakeholders can be both principal and agent. The BMZ is Germany is an agent for the government and the electorate and at the same time the principal vis-a-vis the implementing agency GTZ. The principal could expect the agent to fully comply with the contract, if they had the same interests. However, as described in chapter 2, this is not the case in development cooperation, just like in many principal-agent-relationships. Principals and agents may have some shared interests but also concflictive interests. The electorate in donor countries generally approves of government spending on aid for the purpose of poverty reduction, the promotion of human rights and environmental protection. The strongest coherence of interests therefore is between the population in donor and recipient countries, the two principals. (Faust/Messner 2007:2; Martens, B. 2008:296) Other interests are political and economic. „In reality, however, objectives are likely to differ. Individuals pursue careers, incomes and their own preferences; they may but do not necessarily have a genuine interest in alleviating poverty. Politicians pursue political objectives, agencies pursue the perpetuation and expansion of their remit and budget, consultants seek the next contract, etc.“ (Martens, B. 2008:300; also Michaelowa/Borrmann 2005:3) Organisational interests are likely to differ from individual interests. Governments, ministries, bilateral and multilateral donor organisations, implementing agencies, multilateral organizations and consultants „do not act only on behalf of the collective goals formulated but also have special interests of their own.“ (Faust/Messner 2007:2). Michaelowa and Borrmann present a detailed description of specific incentives for consultants, agency staff and agencies' management in the business of project evaluation (Michaelowa/Borrmann 2005). Also the commercial interest of private companies and security interests of government can play an important role (Faust/Messner 2007:3; Barder 2009:4).


The problem of the principal is how to ensure that the agent complies fully with the contract according to the interests of the principal. Martens argues, that the purpose of aid agencies is precisely to mediate between multiple interests, including their own institutional interests. „On the other hand, while these multiple principals have different objectives and are likely to give incoherent instructions to the agency, the agency may be in a position to make its own proposals, play off different interest groups against each other, forge coalitions in support of the policies it proposes, induce collective action among members and, in general, achieve objectives that individual members would not be able to achieve on their own (Martimort, 1991). Consequently, one important task of bilateral donor agencies is to mediate between donor interest groups at home.“ (Martens 2008:297) In some cases different interests can be accomodated, in other cases conflictive interests lead to suboptimal investment of aid monies (Barder 2009:3). If information flows among the different stakeholders would be perfect, the principal would know to what extend a contract has been fulfilled by the agent in accordance with the interests of the principal. Full information would expose noncompliance on the part of the agent. However this is not the case. The aid system is characterised by large information asymmetries. In classic principal agent relationships such as between an employer and an employee or between a retailer and a customer the market can somehow balance the information asymmetries. If the product bought by a customer is not worth its price the customer may go to a different shop next time. If the employee does not do his work, the employer will find out eventually. In development aid the principal-agent model is hampered by large information assymetries and lack of accountability. Donor in development aid often don't know if aid agencies provide good services on the ground. The geographical distance create imperfect information for all stakeholders except the project management. Cultural difference, sector specific and country specific expertise increase the information imbalance for the principal in the the donor country. Low accessibility of financial and organisational information and language barriers increase the information asymmetry for the principal in the donor and recipient


country. „While domestic aid agencies redistribute income between donors and recipients living in the same political constituency, foreign aid agencies target recipients living outside the donor’s constituency, usually in developing countries. Donors need to invest considerable money in improving the information asymmetries, for example by commissioning evaluations. “High information asymmetries result in high transaction costs (agency costs), (Martens, B. 2008:292; Barder 2009:9-10) Information asymmetries can not be removed entirely with the result that agents will use their remaining room for manoeuvre in their own interests. (Martens, B. 2008:286) Apart from very imperfect information, the aid system also suffers from skewed accountability. Employers can sanction employees who don't work, customers can change their favorite retailer, citizens can vote against the current government. But recipients in development countries have no mechanism to sanction aid that is not delivered properly. The feedback process in foreign aid is broken.Those who pay for the aid are not those who receive it and the information flow between donors and aid beneficiaries is highly insufficient. (Martens, B. 2008:285) The diversity of interests and the insufficient information does not only concern the vertical chain of delivery in aid (from the voter down to the beneficiary), but also horizontal cooperation (Barder 2009:9-10). Many development activities require collective action – for example if two donors cooperate with the health sector in Zambia. However, diversity of interests and competition for funds hamper coordination and cooperation, result in high transaction costs and suboptimal effectiveness. Faust and Messner cite a UNDP evaluation where 'turffighting' among donor organisations was described as being among the biggest obstacles to aid effectiveness in crisis countries (Faust/Messner 2007:4).


How the development agencies within the aid system accomodate the diversity of interests depends, according to the principal-agent model on the incentives for the different agents to act in the interest of the primary principals, the population in donor and recipient countries. Due to the very large information asymmetries the incentives to act in the interest of the primary principals are relatively low. (Faust/Messner 2007:3) „Taxpayers of donor countries normally have no possibility to get in touch with aid recipients. And foreign beneficiaries have no voting rights in donor countries and thus no political leverage on donor politicians. As a result, there is a broken feedback loop that induces a performance bias in aid programmes“ (Martens et al. 2002:154-155; also Barder 2009:11.19-20, Michaelowa/Borrmann 2005:3.18) Also citizens in the donor countries have only limited information about development cooperation and limited possibilities to sanction an ineffective aid bureaucracy. 2.6.3

Obstacles to reform in development cooperation: Incentives

The different stakeholders have little incentives to meet the expectations of the two primary principals, the citizens in the donor and recipient countries. The principal agent relationships, the diversity of interests, competition for scarce resources and the information asymmetry within the aid system result in two prominent features of development cooperation: 1) The tendency on the part of the principals to increase control over their respective agents by procedural routines, focus on planning and control mechanisms. The goal of more control is to be able to report success to the higher 28

level principal (Faust/Messner 2007:5) 2) The tendency on the part of the agents to develop ever new approaches and present innovation in order to justify their existence and attract more funds. (Faust/Messner 2007:4) The focus on control processes and innovation does not however remedy the lack of information on the part of the primary principals, the citizens in donor and recipient countries. To the contrary, chances are that the ever increasing complexity of the aid system exacerbates the information assymmetry. The public in recipient and donor countries are less able to hold their governments, implementing agencies and subcontracted companies to account. (Faust/Messner 2007:5-6) The Paris Declaration and the AAA recognize that the reform within the aid system is necessary to increase aid effectiveness. However, the changes proposed by these two guiding documents (ownership, alignment, harmonization, managing for results and mutual accountability, predictability, transparency and reduction of aid fragmentation) are not in the short term interest of those who have the power to change the system, the donors and implementing agencies (Faust/Messner 2007:12-15). To grant more ownership to recipient country governments would result in reduced control for the donors and thus more uncertainty about outcomes and a threat to their need to proof success (Reinikka 2008:180). Subordinating donors interests to partner country interests or to the collective donor community is impossible as long as aid agencies need to balance the interests of a multiplicity of stakeholders (Bobba/Powell 2007:24). In addition, donors are only accountable to their own governments and a negative reputation due to lack of cooperation is not so much of a problem if there a many donors and if most other donors are equally resistent to cooperation.(Barder 2009:10) Instead of discouraging donor fragmentation, the aid architecture sets incentives for every donor to „have a presence in every sector in every country“ (Barder 2009:11). The focus on management for results and mutual accountability increases the competition among donors and implementing agencies and may require painful internal change processes and streamlining. Likewise predictability of funding limits the flexibility of the donor system, transparency risks to expose inadequate processes


and negative results. For many researchers donor fragmentation15 is among the most important problems in the aid system (Barder 2009:4; Faust/Messner 2007:15.18-19.; Nuscheler 2008:14). From the perspective of a single donor it is clearly not in its interest to be made redundant. „At least as regards poverty reduction, peace building, and the promotion of democracy, implementing agencies are ultimately expected to provide an effective contribution to making themselves superfluous. (...) Instead, insights of modern organizational theory and evidence from the policy field clearly show that organizations – as collective actors – have a major and fundamental interest in securing both their own survival and the greatest possible autonomy for their actions.“ (Faust/Messner 2008:3) Therefore all stakeholders within the aid system seek to maximise aid flows (Martens 2008:296). So, while there is consensus about the type of of reforms needed, there are very low incentives on the part of the donors and implementing agencies to implement the reforms. This situation is reflected by the relatively slow implementation of the Paris Declaration (Barder 2009:6). For the case of Germany a recent evaluation of the implementation of the Paris Declaration confirmed the existance of disincentives. „The following disincentives were reported: shortage of staff, aggravated by an additional workload as a result of increasing transaction costs due to the Paris Declaration, BMZ‘s fast staff rotation, complexity of the German aid system (requiring considerable coor-dination efforts), interfering political priorities, call for visibility of German aid contributions, and institutional selfinterests.“ (Ashoff et al. 2008:12) Recipient countries could theoretically press for more vigorous reforms, but the given imbalance of power is not conducive to successful pressure(Nuscheler 2008:18). Apart from that, there are also stakeholders within the recipient countries who benefit from the current situation, particularly from the lack of accountability vis-a-vis the beneficiaries (Nuscheler 2008:19; Faust/Messner 2007:3). 15 The term fragmentation refers to the high number of donors and implementing agencies in the aid system, each of which has its own interests to maintain and possibly increase funding. Currently there are more than 40 bilateral donors, about 20 global and regional financing institutions, 15 United Nations Agencies and a large number of global fonds. In the health sector alone, there are over 90 gobal fonds. Fragmentation results in very high transaction costs and lack of coordination. (Nuscheler 2008:14).


Because of this apparent contradition, Barder proposes market and mechanisms to change the incentives within the system gradually. „We are reaching the limits to what can be achieved by better planning to improve aid. Greater use of market and network mechanisms, by contrast, would help to alter the incentives and political constraints faced by aid agencies. (...) In the long run, improvements in the aid architecture are likely to be the result of evolution not design. Reform should focus not on a grand new design of the aid system, but a set of technical and apparently innocuous reforms which, over time, create stronger political pressures for evolutionary improvements in the aid system.“ (Barder 2009:3)

2.6.4 Rigorous impact evaluation as one solution to aid effectiveness? Many authors regard RIE as one solution to the problem of aid effectiveness. Some argue from a knowledge perspective that aid has not been effective until now because in many sectors it is not known „what works“ and what doesn't“. „ (Banerjee/He 2008:56) There is too little rigorous evaluation, too little testing of new ideas; successes are rarely taken to scale; failures are not learned from and abandoned, results from rigorous research is not systematically collected and communicated (Barder 2009:5).In this line of argumentation RIE improve the information flow from the project level to the donor level for the sake of learning. The other line of argumentation suggests that an improved evaluation practice as well as more transparency about development aid generally enhances the incentives for actors within the aid business to improve its effectiveness (Barder 2008:20; Banerjee/He:56; Reinikka 2008:193). According to Martens et al. the broken feedback loop in development cooperation explains a necessity for evaluations more than in other policy fields (2002). While Faust and Messner point out that it is not only the rigour of evaluations that will improve aid effectiveness, but that the whole evaluation system needs to change (the independence of evaluations, the transparency of findings, the comparability, etc.) they also propose RIEs to improve accountability and change the incentives of stakeholders within the aid business (Faust,/Messner 2007:20; Faust 2010). Nuscheler highlights the need for more transparency within the aid business and in evaluation in particular. Aid agencies need to proof their impact, he argues, 31

although he does not explicitly call for RIE (Nuscheler 2008:36). From the perspective of the principal-agent-model the purpose of RIE is mainly to hold the stakeholders within the aid system to account. In the course of this thesis the aspect of learning and the aspect of accountability will be further discussed. The question of this thesis is, whether RIE can indeed improve the information within the aid system, increase accountability to primary principals and change incentives. Or whether RIE is just a clever tool to negotiate information flows and to sell „impact“. To put it more bluntly: is RIE just one more symptom of a selfreferential donor system (Faust/Messner 2007:5) or is it a lever to change the system? To answer the question it is important to analyse how much room for manoeuvre there is for stakeholders to maintain the biased flow of information and broken feedback cycle. The issues of „hard data“ and use of evidence are critical to determine this room for manoeuvre.

3 Evaluation Rigorous impact evaluation (RIE) and statistical methods such as propensity score matching (PSM) are an important trend in evaluation of development cooperation. In order to discuss PSM, RIE and their relevance for aid effectiveness a general introduction into the field of evaluation is necessary. This chapter will present a brief overview of the the theoretical and conceptional discussions in the field of evaluation as well as the key functions, classifications and standards of evaluation. To start with the role of evaluation in development cooperation will be outlined.

3.1 Evaluation in development cooperation Evaluations, evaluation research and evaluation as a profession are relatively new in many policy fields in Germany (Widmer et al 2009; Konzendorf 2009:27). In the 1960 the first scientific evaluations took place in Europe, but evaluation played only a minor role in public policy until the 1990s with the movement for „New Public Management“ (Stockmann 2004a:25-26). The last two decades have seen an ever increasing interest in evaluating public projects and programs, the publishing of a wide range of methodological textbooks, the establishment of evaluation courses at universities and the creation of national and international 32

evaluation societies (Stockmann 2004a; Widmer/Beywl 2009).16 Development cooperation has been among the first policy fields to commission evaluations on a regular basis (Stockmann 2004a:28). Already in the 1970s the German Ministry for Economic Development and Cooperation (BMZ) established an „inspection office“ responsible for evaluating the progress of development projects (Stockmann 2004b:375). Many governmental and non-governmental agencies in development cooperation have been involved in intensive evaluation debates for at least a decade and have reached a high level of institutionalisation of evaluation (Reade 2004:275; Zintl 2004:245-246). Most national and international development agencies beyond a minimum size have their own evaluation units and departments and large multilateral organisations have created independent evaluation bodies, such as the Independent Evaluation Group (IEG) of the World Bank and the DAC Network on Development Evaluation of the OECD. The importance of evaluations as a management tool within development cooperation is illustrated by the fact that since 1997 the German Ministry for Economic Cooperation and Development (BMZ) has commissioned three largescale studies on how evaluations are done within the German aid sector. The goals of these studies were to improve and streamline evaluation practice and evaluation systems. The last of these so called „Systemprüfungen“ was published in June 2009 (Borrmann/Stockmann 2009). The purpose of evaluation is to provide information for aid management on different levels and with different focuses. The goals of evaluation are captured by a statement of the IEG: „The goals of evaluation are to learn from experience, to provide an objective basis for assessing the results of the Bank's work, and to provide accountability in the achievement of its objectives. It also improves Bank work by identifying and disseminating the lessons learned from experience and by framing recommendations drawn from evaluation findings.“ (World Bank 2010c; BMZ 2010f) The level of professionalisation of evaluation is illustrated by the development of the OECD-DAC evaluation standards and principles for development cooperation 16 The universities of Bonn and Saarbrücken in Germany and the university of Zürich in Switzerland set-up evaluation programs and the Gesellschaft for Evaluation (DeGEval) created in 1997 and attracts many new members every year.


as well as the DeGEval standards on a national German level for evaluation in general (DeGEval 2010). The key DeGEval standards are utility, feasibility, propriety and accuracy. The key OECD-DAC principles from 1991 are impartiality and independence, credibility, usefulness, the participation of donors and recipients of aid and the cooperation among donors in evaluation (OECD 1991:5.8).

3.2 Constituting elements of evaluation Definitions of evaluation abound and different definitions highlight different aspects of evaluation.17 In 1991 Shadish, Cook and Leviton undertook one of the very few attempts at analysing the theoretical foundations of evaluation. In their work on evaluation theory Shadish et al. found a striking “imbalance in evaluation between the great attention to methods and the small attention to theoretical issues that guide method choice” (1991:34) In response, they did not offer a definition or a comprehensive theory of evaluation. Instead they proposed five issues that evaluation theory needs to address. Shadish et al. acknowledged that evaluators can be guided by very different theories, but each theory, they hold, needs to address these five issues: •

methodological design for data collection and assessment,

values based jugdement of data,

programme theory,

credible knowledge and

use of findings.

Although the publication by Shadish et al. is more than twenty years old, it is still frequently cited and can be regarded as one of the key references in evaluation theory. The definition of evaluation for this thesis will be elaborated in this section based on these five issues.

3.2.1 Data collection and analysis Most authors highlight the systematic collection and assessment of data as one 17 The term „evaluation“ will be used in this thesis to designate the process of data collection, analysis and judgement as well as the product of such a process. The term „evaluation research“ will refer to research about evaluation and about evaluation methods.


key element of evaluation. Michael Quinn Patton, one of the most widely cited authors on evaluation, describes evaluation in very concrete terms as "the systematic collection of information about the activities, characteristics, and outcomes of programs to make judgements about the program, improve program effectiveness, and/or inform decisions about future programming" (1997:23). Definitions of evaluation in German textbooks focus on the technical aspect of evaluation and stress its foundation in applied social sciences (Bortz 2008:96; Kromrey 2006:103; Stockmann 2004a:14) or as a scientific approach to generate and assess knowledge based on the principles of empirical social research (Beywl 2009). Shadish et al. refert to the methodological design of evaluation as the „theory of practice“ (1991:57-64). The process of collecting and analysing data is the most visible component of evaluation and has thus received considerable attention of the evaluation community. The discussion about RIE is also to a large extent focused on methods, such as propensity score matching (PSM). However, as Shadish et al. Point out, the design of evaluations and the choice of methods implicitly or explicitly touches upon other critical issues. “Evaluation theory”, they argued, “is about methods, but not just methods. To inform evaluators about choosing methods, it needs to discuss philosophy of science, public policy, value theory, and theory of use.“ (1991:31) These four issues will be discussed here briefly as relevant components of evaluation and as issues to be considered in the course of this thesis.

3.2.2 Value-based judgement The assessment of the information collected is the second constituting element of evaluation refered to by almost all authors. Peter Rossi and Howard Freeman, authors of a classical textbook on evaluation, define evaluation as the „systematic application of social research procedures for assessing the conceptualization, design, implementation, and utility of social intervention programs“ (Rossi/Freeman 1993:5) Donna Merten's more abstract definition as “the systematic investigation of the merit or worth of an object (program) for the purpose of reducing uncertainty in decision making.”is repeatedly cited in evaluation literature (1998:219 cited in Lee 2004:137). Because evaluation is about judgement, Shadish et al. highlight the role of values in evaluation. „But 35

experience showed that it is impossible to make choices in the political world of social programming without values becoming salient in choices about evaluative criteria, performance standards, or criteria weightings. Evaluators will do a better job of judging the value of programs if they explicitly consider the questions (on values) (...).“ (Shadish et al. 1991:455) In the early years of the evaluation profession evaluators claimed to make value-free judgements by focussing on how well program objectives were met. But, as Lee points out, „programes themselves are designed around specific value decisions about what is important, and (that) the questions asked about programs will differ depending on the particular group and the stake they have in the outcome (...).“ (Lee 2004:154) This position is also shared by authors on impact evaluation. „Some of the main tasks of an impact evaluation are, therefore, to be clear about who decides what the right aims are and to ensure that the legitimate different perspectives of different stakeholders are given adequate weight.“ (Leeuw/Vaessen:11) Evaluators may decide to advocate a given set of criteria such as accepted professional standards against which to judge a programme (prescriptive approach) or decribe values of the programme stakeholders and assess the programme according to these values (descriptive approach). Different strategies to define criteria for judgement and their respective advantages and disadvantages are discussed in Shadish et al.:455–463; also Beywl 2009). 3.2.3

Programme theory

According to Kromey it is the foundation in sociological theory that distinguishes empirical social research from a random collection of isolated data. It is social theory that renders social empirical research systematic (Kromrey 2006:52). Evaluation being one form of empirical social research, it is or should be based on sociological theory. (Caracelli 2004:175) In the context of evaluation the relevant theory is how a programme is supposed to produce the desired social outputs or impacts. „Program evalutation assumes that social problem solving can be improved by incremental improvements in existing programs, better design of new programs, or terminating bad programs and replacing them with better ones. If these conditions do not hold, evaluation cannot achieve its purpose. Theories of social programming, therefore, must show if and how these things can be done.“ 36

(Shadish et al. 1991:37) Theories of social programming spell out in what way, how and to what extent a social programme in a given context will produce the desired outcome. Evaluations are often commissioned to determined whether or not this theory holds (Leeuw/Vaessen 2009:15). In current evaluation publications the theoretical underpinning of a programme is often refered to as the theory of change and there is a large consensus among practitioners and theorists alike that evaluation needs to be based on a theory of change (ibid.; Baker 2000:12-13; Caspari/Barbu 2008:17). The assumed theory of social programming will guide the methodological design of an evaluation and have a strong impact on which methods are considered appropriate.18

3.2.4 Credible knowledge The procuct of evaluators is credible knowledge. Particularly in the political context of policy decisions the credibility of evaluation findings is a key issue (Caracelli 2004:178). “The two roots of evaluation are information gathered according to some plan which ensures its credibility and objectivity, and a process by which value is assigned.” (Lee 2004:137) Credibility hinges on the methodological design of an evaluation, which, according to Shadish et al. is closely connected to the underlying theories of knowledge. Depending on theoretical positions with regards to the nature of reality (ontology) and the nature of knowledge (epistemology) evaluators will favour different concepts of causality and choose different methodological designs (Lee 2004:150; also Patton 1997:265-297). In the last decades there have been fierce debates within the evaluation profession about which kind of knowledge is credible. This debate, the 'paradigm-war', reflects to a large extend the discussion about rational positivism and constructivism and revolves around the question „to what extent social research can capture the nature and complexity of reality and whether, indeed, an objective reality exists.“(Lee 2004:150; see also Caracelli 2004:179)19 18 Leeuw and other distinguish between two broad approaches on how to formulate program theory (Leeuw/Vaessen 2009:19). The first approach is to build a „causal story“ based on existing theories and describe how an intervention produces results. The second approach is to explicitly use a social theory as a benchmark to formally test theoretical assumptions. In this case the theory also largely determines the choice of methods. „This approach is typically applied in statistical analysis but is not in any way restricted to this type of method.“ (ibid.) 19 The scientific measurement paradigm assumes the existence of only one objective truth which can be detected and described by the appropriate measures. A common procedure of


Based on this ontological and epistemological debate, the possibility of evaluating a project objectively is a key issue in this debate. Can an evaluation produce an objective judgement or do results always reflect whose questions are asked, which values are assumed and who's interests are promoted (Caracelli 2004:180)? Most authors on evaluation would contend that the war of paradigms has to a large extend subsided and the profession has reached a common ground of promoting a variety of methods depending on the issues at hand (Stockmann 2004a:21; Lee 2004:152; Patton 1997:291). Concerns formulated by constructivist positions about power, interests, values are often being taken into account and evaluation is often acknowledged as a highly political activity (Stockmann 2004a:19). Also in the RIE debate this position is frequently advocated and the mixed-methods approach is frequently recommended in publications on RIE (White 2008; Leeuw/Vaessen 2009:xv) While Caspari and Barbu (2008:6) explicitly highlight that RIE is not ristricted to quantitative approaches, RIE examples cited almost exclusively refer to dominantly quantitative studies. Jones et al. point out that an intense and highly polarised epistemological debate surrounding RIE is still ongoing (2009:3-4) and that many publications implicitly assume that credible knowledge first and foremost is knowledge developped using experimental and quasi-experimental methods with qualitative analysis being an add-on.20 Preference of one methodological tradition also implies which evaluations following this positivist paradigm is to describe objectives of a social program and measure to what extent these objectives have been achieved. While there are a number of different theoretical approaches based on this paradigm, common major concerns for evaluators using this paradigm are to establish a causal link between a program and outcomes and to gain credibility through objectivity, reliability and validity. (Lee 2004:144-145, Stockmann 2004a:20; Kromrey 2006:104-105) Contrasting with this scientific paradigm of social sciences is the constructivist paradigm which is also called the naturalistic or the interpretive paradigm (Lee 2004:151) Constructivism is a large body of theory from a range of different fields. In the context of the discussion of RIE in development aid evaluation the key issue of constructivism is the assumption that there is no objective truth. Instead, reality is a personal construction of each individual and can become a shared construction, but it is never an objective reality independent of personal viewpoints, interests and values. (Shadish et al. 1991:43; Naudet et al. 2009:26) 20 „Debates over impact evaluation reflect the more general debate over the relative roles of qualitative and quantitative methods in social research. Participatory impact evaluation grew rapidly in the 1980s, and is still going strong especially amongst non-governmental organizations (NGOs). The proponents of the participatory approach are skeptical of the econometricians’ attempts to reduce the impact of complex social interventions to a single number. But the econometricians reject analyses which fail to build on a welldesigned sample of project and comparison groups which allow statements to be made with a degree of scientific confidence about the behavior of indicators with versus without the intervention. The reports and literature from these different approaches are in general developing in parallel, with


methodological schools are funded. Jones et al. even suggest, that methodological preferences may result in funding less complex projects, which are more amenable to RIE methods (2009:31.35) So, even if mixed methods approaches are promoted, most proponents of RIE regard experimental and quasi-experimental using counterfactuals (for explanations see below) as a gold standard. 3.2.5

Use of evaluation findings

One constituting element of evaluation is its use. „Society justifies large funding for evaluation partly expecting that some more or less immediate payoffs will accrue. If evaluation is not useful, those funds could be used for more programs, for reducing the deficit, or for other alternatives that might yield more immediate results.“ (Shadish et al. 1991:52) While evaluation makes use of scientific methods, the aspect of immediate use is often highlighted as the key difference between fundamental research and evaluation (Stockmann 2004:14). Fundamental research does not follow clear objectives and the potential uses of knowledge are not defined. The practice of evaluations originated in social politics (Bortz 2008:96; Lee 2004:138; Patton 1997:10) and is designed to provide knowledge for a range of specific uses and functions external to scientific interests (Stockmann 2004:14). „In utilization-focused evaluation, the primary criterion by which an evaluation is judged is intended use by intended users.“ (italics in original text, Patton 1997:63) Unlike in fundamental research data collection in evaluation is designed to answer specific questions, in a given time frame for a clearly defined users. Different theorists emphasize different uses of evaluation, but there is a general agreement that different functions of evaluation findings are possible and legitimate. Eventhough „use“ is one constituting element, if not the key element of evaluations, there is controversy in the literature regarding the degree to which evaluation findings are actually used. One of the early theorists highlighting the critical issue of use and non-use was Carol Weiss. „Weiss's use of decision theory was recognition that there is a general resistance of social programs to change, even when there is good information about what needs to be changed provided by rare attempts at dialogue to establish common ground let alone methodological fusion.“ (White 2006:2)


evaluation. This happens not only because those working in the program are invested in its survival, but because the social-decision making process is notoriously slow and dependent on political, rather than rational, processes.“ (Lee 2004:158) Given the substantial investments and the important policy goals at stake in evaluation this point is critical for this thesis and will be considered in detail further below.

3.3 Definition of evaluation In the preceding section five constituting elements evaluation theory have been discussed. Based on these elements the following working definition of evaluation is proposed for this thesis: Evaluation is the systematic collection and analysis of data and its value-based jugdement. The evaluation design is based on given theories of social programming and of knowledge construction. The purpose of evaluations is the use of its findings in social policy.

3.4 Concepts and Classifications of evaluation Definitions of evaluation as well as underlying conceptual paradigms are closely linked to its perceived functions. While authors in the literature reviewed may highlight different functions of evaluation to a varying degree, there seems to be a general consensus that there are three key functions of evaluation: conceptual learning and decision making, instrumental learning and decision making and political use of evaluations. (Bortz 2008:97-98); Stockmann 2004:37; Kromrey 2006:103-104; Patton 1997:76; Caracelli 2004:176; Rossi/Freeman 1993:34-36; Shadish et al. 1991:52, Naudet et al. 2009:6). If evaluations with a conceptual function are primarily used to accumulate general knowledge about development cooperation n in a certain sector, in a certain region or using a certain approach. Evaluation findings in this context will be used to further theory and possibly to influence mid-term or long-term strategic decisions in an area of work. Evaluations have an instrumental purposes if their findings are supposed to influence short- and mid-term decisions in a specific project or possibly in a specific donor agency's department (Forss/Bandstein 2008:12). The key difference 40

is that evidence for instrumental purposes is not accumulated but used for the project being evaluated. Most evalutions are commissioned for instrumental use. Many evaluations are also used for political purposes. Political purposes can range from legitimizing public aid funding generally or policy decisions specifically, over using evaluations as ammunition in power struggles to a ritualistic use of creating an image of evidence-based and rational policy making (Forss/Bandstein 2008:11; Lee 2004:144; Rossi/Freeman 1993:48; Shadish et al. 1991:53) Stockmann (2004:18-19) acknowledges the existence of such evaluations but classifies these “ornamental evaluations” as pathological aberrations of evaluation. However, recent research by Boswell (2009) on the political use of expert knowledge demonstrates how political functions may be more than aberrations and indeed play an important role in government policies in the context of non-use of evaluation findings. While it is possible and common practice to combine several functions in one evaluation design, the dominant function of an evaluation will have consequences for its design, its sampling, the degree of stakeholder involvement, the choice of methods and its subsequent use (Stockmann 2004a:17). Evaluations are often classified relative to the point in time at which an evaluation is performed (ibid.). Ex-ante evaluations of a proposed program aim to inform the conceptualisation of a program by studying its overall conditions and theoretical assumptions. The objective of ex-ante evaluations is to assure a high quality of a program concept. On-going evaluations are done to assess a program during its implementation phase. The objective of on-going evaluations is to guide the program management, to point out weaknesses and to increase the program performance. Another term for these evaluations is “formative evaluations” (Bortz 2008:109). Most evaluations in development cooperation are formative evaluations. After the completion of a program ex-post evaluation may be commissioned in order to assess the overall value of a program in relation to its espoused goals as well as unintended impact. (Stockmann 2004a:16). Such an evaluation is often called a summative evaluation (Bortz 2008:109; Naudet et al. 2009:31). Both formative and summative evaluations can focus more on processes or on


results. Evaluations that look beyond immediate output (e.g. bridge is built or children are vaccinated) to assess mid- and longterm impact are called impact evaluations. In recent years the evaluation of impact has been the dominating the debate about evaluation of development cooperation (Reade 2004:274276: Zintl 2004:251). A different classification of evaluation designs is the distinction between selfevaluations and external evaluations. Self-evaluation often have a strong learning focus, while external evaluations are often commissioned to serve control or legitimization needs. Evaluations generally are focused on projects, programs or approaches. In some cases, such evaluations are used to perform a metaevaluation. Meta-evaluations use comparative analysis to gain a deeper and more general knowledge on theoretical concepts or program approaches. Evaluations focussing on impact have been done for many years. They can be participatory or external, more based on quantitative or qualitative tools. The subject of this thesis is rigorous impact evaluations. While in the literature reviewed there is no commonly agreed distinction between impact evaluations and rigorous impact evaluation, there are a number criteria which are used to designate rigorous impact evaluations. These criteria will be discussed further below.

3.5 The call for rigorous impact evaluations in development cooperation 3.5.1 Evidence-based policy Rigorous impact evaluation is one element of a larger trend, particularly in the UK and the USA, towards evidence-based policy in politics (Rieper et al. 2009:11-26; Naudet et al. 2009:7.27). Originally developed in the field of medicine, evidencebased policy refers to a process of policy formulation largely informed by scientific research. According to a document on evidence based policy published by the UK Cabinets Office in 2001 evidence-based policy requires a review of existing research, commissions of new research, the use of expert consultancy and the comparison of a wide range of properly costed and appraised options (UK CO 2001:14). Policy is supposed to be based on rigorous and credible research indicating which policy approaches will produce the best results. Evidence about 42

what works and what doesn't becomes a key criteria for policy. „We will improve our use of evidence and research so that we understand better the problems we are trying to address. We must make more use of pilot schemes to encourage innovations and test whether they work. We will ensure that all policies and programmes are clearly specified and evaluated, and the lessons of success and failure are communicated and acted upon. Feedback from those who implement and deliver policies and services is essential too.“ (UK CO 1999) A very similar strife for evidence-based policy and rigorous evaluations is demonstrated in a White House Memoradum of 2009 (USA EOP 2009) „Rigorous, independent program evaluations can be a key resource in determining whether government programs are achieving their intended outcomes as well as possible and at the lowest possible cost. Evaluations can help policymakers and agency managers strengthen the design and operation of programs. Ultimately, evaluations can help the Administration determine how to spend taxpayer dollars effectively and efficiently - investing more in what works and less in what does not.“ In the USA a non-governmental organisation, the Coalition for evidence-based-policy (USA CEBP 2010) has been created with the sole purpose of promoting evidence-based

policy. More than other policy fields development cooperation is often the object of harsh criticism. The public debate about aid is informed by highly critical publications such as „Dead Aid“ by Damisa Moyo (2009) and development cooperation is often perceived as being without effects and marred by corruption. “(...) many outside of development agencies believe that achievement of results has been poor, or at best not convincingly establised. Many development interventions appear to leave no trace of sustained positive change after they have been terminated, and it is hard to determine the extent to which interventions are making a difference.“ (Leeuw/Vaessen 2009:xx) Given this critical perception of development cooperation and the need to justify public spending on aid, the concept of evidence-based policy has become widely accepted in the sector of development cooperation. The trend for evidence-based policy has been translated into the aid effectiveness debate as „Management for Development Results“


(MfDR) and as a call for more rigorous impact evaluations (Jones et al.:3).21 „The MfDR approach embodies generally accepted tenets of good governance – setting clear objectives, evidence-based decision making, transparency, and continuous adaptation and improvement. (...) The focus on results makes MfDR central to the entire aid effectiveness agenda.“ (OECD-DAC 2008b; see also Ito et al 2008:1). 3.5.2

International initiatives promoting RIE

If evaluation is a “growth industry” (Leeuw), rigorous impact evaluation (RIE) seems to be its hottest product at the moment (Bamberberger/Kirk 2009:4). According to the OECD-DAC Principles for Evaluation of Development Assistance the assessment of impact is one of the five relevant criteria, that should be considered in evaluations (OECD-DAC 2010b:13).22 In 2004 an influential research institute in the USA, the Center for Global Development (CGD) convened the Evaluation Gap Working Group consisting of high-profile researchers to investigate the reasons for the lack of RIE. In 2006 this Working Group published a document vigorously arguing for more RIE and for structural changes within development cooperation to facilitate a rise in RIE „Addressing this (RIE) gap, and systematically building evidence about what works in social development, would make it possible to improve the effectiveness of domestic spending and development assistance by bringing vital knowledge into the service of policymaking and program design.“ (Savedoff et al. 2006:1) In 2006 bilateral and multilateral development agencies created the “Network of Networks for Impact Evaluation (NONIE) in order to foster more and better impact evaluations by its members. Again, the underlying claim is, that RIE will contribute to aid effectiveness. “The proponents of RIE „believe that the ultimate reason for promoting impact evaluations is to learn about 'what works and what doesn't and why' and thus to contribute to the effectiveness of future development interventions.“ (Leeuw/Vaessen 2009:xxi) In 2007 the Center for Global Development, development agencies such as the Millenium Challenge Cooperation, the Bill & Melinda Gates Foundation, the 21 One early example of this trend is again from the UK. In development cooperation the British Overseas Development Institute (ODI), a renowned government funded research institute, created an evidence-based policy in development network (UK ODI). 22 The other criteria are relevance, effectiveness, efficiency and sustainability.


Hewlett Foundation and representatives of development countries created the International Initiative for Impact Evaluation (3IE) in order to „channel funds to high-quality, independent impact evaluations around key questions that confront policymakers“ (CDG 2007; Naudet et al. 2009:28). The World Bank shares the position of the Center for Global Development for more RIE and created the Development Impact Evaluation (DIME) initiative. This initiative aims at increasing the number of RIEs commissioned by the World Bank in selected policy fields, training World Bank staff in designing RIEs and promoting systematic learning from RIEs (World Bank 2010d).23 A number of other international networks and organisations promote RIE. The International Organisation for Cooperation in Development (IOCE), a network of national evaluation associations, facilitates methodological exchange. The Poverty Action Lab at the Massachusetts Institute of Technology evaluates poverty programes using randomized designs, disseminates findings and formulates policy recommendations. The Campbell Collaboration offers primarly systematic reviews of rigorous quantitative studies to informa policy. (Savedoff et al. 2006:58) One of the recommendations of the Systemprüfung 2009 in Germany encourages the use of more RIE in German development cooperation (Borrmann/Stockmann 2009:167). Other researchers in aid effectiveness also highlight the importance of RIEs (Easterly 2006; Faust 2007; Nuscheler 2008; Reinikka 2008; Pritchard 2008; Easterly 2008; Barder 2009).

3.5.3 Arguments in favour of RIE? „In the United States no one can market a prescription medicine for male pattern baldness without evidence it is 'safe and effective.' (...) Yet the nonprofit market is flooded with a continual new stream of proposed programs and interventions. Few public sector actions, even those of tremendous importance, are ever evaluated to the standard required of even the most trivial medicine.“ (Pritchett 2008:122) As described above, there are number of possible evaluation purposes and 23 Priority policy fields for DIME are different areas in education (early childhood development, service delivery and conditional cash transfer), health programs focusing on HIV/AIDS and malaria, rural infrastructure and youth employment. Until now 45 RIEs are published on the DIME website.


functions each requiring different designs. The concept of MfDR outlined above does not explicitly demand RIEs. However, the third of its five core components is „monitoring and evaluating whether the resources allocated are making the intended difference“ and thus implictly supports the trend for more RIEs (OECDDAC 2008b).24 The key argument for the promotion of RIEs is, that RIEs help identify what works in development cooperation. „For development practitioners, impact evaluations play a key role in the drive for better evidence on results and development effectiveness. They are particularly well suited to answer important questions about whether development interventions do or do not work, whether they make a difference and how cost-effective they are.“ (Leeuv/Vaessen 2009:ix; Teller 2008:3) Better knowledge, the assumption is, will improve aid effectiveness. „The value of impact evaluation is best understood as part of a broad scientific enterprise of learning, in which evidence is built over time and accross different contexts, forming the basis for better policymaking and program design.“ (Savedoff et al. 2006:13) Existing studies on impact, Savedoff et al. argue, do not provide robust information about what works and what doesn't. Apart from learning proving impact is also regarded as essential to legitimate public spending and to ensure a continued commitment of governments to finance development cooperation.“Decision makers need better evidence on impact and its causes to ensure that resources are allocated where they can have most impact and to maintain future public funding for international development. The pressures for this are already strong and will increase as resources are scaled up for international development.” (Leeuw/Vaessen 2009:xxi) The need to prove impact is particularly relevant in the context of calls for increased aid funding. „In the current environment, calls for increased aid spending are only credible if it can be shown that current spending is indeed contributing toward the attainment of the Millennium Development Goals.“ (White 2006:1) Visualisation of causal chain of RIE a) for learning and b) for accountability? 3.5.4

Causality and Certainty

There is a relatively large consensus among evaluation authors about what impact 24 Impact according to the OECD-DAC is defined as „the positive and negative, primary and secondary, long-term effects produced by a development intervention, directly or indirectly, intended or unintended.“ (OECD-DAC 2009c)


evaluation is.25 “Impact evaluation is one form of evaluation that assesses the net effect of a program by comparing program outcomes with an estimate of what would have happened in the absence of the program.” (e.g. Caracelli 2004:185; Rossi/Freeman 1993:215-222) Definitions of RIE usually consist of two key elements: attribution and counterfactual (Caspari/Barbu 2008:10; Baker 2000:2).26 Attribution refers to establishing a clear link between the impact and a project, e.g. a change of the income level of a community, that can be clearly and exclusively attributed to an micro-finance program. Attribution seeks to isolate the added value of a program, the net impact. To determine the net-impact of a program RIE needs to identify a clear causal link between program input and perceived output and impact. Causality, therefore, is central to RIE (Leeuw/Vaessen 2009:21-23).27 In RIE general statements about a program contributing to e.g. increase income in a given community are not sufficient, instead, the goal would be to quantify the extent to which a program caused increase (or decrease) in income. The second element of RIE, the counterfactual, is related to the concept of attribution and is supposed to capture what would have happened in case the program had not been executed. The „factual“ is the program situation. The counterfactual refers to the „without program – situation“ (Leeuv/Vaessens 2009: ix; also White 2006:3; Duflo/Kremer 2008:94) To compare the impact of a program with the impact of not having a given program is geared towards attaining certainty, that perceived outcome and impact is really attributable to the program. In practice the comparison between treatment group and counterfactual is faced with two major problems: contamination and sample selection bias, which 25 In recent years there have been a substantial amount of publications on RIE. Most of these publications are focusing either on the need for RIE (When will we ever learn) or on the feasability and reliability of statistical methods. Particularly issues of random sampling, the construction of control groups in quasi-experimental studies and sampling bias are dealt. Another thread of the discussion elaborates on how to do RIE in the real life situations, how to overcome institutional blockages against RIE and how to entice donors to do more RIEs. 26 The Independent Evaluation Group of the World Bank (IEG) explicitly talk about „rigorous“ impact evaluation which implies the use of appropriate technical procedures (White, IEG, 2006, 2) What these appropriate methods are is contentious. White refers to the longstanding debate between partcipatory and most qualitative impact evaluation on the one hand and scientifique, mostly quantitative evaluation and advocate for mixed methods to obtain the most rigorous results. 27 Gangl and DiPrete highlight the central role of causal inferences in social sciences as the crucial link between theoretical modelling and theory-based empirical social research. (2004:397).


will be discussed further below. (White 2006:3) The use of statistical tools generally held to be suitable to determine the degree of certainty of an established causal link between a program and perceived impact. „OVE accepted that to determine 'what works and what does not' requires a quantitative approach, and within the quantitative approach, accepted the emerging consensus of a hierarchy of empirical evidence.“ (Ruprah 2008:6)

4 Propensity Score Matching for hard data Having looked at the aid effectiveness debate in chapter 2 and the role of evaluations in accountability and learning in chapter 3, this chapter will analyse the potential of RIE to increase aid effectiveness on a methodological level, using the example of propensity score matching (PSM). The rigor of RIEs is to a large extent linked to statistics, so this chapter will be about statistics. However, the purpose of this chapter is not to provide a statistics textbook, therefore some technical details will be ignored. Instead, the ambition of this chapter is to shed some light on the quality of evidence PSM and other RIE can and cannot give. 4.1

Theorectical basis for causal analysis

Social policies, for example in development cooperation, are often designed to influence behavioural changes in people. To assess if these policies are successful researchers analyse if changes in behaviour have been caused by the policies designed for this purpose. However, behaviour is always subject to intentions of individuals and are only partially observable. Therefore researchers, e.g. in development cooperation, are faced with the challenge to determine with certainty the effects caused by a given. To meet this challenge, the Rubin Causal Model (RCM) has come to be the accepted modell to analyse causality in empirical social sciences.28 (Gangl/DiPrete 2004:397) All matching approaches, including Propensity Score Matching, are based on the RCM, which therefore will be described briefly. This description will follow the article of Gangl and diPrete (2004). 28 For an overview of philosophical theories on causality and their relevance to sociological research see Winship, Sobel, 2001, 6-12)


4.1.1 The Rubin Causal Model There are three generally accepted criteria for causality underlying the RCM: a) the cause has to precede the effect in time, b) cause and effect have to covariate without the covariation being caused by another factor and c) the effect has to be trigged by the cause. Based on this criteria of causality the RCM defines causality as the difference between the factual and the counterfactual. In the RCM two situations Y are compared in relation to an influencing factor T, for example a social social policy. In one situation T is present (T = 1), in the other situation it is not (T=0). Thus, the situations are called Y1i and Y0i respectively – one is the factual, the other the counterfactual. The key assumption of the RCM is that exposing a group of people to a treatment (e.g. a social policy) will produce different results than not exposing this group of people to this treatment (Winship/Morgan 1999:662; Winship/Sobel 2001:12-19. In the RCM a causal effect (also called the unit effect δ) is the difference between these two situation with respect to the parameter of interest, for example the difference between the income level of a target group after a micro-credit program and the income level of the same target group without a micro-credit program. Given this model of causality the challenge for social empirical research is obvious: it is not possible to observe the factual and the counterfactual at the same time for the same group of people (the treatment group). Rosenbaum and Rubin describe this challenge as a problem of missing data (Rosenbaum/Rubin 1983:41; Winship/Sobel 2001:3). The solution is to find another group of people (the control group) that are equal to the treatment group in everything but the treatment and estimate the effect of not being exposed to the treatment T. So in order to determine a causal link (the unit effect δ) empirical social research needs to construct a counterfactual estimation. (Gangl/DiPrete 2004:402).

4.1.2 Basic assumptions for causal analysis based on the RCM To be able to make a causal link between a factor T and a situation Y three basic assumptions need to hold. These assumptions will be illustrated using the following formular, which will be explained in detail below.


Yli, iεE – Y0i, iεC = δT + (Y0i, iεE – Y0i, iεC) + (δiεE – δiεC) 1

Yli, iεE – Y0i, iεC




1) The difference between the situation Y of an indidual (i) belonging to the treatment group (E) and an individual (i) belonging to the control group (C)


2) The supposedly true effect of factor T,for example the participation in a microfinance program

(Y0i, iεE – Y0i, iεC)

3) The difference between the initial situation of the treatment group and the initial situation of the control group, i.e. before the program

(δiεE – δiεC)

4) The difference between the potential effect δ on an individual belonging to the treatment group and the potential effect δ on an individual belonging to the control group

1) The first element of the formular is the unit effect (the effect caused by e.g. a program). The other three elements represent three conditions that have to be met in order to establish a causal effect. 2) To identify a causal effect in the RCM model it is assumed that there is an effect caused by the treatment and nothing else, a „true“ effect (δT). This effect needs to be stable over time, not influenced by interdependencies between the control group and the treatment group.29 This basic assumption is called the stable unit value assumption (SUTVA). Causal analysis based on regression and matching techniques requires that this assumption holds. In practice, however, it is often violated due to interdependences or to major changes in the environment (e.g. macro-economic situation). To ignore this first assumption invalidates a causal estimation. “When such complex effects are present, the powerful 29 One possible undesirable interdepency could be the success of a program for the treatment group influencing the control group (Rosenbaum/Rubin 1983:41). For example the increased access of farmers in the treatment group to credit could increase the demand and thus the price of land and thus negatively influence the other farmers who do not get a credit.


simplicity of the counterfactual framework vanishes.” (Winship/Morgan 1999:663) 3) A second critical condition for causal analysis is that the treatment group and the control group are in fact equal in all relevant aspects except the treatment (T). This condition is expressed by the third element of the formula: (Y0i, iεE – Y0i,iεC). The difference between the initial situation of the treatment group and the initial situation of the control group should ideally be zero. (Winship/Morgan 1999:661) If there is an initial difference between the two groups, there is a bias in the subsequent estimation. The possible source of bias is called baseline difference (Winship/Sobel 2004:503). This second condition or required assumption is called the unit homogeneity. In order to determine if unit homogeneity is given, the observable covariates are analysed. 4) The third condition for causal analysis refers to the potentially different effect of factor T on the treatment group and the control group (differential treatment effect) and touches on the selection problem. Ideally, there should be zero difference between the effect of a program on the treatment group and the potential effect on the control group. This condition is expressed in the fourth term

(δiεE – δiεC). In non-experimental data the assignment of individuals to either the treatment or the control group is not random. The effects of T on the actual treatment group may thus be different from the hypothetical effect of T on the those individuals, that were not selected. For example in development projects individuals may be chosen to participate in a microfinance projects because of certain characteristics or they may decide themselves to participate. The effect of e.g. a microfinance project, will often partly depend on these selection criteria. The necessary assumption for causal analysis is that there is no such difference of effect in the empirical data. This assumption is called conditional independence assumption (CIA) or strict ignorability assumption (Rosenbaum, Rubin). (Gangl/DiPrete 2004:403; Rosenbaum/Rubin 1983:43; Winship/Morgan 1999:661). According to Winship and Morgan this condition is often violated. “The second source of bias (…), the difference in the treatment effect for those in the treatment and control groups, is often not considered, even though it is likely to be present when there are recognized incentives for individuals (…) to select 51

into the treatment group. Instead, many researchers (…) assume that the treatment effect is constant in the population, even when commen sense dictates that the assumption is clearly implausible.” (Winship/Morgan 1999:667)30 Out of the the three required conditions to estimate causal effects two are directly concerned with the selection of an appropriate control group. If the control group is not equal to the treatment group in all relevant aspects (baseline difference) and if the control group would react differently to the treatment than the treatment group (differential treatment effect), there is selection bias in the estimation. As noted above δ designates the unit effect of a factor T, i.e. the effect of a program on one unit of analysis. In social sciences the unit effect is only occasionally the relevant parameter. Instead average values δ = Y


- Y c for the

treatment group (average treatment effect on the treated – ATT) or the average treatment effect for the total population δ = ( Y

c iЄT

- Y

c iЄT

) (control group and

treatment group combined – ATE) are calculated. The equation for average effect is : Y



- Y


iЄC =




- Y



) + (1 - π)( δ iЄT + δ iЄC).

Without selection bias and interdependencies ATE and ATT are equal (Gangl/diPrete 2004:407) 4.1.3

Approaches to establish a counterfactual

In empirical social research based on quantitative data there are four main ways of constructing a valid control group and to control selection bias. In experimental designs indiduals, groups or other units of research are randomly assigned to either the treatment or the control group. This randomisation assures that the characteristics of the treatment and the control group are as similar as possible. Selection bias is very likely to be minimal. Therefore some researchers strongly 30 Winship and Morgan offer a very intuitive illustration the three assumptions underlying the RCM for averages: “To clarify this decomposition, consider a substantive example – the effect of education on an individual's mental ability. Assume that the treatment is college attendance. After administering a test to a group of young adults, we find that individuals who have attended college score higher than individuals who have not attended college. There are three possible reasons that we might observe for this finding. First, attending college might make individuals smarter on average. This effect is the average treatment effect represented by δ (…). Second, individuals who attend college might have been smarter in the first place. This source of bias is the baseline difference represented by ( Y ciЄT - Y ciЄC ) in the equation (…). Third, the mental ability of those who attend college may increase more than would the mental ability of those who did not attend college had they in fact attended college. This source of bias is the differential effect of treatment, represented by ( δ iЄT + δ iЄC) in (the) Equation (…).” 667


advocate the use of randomisation in RIEs in the context of development cooperation (e.g. Dufflo/Kremer 2008:93-118). Problems of experimental designs are high costs, changes within the control group during the course of the experiment, the difficulty in some circumstances to truely implement a random design and the long time necessary to execute an evaluation (Baker 2000:3) Ravallion lists a number of other problems of randomization, including the threat of the method changing the implementation of a project (25). For all RIE using using non-experimental designs selection bias remains a problem (Leeuv/Vaessen 2009:23). The most commonly used strategies to control selection bias in non-experimental data include regression discontinuity analysis, the use of longitudinal data (difference-in-difference approach), regressions and matching. In quasiexperimental designs the treatment and control group are usually selected after the implementation of a project (Baker 2000:2). The regression discontinuity analysis makes use of the fact that programs often include a certain threshold. For example in a microfinance project only families earning less than US$ 200 a month may be eligible for a credit. In such a case US$ 200 would be a cut-off point. Families earining US$ 205 may be very much like the participating families but are not able to benefit from the project. In such a case families just above the cut-off point would be used as a control group to determine the impact of the microfinance project (Leeuw/Vaessen 2009:28) The problem with this approach is, that thresholds are often not respected and that only the impact around the cut-off-point is assessed (Ngyen/Bloom 2006:14). It's advantage is that effects are not very likely to be biased by unobserved variables (Leeuw/Vaessen 2009:28). The difference in difference approach collects data of the treatment and the control group before a project and after a project. This data then allows to compare the scope of changes within the treatment group with the scope of changes within the control group and thus deduces the impact of e.g. a project. (Leeuv/Vaessen 2008:26; Winship/Sobel 2001:45) The difference in change is attributed to the program. Winship and Sobel point out that this approach is not the panacea it is often considered and that underlying assumptions are often not


tested. (Winship/Sobel 2001:40) Important assumptions of this approach are that in the case of non-treatment the variables of interest would change to the same extent in both groups (Winship/Sobel 2001:46) or that the differences between treatment and control group are stable over time (Leeuv/Vaessen 2009:26. Both assumptions are not necessarily met. Dufflo/Kremer also warn against considerable bias produced using the difference-in-difference approach (2008:9697) Regression: Regression analysis, mostly parametric OLS regression, is frequently used in social sciences to make causal inferences. This method cannot be described in the context of the thesis. Its basic concept is that a basic assumption is made about the relationship between different variables (Diaz-Bone 2006:185227). Based on this assumption the impact of a number of „input“ (independent) variables (e.g. class room size, school fees, household income, participation in a school feeding project) on an „output“ (dependent) variable (e.g. number of years children spend at school) is calculated. Parametric OLS regression requires a number of conditions that need to be met (Diaz Bone 2006:223-227; Boslaugh 2008:231) including data characterised by a linear function, the specification of a correct model and a control for self-selection.31 These conditions can be difficult to meet and violating these conditions results in less robust estimations. (Winship/Morgan 1999:673) In this context Winship and Sobel strongly warn against too much confidence in causal inference using regressions other nonexperimental methods, particularly if only one model is used and specify the necessary requirements to use regressions (2001:35). A common problem with regression analysis is endogeneity, which means that the independent variable correlates with the error term and produces a biased result. Roodman holds that endogeneity can not be removed fully and that it is also impossible to know exactly how much endogeneity has been removed by statistical instruments. Instruments may or may not be valid and instrument tests are a helpful but imperfect in testing the validity of instruments (Roodman 2007:8-9) Regression analyis is often used in combination with PSM so the conditions for 31 Self-selection is a term used in literature on rigorous impact evaluation and social empirical research. It denotes the fact that participants of a project can choose to participate in a project. If self-selection is possible, participants are likely to be more motivated and thus produce higher 'project impact' than in situations where self-selection is not possible


applying regressions often need to be met in analysis involving PSM (Ravallion 2005:34). Matching is the fourth option for constructing valid control groups and thus allow causal analysis. In matching the challenge is to find for each unit of the treatment group one or several units of the control group that are as similar as possible with regards to all relevant variables. According to Gangl and diPrete one advantage of matching approaches is the need to focus on one key causal factor and establish a hierarchy of influencing factors as opposed to regression where a number of potentially causal factors can be considered. (Gangl/diPrete 2004:14) “(...) because fewer parameters are estimated than in a regression model, matching is more efficient. Efficiency can be important with small samples” (Winship/Morgan 1999:674). One disadvantage is that a high number of relevant variables may result in finding only very few matches. In evaluation literature in the field of development cooperation propensity score matching, one specific matching approach, is frequently cited as a good solution to this problem. Propensity Score Matching is described below. Another disadvantage is that matching approaches generally need treatment and control group data with considerable overlap of relevant characteristics. This overlap is called “common support” and is illustrated in the graphic below.

Treatment Common Group support


Control Group

Group of Common Support

Source: EU 2010: Group of common support [15.06.2010] If the treatment group and the control group are very dissimilar, the overlap, the common support where such matches can be found, may be very small. As a consequence the sample size of the analysis becomes small and the power of the analysis suffers and the degree of certainty in the use of the results (often called the estimates) is reduced.

4.2 Propensity Score Matching and its limits 4.2.1 Description of the method The objective of PSM is to create a control group that resembles as much as possible the treatment group on all relevant variables so that a comparison of both groups allows the estimation of the causal effect of the treatment. The procedure of how to assign individual units (persons, groups, etc.) to either the treatment or control group is called the assignment model. An assignment model is based on theoretical assumptions about which characteristics of the units are relevant and which are not. All variables that are likely to influence either the selection into the treatment group or the outcome of the treatment should be included in the 56

assignment model. For example in an evaluation of an educational project promoting children of poor families relevant variables for the selection of targeted children may be family income, number of siblings, gender and distance to school, whereas relevant outcome variables may be school attendance, grades and years of schooling. The number of relevant variables may be very high which makes an exact matching for example of one person of the treatment group with one person with exactly the same characteristics of the control group difficult. Even with a large data sets such a one-on-one matching may be impossible for many cases. „Ideally, treated and control units would be exactly matched on all covariates x, so that the sample distribution of x in the two groups would be identical. (...) Unfortunately, exact matches even on a scalar balancing score are often impossible to obtain, so methods which seek approximate matches must be used.“ (Rosenstein/Rubin 1983:49; also Baker 2000:48-51) PSM is a solution to this problem. In 1983 Rubin and Rosenbaum proved that it is possible to match based on one single characteristic: the propensity score.32 The propensity score is a variable that can be calculated for each unit of analysis and which aggregates all variables of interest into a single number. The propensity score is the likelihood of a person (or another unit of analysis, a group, a village, etc.) to be part of the treatment group. For example in an educational project the propensity score would be the likelihood of a child to be selected as beneficiary of this project. To establish this propensity score data for project participants and data for non-participants of the schooling project are pooled. Then all known relevant variables that influence whether or not a child participate in this project are used to estimate the likelihood of each child in the treatment and the control group to benefit from the 32 The mathematical proof that propensity score matching is a valid approach to form a control group is based on five theorems: “(i) The propensity score is a balancing score. (ii) Any score that is 'finer' than the propensity score is a balancing score; moreover, x is the finest balancing score and the propensity score is the coarsest. (iii) If treatment assignment is strongly ignorable given x, then it is strongly ignorable given any balancing score. (iv) At any value of a balancing score, the difference between the treatment and control means is an unbiased estimate of the average treatment effect at that value of the balancing score if treatment assignment is strongly ignorable. Consequently, with strongly ignorable treatment assignment, pair matching on a balancing score, subclassification on a balancing score and covariance adjustment on a balancing score can all produce unbiased estimates of treatment effects. (v) Using sample estimates of balancing scores can produce sample balance on x.” (Rosenbaum/Rubin 1983:43-44 and 44-48)


educational project. This is done usually with a logistical or probit regression, so the propensity score is an estimate based on statistical calculations (Ravallion 28). Each child is thus given one particular score. The important advantage of PSM is, that once each unit is given a score, matching can be done on one single dimension. Instead of having to consider a potentially large number of variables, only one variable (the propensity score) is used (Winship/Morgan 1999:676; White 2006:14f.).33 Once the propensity score for each unit has been calculated a matching strategy has to be decided. There are several possible matching strategies, each using a specific algorithm to match a unit of the treatment group to one or several units of the control group. The objective of these matching strategies is to adjust the control group as much as possible to the treatment group. The three most common algorithms are stratification, nearest-neighbor-approach and Kernel matching. Both the stratification algorithm and the nearest-neighbor-algorithm use only the control data that most ressembles the treatment data. All units of analysis in the control data where the propensity score differs a lot from the treatment data are eliminated (Baker 2000:51). In stratification a weighted average of the propensity scores of several control group units are compared to one treatment group unit. In nearest-neighbor-matching scores of only one unit of the control group is compared to one unit of the treatment group. In Kernel matching the complete control data is used. This approach relies on a weighted average of all control unit scores. (Gangl/di Prete 2004:17-19) Other variations of PSM include the estimation of several propensity scores based on different set of selection criteria and the combination of matching on important selection criteria and the propensity score. (Winship/Morgan 1999:677; Baker 2000:51) It is interesting to note that Rosenbaum and Rubin warn against discarding data for which no appropriate matches can be found, since these unmatched cases may differ systematically from the matched cases and excluding them may yield biased results (1985:34) Whether or not the construction of a valid control group through PSM has been 33 Heckman et al (1997) highly the difference between the “true� propensity score and the estimated propensity score, which may challenge the validity of the approach. The literature reviewed does not offer a conclusive position on whether this difference is threatening the propensity score approach as such XXX.


successful can be assessed with a balancing test. This test compares the mean values of key variables of the treatment and control group. However a balancing test can only determine the quality of the matching algorithm, whether or not the matches actually have very similar characteristics with respect to the chosen variables. The assignment model as such, whether the chosen variables are really the relevant ones, cannot be tested (Gangl/diPrete 2004:19). To determine the impact of a project the last step in causal analysis using PSM is to calculate the impact of the project for all units selected in the matching process. To calculate the impact either the average value of the relevant outcome variable of all selected treatment units, for example number of years at school, is compared to the average value of the same variable for all control units. Alternatively the median valud or the quartile distribution can be compared (Gangl/di Prete 2004:20).34 According to Rosenbaum and Rubin, the use of PSM holds a number of advantages. The major advantage is that matching can be done based on only one variable that aggregates many other variables. Another important advantage is, that it is easy to apply and persuasive even to nontechnical audiences (Rosenbaum/Rubin 1985:33). PSM can also be applied in combination with other statistical tools, and improve the results of model-based statistical tools such as regression. PSM is much less dependent on large sample sizes than multivariate matching, which often result in small sizes of the final analysis group (Rosenstein/Rubin 1983:48-49)35

34 If the number of years at school for all control group children would be ranked, the median is the value of the child in the middle of this scale, which may differ considerably from the mean, average value. The quartile distribution classifies all available values into the 25% lowest values, the 25 % second lowest, the 25% third lowest and the 25% highest values for number of years at school. 35 Another relevant issue in matching is whether or not matching approaches are equal-per-centbias-reducing. If a matching approach is equal-per-cent-bias-reducing the bias in each coordinate of a variable is reduced by the same percentage. For example if in an education project matching reduces the bias of family income in comparing school attendance, an equalper-cent-bias-reducing method reduces the bias of high family income just as much as the one of medium family income. Matching approaches need to be equal-per-cent-bias-reducing for linear functions of the covariables, because otherwise they may actually increase instead of reduce bias. In their paper, Rosenbaum and Rubin offer the mathematical proof, that propensity score matching is in fact equal-per-cent-bias-reducing (Rosenbaum/Rubin 1983:49).


4.2.2 Limitations of PSM PSM is widely cited as a useful approach to RIE in development cooperation. However, there are a number of issues that threaten the accuracy of estimates based on PSM. Some of these threats concern the given data sets and circumstances of an evaluation, others are general in nature and concern PSM as such. Use of different data sets: In the context of PSM one limiting factor relating to the issue of appropriateness is the quality of data required to use the PSM approach. In the literature reviewed the quality of data necessary to apply PSM is contentious. Heckman et al (1997) published research on a national job training program in the USA using PSM and formulated crucial requirements on data quality in order to estimate valid treatment effects. These requirements are “(i) the same data sources (…) are used for participants and nonparticipants, so that earnings are other characteristics are measured in a analogous way, (ii) participants and nonparticipants reside in the same local labor markets, and (iii) the data contain a rich set of variables relevant to modeling the programparticipation decision.” (Smith/Todd 2001:112) According to Heckman the performance of the estimators, i.e. the validity of an impact assessment, diminishes greatly, if these requirements are not met in the control group (Ravallion 18). Dehejia and Wahba (1998) analysed the same national job training program again using PSM and concluded that estimations of treatment effect can perform satisfactorily even if the data quality on the control group does not meet Heckmans criteria. Todd and Smith (2001) refuted this position and found that Dehejia's and Wahba's results are not robust, but highly sensitive “both to changes in the sample composition and to changes in the variables included in the propensity-score model.” (Smith/Todd 2001:117) Small group of common support: In addition to the problem of possible different sources of data for control and treatment units the use of PSM is often limited by treatment group data and control group data that are not very comparable i.e. by a very small group of common support. PSM and other matching approaches are based on the assumption that a sufficiently large number of matching pairs can be found in the treatment and the control group. There may


be cases, where no matches for treatment units can be found among the control group. These unmatched treatment units cannot be considered in the impact estimation. If a substantial part of the treatment data has to be discarded because of that, “the estimated average causal effect only applies to cases of the sample of treated cases for which there are samples.”(Winship/Sobel 2001:33). In this case the impact found in a project would only be valid for a subsection of the beneficiaries. White (2006:11) also highlights the challenge of finding sufficient matches and points out the need for large sample sizes to address this challenge. Ravallion points out, however, that the limitation of analysis to the group of common support increases the robustness of impact estimates (32). Selection on observables: Another limiting factor of PSM is more fundamental in nature and concerns the variables considered in PSM. The calculation of the propensity score is done using a number of variables such as family income or number of household members in the educational project example that are supposed to influence the probability of a person or a group to benefit from a project. The choice of these variables are done based on theoretical considerations as to which factors are important and which factors can be observed. However, there may be other factors, that influence for example the participation of children in the school project, that are not known to the researchers. Such factors would be called „unobservables“ - factors that are not taken into consideration. The possibility of unobservables influencing the selection of individuals into the treatment group is seen by many authors as the key weakness of PSM (Baker 2000:52-53;Leeuv/Vaessen 2009:25; Rosenbaum/Rubin 1985:35; DiPrete/Engelhardt 2004:502). The results of an impact analysis using PSM are invalid if there are unobserved factors, that influence e.g. whether or not a child can participate in a school project and that also influence how well the children in the project are doing. If such unobserved factors exist, the outcome of a project is not attributable to the project alone, but also to the choice of participants. In such a case the result of an impact evaluation will be exaggerated. PSM implies the assumption, that such unobserved factors do not exist. According to a number of authors this is a very strong assumption. (Ravallion 30; Winship/Morgan 1999:695-696 and Heckman et al. 1998:1071; Heckman et al. 1994; White


2006:10-12). In the literature reviewed there is not straightforward strategy to assess whether or not this assumption holds. As Ravallion explains the use of PSM might even increase bias of results. This could be the case if at the outset there are several factors influencing impact, some positive, others negative. If for example the correction of bias through PMS reduces only the factors negatively influencing the impact of a project, the overall estimate of impact will be more biased than before the matching process (Ravallion 31). Ravallion points out however, that in practice such cases have not been observed (ibid.). DiPrete and Engelhard hold that PSM can produce more reliable results than regressions without however claiming that it can eliminate selection bias (2004:21) other authors found no difference in the effectiveness of these two methods (Ravallion 33). Ravallion points out that the fact that estimates based PSM on observable variables increases the requirements in terms of quantity and quality of data (27) Lack of control for differential treatment effect: Without explicitly referring to PSM Winship and Sobel (2001:25) warn that there are statistical methods attempting to eliminate the baseline difference of selection bias, but very few techniques to adjust the differential treatment effect component of selection bias. The propensity score is also a variable designed to adjust for baseline differences. There may or may not be additional selection bias due to differential treatment effect. Winship and Sobel conclude, that in many research projects based on nonexperimental data such as most RIEs, only the treatment effect on the treated can be estimated, not the potential treatment effect on the entire population of interest. Duflo and Kremer (2008:106) concur with the critical assessment of nonexperimental designs and argue that many non-experimental designs are biased because not all relevant variables are included in analysis („omitted-variable bias“). According to these authors „non-experimental estimators often produce results dramatically different from those of randomized evaluations, that the estimated bias is often large, and that no strategy seems to perform consistently well.“ (ibid.)


4.3 Limitations of RIE generally PSM is one out of many quantitative research tools. In the debate about RIE quantitative methods involving a counterfactual are often treated as the gold stand (JJSD) and particularly non-statisticians often link quantitative research results to hard data and certainty. An indepth look at particular quantitative tools such as PSM and at quantitative methods of social empirical research generally provide a more nuanced picture of the degree of certainty that can be achieved using these tools. This subchapter will outline some of the limitations faced by quantitative approaches in each of the four stages of research: design, data collection, data analysis and interpretation. Some of these limitation are not exclusively relevant for quantitative methods, others are directly linked to the standards of reliability and validity used in statistical analysis.36 The purpose of this is not to discredit quantitative tools but to encourage a more realistic discussion about the merits of RIE.

4.3.1 Limitations in Research Design Whose indicators? The quality of a RIE is to a large extent dependent on its design. The evaluation design has to assure that the core evaluation issues are adequately operationalised into valid indicators and measurement scales and the correct unit of analysis (individual, household, village, etc.) (Boslaugh 2008:97; Kromrey 2006:200. 236-238.400-402). In some evaluations the definition of indicators can be straightforward, in other cases the choice of indicators may be contentious, strongly related to particular interests and all but objective. (Salais, Hornbostel) In such cases indicators cannot be objectively right or wrong, but appropriate or inappropriate from a given point of view. The choice of measurement scales is also relevant for the quality of an evaluation. For example in an evaluation of a school project it matters if impact is measured as a) posivite impact/no impact b) high impact/medium impact/low impact or c) impact rated on a scale from 1 to 10. The use of many statistical tools requires a measurement on a 36 A result of statistical analysis is reliable if a repition of the same analysis would yield the same results, i.e. if the tools or instruments choosen measure correctly. For example, if a scale to measure body wheight is not functioning well, it is unreliable. A result is valid, if a research tool actually measures what it is supposed to measure, for example if a welfare index is appropriate to measure the concept of welfare. (Kromrey 2006; Boslaugh 2008)


scale but may not be appropriate for the issue at hand (Kromrey 2006:247). Sample design: Part of an evaluation design using quantitative measures is the sampling process. The size of the sample depends on many factors such as the percentage of the target population out of the total population, the degree of precision required in the analysis and the expected percentage of non-response. Another important factor for the calculation of the sample size is the expected value of the outcome variable (Bortz 2008:52-53) In the case of a RIE the evaluator would need an estimation about the scope of the impact in order to determine the sample size. This is often not possible. For many statistical tools relatively large samples are necessary. For example PSM requires large sample to assure a sufficiently large group of common support. In social research generally effects are small and relatively large sample sizes are needed to determine effects (Rossi/Freeman 1993:264). At the same time sample sizes can be too large with the result that small „differences appear to be statistically significant, when they are substantively unimportant“ (ibid. 1993:229). Apart from the sample size the sample composition is important and weighting of subsamples may be necessary. (The United Nations: Designing Household Survey Samples: Practical Guidelines, Department of Economic and Social Affairs, Statistic Division, Studies in Methods, Series F, No. 98, New York 2005.) Other crucial aspects to be considered in sample design for RIEs are possible spill-over effects between project beneficiaries and non-beneficiaries (Leeuw/Vaessen 2009:23-24; White 2006:4; Ravallion 24)

4.3.2 Limitations in data collection There are a number of error sources in data collection and the quality of data, particularly in the context of development cooperation is often wanting (Lipsey 1988:22). For example the tools to collect data may not be valid or may not be implemented properly by trained interviewers (Kromrey 204). Answers in surveys can be subject to reponse bias, for example if people do not understand questions correctly, if they refuse to give correct answers (e.g. on income, educational level, prevalence of diseases, etc.) or adjust their answers to what they expect the interviewer wants to hear (Boslaugh 2008:108; Ravallion 19; White 2006:18-19). In research where people are supposed to be interviewed more than once attrition 64

is a common problem – the fact that people cannot be interviewed a second time (because they have move, have died, refuse to answer, etc.) (Ravallion, 19).

4.3.3 Limitations in data analysis Statistics are to a large extent concerned with probabilities not certainties. Statements made in statistical analysis vary in their degree of uncertainty and in the number of conditions under which they are made. Because of that statisticians would not support any claim in the RIE about certainty. To the contrary, IMF scholar and researcher in econometric methods Edward Leamer is adamant in highlighting the limits of statistical data analysis: „If we want to make progress, the first step we must take is to discard the counterproductive goal of objective inference. The dictionary defines an inference as a logical conclusion based on a set of facts. The "facts" used for statistical inference about 0 are first the data, symbolized by x, second a conditional probability density, known as a sampling distribution, f(xl0), and, third, explicitly for a Bayesian and implicitly for "all others," a marginal or prior probability density function f(0). Because both the sampling distribution and the prior distribution are actually opinions and not facts, a statistical inference is and must forever remain an opinion.“ (1983:37) Given this understanding of statistical analysis in general, additional limitations of data analysis in concrete cases can relate to two issues: a) whether the statistical tools chosen are appropriate for the question at hand and whether the assumptions underlying these tools hold and the conditions attached to them are met. With reference to the choice of models Leamer contradicts the belief of objectivity in data analysis and highlights the common practice of researchers to adapt the research process to their scientific interests. „The false idol of objectivity has done great damage to economic science. Theoretical econometricians have interpreted scientific objectivity to mean that an economist must identify exactly the variables in the model, the functional form, and the distribution of the errors. Given these assumptions. And given a data set, the econometric method produces an objective inference from a data set, unencumbered by the subjective opinions of the researcher. This advice could be treated as ludicrous, except that it fills all the econometric textbooks. Fortunately, it is ignored by applied econometricians.“ (ibid.) In the context of parametric OLS regressions examples of such 65

requirements are linear function of the data and low collinearity (Roodman 2007:11; Diaz-Bone 2006:225) Limitations in the area of data analysis have been described above for the case of PSM and regressions. In a discussion of limitations of statistical tools Roodman points out that many new estimators are only valid for infinite data sets but not for real world data sets. Roodman 2007:12) Even if good data is available, if credible assumptions are made, if valid instruments exist and if the conditions for applying certain statistical tools are met, statistical analysis can usually only partially explain social phenomena. Effects in social empirical research tend to be small (Rossi/Freeman; WeissXXX). The „unkown“ is usally bigger than the „known“. One example for that is a common estimator in regressions analysis r2, which is rarely beyond 0.5. For example if an analysis tries to identify the effect of a health project and other factors such as agricultural production, national economic growth, etc. on the prevalence of malnutrition the r2 value would quantify to what extent all factors explain existing variation in malnutrition. If this r2 value is rarely beyond 0.5 it means that usually regression analysis leaves more than 50% of the variation in malnutrition unexplained (Diaz-Bone 2006:213). Another example is the estimator r which indicates to what extent two independent variables correlate. According to Diaz-Bone correlations in empirical social research tend to be low. With r potentially ranging from 0 to 1 correlations between 0.2 and 0.5 are already considered medium and correlations above 0.5 as strong (ibid.). In the context of evaluations both Lee (2004:148) and Caracelli (2004:185) highlight the fact that even with powerful statistical tools there is always some uncertainty that remains.

4.3.4 Limitations in data interpretation Once a set of data has been analysed it needs to be interpreted. Interpretation, just like the choice of indicators or the choice of statistical models depends on theoretical stances and preferences of researchers and as such is not objective (Jones/Baumgartner 2005:12). Lack of objectivity is not a problem in itself. To present interpretations as objectiv is problematic. Two critical issues in data interpretation that are relevant for RIE and robustness and external validity. 66

A statistical result can be considered robust and credible if different methods and different models have been used to analyse the same project and arrive at similar conclusions. To test a result for robustness is particularly important in scientific areas where there is a lack of commonly agreed theory (Winship/Sobel 2001). Causal analysis in RIE based on only one model or only one statistical approach is not reliable. “We view the all too common attempt when one has nonexperimental data to make causal inferences about a series of X's within a single model as hazardous. The threats to the validity of such claims are simply too great in most circumstances to make such inferences plausible.” (Winship/Sobel 2001:35; also Boslaugh 2008:99; Winship/Morgan 1999:695 and 703) Rossi and Freeman concur and recommend a repetition of evaluation studies where it is important to be certain about effectiveness (Rossi/Freeman 1993:229). In the context of RIE Easterly points out that the incentives within development cooperation to replicate studies are very low (2006:22) One argument often made in the RIE discussion is that „hard evidence“ is needed in development cooperation so that successful projects can be either scaled up or replicated elsewhere. The underlying assumption is that all factors that contributed to a project being sucsessful are also given in a larger or in a different setting. Among evaluation authors this aspect is discussed as external validity. A RIE result for a development project has a high external validity if the same result can be expected in another project in a comparable situation. A number of authors question the external validity of RIE results (e.g. Lee 2004:172; Duflo/Kremer 2008:112). Claims of external validity are hard to test are oftentimes casual (Easterly 2006:21-22). One argument against external validity is that projects which are evaluated using RIE are often highly controlled and even in the same project less control may lead to different results. Also upscaling a project increases implementation challenges, which may be reflected in less impact (Easterly 2006:24-25). Results of RIE risk being time-limited and liable to change with short-term changes within a project or within the socioeconomic and political environment (Rossi/Freeman 1993:28). „Both the object of evaluation and its environment are moving targets, changing even as they are observed and described. Conclusions based on an evaluation, therefore, must be


appropriately modest.“ (Lee 2004:168)

4.4 Do Propensity Score Matching and Rigorous Impact Evaluation produce hard facts? The term „facts“ is often used to designate irrefutable and objective truth beyond doubt or dispute. Neither PSM nor RIE or any statistical tools can produce such „hard facts“. As illustrated by the comments of Roodman and Leamer, statisticians do not claim to produce such facts. However, in the context of development cooperation there is a strong pressure to prove effectiveness and there is a tendency to treat statistical evidence as „hard facts“. While statistical analysis is very valuable and can reduce uncertainty in social policy, it cannot eliminate uncertainty. Four forms of uncertainty are linked to RIE. 1) Statistical uncertainty: At the most basic level the probabilistic approach of RIE has an inherent uncertainty. Results based on RIE are only valid in a given statistical model, under given assumptions and with varying degrees of certainty. Uncertainty is increased by the lack of data quality often encountered in development cooperation (Neubert 2009). 2) Interpretation: Results in empirical social research are always dependend on theoretical concepts, choice of indicators, personal preferences, priorities, values and interests. Methods, statistical or others cannot undo eliminate the room for interpretation on the part of the research, possibly conflictive view points and thus the uncertainty as to what the results would be with other theoretical frameworks. „The concepts of unbiasedness, consistency, efficiency. maximum-likelihood estimation, in fact, all the concepts of traditional theory, utterly lose their meaning by the time an applied researcher pulls from the bramble of computer output the one thorn of a model he likes best, the one he chooses to portray as a rose.“ (Leamer 1983:36; Jones/Baumgartner 2005:10) 3) Complexity: Statistical tools by design try to reduce complexity. How complexity is reduced, how well reality is represented in a reduced model is difficult to evaluate and remains an area of uncertainty. RIE of highly complex efforts in development cooperation such as policy advice, budget support and country-wide programs involve a high degree of uncertainty as to the appropriate


reducation of reality and are not useful (Forss/Bandstein 2008:12; Jones et al. 2009:4-7) In this context several authors highlight the importance of mixed methods and appropriateness of different methods in different situations (Roodman 2007:21). „This is relevant in the context of it (statistical analysis) being treated as a 'gold standard' for impact evaluation. This is like saying that a hammer is the 'gold standard' of tools: that is only the case if all your problems are shaped like nails.“ (Jones et al. 2009:6) 4) Uncertainty of statistical rigor for non-statisticians: One last aspect of uncertainty is related to the quality of statistical analysis in RIE. Statistical analysis is mined with caveats and in many cases even experts disagree on the validity of results, the appropriateness of models, the accordance with conditions, etc. Whether or not a result from an RIE is really as reliable as possible is practically impossible to assess for anybody, who is not a statistician. There is therefore a high degree of uncertainty for users of RIE untrained in statistics. This uncertainty is increased if only one study has been done or if methodological aspects relating to RIE tools are disputed by experts. This aspect is particularly relevant since proponents of RIE claim that RIE will improve accountability and feedback.

5 Can evaluation findings improve aid policy – theoretical analysis and empirical evidence of use Proponents of RIE hold that findings of such studies can have an important impact on aid effectiveness by improving the knowledge base for decision making and by creating greater incentives to opt for effective aid policies. This chapter discusses the policy potential of RIE from a theoretical point of view and reviews the existing evidence on the use of evaluations for policy development. 5.1.1

The concepts of use and the crisis of utilisation

This thesis cannot do justice to the existing literature on the impact of scientific knowledge in policy decision making. A literature review on research and policy decision making by Almeida and Báscolo refers particularly to the work of Carol Weiss and Trostle et al. And identifies three main models of how scientific


research and policy making relate: rational, strategic and enlightenment. (Almeida/Báscolo 2006:10). The rational model of knowledge utilisation assumes a direct and linear influence of research on policy. It is knowledge-driven and instrumental for problem solving. The strategic model „views research as a kind of ammunition in support or critical of certain positions“ (Almeida, 10). The term „enlightenment“ was coined by Weiss and refers to diffuse processes in which research influences policy. This modell assumes a long term impact of research on policy while at the same time acknowledging the embeddedness of both research and policy making in a complex web of processes (Almeida, 10). An alternative classification found in the literature review of Almeida and Báscolo was the work of Kirkhart, who proposes to analyse the relationship between research and policy with the broader concept of influence (Almeida, 11), which looks at issues such as the initial cause of influence, its cognitive, affective and political aspects and intent in the relationship between research and policy. Since the focus of this thesis is evaluation, these parametres proposed by Kirkhard are already set and do not allow a better understanding of use. In a more recent research project on information in political policy-making Jones and Baumgartner (2005) describe the high complexity of how information enters political processes. Their research discredits the rational model of decision making. Instead, Jones and Baumgartner confirm that research is used strategically but that it also impacts policy-making along the lines of Weiss' enlightenment model. Jones and Baumgartner identify multiple factors that influence if, when, to what extent and how information and scientific evidence enters policy decisions. While their analysis provides a lot of insight for this thesis, their degree of detail is less relevant in this context. Therefore the three models of research influencing policy distinguished by Trostle et al. (rational, strategic and enlightenment) will used to interpret the evidence on utilisation of evaluation findings in development cooperation. The use of evaluations findings for policy has been an important issue for evaluators for a long time. Carol Weiss has been the first research to analyse the critical aspect of use in the 70s. In the 90s Michael Quinn Patton described the practice of evaluation as being confronted with a 'crisis of utilization“ and has since then focused his research efforts on promoting the use of evaluation findings


(1996). Lee discusses the challenge of use at some length and is generally pessimistic about it. “Michael Scriven and others still place their greatest faith in science and in the willingness of policy makers and the public to use objectively gathered information as the primary basis for decisions in spite of considerable evidence that this simply does not happen.“ (Lee 141) She points out that evaluation have to be assessed from the perspective of cost-effectiveness. The contribution of evaluations, she argues, need to be of a higher value than the costs of doing the evaluation. (ebd.) This aspect of cost-effectiveness is particularly critical for RIE, which are generally much more expensive than non-rigorous evaluations. From a management perspective and a perspective of aid effectiveness the expenses for evaluations can only be justified if they have a tangible impact on policy processes and policy implementation. The use of evaluation findings is also considered to be a problem in international development cooperation (Delarue, Naudet and Sauvat (DNS), 2008). A study on evaluation systems in the German develoment cooperation by Stockmann (2009, xxx) stressed the insufficient use of evaluation findings. In a recent paper Faust points out the strategic gap between evaluations and decision makers both on the operational and the political level (Faust, 2010, 46). Given the complex processes of decision making and the multiplicity of interests in aid even strong advocats of RIE agree on the relevance of use. „That is, to what degree have any decisions actually been made differently as a result – has the impact evaluation had any impact? Decisionmakers are exposed to many different sources of information, advice and pressure, of which the evaluation is only one – and usually not the most significant.“ (Smart Policy 8) For example, former World Bank economists Pritchett and Easterly, both supporters of more RIE, highlight that the claim regarding the impact of randomized studies on development policy are bold and completely unsupported (Easterly, 20) Easterly (23) encourages the use of scientific evidence in policy, but acknowledges that „much aid practice doesn't bother with seeking objective evidence, or ignores evidence that does exist.“ The evidence about the use of evaluation findings for policy development is


sparse and not conclusive.37 „In keeping with the broader literature on impact evaluations, there is little systematic analysis of how impact evaluations have been used in policy processes about human and social development and, in turn, the efficacy of such efforts.“ (JJSD, 25) Conclusions on the basis of this evidence can therefore only be tentative and preliminary. This is particularly true for RIEs, since there have not been many RIEs so far (JJSD, 3). In this chapter the available evidence on use in the literature reviewed will be summarised with reference to different types of use: instrumental use, conceptual use and political use. The two guiding questions for this review will be: To what extent are evaluation findings generally used for policy development? Is an substantial investment in RIEs likely to increase the current use of evaluation findings for policy development? Obviously these questions are to some extent problematic, because it is not evident how „impact on policy“ should be defined, which measures should be used and which sources of information should be regarded as valid to answer the questions. There is, for example, no clear line between use and „non-use“. An evaluation, rigorous or not, may have an influence on policy but it is hard to determine if this influence is large or small. One such example may be the highly praised evaluation on deworming of school children in Kenya (Miguel, Kremer 2003a). Worms are one of the most prevalent diseases in all development countries and mass deworming of children has a strong impact for example on school attendance not only of the treated children but also for other children in the area, since there is less risk for contamination. Deworming is very effective and very cheap. (Duflo, Kremer, 101f.) There are a number of organisations promoting deworming around the world. Some of these organisations actually cite the evidence of the study mentioned above. However, deworming is far from being a priority in development cooperation. To the contrary, the World Health Organisation counts worms among the neglected tropical diseases, by which about 1 billion people are affected but which do not receive adequate attention , An assessment of 37 While researchers are increasingly encouraged to demonstrate impact of their work by donors, there are few incentives for implementing agencies to communicate on how evaluation findings shaped decision-making processes; instead, researchers often get to hear about such effects ‘informally, through second-hand channels’. (JJSD, 25, 3)


the use of this case would probably be characterised by conflictive views (JJDS, 25, xxx). Given these problems, the objective of this chapter is therefore not to present an indepth assessment of use but rather to illustrate that use of evaluation findings for policy development is not self-evident. For this purpose the following subchapters on instrumental, conceptual and political use will each present some theoretical reflections and existing empirical evidence on the use of evaluation findings with particular reference to RIE findings. Both theory and empirical evidence suggest that use is dependent on a number of factors. The factors highlighted in the literature reviewed will be described below. This chapter concludes with an interpretation of the empirical evidence on use in the light of the principal agent theory and an analysis of how the issue of use impacts on the discussion about the use of RIE to improve aid effectiveness.

5.2 Instrumental use Instrumental use of evaluation findings will be defined here as the short to midterm use of evaluation findings in decision-making processes regarding the project evaluated or other projects. Some authors suggest that RIE are highly suited for providing input to ongoing projects and guiding policy decision such as to continue or discontinue a project, to modify an approach or to scale a project (JJSD, 11). These authors assume a more or less direct link between research and policy. However, from a theoretical stance this position has several limitations. First of all, as Weiss points out, knowledge in social sciences does not often lend itself to immediate implementation, it is generally less compelling and authoritative as knowledge from the natural sciences (The many meanings of .., 427). Other limitations of the rational model of decision making have been discussed in chapter 4 with regards to RIEs and related to the assumptions usually attached to RIE findings and external validity (JJSD, 25). Also timing of RIEs is particularly crucial if there is a high turnover of key decision makers during the lifetime of an RIE (JJSD, 25). Furthermore for RIE to determine what works and what doesn't several possible solutions to a problem would need to be studied in comparison. This is often not 73

feasible due to technical and financial limitations (ibid.). Because of these shortcoming JJSD argue that instrumental use is not very comptabible with RIEs (JJSD, 25). The most important theoretical caveat concerning the rational model is the multitude of interests, goals and influences in political decision making (The many meanings, 427). Decision making, according to Weiss, is influenced by many factors. „Scientific research knowledge is but one influencing factor in decision making. Other factors can be believe systems, interests and power relations.“ (cited in DNS, 14) Weiss, a strong advocate of evidence-based policy, worked for over four decades on the use of research knowledge for policy and holds that empirically instrumental use of research is quite rare (1998, 1986, p.280) and happens only under certain conditions (see chapter 5.4. for more details). „Even a cursory review of the fate of social science research, including policy research on government-defined issues, suggests that these kinds of expectations (instrumental use) are wildly optimistic.“38 (The many meanings, 428) More recent studies of use arrive at similar conclusions (JJSD, 13; Peta Sandison; XX; Nutley (2003), Furubo (1994, 2005): evidence about what works and what doesn't is infact rarely or only inadequately translated into policy. „The early hope was that use would happen with little effort because evaluation results were compelling, because stakeholders eagerly awaited scientific data about programs, or because policymaking was a rational problem-solving endeavor. (...) But those hopes were dashed, because evaluation results were seldom compelling relative to the interests and ideologies of stakeholders, stakehodlers usually regarded scientific input as minor in decision making, and problem solving is far from a rational endeavor.“ (Shadish, Cook, Leviton: 54) With respect to educational programs in the USA Weiss found that „a considerable 38 It probably takes an extraordinary concatenation of cirmcumstances for research to influence policy decisions directly: a well defined decision situation, a set of policy actors who have responsibility and jurisdiction for making the decision, an issue whose resolution depends at least to some extent on information, idetification of the requisite informational need, research that provides the information in terms that match the circumstances within which choices will be made, research findings that are clear-cut, unambiguous, firmly supported, and powerful, that reach decision-makers at the time they are wrestling with the issues,that are comprehensible and understood, and that reach decision-makers at the time they are wrestling with the issues, that are comprehensible and understood, and that do not run counter to strong political interests.“ (The many meanings, 428)


amount of ineffectiveness may be tolerated if a program fits well with prevailing values, if it satisfies voters, or if it pays off political debts.“ (Weiss, 1973b, 40) Likewise social programmes may be terminated despite positive findings in evaluations (Shadish, Cook, Leviton, 56). Shadish, Cook and Leviton concluded that evaluation findings were often not used in the short-term because of given policy cirumstances (53) or that findings were not used at all. JJSD interviewed a 63 experts on RIE and also arrivee at a negative conclusion with reference to instrumental use. Key informants in these interviews suggested, that decisions on scaling up programes tend to follow fashions and hypes rather than being being based on evidence (JJSD, 31). Rossi et al (2004) are less sceptical about instrumental use and points out that systematic studies about use or non-use of evaluation findings for policy have only been done in the last two decades. They assume a fair degree of instrumental utilisation despite the widely held pessimistic view on this issue (DNS, 20; Rossi, xxx) A more recent US study from 2008 about a results-based-management program including compulsory RIEs confirms the negative positions and found very little use of the RIEs (JJSD, 14). In the field of development cooperation however DNS confirm the sceptical position on use (DNS, 6, JJSD, 13; Stockmann et al, 2009, xxx). In contrast, Zintl (253), the head of the evaluation department of the BMZ, consideres instrumental use to be generally adequate, even though the degree of use varies. In 2009 12 RIE have been the subject of analysis in a World Bank workshop on evaluation use for policy. The workshop paper clearly stresses that there is no hard evidence on use of these 12 evaluations. Instead the evaluators and authors of the evaluations presented their subjective perceptions on how their findings had fed into policy decisions (Smart Policy, 50).39 The result of this cursory „useassessment“ was that rather positive. Evaluators reported that some policy recommendations had been implemented by governments and that some programs were replicated in other settings. The Center for Global Development paper on 39 „The twelve impact evaluations discussed in this report were utilized and had influence in three broad areas: project implementation and administration; providing political support for or against a program; and promoting a culture of evaluation and strengthening national capacity to commission, implement, and use evaluations.“ Smart Policy, 65


RIE mentions a few examples of how RIE have had an influence on programes, albeit without details and without assessing the overall impact of RIEs on policy so far. „Impact evaluations have played critical roles in helping NGOs modify education programs in India to improve student achievement, protected and expanded national conditional cash transfer programs in several Latin American countries, and demonstrated the impact of inexpensive health interventions in improving school attendance in Africa.“ (CDG, 3) Given the hitherto limited number of RIE in evaluation of development cooperation, there are a number of examplary RIE studies that are cited in many publication on RIE (JJSD, 25). The most prominent RIE is the evaluation of the Progresa Study in Mexico starting in 1998. (Caspari, Barbu, 10f.; Ravallion, (mystery), Ravallion, Evaluating Anti-Poverty Programs e.g. 36f., 71; ADB, 8, Baker, 140f, When will we ever learn, 18.26; ODI paper 300, 23-27; Duflo, Kremer, 99f., in: Easterly).40 „PROGRESA is a multisectoral program aimed at fighting extreme poverty in Mexico by providing an integrated package of health, nutrition, and educational services to poor families. The Mexican government will provide monetary assistance, nutritional supplements, educational grants, and a basic health package for at least three consecutive years.“ (Baker, 140) One key element of the program is a conditional cash transfer (CCT) to parents for sending their children to school and to health services. The RIE using a randomised design found a positive effect of the program and school enrolement and child health and recommended the up-scaling of the program. Many authors claim that the Progresa evaluation is a clear example of how a RIE can produce credible findings and how these findings have a major impact on social policy and thus on poverty reduction. „The impact evaluation of Progresa in the mid-1990s is widely credited with preserving that social program in the transition to an opposition administration (the program was retained and expanded, and the name was changed to Oportunidades). Furthermore, the Progresa evaluation influenced the adoption of similar conditional cash transfer programs in many other countries.“ (When will we ever learn, 26; also Morley, Coady, 2003) 40 The Progresa Study is probably not only the most cited RIE, but also the most comprehensive study. In fact there are more than 20 at times large studies than have been executed on the basis of the original Progresa Data. References:


Progresa has infact been up-scaled to the whole country and to urban centres. However Easterly, a proponent of RIE, cautions that not the RIE but political factors may have been the decisive factor for up-scaling Progresa. „Of course, there were also political factors. Green (2005) found that, despite the attempt to depoliticize Progresa, municipalities that had previously voted for the party in power were more likely to have their localities enrolled in the program. DiazCayeros et al. (2008) dispute that finding, but found that even a non-discretionary Progresa/ OPORTUNIDADES program paid off at the polls for the incumbent in both the 2000 and 2006 elections. They also point out that President Vicente Fox’s decision to expand OPORTUNIDADES from rural areas to the cities made political sense since his party’s political base was urban.“ (Easterly, 21) JJSD share this more nuanced view of evaluation use in the case of Progresa and stress that the extension of the Progresa program was largely due to substantial political bargaining including within Mexico and with international stakeholders like the World Bank and the IFPRI. A number of factors described in more detail below, fostered the use of the Progresa evaluation for policy development and for higher effectiveness of poverty reduction programs. However, there is also evidence that in other cases, evidence about program impact was not used for policy development but decisions were made contrary to RIE findings. JJSD cite the CCT program in Nicaragua that was discontinued depite positive evidence of impact (26). Another example is a widely praised RIE of a successful school vouchers program in Colombia that has nevertheless been cancelled (Easterly, 20). DNS cite the example of cereal banks in West Africa, where decision makers did not terminate the programes despite evidence that they do not work (DNS, 29). As DNS (2008, 22) and Patton (1997, 90) point out, instrumental use (or non-use) may not only be at the level of decision makers, but also at the level staff and the beneficiaries. Patton refers to this use as „process use“ (ibid). The training of staff in RIE is one form of process use is suggested by the participants of the World Bank workshop on evaluation use (Smart Policy, XXX). JJSD (8) hold that internal learning process, process use, are unlikely in RIE because of the rather detached and primarily quantitative orientation of RIEs.


Interpretation and Self-evidence of evidence : Instrumental use of RIE findings presupposes, that the analysis of a program can be interpreted in a relatively straightforward way and can be translated into relatively clear policy recommendations. Nutley et al. point out, however, that the translation of research findings into recommendations is not straightforward and independent on interests and contexts of application. „There is no such thing as the body of evidence: evidence is a contested domain and is in a constant state of becoming. Research is rarely self-evident to the practitioner but varies according to the context in which it is received.“ (Nutley, et al, cited in DNS, 10). Beispiel Duflo, Kremer, 100f. While it seems relatively fair to say, that hitherto RIE have had only limited impact on policy development, the evaluation experts interviewed by JJSD argue, that instrumental use is not very compatible with RIE and is not its main purpose. „Because of frequent turnover of key policy decision makers during the lifecycle of an impact evaluation, some argue that it is difficult to influence specific programmes as they unfold, especially because the findings are often delivered at the end of the project (Muraldharan, interview 2008).“ (JJSD, 25) For many of these expert, conceptual use is the main purpose of RIE.

5.3 Conceptual use The proponents of RIE hold that there is a strong need for better general knowledge about what works and what doesn't at the program level. This knowledge will be used to influence concepts of developement programs around the world and help invest development monies more effectively. „ (...), a dearth of rigorous studies on teacher training, student retention, health financing approaches, methods for effectively conveying public health messages, microfinance programs, and many other important programs leave decisionmakers with good intentions and ideas, but little real evidence of how to effectively spend resources to reach worthy goals.“ (Savedoff et al. 2006:1) The question how much need for evidence there is at the project level, the agency 78

level and the government cannot be resolved here. Jones and Baumgartner point out that there is usually too much information in policy making rather than to little (2005:9) at the government level. Despite these caveats the usefulness of RIE for conceptual purposes seems convincing. The primary user for this type of use would be experts and scientists, who can critically assess the validity of findings and its underlying assumptions. Conceptual use necessitates an effort on behalf of the scientific community to manage the existing knowledge and to improve knowledge sharing. Much additional evidence could certainly be generated simply by aggregating existing knowledge in different organisations and countries. The Campbell Collaboration and 3ie are positive examples in this context (Jones et al. 2009:15). While proponents of RIE expect a relatively strong link between evaluation findings and policy even for conceptual use, Weiss proposes a much more diffuse model, which does not assume that policy makers actively search for evidence or are receptive to research. „The imagery is that of social science generalizations and orientations percolating through informed publics and coming to shape the way in which people think about social issues. (...) It helps to change parameters within which policy solutions are sought. In the long run, along with other influences, it often redefines the policy agenda.“ (Weiss: The many ways, 429f.) According to Weiss the conceptual use to create social awareness, or as she calls it, „enlightenment“ it the major goal of evaluation. (Lee, 158f.) The degree of influence depends on the political circumstances and the political interests at play. (Interview This position is confirmed by the extensive research of Jones and Baumgartner on policy processes in the USA (2005). On an empirical level there is very little data available which allows to support or to refute the existence of conceptual use in the context of development cooperation. This is partly due to the diffuse nature of conceptual use. The consultancy company Ernst and Young did a study on the utility of evaluation findings in public policy in 2008, where 44% of 1000 respondents considered evaluation findings as useful to improve the general understanding of the sector. (DNS, 20f.) The current director of the BMZ evaluation department acknowledges


that there is a lot of potential to increase the strategique use of evaluations (not only RIE) (Zintl, 253). Birdsall et al. give examples both from the USA and LatinAmerican countries of evaluations of successful programs prompting replications in other regions. They stress that particularly randomised control trials can be easily presented to decision makers and suggest that this advantage will favour the use of findings. (DCG, 25f.). From a time perspective the conceptual use of evaluation findings is clearly more likely than instrumental use, since RIE don't only take a long time, but they also focus on long-term development. „Impact studies generally take considerably longer still, which is one reason why they can seldom be made to dovetail into important policy decisions: rather, they provide long-range feedback on broad strategies like success of poverty alleviation) more often than on specific policies.“ Cracknell, 2000, p.205) Conceptual use of evaluation for policy development hinges on the general relevance of findings from a specific context. In evaluation literature this general relevance is called the external validity of evaluation findings. One of the main difference between scientific research generally and evaluations in particular is the fact that evaluations are per definition context-specific (Cousins, 2006, 2; xxx). „After all, the findings of an impact evaluation for a particular program in a particular context will be most relevant and useful in that specific instance. Nevertheless, some of what is learned will be useful to countries with similar contexts.“ (CDG,14) Birdsall et al continue to argue that a well done RIE will provide sufficient contextual information of a program to allow practitioners in other contexts to assess the relevance of the findings for their work. Lee is much less optimistic about external validity and refers to Cronbachs work on this issue. „Cronbach also recognized that generalization or replication is virtually impossible, expressing a deep respect for the specificity of context effects, which is an idea that should be more widely accepted, ....“ (Lee 158) Ito, Kobayashi and Wada (IKW, 14) emphasize the importance of external validity in aid evaluations and deplore the insufficient efforts of the evaluation community to establish external validity. The World Bank workshop report on the use of RIE supports the relevance of context to external validity and use and suggests that


context-specific RIEs are less relevant for conceptual use and more relevant for instrumental use. (Smart Policy, 16) DNS consider the likelihood of evaluation findings to result in a growing body of knowledge and warn that the high diversity of politics and contexts in development cooperation may allow the accumulation of knowledge only for a few isolated policy fields. (DNS, 28) The 3ie initiative recognizes this challenge to some extend by concentrating its efforts on a few critical areas of development policy (XXX; Smart policy, 70)

5.4 Political use Evaluations are a political tool in a political context as many evaluation researchers have repeatedly pointed out (Weiss, yyy; Rossi (2004), Stockmann, 34, DNS, 19). Accountability is one possible political use of evaluations, which is often mentioned in evaluation handbooks. Accountability is referred to here as the objective and comprehensive reporting of organisations to their superiors. Theoretically, RIE are well suited to provide upward accountability. „Where accountability to donors is the priority, the virtues of rigour, independence and efficiency are prized. (...) (RIE) fit these requirements very well, providing robust and independent evidence 'proving' the effects of the intervention.“ (JJSD, 8) In contrast RIE are not at all suitable to promote downward accountability towards beneficiaires (ibid.) because only those with substantial statistical training are in a position to assess the quality and caveats of statistical analysis. Other, very frequent forms of political use of evaluations are non-use, symbolic use or biased use in order to legitimize one's work, to assure continued funding to support management decisions already taken or to further other interests. Some authors like Rossi and Freeman use the term accountability to include biased use of evaluation. RIE are very suitable for this type of use, since impact of a project can often be reduced to a single number and marketed effectively. Scientific caveats and assumptions can easily be ignored in such marketing, since only experts are aware of the existence of such caveats and their impact on statistical findings. Weiss argues that the use of research for predetermined positions is legitimate as long as findings are not distorted or misinterpreted and as long as all stakeholders have access to the evidence (The many meanings, 429). However, 81

statistical findings are highly dependent on a statistical model and underlying assumptions.41 JJSD hold that the most frequent use of RIEs is political and to legitimize development projects (16). „The human and social service industry is not only huge in dollar volume and in the number of persons employed but is also laden with ideological and emotional baggage. Programs are often supported or opposed by armies of vocal community members; indeed, the social program sector is comparable only to the defense industry in its lobbying efforts, and the stands that politicians take with respect to particular programs are believed often to determine their fates in elections. Accountability information is the major weapon that stakeholders use in their battles as advocates and antagonists.“ (Rossi/Freeman, 1993, 181) In existing conflicts of interests, research becomes „ammunition“ to support one's cause, with research findings often being taken out of context and with caveats and underlying assumptions being ignored (Weiss, The many meanings, 429). The World Bank report on policy use of evaluation findings is very explicit on the political character of evaluations. Evaluations, the report argues, are frequently employed to „provide support for decisions that agencies have already decided upon or would like to make, mobilize political support for high profile or controversial programs, provide independent support for terminating a politically sensitive program, and to provide political or managerial accountability.“ (Smart policy, 8; 68f.) According to this report it is even the „potential political benefit or detriment that causes decision makers to embrace or avoid evaluations“ Smart Policy, ibid). Shulha and Cousins made an effort more than 10 years ago to study the political use of evalutions and found that it is hard to study empirically. Their research identified three types of political (mis)use of evalution findings: (a) the commissioning of evaluation for purely symbolic purposes; (b) the conscious subversion of the evaluation process by program practioners; and (c) the purposeful non-use of high quality information.“ (Shulha et Cousins, 1997, p. 18) 41 Often opposing camps could probably commission rigorous evaluations specified in such ways as to support their respective positions. The investment in such research would seem a waste from a beneficiaries and taxpayers' point of view.


With reference to the current trend for more RIE DNS (29) point to the fact that development cooperation is under considerable pressure to legitimate its activities. Given this pressure they perceive a strong need on the part of aid agencies to provide quantitative and highly credible proof for their success.42 Given this pressure for legitimization there is seems to be a high prevalence of political use and misuse of evaluations and general and RIEs in particular. „There is a lot of evidence of legitimating and symbolic use of experimental Ies. For example in CGIAR, it was feld by many donors that 'defence of budgets' was one of the most crucial roles it had to play (...). Some interviewees argued that it in fact dominates over other functions, used as a marketing device to prove the aid organisation's successful work.“ (JJSD, 14, 16) Michaelowa and Borrman concur with this analysis. „(Evaluations) are simultaneously used as an instrument of transparency and control, accountability, legitimization and institutional learning. With respect to the legitimization function, evaluation can be thought of as marketing device to prove the aid organizations's successful work to the general public. (...) However (...) the legitimization function seems to be dominating. Transparency and legitimization are clearly conflicting objectives in all cases in which actual development outcomes are not fully satisfactory.“ (Michaelowa/Borrmann 2005, 1; also Ruprah, 14; Weiss, yyy; Rossi (2004), Stockmann, 34, DNS, 19) According to Michaelowa/Borrmann and other authors (for example Ruprah, 14, Teller, 2008, 3; Prichett, 121ff. In Easterly) these conflictive interests of transparency and legitimization are inherent in the aid system and are not likely to disappear. DNS confirm this conflict of interests and suggest that evaluations done predominantly for political purposes are often not useful for instrumental or conceptual learning and vice versa (xxx). Political use of evaluation findings often takes the form of selective use, i.e. only 42 „La communication importante faite autour de cet exemple montre que l’utilisation persuasive est potentiellement un effet important de ces évaluations. Dans un domaine comme le développement, dont la légitimité est fragile, il n’est pas étonnant de voir se développer le plaidoyer pour les évaluations quantitatives. Les bailleurs – politiques et contribuables réunis – ont besoin de voir des résultats probants pour poursuivre leur engagement. Dans ce sens, le mouvement actuel correspond à la remise en cause du statu quo « opérationnel » – en revendiquant de se limiter à, et de se concentrer sur ce qui marche de façon prouvée – et à la légitimation de la poursuite du financement du développement lui-même, en établissant de manière objective des résultats probants.“ D,N,S, 29


those empirical findings are communicated that suit the political interest of the agency or department who commissioned the evaluation. In 2006 Banerjee et al. did an evaluation study of the World Bank Research between 1998 and 2005. While the general outcome of this study was positive, Banerjee et al were highly critical of the selective use of research findings to promote World Bank policy. „But the panel had substantial criticisms of the way that this research was used to proselytize on behalf of the Bank policy, often without taking a balanced view of the evidence, and without expressing appropriate skepticism. Internal research that was favorable to Bank positions was given great prominence, and unfavorable research ignored. There were similar criticisms of the Bank's work on pensions, which produced a great deal that was useful, but where balance was lost in favor of advocacy.“ (Banerjee et al, 2006, 6, Teller confirms this critical finding for the health sector and cites a PEPFAR evaluation from 2007 as an example of „policy-driven evidence seeking“ instead of „evidence-informed policymaking“ (JJSD, 26).43 Kremer and Duflo hold that selective use, or publication bias as they call it, is a serious issue that needs to be addressed. „Positive results naturally tend to receive a large amount of publicity: agencies that implement programs seek publicity for their successful projects and academics are much more interested in and able to publish positive results than more modest or insignificant results. (...) Available evidence suggests the publication bias problem is severe (DeLong and Lang 1992) and especially significant with studies that employ nonexperimental methods.“ (Duflo, Kremer, 110, also JJSD, 38). This position is confirmed by JJSD who established a database of available RIE studies and found that hardly any of these evaluation reported no impact or negative impact (JJSD, 9. 26) „Experimental IEs are more likely to be carried out when they are expected to generate positive results. Like other types of evaluation, they tend to be published only if they demonstrate positive results.“ (JJSD, 9) Enforcing RIEs and even publishing them will not change the vested interests. Highly technical evaluations based on sophisticated statistical tools and bold assumptions could prove to be an efficient 43 President George W. Bush made the fight against HIV/AIDS in Africa a priority and created the high-profile Presidents Emergency Plan for AIDS Relief (PEPFAR) in 2003.


alternative strategy for aid agencies to cover up failures.

5.5 Factors The review of literature does not provide a lot of evidence about the use or nonuse of evaluations for policy development. There seems to be efficient use of findings either for instrumental or conceptual learning in some cases, partial use or non-use in other cases and generally a great deal of political use and misuse of evaluation findings. The literature reviewed does provided, however, much more information about the factors influencing use or non-use. The insights regarding these factors should inform the debate about RIE for more effective aid. The key factors influencing the use of policy findings will be presented here. (JJSD, 12-14)

5.5.1 Internal incentives If an evaluation is commissioned or not, what type of evaluations are commissioned and if evaluation findings are used or not is to a large extent a matter of internal incentives within the organisations active in development cooperation. With reference to RIE Lant Pritchett assesses the incentives of project managers and agency staff to commission a RIE (Lant Pritchett, in Eastery, 125-140). According to Pritchard in the current aid system these stakeholders have a high interest to promote their project very low incentives to produce knowledge that may demonstrate its ineffectiveness. The strategically wise decision, Prichard concludes, is for agency and project staff to avoid any serious evaluation (Pritchett, 140ff). „If a program can already generate sufficient support to be adequately funded, then knowledge is a danger. No advocate would want to engage in research that potentially undermines support for his or her program. Endless, but less than compelling, controversy is preferred to knowing for sure that the answer is no.“ (Pritchett, 142) Ruprah concurs with reference to his experience in the Inter-American Development Bank, where staff members have very little incentives to commission (16). Speaking for evaluations in the health sector, Teller (2008:3.9) confirms these incentives within the aid system to hide


project effectiveness. The current culture in development cooperation rewards successes and feigned successes and punishes failure. The use of evaluation findings is thus discouraged in all cases where findings are not positive. “When there are 'bad' results, the proper means and context for presentation and discussion may make the difference between a rejection or suppression of the results and beneficial reforms and future use of impact evaluations and other evidence for policy making.“ (Smart Policy, 52) Given this culture the World Bank workshop report recommends a greater involvement of stakeholders within the aid agencies in evaluations and sensitive communication of findings. „A sensitive handling of negative findings can increase use and strengthen evaluation culture.“ (Smart Policy, 68) There are disincentives within organisations to expose possibly unsuccessful work. These disincentives concern both the commissioning of evaluations as well as the use of evaluations and are deeply embedded in the aid system. Pritchett, Easterly, Birdsall et al. and others argue that RIEs should be commissioned. However they fail to explain how the misuse of RIEs (selective use and non-use) can be assured. Even when many projects are subject to rigorous evaluations, the structure of incentives will remain. In their analysis of evaluation politics within the aid sector Borrmann and Michaelowa refer to another aspect of internal incentives regarding evaluations. They hold that staff at a higher level of decision making has many more projects to oversee and is therefore less dependent on the results of individual evaluations to prove his or her success. Thus the use of evaluation findings of hierarchically higher placed staff is probably less biased towards legitimization and more towards learning (Michaelowa/Borrmann, 14) In their paper about the usefulness of evaluations DNS present the findings of John Garcia about the factors influencing the uptake of scientific evidence into policy making in the field of public health. Garcia confirms the relevance of internal organisational factors. In addition to the points made above he cites a number of other relevant internal factors such as the relevance of current organisational constraints, communication patterns and networks within the


commissioning organisation, history and age of an organisation and internal competition (DNS, 24)

5.5.2 The political window of opportunity Evaluation experts generally agree on the political nature of evaluations. The workshop document of the World Bank explicitly recommends to market evaluation as a political tool in order to enhance use. (Smart Policy, 15) It is therefore not surprising to find that political interests highly influence the use of evaluation findings. The political environment of a project, of an aid agency or even of a nation determines to a large extent if and how evaluation findings are used. „In our study of how federal health evaluations were used, we found that use was affected by intra- and interagency rivalries, budgetary fights with the Office of Management and Budget and Congress, power struggles between Washington administrators and local program personnel, and internal debates about the purposes or accomplishments of pet programs. Budgetary battles seemed to be the most political, followed by turf battles over who had the power to act, but political considerations intruded in some way into every evalution we examined.“ (Patton, 1997, 343) Stakeholders with decision making power will generally use evaluations in line with their overall interests. Advocates of projects are likely to use positive findings and project critics are likely to use negative findings. Evaluation recommendations can only be implemented if they are in the current window of political opportunity. The best options for a project and the project beneficiaries may not be in the current window of political opportunity. For example the abolition of unsuccessful cereal banks mentioned above may be the best option in terms of cost-effectiveness and aid effectiveness, but is has not been politically feasible because of the vested interests of several stakeholders. Generally, evaluation findings do not have the power to end unsuccessful projects since all projects and organisations tend to strive for self-preservation (DNS?) The implementation of evaluation findings against the political interest of stakeholders with decision-making power is very unlikely and investing in RIE under such circumstances may be a waste of scarce resources (Smart Policy, 50) The external incentives are influenced by the scale of a project. The bigger a project, the more stakeholders are likely to be involved and the higher the stakes 87

for decision makers involved (DNS). Lee found that „ (...) the larger the scale of the social problem and program solution, the more difficult it is to see any change at all in response to evaluation findings.“ (Lee, 158) This does not imply that findings are not used, but that they have to compete with other sources of evidence and with diverse interests. Findings „may well become part of the evidence that will later have a cumulative effect on policy.“ (Lee 158) In their summary of Garcia's findings about influencing factors DNS give some more concrete of influencing factors in the external political environment (DNS, 24). According to Garcia the visibility of an evaluation, the degree of politicization, inter-organisational competition, the size and the political importance of constituencies can have an important impact on the use of scientific evidence for policy development. Also Carol Weiss, who studied the use of evaluation findings for policy for over three decades, highlights the strong influence of the external (and internal) political environment for the use of evaluation findings. «At least instrumental use is common under three conditions: (i) if the implications of the findings are relatively uncontroversial, neither provoking rifts in the organization nor running into conflicting interests, (ii) if the changes that are implied are within the program’s existing repertoire and are relatively small-scale, and (iii) if the environment of the program is relatively stable, without big changes in leadership, budget, types of clients served, or public support.» (Weiss, 1998, p. 24)

5.5.3 Communication of findings Quantitative results seem to be easier to communicate, unambiguous and straight forward, often reduced to a single number. Qualitative findings do not have this possibility and tend to be more complex and ambiguous. However, the simplicity may be often misleading, since the reduction also neglects all the assumptions and caveats attached. The communication of findings is one key factor influencing the use of evaluation findings. This is applies to all type of evaluations and has been highlighted for many years by Patton. (1997) For people to use evaluation findings, they need to understand the findings and the process of having obtained the findings. „In other


cases, if findings are presented in a manner that is too technically complex for its audience, decision makers may either misinterpret the findings, leading to misinformed choices, or ignore the findings altogether.“ (Smart Policy, 8, 67; also DNS 27 and 24 (Garcia)) In communicating the findings it is equally important, that users understand the methods applied to arrive at the findings. Not understanding the methods may result in distrust and non-use. „Similarly, people may not trust evidence (especially evidence contrary to their beliefs) that comes from methods they do not understand, so training in or exposure to impact evaluation as well as the use of easy-to-understand methods may make evaluation results more convincing.“ (Smart Policy, 48, also JJSD, 37) The need for an in-depth understanding of methods is particularly relevant when findings are based on weak evidence or strong assumptions, which may often be the case with suboptimal data and nonexperimental statistical methods such as PSM. „There are other situations in which potentially important but controversial findings may be based on weak evidence (for example with small sample sizes and low statistical power). While researchers may understand that such findings must be interpreted with caution, the mass media or political supporters or critics of a program may ignore these caveats, perhaps jumping to conclusions that a program should be terminated or an innovative approach should receive major funding. (Smart Policy, 71f.) The higher the conflicts of interests surrounding a program and its evaluation, the higher is the difficulty to communicate findings appropriately. RIE are often advocated to improve the feedback loops and thus accountability within development aid (Faust, Leuw,Vaessens). This argument underscores the necessity of having evaluation reports comprehensible also for those stakeholders hitherto excluded from the feedback loop, the citizens in the North and the South. The World Bank workshop report mentions training to donors and policy makers in understanding evaluation findings (Smart Policy 48) While this effort is certainly worthwhile, it is not clear how all RIE can help citizens to hold governments and aid agencies more to account. Finally, use of evaluation findings is influenced by the degree to which findings


are communicated in terms of clear policy recommendations (Smart Policy, 67). Dissemination is also an issue (Smart Policy, JJSD, 37)

5.5.4 Timing of findings Timing is seen by several authors as one critical factor influencing use or non-use of evaluation findings for policy (e.g. DNS, 24, Garcia). To be relevant for the evaluated project evaluation findings must be available well before the project ends and as long as the projects commands the attention of decision makers. At the same time quick results may lead to reduced quality and credibility and reduced relevance for long-term learning (Smart Policy, 13. 56). DNS (28) regard the necessary time to complete a RIE as one of the factors for little instrumental use of RIEs. In the same vein Zintl (253) acknowledges that some evaluations results come in too late. Although timing is recognized as crucial, especially for instrumental use, DNS hold that this aspect of evaluations may be be difficult to manage. The management of an evaluation can be cumbersome and involves a lot of stakeholders. Especially largely quantitative, and among those especially the experimental studies, take a long time. Policy interests may have changed by the time findings are released. (D,N,S, 30f.) One solution to the issue of timing and relevance is for RIE to focus exclusively on highly relevant policy questions. This approach is chosen by the 3IE initiative (XXX).

5.5.5 Relevance and quality Obviously, the relevance of an issue analysed and the methodological quality in an evaluation have an impact on the degree and type of use (Smart Policy, 58f., JJSD, 27). „A key element in the successful utilization is developing a system for the selection of evaluations that address key policy issues and for analysis, dissemination, and utilization of the results.“ (Smart Policy, 70) According to Zintl (253), both relevance and quality are not always assured. The strategy of the 3IE initiative to focus on highly relevant issues is an useful approach to address this challenge. As far as quality is concerned there is little knowledge about its impact on use. In


his study Garcia found that the credibility, reputation, independence, personality and communication skills of the evaluator can influence the use or non-use of scientific evidence for policy development (DNS, 24). It is interesting to note that the participants of the World Bank workshop on use of RIEs point out that quality may be important for those stakeholders that disagree with findings. „While stakeholders may be willing to “trust the experts” if an evaluation offers results that support what they want to hear, there may be a reasonable tendency to distrust results – and particularly methods – that they don’t understand. People tend to trust or distrust evidence based on what they already believe, looking for results that confirm what they believe and looking for ways to discredit contrary information. Perhaps one reason is that it is difficult to distinguish between good and bad evidence.“ (Smart Policy 67)

5.6 Use of evaluation findings for policy development in the light of the principal-agent theory? Policy formulation is not a rational process where scientific evidence has a strong influence on decision making. Scientific knowledge plays usually only a minor part in policy development. Therefore the use of evaluation findings for policy development is not self-evident. While it is safe to say that evaluations can at best play a small role in the decision making process, the degree and the kind of use of evaluation findings depends on a large number of factors. Some factors are more technical in nature: the timing and quality of evaluations as well as the effort in communicating and disseminating the findings. Other factors are more political in nature: the incentives within organisations and in the external environment to protect institutional, political and and personal interests. The analysis of the aid sector from the perspective of the principal-agent theory decribed above focuses on the incentives inherent in the aid system. This analysis can now be translated into the field of evaluation in development cooperation. Poverty reduction is certainly one important motivation for individuals and organisations within the aid system. However, both the theoretical analysis as well as the limited empirical evidence suggest that the use of evaluation findings is subject to political incentives.


The stakeholders concerned with evaluations are primarily the donor government, the implementing agencies in the North and the South, the consultants and the project. The use of RIEs touches differently on the incentives of the agents and the incentives of the principals. As Pritchett an other clearly point out, the agents do not have strong incentives to report anything but successful work. In a context of competition for scarce resources among agents and the organisational interests in self-preservation and growth agents have an interest to appear successful to justify their existence and possibly increase funding. RIE don't change these incentives, but it represent a tighter control mechanism imposed on the agents and reduce the scope for fake success-marketing.44 It can be assumed, that RIE infact lead to a more realistic assessment of what aid projects achieve. The use of evaluation reports, however, is in the realm of the principals. Whether and how the principals use this more realistic assessment of projects depends on 44 „Because experimental Ies can go both ways – demonstrate positive or negative impact – an organisation that conducts them runs the risk of findings that could embarrass individuals, projects or programmes, and could undercut its ability to raise funds (Levine and Savedoff, 2006). There are many such disincentives to finding out 'bad news' existing in various organisations, and this is likely to inhibit the commissioning and /or publishing of many studies (Ravallion, 2005).“ JJSD, 8


their specific interests in a specific situation. Evaluations of low political importance are more likely to be used for instrumental or conceptual learning. Evaluations touching upon contentious issues are more likely to be used for political purposes. Positive results are more likely to be used than negative results unless the principal that commissioned the evaluation is a critic of the program. Small scale recommendations are more likely to be implemented than large scale recommendations. Recommendations that confirm the position of the principal are more likely to be implemented than recommendations contradicting beliefs, convictions and assessments of the principal. The main principals the electorates in the North and the South are to a large extend excluded from the realm of evaluation in development cooperation. If evaluation reports are published by donors (as is now widely called for by evaluation experts, Stockmann, 2009) the volume, comprehensibility and at times language of reports are high barriers for citizens in the North and the South. The current system of incentives does not favour the publication of possibly critical information for citizens (Michaelowa/Borrmann, xx) This is reflected in the minimal effort on the part of donors to actively disseminate project information that does not fall into the legitimisation category. Some claim, that the push for more evidence (Faust?) in development cooperation through RIE would increase the accountability of stakeholders to primary principals. The theoretical analysis and the limited empirical evidence do not support this view very much. From a systemic point of view the crucial question in relation to RIE is, whether RIEs will just be another element of interest-guided information within the system, or whether RIEs are actually able to change the structure of incentives. While the demand for tangible results in the aid business is the right response to the ongoing poverty in the world despite five decades of development cooperation, the ambition to measure impact may not contribute much to achieving impact, but may, to the contrary lead to an ever more self-referential system which excludes the primary stakeholders in holding the aid professionals to account.


6 Propensity Score Matching for better aid? 6.1 Will RIEs produce hard facts? If „hard facts“ imply the notion of eliminating doubt and uncertainty and assuring objectivity, then PSM or any other RIE cannot produce hard facts.45 Econometricians and statisticians are the first ones to acknowledge this. With regards to objectivity any designs of social empirical research, quantitative or qualitative, is the product of theoretical assumptions, scientific preferences and value statements. Designs should clarify these theoretical assumptions and provide arguments in favour of their credibility, but designs as such cannot be objective (Lee 2004:165; Hornbostel XX:77) With regards to certainty statistical tools can produce evidence which is characterised by a high level of probability under certain conditions: The data has to be of sufficient quality, for example in terms of sampling strategy, sample size and possible response bias. Underlying assumptions of a given statistical model have to hold. Conditions for using certain statistical tools are met. In the case of PSM there are several conditions. Creating a control group using PSM in order to evaluate the impact of a e.g. health project will only provide valid results if the three conditions posed by Rosenbaum and Rubin are met: 1) The effect of the health project has to be stable over time and there are no interdependencies between the control group and the treatment group (e.g. higher salaries in the project attracting qualified staff in control group areas). 2) The control group and the treatment group are identical in all aspects except treatment and there are no critical unobserved differences between the two groups. 3) The effect of the health project would be exactly the same in the control group if it was executed in their area. Particularly the potential effect of unobserved variables are 45 Forss and Bandstein discuss the definition of „evidence“ and warn against the notion of „evidence“ being exclusively linked to quantitative research. „Historical research has provided good evidence for the decline and fall of the Roman Empire, palaeontologists have evidence for the extinction of dinosaurs, etc. Evidence for the success or failure of efforts to combat HIV/AIDS in southern Africa is not likely to be generated by the same research methods that generate evidence on the effectiveness of a vaccine or drugs for the infected.“ (Forss/Bandstein 2008:15)


a threat to the validity of PSM. Qualitative analysis will certainly be helpful to determine if there may be hitherto unobserved differences between the control group and the treatment group. Such analysis can increase the certainty, but not eliminate uncertainty completely (Caracelli 2004:185). If PSM is used in combination with regression analysis all conditions linked to this tool will need to be met (for example the data can be characterised by a linear function and there are no strong correlations among independent variables). Lipsey, a strongly quantitatively oriented evaluator warns that too many quantitative evaluations are done under unfavourable circumstances: „Much less evaluation research in the quantitative-comparative mode should be done. Though it is difficult to ignore the attractiveness of assessing treatment effects via formal measurement and controlled design, it is increasingly clear that doing research of this sort well is quite difficult and should be undertaken only under methodologically favorable circumstances, and only then with extensive priortesting regarding measures, treatment theory, and so forth. The field of evaluation research and the individual treatments evaluated would generally be better served by a thorough descriptive, perhaps qualitative, study as a basis for forming better concepts about treatment, or a good management information system that provides feedback for program improvement, or a variety of other approaches rather than by a superficially impressive but largely invalid experiemental study.“ (Lipsey, 1988, 22f.) Even under optimal conditions and good data sets for empirical social research, which are usually not attained in development cooperation, the strength of evidence is usually low as has been exemplified with the r2 estimator. Even within their respective models statistical results leave a large part of reality unexplained. In addition, statistical models are already a reduction of complexity to a limited number of variables which needs to be justified given the lessons learned about the complexity of social change (Jones et al. 2009:14)46. Often times RIEs will not 46 „Donor-driven and top-down as opposed to reflecting local needs, with intake according to local control. Project-based aid and technical/output focused as opposed to multiple components; interventions responding to context; and the importance of advocacy, sector-wide approahes (SWAps), etc. for sustainable change. Emphasis on economic indicators as opposed to recognition of the multidimensional nature of poverty and change. Standardised, rigid projects vs. Frontline flexibility, evolving in response to changing conditions. Attributing effects to individual actors as opposed to a focus on working in partnership, harmonisation.“ (JJSD, 14)


be adequate to evaluate large and complex projects such as budget support, country-wide or cross-country programs and policy advice (Neubert 2009; Jones et al. 2009:34; Faust 2009:14-17). Naudet et al. (2009:12) also warn against a narrow definition of evidence and point to non-rigorous forms of highly relevant evidence that can be very useful for decision makers. Given all these caveats it is clear that evidence produced by PSM and by RIE in general is not perfect. Under certain conditions and for certain research questions it is the best evidence possible, but it is not infallible. To communicate this evidence appropriately to people with little or no statistical training is extremly challenging. Once this type of evidence leaves the realm of scientific research and enters the arena of political bargaining there is a high risk of all these caveats to be neglected or to be misunderstood. Adequate communication of results can make the difference between good evidence and fake evidence. The audience of RIE is very mixed and many of the stakeholders in development cooperation are very likely to be overwhelmed by the statistical details. If one purpose of RIE is to provide desk officers, project managers and citizens with certainty about what works and what doesn't, this would imply that these stakeholders can ascertain the quality of evidence given to them. PSM has the advantage of being relatively straightforward in the comparison of impacts between control group and treatment group, however PSM is always embedded in a larger research design with many issues of validity and reliability. In the RIE debate the merits of RIE are often highlighted citing individual examples. However, statistics textbook and experts stress the importance of robustness of results. A high degree of certainty in quantitativ analysis is dependent on repeated studies using various methods and models. Limitations of validity do not only concern quantitative methods. Qualitative approaches have their own limitations and biases (Caracelli 2004:193). Because of that many authors advocate for the mix of quantitative and qualitative tools and underscore the importance for qualitative methods for context analysis (Faust 2009:14-19; Bamberger/Kirk 2006:71; White probits, logits; Jones et al. 2009). However even with a multimethod approach results may inconclusive (Caracelli 2004:195) and there are probably limits of how much can be known in social


empirical research and how much certainty can be achieved. (How much can we know in social sciences in terms of certainty and causality (Volker Schneider!) While there are a number of challenges facing RIEs in terms of internal validity, external validity is even more difficult to achieve and particularly to prove. There are certainly cases where external validity can be achieved, but unless a project is replicated in a different setting and reevaluated external validity remains uncertain. While the elimination of doubt and uncertainty is unlikely, another relevant question in the context of applied social research is how much evidence is actually needed. Rossi/Freeman argue that RIE are needed in social sciences because the state of knowledge in social sciences is often inadequate (Rossi/Freeman 1993:353). Looking at some examples of recent RIE this argument is not convincing. Duflo and Kremer (2008:97) cite an example of a RIE where the impact of an educational project was assessed. The evaluation compared the learning results of classes having two teachers with classes having only one teacher. The RIE proved that classes with two teachers had better learning results. It would seem that educational science in the last decades has advanced enough to make such an RIE unnecessary. Neubert (2009) argues along these lines and points out that the results of many RIEs are in fact trivial. Statistical approaches in social empirical research have long been established as valuable tools to provide evidence about social phenomena. The movement for more RIE in development cooperation is an important and positive step towards more knowledge about development. However, RIEs are not the magic wand to establish certainty in development cooperation. RIEs are not suitable for all type of development activities, they are always dependent on assumptions, conditions and multiple replication to assure validity and robustness. RIEs can only provide a limited degree of certainty.

6.2 Will more scientific evidence on impact lead to better aid? In the current RIE debate it is often assumed that scientific evidence is necessary to improve aid effectiveness. However, just as there are many caveats attached to the production of scientific evidence there are many caveats linked to the use of 97

scientific evidence in development cooperation. Theories of political decision making strongly discourage the idea of research and scientific evidence directly impacting on policy-making (Jones/Baumgartner 2005). Political systems tend to resist information that requires a change of policy, policy making is fraud with uncertainties, conflicts of interests and limitations of understanding and the high number of issues on the agendas of decision makers. The review of the limited amount of literature available for developement cooperation confirms this position and indicates that the use of scientific evidence for policy making is limited. For short-term decisions many authors report on non-use or selective use of evaluations. Evaluations are one out of many influencing factor in the arena of political negotiation and are often used to as ammunition to support political interests. Evaluation findings that are not in line with political interests of decision makers or that require major changes are very unlikely to have any impact on policy decisions. The publication bias refered to by Duflo and Kremer (2008) is indicative of this use. In contrast evaluations, particularly accumulated evidence seems to have impact on long term policy making. Several recent publication about the aid system and its reform use the principalagent model to highlight the structural deficits of the aid system. According to these publications the incentives within this system encourage biased reporting and discourage an open flow of information and accountability. Advocates of RIE argue that good evidence about the impact of development cooperation will change the incentives of stakeholders and limit the room for biased reporting. If good evidence about impact was available, the suggest, policy makers could no longer get away with low aid effectiveness. RIE is thus regarded as a tool to limit the room for manoeuvre of stakeholders within the aid system to pursue particular interests. As chapter IV and V have shown, this is not necessarily the case. RIE can produce reliable data, but neither PSM nor other rigorous tools can exclude the influence of partisan interests in the research design. In addition the risk of bias in data collection, unmet assumptions in data interpretation or unwarrented interpretations can not be entirely avoided. In scientific research there are peer


processes that enhance the quality of evidence. Such peer processes do not exist in development cooperation (Bamberger/Kirk 2009:71). To assess the quality of evidence a high degree of expertise is necessary. Most stakeholders within the aid system do not have this expertise. This is particularly true for the stakeholders that are excluded from the feedback cycles until now.47 Without such an objective assessment true and false claims are hard to distinguish and RIEs are unlikely to enhance accountability towards citizens in donor and recipient countries and strengthen feedback loops within the aid system (Jones et al. 2009:34). Even if measures were taken to assure an objective assessment of the technical quality of RIEs, an independent assessment of theoretical underpinnings, indicators used and questions asked seems very unrealistic. RIE may promote upward accountability and increase control mechanisms to some extent. But effects for aid effectiveness will be marginal, because the given system of incentives is not touched and the stakeholders just have to be a bit cleverer to proof effectiveness. What is lacking in the system is downward accountability. But civil society is unlikely to benefit from an increased investment in RIE (ibid.:8). Analysis of the aid system from a principal agent model also reveals the lack of feedback from aid beneficiaries to the rest of the aid system. RIEs do not improve this lack of feedback. To the contrary, large quantitative research relies on standarised tools and does not provide room for feedback. Apart from accountability RIE are promoted to enhance learning in development cooperation. While short term learning based on RIE is not supported by empirical evidence, RIEs do seem to have potential for long term learning. A crucial condition for long term learning is the aggregation of knowledge and meta analysis. Important stakeholders such as the World Bank and 3ie are promoting the synthesis of knowledge but there is much room for improvement on this level (Ito et al. 2008:13-14; Savedoff et al. 2006:30; Leeuw/Vaessen 2009:41). A better understanding about development and long term learning in this respect is 47 Another condition for this argument to hold is, that evaluations, whether rigorous or not, are all published in a timely manner, in the languages of all stakeholders and easy to find (Savedoff et al. 2006:33). This conditions has long been recognised by initiatives such as NONIE, 3ie and also by the recent study on evaluation in German development cooperation. The availability of evaluation reports, however, is not sufficient.


certainly desirable, however lack of knowledge does not seem to be the main obstacle to aid effectiveness. In addition, impact on the project level, do not seem to address the core issues of the aid effectiveness agenda. Both knowledge and positive impact estimates at the project level established by rigorous evaluations can very well go along with an overall ineffective aid system. Even if the choice of projects to evaluate is not guided by political interests, which is unlikely, it is the overall system of aid that suffers from lack of effectiveness. The Paris Declaration and the AAA clearly point to systemic issues such as donor fragmentation and proliferation, lack of collaboration, lack of country ownership and alignment, lack of transparency and predictability and widespread corruption that hamper aid effectiveness (Ito et al. 2008:10; Jones et al. 2009:12). Evaluating a single project with rigorous methods cannot provide any information about these systemic problems and evaluating the progress on these issues cannot be done using rigorous methods for lack of a counterfactual. Promoting RIE is certainly a good thing, but their potential to improve aid effectiveness through greater accountability and better knowledge seems to be minimal. Given all this assessment of RIE and its potential for aid effectiveness it is questionable if RIE really deserve so much attention. The aid system is a market where interests and resources are negotiated (Jones et al. 2009:11; Faust ) and there is considerable competition for scarce resources. The call for more investment in RIE should also be seen in this context of conflictive interests and market shares. While it is uncertain, to say the least, that citizens in poor countries will benefit from RIS, consultants, research institutes and evaluation departments are immediate beneficiaries of more investment in RIEs.48 Jones et al. 2009 warn that too much emphasis on RIE might even have detrimental effects on development policy because of the competition for resources among stakeholders within the aid system. „If funds are influenced to a large extent by the ability to demonstrate impact using experimental IEs, it will lead to development policy and practice being skewed towards those types of 48 It is interesting to note in this context that some authors treat the advancement of the evaluation „culture“ as a positive impact of RIEs (Bamberger/Kirk 2006). From the perspective of aid effectiveness no management tool such as evaluation should be considered a goal in itself.


projects most suitable to this methodology (...).49 Bamberger and Kirk advance a similar argument with reference to evaluators prefering to use rigorous methods in order to be able to publish their in prestigious journals (2006:71). The increase of knowledge and understanding about development processes and aid mecanism is valuable. But RIEs are far from being key levers of aid effectiveness. While it may be attractive to view the aid system as an altruistic enterprise to combat poverty in the world and policy makers just waiting for good evidence on what to do, this perception is not realistic. Instead knowledge does not seem to be a key factor in policy making and even rigorously produced knowledge can be adjusted to political interests, which are often conflictive. Aid effectiveness is not a problem of knowledge, but a problem of political will and incentives (Nuscheler 2008:11). Statistical evidence, though attached to many caveats, may nevertheless be presented as hard facts and suits the interests of some stakeholders. It will not, however, change the system of incentives and empower citizens to hold the aid system accountable. Reform of the aid system is unlikely to be triggered by RIEs or by evaluations generally. 6.3

Aid as key lever for poverty reduction

Many debates, publications, conferences and even organisations are concerned with the effectiveness of aid. In donor and recipient countries many people are employed in the aid business and many people around the world are directly concerned with aid effectiveness or lack thereof. However, despite the large number of lives touched by aid it should not go unmentioned that aid is not a key lever of poverty reduction. So, the importance of RIE in the overall picture is even less important. As was mentioned briefly above, econometrician are still arguing about the overall impact of aid on development. On the one hand there is some evidence that aid actually has a negative impact on development. Aid seems to have a negative impact on governance (Knack, Faust), it increases corruption in ethnically diverse countries(xxx) and hampers the export sectors of recipient countries (Faust, Subramanian/Rajan 2005). On the other hand there are other 49 JJSD use the term „experimental IEs“ to include quasi-experimental IE, so their term is equivalent to the term RIE used in this thesis.


areas of policy which hold much larger promise to reduce global poverty than aid. Trade policy, control mechanisms for financial markets, patent laws, policies geared towards environmental protection, anti-corruption measures, export regulations for weapons, immigration laws, fundamental research on poverty related issues (health, agriculture) seem to be much bigger levers of poverty reduction in the mid-term and long-term. However these policy areas are much closer linked to the interests of donor countries, fraud with vested interests and controversy. It is much easier for the German government to mobilise funds for development aid and the development industry than to push for changes in the above mentioned policy areas. „These solutions are seldom pursued with the zeal that they deserve, in part because they are more difficult to support politically, and in part because that zeal which is essential to overcome the difficulties gets diverted toward, well, to calling for more aid.“ (Subramanian 2007) Aid will hopefully become more effective in the coming years, however this will not be decisive in the fight against poverty. Politicians, managers and researchers in development cooperation should be clear about the key levers of aid effectiveness and the key levers of poverty reduction. RIEs can certainly be useful for long-term learning. But RIEs should not become another fashion within development cooperation which keeps the aid system running but does not address the crucial barriers of aid effectiveness.


7 References 7.1 Books and monographs 1. Ashoff, Guido, Beate Barthel, Nathalie Bouchez, Sven Grimm, Stefan Leiderer and Martina Vatterodt 2008: The Paris Declaration: Case Study of Germany. Evaluation Reports 032. Bonn: Federal Ministry for Economic Cooperation and Development. 2. Baker, Judy, 2000: Evaluating the Impact of Development Projects on Poverty. A Handbook for Practitioners. Directions in Development. Washington D.C.: World Bank. 3. Bamberger, Michael and Angeli Kirk (Edt.), 2009: Making Smart Policy: Using Impact Evaluation for Policy Making. Case Studies on Evaluations that Influenced Policy. Doing Impact Evaluation No. 14 (June 2009). Washington D.C.: The World Bank. 4. Bamberger, Michael, 2006: Conducting Quality Impact Evaluations Under Budget, Time And Data Constraints. Washington D.C.: IEG-ECD, World Bank. 5. Banerjee, Abhijit, Angus Deaton, Nora Lustig and Ken Rogoff 2006: An Evaluation of World Bank Research, 1998-2005. Washington D.C.: The World Bank. 6. Barder, Owen, 2009: Beyond Planning: Markets and Networks for Better Aid. Global Economy & Development Working Paper 185. Washington D.C.: Brookings Institute. 7. BMZ 2001: Armutsbekämpfung – eine globale Aufgabe. Aktionsprogramm 2015. Der Beitrag der Bundesregierung zur weltweiten Halbierung extremer Armut. BMZ Materialien Nr. 106. Bonn: Bundesministerium für wirtschaftliche Zusammenarbeit und Entwicklung. 8. BMZ 2008a: Weißbuch zur Entwicklungspolitik. 13. Entwicklungspolitischer Bericht der Bundesregierung. Stand Juni 2008. Bonn: Bundesministerium für wirtschaftliche Zusammenarbeit und Entwicklung. 9. BMZ 2009g: Entwicklungspolitische Bilanz der 16. Legislaturperiode. Bonn: Bundesministerium für wirtschaftliche Zusammenarbeit und Entwicklung. 10. Bobba, Matteo and Andrew Powell 2007: Aid Effectiveness: Politics Matter. Washington D.C.: Inter-American Development Bank. 11. Borrmann, Axel and Reinhard Stockmann, 2009: Evaluattion in der deutschen Entwicklungszusammenarbeit. Band 1, Systemanalyse. Sozialwissenschaftliche Evaluationsforschung Bd.8. Münster: Waxmann. 12. Boslaugh, Sarah and Paul Watters, 2008: Statistics In A Nutshell. A Desktop Quick


Reference. Sebastopol. CA.: O'Reilly Media. 13. Burnside, Craig and David Dollar, 1997: Aid, Policies, and Growth. Policy Research Working Paper 1777. Washington D.C.: The World Bank. Policy Research Department. 14. Caspari, Alexandra and Ragnhild Barbu, 2008: Wirkungsevaluierungen: Zum Stand der internationalen Diskussion und dessen Relevanz für Evaluierungen der deutschen Entwicklungszusammenarbeit. Evaluation Working Papers. Bonn: Bundesministerium für wirtschaftliche Zusammenarbeit und Entwicklung. 15. Deaton, Angus, 2010: Instruments, randomization, and learning about development. Research Program in Development Studies. Princeton: Princeton University. 16. Easterly, William and Tobias Pfutze, 2008: Where Does The Money Go? Best and Worst Practices in Foreign Aid. Global Economy & Development Working Paper 21. Washington D.C.: Brookings Institute. 17. Easterly, William, 2006: The white man's burden. Why the West's efforts to Aid the Rest Have done So Much Ill and So Little Good. London: Penguin. 18. Easterly, William, 2008: Can the West save Africa. Global Economy & Development Working Paper 27. Washington D.C.: Brookings Institute. 19. Erler, Brigitte, 1990: Tödliche Hilfe. Bericht meiner letzten Dienstreise in Sachen Entwicklungshilfe. Köln: Dreisam Verlag. 20. Faust, Jörg und Dirk Messner, 2007: Organizational Challenges for an Effective Aid Architecture – Traditional Deficits, the Paris Agenda and Beyond. Discussion Papaer 20/2007. Bonn: Deutsches Institut für Entwicklungspolitik. 21. Forss, Kim and Sarah Bandstein: Evidence-based Evaluation of Development Cooperation: Possible? Feasible? Desirable? NONIE Working Paper No. 8. Washington D.C.: NONIE. 22. House, Ernest: 1980: Evaluating with validity. Beverly Hills: Sage. 23. Ito, Seiro, Nobuyuki Kobayashi, Yoshio Wada, 2008: Learning to Evaluate the Impact of Aid, NONIE Working Paper No.6. Washington D.C.: NONIE. 24. Jones, Bryan and Frank Baumgartner, 2005: The politics of attention. How Government Prioritizes Problems. Chicago, London: The University of Chicago Press. 25. Jones, Nicola, Harry Jones, Liesbet Steer and Ajoy Datta, 2009: Improving impact evaluation production and use. ODI Working Paper 300. London: Overseas Development Institute. 26. Kevenhörster, Paul und Dirk van den Boom 2009: Entwicklungspolitik Lehrbuch. Elemente der Politik. Wiesbaden: Verlag für Sozialwissenschaften. 27. Kharas, Homi, 2007: Trends And Issues In Development Aid. Working Paper No.1.


Washington D.C.: Wolfensohn Center for Development at the Brookings Institution. 28. Leeuw, Frans and Jos Vaessen, 2009: Impact Evaluation and Development: NONIE Guidance on Impact Evaluation. Washington D.C.: NONIE. 29. Martens, Bertin, Uwe Mummert, Peter Murrell and Paul Seabright (Edts), 2002: The institutional economics of foreign aid. Cambrigde: University Press. 30. Martens, Jens, 2007: Armutszeugnis. Die Millenniumsentwicklungsziele der Vereinten Nationen. Bonn: Global Policy Forum Europe. 31. Martens, Jens, 2008: Die Wirklichkeit der Entwicklungshilfe. Eine kritische Bestandsaufnahme der deutschen Entwicklungspolitik. Sechzehnter Bericht 2007/2008: Bonn: Deutsche Welthungerhilfe, Terre des Hommes. 32. Martens, Jens, 2008: Kassensturz in der Entwicklungszusammenarbeit. Kosten und Finanzierung der internationalen Entwicklungsziele. Bonn: Global Policy Forum Europe. 33. Michaelowa, Katharina and Axel Borrmann, 2005: What determines Evaluation Outcomes? Evidence from Bi- and Multilateral Development Cooperation. HWWA Discussion Paper 310. Hamburg: Hamburgisches Weltwirtschaftsinstitut. 34. Miller, Patricia 2010: The Index of Global Philantropy and Remittances 2010. Washington D.C.: Hudson Institute. Center for Global Prosperity. 35. Moyo, Dambisa, 2009: Dead Aid. Why Aid Is Not Working And How There Is Another Way For Africa. London: Penguin. 36. Myrdal, Gunnar 1984: Politisches Manifest über die Armut in der Welt. 4th edt. Frankfurt: Suhrkamp. 37. Naudet, Jean-David, Delarue Jocelyne and Véronique Sauvat, 2009: Les évaluations sont-elles utiles? Revue de littérature sur 'connaissance et décisions'. Série Notes méthodologiques no.3. Paris: Agence Francaise de Développement (AFD). 38. Ngyen, Binh and Erik Bloom, 2006: Impact Evaluation. Methodological and Operational Issues. Manila: Asian Development Bank. 39. Nuscheler, Franz 1991: Lern- und Arbeitsbuch Entwicklungspolitik. 3rd edition. Bonn:Verlag Dietz. 40. Nuscheler, Franz 2008: Die umstrittende Wirksamkeit der Entwicklungszusammenarbeit. Duisburg: Institut für Entwicklung und Frieden, Universität Duisburg-Essen (INEF-Report 93/2008). 41. Patton, Michael Qinn: 1997: Utilization-Focused Evaluation. The New Century Text. 3rd edition. Thousand Oaks: Sage. 42. Rajan, Raghuram and Arvind Subramanian 2005: What undermines Aid's Impact on Growth? IMF Working Paper WP/05/126. Washington D.C.: International Monetary Fund.


43. Ravallion, Martin, 2005: Evaluating Anti-Poverty Programs. World Bank Policy Research Working Paper No.3625. Washington D.C.: The World Bank. 44. Reuke, Ludger und Sandra Albers, 2008: Alles in ODA [o:da]. Wider Die Unordnung in der Anrechnung Deutscher 'Offizieller Unterstßtzung' 2003 bis 2006/07. Bonn: Germanwatch. 45. Rieper, Olaf, Frans Leeuw and Tom Ling, 2009: The Evidence Book. Concepts, Generation and Use of Evidence. New Brunswick, NJ: Transaction Publishers. 46. Roodman, David 2007: Macro Aid Effectiveness Research: A Guide for the Perplexed. Working Paper Number 134. Washington D.C.: Center for Global Development. 47. Roodman, David 2008: Through the Looking Glass, and What OLS Found There: On Growth, Foreign Aid, and Reverse Causality. Working Paper Number 137. Washington D.C.: Center for Global Development. 48. Ruprah, Inder Jit, 2008: 'You can get it if you really want': Impact Evaluation Experience of the Office of Evaluation and Oversight of the Inter-American Development Bank. NONIE Working Paper No.3. Washington D.C.: NONIE. 49. Shadish, William, Thomas Cook and Laura Leviton, 1991: Foundations of Program Evaluation. Theories of Practice. Newbury Park: Sage. 50. Teller, Charles, 2008: Are We Learning About What Really Works? Lost Opportunities and Constraints to Producing Rigorous Evaluation Designs of Health Project Impact. NONIE Working Paper No. 9. Washington D.C.: 51. The United Nations, 2005: Designing Household Survey Samples: Practical Guidelines, Department of Economic and Social Affairs, Statistic Division. Studies in Methods. Series F, No. 98: New York. 52. White, Howard, 2006: Impact Evaluation – The experience of the Independent Evaluation Group of the World Bank. Washington D.C.: IEG-ECD. 53. White, Howard, 2008: Of Probits and Participation: The Use of Mixed Methods in Quantitative Impact Evaluation. NONIE Working Paper No.7. Washington D.C.: NONIE.

7.2 Articles in anthologies 54. Martens, Bertin, 2008: Why do Aid Agencies Exist? 285-320. In: William Easterly (Edt.) 2008: Reinventing Foreign Aid. Cambridge Massachusetts: Massachusetts Institute of Technology. 55. Andersen, Uwe, 2005: Deutschlands Entwicklungspolitik im internationalen Vergleich. 54-65. In: Entwicklung und Entwicklungspolitik. Informationen zur


politischen Bildung 286. Bonn: Bundeszentrale für politische Bildung. 56. Banerjee, Abhejit and Ruimin He, 2008: Making Aid Work. 47-92. In: William Easterly (Edt.) 2008: Reinventing Foreign Aid. Cambridge Massachusetts: Massachusetts Institute of Technology. 57. Beywl, Wolfgang und Thomas Widmer, 2009: Evaluation in Expansion: Ausgangslage für den intersektoralen Dreiländer-Vergleich. 13-26. In: Thomas Widmer, Wolfgang Beywl und Carlo Fabian (Hg.), 2009: Evaluation. Ein systematisches Handbuch. Wiesbaden: Verlag für Sozialwissenschaften. 58. Birdsall, Nancy, 2008: Seven Deadly Sins: Reflections on Donor Failings. 515-552. In: William Easterly (Edt.) 2008: Reinventing Foreign Aid. Cambridge Massachusetts: Massachusetts Institute of Technology. 59. Caracelli, Valerie, 2004: Methodology: Building Bridges to Knowledge. 175-201. In: Reinhard Stockmann (Hg.): Evaluationsforschung. Grundlagenund ausgewählte Forschungsfelder. 2. Auflage. Sozialwissenschaftliche Evaluationsforschung Band 1. Opladen: Leske+Budrich. 60. Duflo, Esther and Michael Kremer, 2008: Use of Randomization in the Evaluation of Development Effectiveness. 93-120. In: William Easterly (Edt.) 2008: Reinventing Foreign Aid. Cambridge Massachusetts: Massachusetts Institute of Technology. 61. Easterly, Michael, 2008: Introduction: Can't take it anymore. 1-44. In: William Easterly (Edt.) 2008: Reinventing Foreign Aid. Cambridge Massachusetts: Massachusetts Institute of Technology. 62. Konzendorf, Götz, 2009: Institutionelle Einbettung der Evaluationsfunktion in Politik und Verwaltung in Deutschland. 27-39. In: Widmer, Thomas, Wolfgang Beywl und Carlo Fabian (Hg.), 2009: Evaluation. Ein systematisches Handbuch. Wiesbaden: Verlag für Sozialwissenschaften. 63. Kremer, Michael and Edward Miguel, 2008: The Illusion of Sustainability. 201-254. In: William Easterly (Edt.) 2008: Reinventing Foreign Aid. Cambridge Massachusetts: Massachusetts Institute of Technology. 64. Lee, Barbara, 2004: Theories of Evaluation. 135-173. In: Reinhard Stockmann (Hg.): Evaluationsforschung. Grundlagenund ausgewählte Forschungsfelder. 2. Auflage. Sozialwissenschaftliche Evaluationsforschung Band 1. Opladen: Leske+Budrich. 65. Prichett, Lant 2008: It pays to be ignorant. 121-144. In: William Easterly (Edt.) 2008: Reinventing Foreign Aid. Cambridge Massachusetts: Massachusetts Institute of Technology. 66. Prichett, Lant and Michael Woolcock, 2008: Solutions When the Solution Is the Problem: Arraying the Disarray in Develpment. 147-178. In: William Easterly (Edt.) 2008: Reinventing Foreign Aid. Cambridge Massachusetts: Massachusetts Institute


of Technology. 67. Reade, Nicolà, 2009: Ländervergleich: Evaluierung in der Entwicklungszusammenarbeit. In: Widmer, Thomas, Wolfgang Beywl und Carlo Fabian (Hg.), 2009: Evaluation. Ein systematisches Handbuch. Wiesbaden: Verlag für Sozialwissenschaften. 68. Reinikka, Ritva, 2008: Donors and Service Delivery. 179-200. In: William Easterly (Edt.) 2008: Reinventing Foreign Aid. Cambridge Massachusetts: Massachusetts Institute of Technology. 69. Schneider, Volker, 2007: Komplexität, politische Steuerung, und evidenz-basiertes Policy-Making. 55-70. In: Frank Janning und Katrin Toens (Hg.): Die Zukunft der Policy-Forschung. Theorieentwicklung, Methodenfragen und Anwendungsaspekte. Wiesbaden: VS-Verlag. 70. Stockmann, Reinhard, 2004a: Evaluation in Deutschland. 13-43. In: Reinhard Stockmann (Hg.): Evaluationsforschung. Grundlagenund ausgewählte Forschungsfelder. 2. Auflage. Sozialwissenschaftliche Evaluationsforschung Band 1. Opladen: Leske+Budrich. 71. Stockmann, Reinhard, 2004b: Evaluation staatlicher Entwicklungspolitik. 375-410. In: Reinhard Stockmann (Hg.): Evaluationsforschung. Grundlagenund ausgewählte Forschungsfelder. 2. Auflage. Sozialwissenschaftliche Evaluationsforschung Band 1. Opladen: Leske+Budrich. 72. Svensson, Jakob, 2008: Absorption Capacity and Disbursement Constraints. 311-332. In: William Easterly (Edt.) 2008: Reinventing Foreign Aid. Cambridge Massachusetts: Massachusetts Institute of Technology. 73. Winship, Christopher and Michael Sobel, 2004: Causal Inference in Sociological Studies. 482-503. In: Hardy, Melissa and Alan Bryman (Edt.) Handbook of Data Analysis. London: Sage. 74. Zintl, Michaela, 2009: Evaluierung in der deutschen Entwicklungszusammenarbeit. In: Widmer, Thomas, Wolfgang Beywl und Carlo Fabian (Hg.), 2009: Evaluation. Ein systematisches Handbuch. Wiesbaden: Verlag für Sozialwissenschaften.

7.3 Articles in journals 75. Almeida, Celia and Ernesto Báscolo,2006: Use of research results in policy decision making, formulation, and implementation: a review of the literature. In: Cadernos de Saúde Pública. Vol. 22, Rio de Janeiro: 7-19. 76. Brautigam, Deborah and Stephen Knack, 2004: Foreign Aid, Institutions, and Governance in Sub-Saharan Africa. In: Economic Development and Cultural Change.


University of Chicago Press Vol. 52(2): 255-285. 77. Briggs, Derek, 2004: Causal Inference and the Heckman Model. Journal of Educational and Behavioral Statistics. Vol. 29 (4): 397-420. 78. Coase, Ronald, 1937: The nature of the firm. Economica 4 (16): 386-405. 79. Dehejia, Rajeev and Sadek Wahba, 1999: Causal Effects in Nonexperimental Studies: Reevaluating the Evaluation of Training Programs. The Journal of the American Statistical Association. Vol.94, No.448 (December 1999): 1053-1062. 80. Dehejia, Rajeev and Sadek Wahba, 2002: Propensity Score Matching Methods for Nonexperimental Causal Studies. The Review of Economics and Statistics.Vol. 84, No.1 (February 2002): 151-161. 81. DiPrete, Thomas and Henriette Engelhardt, 2004: Estimating Causal Effects with Matching Methods in the Presence and Absence of Bias Cancellation. Sociological Methods and Research. Vol. 32, No.4 (2004): 501-528. 82. Gangl, Markus and Thomas DiPrete, 2004: Kausalanalyse durch Matchingverfahren. In: Diekman, Andreas (Hg.): Methoden der Sozialforschung. KĂślner Zeitschrift fĂźr Soziologie und Sozialpsychologie.Sonderheft 44.2004: 396-420. 83. Heckman, James, 1997: Instrumental Variables: A Study of Implicit Behavioral Assumptions Used in Making Program Evaluations. The Journal of Human Resources. Vol. 32, No.3 84. Heckman, James, Hidehiko Ichimura and Petra Todd, 1998: Matching as an Econometric Evaluation Estimator. Review of Economic Studies (1998): 261-294. 85. Heckman, James, Hidehiko Ichimura, Jeffrey Smith and Petra Todd, 1998: Characterizing Selection Bias Using Experimental Data. Econometrica. Vol.66, No.5 (Sept. 1998): 1017-1098. 86. Knack,Stephen, 2001. Aid Dependence and the Quality of Governance: CrossCountry Empirical Tests. In: Southern Economic Journal Vol. 68(2): 310-329. 87. Rosenbaum, Paul and Donald Rubin, 1983: The Central Role of Propensity Score Matching in Observational Studies for causal effects. Biometrika, Vol.70, No.1 (April 1983): 41-55. 88. Rosenbaum, Paul and Donald Rubin, 1985: Constructing a Control Group Using Multivariate Matched Sampling Methods That Incorporate the Propensity Score. The American Statistician. Vol. 39., No.1 (Feb. 1985): 33-38. 89. Smith, Jeffrey and Petra Todd, 2001: Reconciling Conflicting Evidence on the Performance of Propensity Score Matching Methods. The American Economic Review, Vol. 91, No. 2 (May 2001), Papers and Proceedings of the Hundred Thirteenth Annual Meeting of the American Economic Association: 112-118. 90. Weiss, Carol, 1979: The many meanings of research utilization. In: Public


Administration Review. Sept/Oct 1979:426-431. 91. Winship, Christopher and Stephen Morgan, 1999: The Estimation of Causal Effects from Observational Data. Annual Review of Sociology. 1999, 25:659-707

7.4 Internet resources 92. BMZ 2009a: ODA Stufenplan . Site:; Retrieved: 08 June 2010. 93. BMZ 2009b: Geber im Vergleich 2009. Site:; Retrieved: 08 June 2010. 94. BMZ 2009c: Bi- und multilaterale Netto-ODA. Site:; Retrieved: 08 June 2010 . 95. BMZ 2009d: Mittelherkunft der bi- und multilateralen ODA 2007-2008. Site:; Retrieved: 08 June 2010. 96. BMZ 2009e: Deutsche Netto-ODA 2003-2008 . Site:; Retrieved: 08 June 2010. 97. BMZ 2009f: Bi- und multilaterale Netto-ODA nach Ländern 2004-2008. Site: ; Retrieved: 25 June 2010. 98. BMZ 2010a: Die Aid-Effectiveness-Agenda – Wirksamkeit der Zusammenarbeit steigern. Site:; Retrieved: 10 June 2010. 99. BMZ 2010b: Die Geschichte des Ministeriums. Site:; Retrieved: 08 June 2010. 100.

BMZ 2010c:Grundsätze: Warum brauchen wir Entwicklungspolitik. Site:; Retrieved: 10 June 2010. 101.

BMZ 2010d: Priority areas of German development cooperation. Site:; Retrieved: 08 June 2010. 102.

BMZ 2010f: Vergangenheit prüfen, Zukunft gestalten: Ziele, Grundsätze und

Verfahren der Evaluierung. Site:; Retrieved: 08 June 2010. 103.

BMZ 2010g: Was ist ODA. Site:


1_Leitfaden_Was_ist_ODA.pdf; Retrieved: 23 June 2010. 104.

Cairo 1994: United Nations International Conference on Population and

Development (ICPD) . Site: ; Retrieved: 08 June 2010. 105.

CDG 2007: Designing a New Entity For Impact Evaluation: Meeting Report,

February 2007. Site: %20gap/Bellagio_07_Meeting_Report.pdf; Retrieved: 08 June 2010. 106.

DeGEval 2010: Standards f端r Evaluation. Site:; Retrieved: 10 June 2010. 107.

EU 2010: Group of Common Support. Site: /method_techniques/counterfactual_impact_evaluation/propensity/propensity_details _en.htm; Retrieved: 08 June 2010. 108.

G8 Information Centre: G8 Information Centre. Site:; Retrieved: 10 June 2010. 109.

HLF 2010a: Accra High Level Forum on Aid Effectiveness Milestones in

Aid Effectiveness. Site:,,contentMDK:21690 872~menuPK:64861438~pagePK:64861884~piPK:64860737~theSitePK:4700791,00 .html; Retrieved: 08 June 2010. 110.

Monterrey 2002: International Conference on Financing for Development.

Site:; Retrieved: 08 June 2010. 111.Neudeck et al., 2008: Bonner Aufruf, Eine andere Entwicklungspolitik. Site:; Retrieved: 10 June 2010. 112.

New York 2000: United Nations Millennium Declaration. Site:; Retrieved: 10 June 2010. 113.

OECD-DAC 1991: Principles For Evaluation Of Development Assistance.

Site:; Retrieved: 10 June 2010. 114.

OECD-DAC 2005: DAC Peer Review (2005): Germany. Site:,3343,en_2649_34603_35878945_1_1_1_1, ml; Retrieved: 23 June 2010. 115.

OECD-DAC 2008a: The Paris Declaration on Aid Effectiveness and the

Accra Agenda for Action. Site:; Retrieved: 08 June 2010. 116.

OECD-DAC 2008b: Management for Development Results Information

Sheet. Site:; Retrieved: 09 June 2010.



OECD-DAC 2008c: Scaling Up: Aid Fragmentation, Aid Allocation and Aid

Predictability. Site:; Retrieved: 10 June 2010. 118.

OECD-DAC 2008d: Survey on Monitoring the Paris Declaration. Site:; Retrieved: 22 June 2010. 119.

OECD-DAC 2008e: Is it ODA? Site:; Retrieved: 22 June 2010. 120.

OECD-DAC 2009a: Policy Brief 2009: Managing for results. Site:; Retrieved: 22 June 2010. 121.

OECD-DAC 2009b: Round table 4_MfDR_final report. Site:; Retrieved: 22 June 2010. 122.

OECD-DAC 2009c: Glossary of Key Terms in Evaluation and Results-

based Management. Site:; Retrieved: 22 June 2010. 123.

OECD-DAC 2010a: Working Party on Aid Effectiveness. Site:,3343,en_2649_3236398_43382307_1_1_1_1,00 .html; Retrieved: 10 June 2010. 124.

OECD-DAC 2010b: Evaluating Development Co-operation. Summary of

Key norms and standards. Site: ; Retrieved: 08 June 2010. 125.

OECD-DAC 2010c: Creditor Reporting System. Site:; Retrieved: 09 June 2010. 126.

OECD-DAC 2010d: ODA – OECD-DAC Glossary. Site:,3343,en_2649_33721_42632800_1_1_1_1, ml#ODA; Retrieved: 10 June 2010. 127.

OECD-DAC 2010e: Peer Reviews. Site:,3347,en_2649_34603_1_1_1_1_1,00.html; Retrieved: 23 June 2010. 128.

REG online 2009: Bundeshaushalt 2009. Site: ; Retrieved: 10 June 2010. 129.

Rio 1992: UN Conference on Environment and Development. Site: ; Retrieved: 10 June 2010. 130.

UK CO 1999: Modernising Government – Whitepaper: 2. Policy Making.

Site:; Retrieved: 10 June 2010.



UK CO 2001: Better policy making. Site:; Retrieved: 15 June 2010. 132.

UK ODI: Evidence-Based Policy in Development Network. Site:; Retrieved: 22 June 2010. 133.

UN 2000a: United Nations Millennium Declaration. Site:; Retrieved: 22 June 2010. 134.

UN 2008: The Millennium Development Goals Report 2008. Site: 2008_En.pdf#page=22; Retrieved: 22 June 2010. 135.

USA CEBP 2010: Coalition for Evidence-Based Policy. Site:; Retrieved: 22 June 2010. 136.

USA EOP 2009: Memorandum for the Heads of Executive Departments and

Agencies, Oct 7, 2009. Site:; Retrieved: 22 June 2010. 137.

World Bank 2010a: NONIE – Network of Networks on Impact Evaluation.

Site:; Retrieved: 22 June 2010. 138.

World Bank 2010b: Pages from World Bank History: The Pearson

Commission. Site: 0,,contentMDK:20121526~pagePK:36726~piPK:36092~theSitePK:29506,00.html; Retrieved: 22 June 2010. 139.

World Bank 2010c: Improving Development Results Through Excellence in

Evaluation. Site: theSitePK=1324361&pagePK=64253958&contentMDK=20999016&menuPK=6425 3130&piPK=64252979; Retrieved: 22 June 2010. 140.

World Bank 2010d: The Development Impact Evaluation (DIME) Initiative.

Site:, ,menuPK:3998281~pagePK:64168427~piPK:64168435~theSitePK:3998212,00.html ?placeholder; Retrieved: 22 June 2010. 141.

World Bank 2010e: Net ODA received (5 of GNI). Site: display=default; Retrieved: 22 June 2010. 142.

World Bank 2010f: Outlook for Remittance Flows 2010-11. Site:


1110315015165/MigrationAndDevelopmentBrief12.pdf; Retrieved: 22 June 2010. 143.


Magisterarbeit 25.Juni  
Magisterarbeit 25.Juni