Estadistica 67 188 189

Page 1

ESTADÍSTICA ESTADISTICA volumen 53 volumen 67

Junio y Diciembre 2015

números 188 y 189

REVISTA SEMESTRAL REVISTA DEL INSTITUTODEL INSTITUTO INTERAMERICANO ESTADÍSTICA INTERAMERICANO DEDE ESTADÍSTICA

JOURNAL OF BIANNUAL JOURNAL OF THE THE INTER-AMERICAN STATISTICAL INSTITUTE INTER-AMERICAN STATISTICAL INSTITUTE



CUERPO EDITORIAL / EDITORIAL BOARD

APLICACIONES/ APPLICATIONS

TEORIA Y METODOS/ THEORY AND METHODS

CLYDE CHARRE DE TRABUCHI

GRACIELA BOENTE

French 2740, 5º A 1425 Buenos Aires, Argentina Tel (54-11) 4824-2315 e-mail fliatrab@hotmail.com

Departamento de Matemáticas/IMAS Facultad de Ciencias Exactas y Naturales Ciudad Universitaria, Pabellón 1 1428 Buenos Aires, Argentina Tel (54-11) 4576-3335 e-mail gboente@dm.uba

EDITORES ASOCIADOS / ASSOCIATE EDITORS M. AGUILERA CEPAL, CHILE

A. CUEVAS L. MELO VELANDIA Univ. Autónoma de Madrid, ESPAÑA Banco de la República, COLOMBIA

R. ASSUNCAO Univ. Fed. Minas Gerais, BRASIL

E. DAGUM Consultor/Consultant, CANADA

P. MORETTIN Univ. de Sao Pablo, BRASIL

A. BIANCO

E. de ALBA

F. NIETO SANCHEZ

Univ. Nac. Buenos Aires, ARGENTINA INEGI, MEXICO

Univ. Nac. Colombia, COLOMBIA

L. BECCARIA Univ. Gral. Sarmiento, ARGENTINA

L. ESCOBAR Louisiana State Univ., USA

M. SALIBIAN BARRERA Univ. British Columbia, CANADA

M. BLACONA Univ. Nac. Rosario, ARGENTINA

W. GONZALEZ MANTEIGA

E. SENRA DIAZ

Univ. Santiago de Compostela, ESPAÑA Univ. de Alcalá, ESPAÑA

O. BUSTOS

V. GUERRERO GUZMAN

Univ. Nac. Córdoba, ARGENTINA

ITAM, MEXICO

G. M. CORDEIRO Univ. Fed. Pernanbuco, BRASIL

M. I. LEAL DE CARVALHO GOMES L. TRUJILLO OYOLA

Univ. de Lisboa, PORTUGAL

F. TIBALDI Scientific Institute of Public Health, BELGICA Univ. Nac. Colombia, COLOMBIA P. VERDE University of Düsseldorf, ALEMANIA



ESTADÍSTICA (2015), 67, 188 y 189, pp. 5-6 © Instituto Interamericano de Estadística

NOTA DE LA OFICINA EDITORIAL En este Volumen, coincidente con el 75° aniversario de la creación del Instituto Interamericano de Estadística, se presentan algunos artículos especiales sobre los que quisiéramos hacer una referencia particular. En primer lugar debemos señalar que, dado que después de muchos años habría de celebrarse en la región latinoamericana un Congreso Mundial de Estadística del Instituto Internacional de Estadística (ISI), el IASI tuvo oportunidad de hacerse presente de varias maneras. Se trató del 60° Congreso Mundial de Estadística, realizado en Río de Janeiro, Brasil del 26 al 31 de julio de 2015. El IASI participó directamente en el programa del Congreso mediante la organización y realización de una sesión de trabajos invitados y dos sesiones sobre tópicos especiales. Además, realizó en Rio de Janeiro la sesión anual 2015 de su Comité Ejecutivo, organizó un concurso extraordinario del Premio IASI a la Excelencia cuyo ganador fue presentado en el Congreso y mantuvo una serie de contactos tendientes a la cooperación con otras instituciones, uno de los cuales condujo muy pronto a la formalización de un Memorandum de Entendimiento con la Asociación Internacional para la Estadística Oficial (International Association for Official Statistics-IAOS) para la selección de artículos que serían publicados tanto en la revista Estadística como en el Statistical Journal de IAOS. Las actividades que formaron parte del programa del Congreso fueron: (1) Sesión de trabajos invitados IPS088 - “Desafíos para el Desarrollo de la Estadística en Países de América Latina”, organizada por Pedro A. Morettin - ex presidente de IASI y actual miembro del Consejo Consultivo del Instituto; (2) Sesión sobre tópicos especiales STS049 – “El censo de población sin papel: solución para el futuro”, organizada por Evelio O. Fabbroni, Director Ejecutivo del IASI; y (3) Sesión sobre tópicos especiales STS048 – “Salvaguarda de la integridad de las estadísticas y de la independencia de los estadísticos: Producción de estadísticas de acuerdo con consideraciones estrictamente profesionales”, estuvo organizada por Juan Carlos Abril, en ese momento Presidente del Instituto. En esta última sesión se presentaron cuatro ponencias, una de ellas, “Las estadísticas como instrumento para sociedades democráticas, prósperas y transparentes” de Carlo Malaguerra y Alphonse MacDonald se presenta en este


6

ESTADÍSTICA (2015), 67, 188 y 189, pp. 5-6

volumen y, con la finalidad de facilitar su mayor divulgación, se publica en inglés y en español. Como se señaló anteriormente, en el 60° Congreso Mundial de Estadística se entregó el Premio IASI a la Excelencia otorgado mediante concurso extraordinario convocado en celebración de la fundación del Instituto. Este Premio busca identificar y reconocer nuevos talentos en el área de Estadística en la región de las Américas, atraer su atención hacia el Instituto, estimular su actuación en favor del desarrollo de la Estadística en la región, y facilitar la divulgación de la producción de trabajos relevantes de candidatos jóvenes. El ganador del concurso, cuyo trabajo se publica en este Volumen, fue Christian E. Galarza de la Escuela Superior Politécnica del Litoral, Guayaquil, Ecuador con la coautoría de Víctor H. Larchos, Universidade Estadual de Campinas, Campinas, Brasil, quien fue su Director de Tesis por su trabajo “Likelihood based inference for quantile regression in nonlinear mixed effects models”. El Director Ejecutivo del IASI, Evelio O. Fabbroni, ha escrito unas palabras en conmemoración de las Bodas de Platino del Instituto Interamericano de Estadística, haciendo una breve pero sustanciosa referencia histórica a la creación de nuestro Instituto y un resumen de las actividades que actualmente desarrolla el IASI. Al cierre de esta Nota queremos agradecer de modo especial a la Lic. Delia Keller y a la Dra. Alicia Picco por la valiosa colaboración brindada en la edición de dos artículos incluidos en la presente entrega de la revista, los de Edmundo Berumen Torres y de Christian E. Galarza, respectivamente.

Diciembre 2015


ESTADÍSTICA (2015), 67, 188 y 189, pp. 7-8 © Instituto Interamericano de Estadística

NOTE FROM THE EDITORIAL OFFICE This volume, coincident with the 75th anniversary of the creation of the InterAmerican Statistical Institute, presents some special papers on which we would like to make a particular reference. Firstly, we should indicate that, because after many years it would be held in the Latin American region a World Statistics Congress of the International Statistical Institute (ISI), IASI had the opportunity to participate in several ways. This event was the 60th World Statistics Congress, held in Rio de Janeiro, Brazil from the 26 to the 31 of July of 2015. IASI directly participated in the program of the Congress through the organization and holding of a session of invited papers and two sessions on special topics. In addition, the 2015 Executive Committee annual session was held in Rio de Janeiro; it was organized an extraordinary contest of the IASI Award for Excellence whose winner was presented at the Congress and there were a series of contacts aimed at cooperation with other institutions, one of which led soon to the formalization of a Memorandum of Understanding with the International Association for Official Statistics (IAOS) for the selection of papers that would be published both in the journal Estadística and in the Statistical Journal of IAOS. The activities that were part of the program of the Conference were: (1) Invited Paper Session IPS088 – “Challenges for the Development of Statistics in Latin American Countries”, organized by Pedro A. Morettin - former President of IASI and current Member of the Advisory Board of the Institute; (2) Special Topic Session STS049 – “The paperless population census: solution for the future”, organized by Evelio O. Fabbroni, Executive Director of IASI; and (3) Special Topic Session STS049 – “Safeguarding the integrity of statistics and independence of statisticians: Producing statistics according to strictly professional considerations”, organized by Juan Carlos Abril, at that time President of the Institute In this last session four papers were presented, one of them, "Statistics as instruments for prosperous, transparent and democratic societies”, by Carlo


8

ESTADÍSTICA (2015), 67, 188 y 189, pp. 7-8

Malaguerra and Alphonse MacDonald is presented in this volume and, in order to facilitate a greater dissemination, it is presented both in English and in Spanish. As above indicated, the IASI Award for Excellence, granted through an extraordinary contest in celebration of the foundation of the Institute, was delivered in the 60th World Statistics Congress. This award is aimed at identifying and recognizing new talents in the area of Statistics in the American region, to attract their attention towards the Institute, stimulate their action in favor of statistical development in the region, and facilitate the dissemination of relevant work of young statisticians The winner of the contest, whose work is published in this Volume, was Christian E. Galarza, from the Escuela Superior Politécnica del Litoral, Guayaquil, Ecuador, with the co-authorship of Víctor H. Larchos, Universidade Estadual de Campinas, Campinas, Brazil, who was its Director of Thesis for his work "Likelihood based inference for quantile regression in nonlinear mixed effects models" The Executive Director of IASI, Evelio O. Fabbroni, has written a few words in commemoration of the Platinum Jubilee of the Inter-American Statistical Institute, making a brief but substantial historical reference to the creation of our Institute and a summary of the activities currently developed by IASI. At the end of this Note, we want to thank in a special way the Lic. Delia Keller and Dr. Alicia Picco for the invaluable assistance provided in the editing of two papers included in this issue of the journal, the one by Edmundo Berumen Torres and that of Christian E. Galarza, respectively.

December 2015


ESTADÍSTICA (2015), 67, 188 y 189, pp. 9-19 © Instituto Interamericano de Estadística

STATISTICS AS INSTRUMENTS FOR PROSPEROUS, TRANSPARENT AND DEMOCRATIC SOCIETIES CARLO MALAGUERRA Former Director General of the Swiss Federal Statistical Office, Sion, Valais, Switzerland carlo.malaguerra@gmail.com ALPHONSE L. MACDONALD Former Senior Official of the United Nations Population Fund, UNFPA, New York, USA fonzhan@gmail.com

ABSTRACT In January 2014 the General Assembly of the United Nations endorsed the Fundamental Principles of Official Statistics, which was adopted by the Statistical Commission of the United Nations in April 1994, following an initiative of the Conference of European Statisticians. Valid and reliable information is essential for the management of the affairs of a democratic society aiming at generalised wellbeing and prosperity. It is important that users and stakeholders of official statistics and the citizens at large have total confidence in statistics. To produce valid and reliable statistics it is necessary that Governments provide the legal framework and resources to the statistical system of their countries to allow statisticians to produce the required statistical information, without interference using the best available methodology and techniques from the best suited sources of information. Respondents, be they individual, enterprises or organisations, have to provide the required information truthfully and as completely as possible. Official statistics have to guarantee that such individual information will be used for statistical purposes only. Moreover the results of statistical enquiries have to be made available to all users without distinction. Such basic requirements of official statistics were not respected in the centrally planned economies before 1989 and even in some of the countries with market economies. During the transition process toward democracies and market economies of the countries from Eastern and Central Europe it was recognized that official statistics plays an essential role for preserving democracy and that its special and unique role should be recognized by

This paper was presented during the 60th World Statistics Congress of the International Statistical Institute, ISI2015, which took place in Rio de Janeiro, from 26-31 July 2015.


10

ESTADÍSTICA (2015), 67, 188 y 189, pp. 9-19

governments and the public at large. At the request of one of the Eastern European countries the Conference of European Statisticians proposed a Charter called “Fundamental Principles of Official Statistics” establishing the parameters to guarantee the production of valid and reliable official statistics. As years passed, it was recognized that these “Principles” should have a universal validity. This was reached in 2014 with the endorsement of the “Principles” by the United Nations General Assembly. Consequently the Fundamental Principles of Official Statistics have universal acceptance and should be adhered to by all nations and societies. Suggestions are made ensure that the Fundamental Principles are continued to be adhered to. Keywords Fundamental Principles of Official Statistics, democratic society; functional independence RESUMEN En enero de 2014 la Asamblea General de las Naciones Unidas dio su respaldo a los Principios Fundamentales de las Estadísticas Oficiales, que fueron adoptados por la Comisión de Estadística de las Naciones Unidas en abril de 1994, a raíz de una iniciativa de la Conferencia de Estadísticos Europeos. Información válida y confiable es esencial para la gestión de los asuntos orientados al bienestar generalizado y la prosperidad, en una sociedad democrática. Es importante que los usuarios, los interesados en las estadísticas oficiales y los ciudadanos en general tengan absoluta confianza en las estadísticas. Para producir estadísticas válidas y fiables, es necesario que los gobiernos establezcan el marco legal y los recursos para el sistema estadístico de sus países para permitir a los estadísticos producir la información estadística necesaria sin interferencia, utilizando la mejor metodología y técnicas disponibles de las fuentes más adecuadas de información. Los informantes, ya sean individuos, empresas u organizaciones, tienen que proporcionar la información requerida con veracidad y en la forma más completa posible. Las oficinas estadísticas tienen que garantizar que dicha información individual será utilizada únicamente con fines estadísticos. Además, los resultados de encuestas estadísticas han de ponerse a disposición de todos los usuarios sin distinción. Tales requisitos básicos de las estadísticas oficiales no se respetaron en las economías de planificación centralizada antes de 1989, e incluso en algunos de los países con economías de mercado. Durante el proceso de transición hacia democracias y economías de mercado de los países de Europa Oriental y Central, se reconoció que las estadísticas oficiales desempeñan un papel esencial para la preservación de la democracia y que su papel especial y único debe ser reconocido


`

MALAGUERRA et al.: Statistics as instruments for prosperous, transparent...

11

por los gobiernos y el público en general. A petición de uno de los países de Europa del Este la Conferencia de Estadísticos Europeos propuso una Carta denominada "Principios Fundamentales de las Estadísticas Oficiales" que establezca los parámetros para garantizar la producción de estadísticas oficiales válidas y fiables. Al pasar de los años se reconoció que estos "principios" deben tener una validez universal. Esto se alcanzó en 2014 con el respaldo de los "Principios" por parte de la Asamblea General de las Naciones Unidas. Por consiguiente, los Principios Fundamentales de las Estadísticas Oficiales tienen aceptación universal y deben ser respetados por todas las naciones y sociedades. Se hacen sugerencias para asegurar la permanente adhesión a los Principios Fundamentales. Palabras clave Principios Fundamentales de las Estadísticas Oficiales, sociedad democrática, independencia funcional Introduction Currently, citizens of most, if not all countries, expect to have access to up-to-date, valid and reliable statistical information about their society and the world at large. Valid and reliable statistical information is essential for the management of the affairs of a democratic society aiming at generalised wellbeing and prosperity. It’s therefore important that users of official statistics and the citizens at large have total confidence in the quality of the statistics. To produce valid and reliable statistics it is necessary that Governments provide the legal framework and resources to the statistical system of their countries to allow statisticians to produce the required statistical information, without interference, using the best available methodology and techniques from the best suited sources of information. Respondents, be they individuals, enterprises or organisations have to provide the required information truthfully and as complete as possible. National laws on the statistical system have to guarantee that such individual information will be used for statistical purposes only. Moreover the results of statistical enquiries have to be made available to all users without distinction. This appreciation of statistical information and the way it is produced is of recent origin. Twenty five years or so ago the situation was very different in many countries. Statistics: nature and early developments Historically all societies of a certain complexity require information that allows and enables them to regulate the affairs of their society. There is physical and


12

ESTADÍSTICA (2015), 67, 188 y 189, pp. 9-19

literary evidence that classical historical societies, such as Babylon and Egypt had well developed mathematical systems which were used to prepare records on population, size of agricultural unit, production, trade patterns and transactions. It is said that numeracy predates literacy; this is exemplified by the Inca Empire in South America which had no writing, but a well-developed accountancy system based on the quipu, a mnemonic device consisting of ‘rows of strings in which the colour of the threads and the loops of the knots represented arithmetical units or recording categories’ (Hemming, 1972, p.61). After the conquest of Peru, until about 1600, the Spanish authorities recognised the quipus as valid records in judicial processes and allowed them to be used as instruments of data collection by natives clerks employed in their administration (Loza, 1998). In medieval Europe the administrators of city states, duchies and kingdoms kept records, in more or less systematic fashion, on issues that were of interest to the rulers, mainly for taxation and defence (abled bodied men) purposes 1. This was of most importance after the conquest of new lands, and probably the most exact and comprehensive data collection exercise ever carried out was the compilation of the so-called Doomsday Book 2, covering much of England and part of Wales after the Norman Conquest in 1067. Ecclesiastical authorities kept detailed records on their parishioners, including birth, deaths and marriages. Since the Enlightenment, persons interested in the advancement of knowledge, science and society established “learned societies” in which topics of scientific and societal interest were discussed. Individual scholars carried out numerical studies on a wide range of population, social, economic and health phenomena, which gave rise to the development of ‘political arithmetic’, a term introduced by William Petty in England (Meitzen & Falkner, 1891, p.30 and Stigler, 2005, p.223). Their studies were an early form of descriptive statistics, or accountancy of the nation, considered to be of national interest and which contributed to its power and prestige. Political arithmetic was an instrument for enhancement of public administration of a nation. The best known scholar in this field is John Graunt (Glass, 1964), who is considered to be the founder of demography, and whose studies on the causes of mortality (Observations upon the Bills of Mortality) show sound methodology and the ability to work with incomplete, inconsistent and errorplagued data that withstand modern scrutiny. The findings of these studies were not always to the liking of those in power, as was exemplified by the case of Johan Heinrich Waser, a “burgher” (citizen) and scientist of the city of Zürich, Switzerland, whose study on buildings and the risks of fire was considered an act of treason. He was decapitated on 27 March 1780 (Graber, 1980).The Founding Fathers of the United States of America were


`

MALAGUERRA et al.: Statistics as instruments for prosperous, transparent...

13

probably the first to establish a link between democracy, governance and statistics when they formulated the Constitution. In Article 1, clause 3 3 they mandated that the number of Representatives and the direct taxes should be based on the number of residents in each States, which would be established by a decennial enumeration. Parallel to the compilation and the analysis of numerical information of the nations, statistical techniques and methods were devised and in mathematics theories of probability and of errors were developed which would provide the theoretical underpinning of the emerging modern science of statistics. Given the political and cultural fragmentation of Europe the early statistical compilations used a wide variety of methods, and studies on the same subject very often were not comparable. Standardisation, impact, convergence and divergence The data requirements of Napoleonic France and post-Napoleonic France and Prussia had important consequences for the development of public (official) statistics. The 19th century was a period of rapid and profound political, economic and social change, which had important consequences for the development of statistics. In the early19th century the establishment of statistical societies (SocietĂŠ de Statistiques deParis, 1803, the Royal Statistical Society of London, 1834 and the American Statistical Association, 1839) promoted and contributed to the standardisation of methods and procedures. These efforts culminated in a series of International Statistical Conferences between 1853 and 1885, originally organised under the dynamic leadership of Adolphe QuĂŠtelet, which in 1885 led to the establisyment of the International Statistical Institute (ISI) (De Neumann Spallart, 1886). The results of statistical enquiries carried out by independent scholars influenced the growth of national consciousness (Switzerland) and contributed to the creation of nation states (Italy and Germany). In several European countries the creation of National Statistical Commissions had a positive impact on the acceptance of statistical information and its usefulness for governance. Several countries started to include statistics in their public administration, by creating statistical units in ministries or, later, by establishing national and sub-national statistical offices. This confirmed the importance of statistical information for policy development and governance. However, at the same time it changed the status of statisticians, who up to then were mostly independent scholars. Statisticians in government service became civil servants subject to rules and regulations of the civil service, which could endanger their scientific independence. The substitution of authoritarian regimes for liberal democratic ones changed the nature of official statistics. They were not only necessary for policy formulation and development, but became


14

ESTADÍSTICA (2015), 67, 188 y 189, pp. 9-19

means of verification of governmental compliance with policies by parliament, the electorate and ultimately by the population at large. The birth of the Fundamental Principle In the twentieth century important developments in theoretical and applied statistics were realised. Statistical development both in “administrative and scientific statistics” was promoted by the International Statistical Institute (ISI). In the early 20th century the League of Nations provided a forum for Directors of all National Statistical Offices to discuss issues concerning statistical standards which became especially relevant through the first International Conference of Economic Statistics in 1928. The option for technical dialogue was continued by the Conference of European Statisticians (CES) 4 established in 1953.Up until the nineteen sixties the main actors in the development of statistics were European, North American, some members of the British Commonwealth, with limited contributions from Latin Americans and Asians. The creation of the Soviet Union in the beginning of the 20th century and the establishment of the ‘Eastern bloc’ after World War II created two main antagonistic political blocs consisting of the “liberal free market economies” and “centrally planned socialist economies”. These had differing political views and practices in the use of science in general and statistics in particular. After a period of antagonism they developed a modus vivendi, in which the divergence in ideologies and the objectives and application of statistics were mutually respected. The activities of the Conference of European Statisticians were concentrated in technical issues of common interest and the political and ideological issues underlying the two systems and the way statistics were collected and used were avoided. It was common knowledge that in the socialist countries statistics served mainly the interest of the government (the party) and that information could be falsified, distorted or suppressed and that only selected information was made available to the research community and wider public (Anderson et al., 1994). Statistical offices in the democratic countries were thought to be guided by what would become the fundamental principles of official statistics. However, similar and other issues affecting their statistical offices could equally not be discussed within the activities of the CES. Therefore issues related to the independence of statistical offices and statisticians were simply not discussed. As result the public debate on the role of statistics, statisticians and statistical office did not take place. The fall of the Berlin Wall on 9 November 1989 altered the geopolitical structure of world; it engendered public optimism about future developments based on the principles of democracy, including transparency and accountability, and created a


`

MALAGUERRA et al.: Statistics as instruments for prosperous, transparent...

15

fundamental change in the mind-set of the members of the CES. Three months after the fall of the Berlin wall in an extraordinary meeting of the CES its members started the discussion about the consequences of the geopolitical changes for international statistical cooperation. The consequences of the change were considered for both the former socialist countries and the free market economy countries. Thereafter, statisticians discussed and reached a preliminary consensus on the nature of statistical information, as a public good, its role in governance and for the democratisation of the society. More importantly, the statisticians were the ones that brought the importance of statistics for governance to the attention of the politicians and the public at large. The CES was one of the first institutions, if not the first institution, to draw conclusions on the changed world outlook and to propose new arrangements for their profession within the newly emerging geopolitical structure. It was the Polish delegation, through Mr. Jozef Olenski, who, to the surprise of some of the delegations of the democratic countries, requested that CES would develop and proclaim an international convention of official statistics. In the ensuing discussions the idea attracted considerable support, but it was decided that what was needed would be a Charter of fundamental principles, which would be applicable to all countries, not only the countries in transition from a socialist to a free market economy. The Bureau of the CES requested the Polish National Statistical Office to develop a draft of the Fundamental Principals of Official Statistics for the next annual (1990) meeting. The formulation of the Fundamental Principals involved considerable reflection and discussion, as within the CES there were two streams of thought: on the one hand those that would like to see a formalisation of the functions, duties and privileges of statisticians and statistical offices and those who were in favour of a more flexible practical political approach. Members of CES and staff of the UNECE secretariat spent many hours discussing proposals and counterproposals. A compromise between these two approaches and on the language of the document was achieved, and at the 1991 meeting of the Conference of European Statisticians the Fundamental Principals of Official Statistics were approved. The statisticians had completed their task to agree on common principals for their profession to operate within society establishing the parameters for the behaviour for all stakeholders in the process. Governments should provide the legal framework and resources to the statistical system of their countries to allow statisticians to produce the required statistical information, without interference using the best available methodology and techniques from the best suited sources of information. Respondents, individual, enterprises or organisations, have to provide the required information truthfully and as completely as possible. Official statistics legislation


16

ESTADĂ?STICA (2015), 67, 188 y 189, pp. 9-19

has to guarantee that such individual information will be used for statistical purposes only. Moreover the results of statistical enquiries have to be made available to all users without distinction. It was up to the political leadership to play their part, first in Europe and afterwards globally to establish the universal applicability of these Principals. This was achieved in January 2014 5. Conclusions Following the adoption of the Fundamental Principles by the United Nations Economic Commission for Europe (UNECE) in 1992, the Statistical Commission of the United Nations endorsed them in 1994, thereby indicating the global validity of these principles. Although the Fundamental Principles were originally conceived to assist former socialist economies to modernise their national statistical system, they appealed to statisticians in the developing countries as well. Many of the countries had only achieved national independence after World War II and had inherited an antiquated sometimes authoritarian statistical service which was based on the interest of the colonial power and was not oriented towards national development. Consequently many developing countries adopted the Fundamental Principles as guidelines for the organisation of their national statistical systems and the execution of their statistical programmes. In 2009 the African Union adopted the African Charter on Statistics 6 which according to Article 3 fully incorporates the Fundamental Principles.In 2013 the Economic and Social Council (ECOSOC) endorsed the Fundamental Principled for Official Statistics (E/2013/21) and on 29 January 2014 they were endorsed by the General Assembly of the United Nations thereby giving them universal applicability. Up-to-date, valid and reliable statistics are essential for the management of a democratic society. In pre- and non-democratic societies information and hence statistics was part of the political power base of the rulers, kings, princes, dictators or colonising powers. The people had no access to statistical information. The democratisation of regimes changed the role of official statistics. They were now considered public goods and were shared with the population and are means of verification of governmental compliance with policies. Information is power: who owns information owns power. Political leaders, including parliamentarians, statisticians and the public at large are well advised to consider the injunction of an American statistician on the role of statistics on governance and world peace. The role of statistics in promoting (global) governance and world peace was evoked by the Assistant Secretary and


`

MALAGUERRA et al.: Statistics as instruments for prosperous, transparent...

17

Statistician of the Carnegie Endowment for International Peace Dr, S.N.D. North at the Commemoration of the Seventy Fifth anniversary of the American Statistical Association in 1918 who stated: “Statistics is the twin sister of international law, in multiplying the ways and methods of mutual help, cooperation and understanding between the nations. Both sciences supply indispensable links in the lengthening chain of world unity.” (Koren, 1918). To ensure that the Fundamental Principles continue to be adhered to and respected we repeat a suggestion made by Carlo Malaguerra in his presentation to the 2014 session of the UN Statistical Commission (Malaguerra, 3 March 2014) that the international organisations dealing with statistics should regularly discuss the Fundamental Principles and their role in the democratic process and provide Chief statisticians the possibility to expose their views. A second suggestion is that in line with the principles of transparency and accountability an inventory is made of the compliance of national statistical offices with the provisions of the Fundamental principles on the basis of data in the public domain. This exercise should be carried out by an independent research institute or scientific foundation and the members States of the CES should be dealt with first, because the Fundamental Principles were developed on their initiative. References ANDERSON, B.A.; KALEV, K. and SILVER, B.D. (1994). “Development and prospects for population statistics in countries of the former Soviet Union”. Population Index. 60 (1): 4 – 20. CONFERENCE OF EUROPEAN STATISTICIANS. (2014). Members’ Guidebook. “http://www.unece.org/fileadmin/DAM/stats/documents/ece/ces/bur/2014/October/ CES_Members_Guidebook.pdf” DE NEUMANN SPALLART, M.F.X. (1886). “La fondation de l’Institute Internationale de la Statistique: Aperçu historique”. Bulletin de l’Institut Internationale de la Statistique. I (1-2): 1 – 34. GLASS, D.V. (1964). “John Graunt and His Natural and Political Observations”. Notes and Records of the Royal Society of London. 19 (1): 63-10. Jun., 1964. Stable URL: http://www.jstor.org/stable/3519862. Accessed: 26-03-2015 17:00 UTC. GRABER, R. (1980). “Der Wasser-Handel”. Revue Suisse d’histoire. 39: 321 – 356.


18

ESTADÍSTICA (2015), 67, 188 y 189, pp. 9-19

HEMMING, J. (1972). The Conquest of the Incas. First ABACUS edition. Sphere Books Ltd. London KOREN, J. (1918). The history of statistics: Their development and progress in many countries; In Memoirs to commemorate the Seventy Fifth anniversary of the American Statistical Association. The Macmillan Company of New York. New York. LOZA, C.B. (1998). “Du bon usage des quipus face à l’administration colonial Espagnole (1550 – 1600)”. Population (French edition: Institut National d´Ètudes Démographiques). 53 (1-2 Population et Histoire): 139 – 159. MALAGUERRA C. (2012). Keynote speech at the Conference European Statisticians. Paris. France. “http://www.unece.org/fileadmin/DAM/stats/documents/ece/ces/2012/Malaguerra_ keynote_address.pdf” MALAGUERRA, C. (2014). Video presentation at United Nations Statistical Commission, High Level Forum on Official Statistics UN Fundamental Principles of Official Statistics. 3 March 2014. New York. “http://unstats.un.org/unsd/statcom/statcom_2014/seminars/High_Level_Forum/def ault.html”. MEITZEN, A. and FALKNER, R. P. (1891). “History, Theory, and Technique of Statistics. Part First: History of Statistics”. The Annals of the American Academy of Political and Social Science. 1 (2 pt 1): 1–100. Retrieved from "http://www.jstor.org/stable/1008943" Accessed: 27-03-2015 20:07 STIGLER, S.M. (2005). “Statistics and the Wealth of Nations”. International Statistical Review. 73(2): 223 – 225.


`

MALAGUERRA et al.: Statistics as instruments for prosperous, transparent...

19

Notes 1

An excellent overview of the development of statistics up to the end of the 19th century can be found in: Meitzen, A. and Falkner, R.A., Science, History, Theory, and Technique of Statistics. Part First: History of Statistics, in: Annals of the American Academy of Political and Social Science, Vol. 1, Supplement 2, Part 1 (Mar., 1891), pp. 1+3-100 (Stable URL: “http://www.jstor.org/stable/1008943” Accessed: 27-03-2015 20:07 UTC). 2

The original documents are at the National Archives in London, England, and are accessible on line through http://www.nationalarchives.gov.uk/domesday/ 3

See http://constitution.findlaw.com/articles.html. This article was modified by the 13th (1865) and 14th (1868) Amendments. See “http://www.senate.gov/civics/constitution_item/constitution.htm" \l "amendments”

4

In spite of its name the Conference is open to Directors of National Statistical Office of all countries provided they participate regularly in the activities of the Conference. At present the regular membership consists of all members states of the United Nations Economic Commission for Europe, (UNECE) which include Canada and the USA, the OECD member countries and some other countries outside the region for example Brazil, China, Colombia and Mongolia See Conference of European Statisticians. Members’ Guidebook at http://www.unece.org/fileadmin/DAM/stats/documents/ece/ces/bur/2014/October/CES_Me mbers_Guidebook.pdf

5

This section is based on the Keynote speech by Carlo Malaguerra at the Conference European Statisticians, Paris, France; see: “http://www.unece.org/fileadmin/DAM/stats/documents/ece/ces/2012/Malaguerra_keynote _address.pdf”

6

See: “http://www.au.int/en/sites/default/files/AFRICAN_CHARTER_ON_STATISTICS.pdf”.

Invited Paper Received September 2015 Revised December 2015



ESTADÍSTICA (2015), 67, 188 y 189, pp. 21-32 © Instituto Interamericano de Estadística

LAS ESTADÍSTICAS COMO INSTRUMENTO PARA SOCIEDADES DEMOCRÁTICAS, PRÓSPERAS Y TRANSPARENTES CARLO MALAGUERRA Ex Director General de la Oficina Federal de Estadística de Suiza, Sion, Valais, Suiza – carlo.malaguerra@gmail.com ALPHONSE L. MACDONALD Ex alto funcionario del Fondo de Población de las Naciones Unidas, FNUAP, Nueva York, EE.UU. – fonzhan@gmail.com

RESUMEN En enero de 2014 la Asamblea General de las Naciones Unidas dio su respaldo a los Principios Fundamentales de las Estadísticas Oficiales, que fueron adoptados por la Comisión de Estadística de las Naciones Unidas en abril de 1994, a raíz de una iniciativa de la Conferencia de Estadísticos Europeos. Información válida y fiable es esencial para la gestión de los asuntos orientados al bienestar generalizado y la prosperidad, en una sociedad democrática. Es importante que los usuarios, los interesados en las estadísticas oficiales y los ciudadanos en general tengan absoluta confianza en las estadísticas. Para producir estadísticas válidas y fiables, es necesario que los gobiernos establezcan el marco legal y provean los recursos para el sistema estadístico de sus países para permitir a los estadísticos producir la información estadística necesaria sin interferencia, utilizando la mejor metodología y técnicas disponibles de las fuentes más adecuadas de información. Los informantes, ya sean individuos, empresas u organizaciones, tienen que proporcionar la información requerida con veracidad y en la forma más completa posible. Las oficinas estadísticas tienen que garantizar que dicha información individual será utilizada únicamente con fines estadísticos. Además, los resultados de encuestas estadísticas han de ponerse a disposición de todos los usuarios sin distinción. Tales requisitos básicos de las estadísticas oficiales no se respetaron en las economías de planificación centralizada antes de 1989, e incluso en algunos de los países con economías de mercado. Durante el proceso de transición hacia democracias y economías de mercado de los países de Europa Oriental y Central, Este artículo fue presentado durante el 60° Congreso Mundial de Estadística del Instituto Internacional de Estadística, ISI2015, que tuvo lugar en Río de Janeiro, Brasil del 26 al 31 de julio de 2015.


22

ESTADÍSTICA (2015), 67, 188 y 189, pp. 21-32

se reconoció que las estadísticas oficiales desempeñan un papel esencial para la preservación de la democracia y que su papel especial y único debe ser reconocido por los gobiernos y el público en general. A petición de uno de los países de Europa del Este la Conferencia de Estadísticos Europeos propuso una Carta denominada "Principios Fundamentales de las Estadísticas Oficiales" que establezca los parámetros para garantizar la producción de estadísticas oficiales válidas y fiables. Al pasar de los años se reconoció que estos "principios" deben tener una validez universal. Esto se alcanzó en 2014 con el respaldo de los "Principios" por parte de la Asamblea General de las Naciones Unidas. Por consiguiente, los Principios Fundamentales de las Estadísticas Oficiales tienen aceptación universal y deben ser respetados por todas las naciones y sociedades. Se hacen sugerencias para asegurar la permanente adhesión a los Principios Fundamentales. Palabras clave Principios Fundamentales de las Estadísticas Oficiales, sociedad democrática, independencia funcional.

ABSTRACT In January 2014 the General Assembly of the United Nations endorsed the Fundamental Principles of Official Statistics, which was adopted by the Statistical Commission of the United Nations in April 1994, following an initiative of the Conference of European Statisticians. Valid and reliable information is essential for the management of the affairs of a democratic society aiming at generalised wellbeing and prosperity. It is important that users and stakeholders of official statistics and the citizens at large have total confidence in statistics. To produce valid and reliable statistics it is necessary that Governments provide the legal framework and resources to the statistical system of their countries to allow statisticians to produce the required statistical information, without interference using the best available methodology and techniques from the best suited sources of information. Respondents, be they individual, enterprises or organisations, have to provide the required information truthfully and as completely as possible. Official statistics have to guarantee that such individual information will be used for statistical purposes only. Moreover the results of statistical enquiries have to be made available to all users without distinction. Such basic requirements of official statistics were not respected in the centrally planned economies before 1989 and even in some of the countries with market economies. During the transition process toward democracies and market economies of the countries from Eastern and


MALAGUERRA et al.: Las estadísticas como instrumento para sociedades… `

`

23

Central Europe it was recognized that official statistics plays an essential role for preserving democracy and that its special and unique role should be recognized by governments and the public at large. At the request of one of the Eastern European countries the Conference of European Statisticians proposed a Charter called “Fundamental Principles of Official Statistics” establishing the parameters to guarantee the production of valid and reliable official statistics. As years passed, it was recognized that these “Principles” should have a universal validity. This was reached in 2014 with the endorsement of the “Principles” by the United Nations General Assembly. Consequently the Fundamental Principles of Official Statistics have universal acceptance and should be adhered to by all nations and societies. Suggestions are made ensure that the Fundamental Principles are continued to be adhered to. Keywords Fundamental Principles of Official Statistics, democratic society; functional independence Introducción En la actualidad los ciudadanos de la mayoría de los países, sino de todos, esperan tener acceso a información estadística actual, válida y fiable sobre la sociedad y el mundo en general. La información válida y fiable es esencial para la gestión de una sociedad democrática cuyos objetivos son el bienestar general y la prosperidad. Es importante que los usuarios y grupos de personas con interés en las estadísticas oficiales y los ciudadanos en general tengan confianza absoluta en las estadísticas. Para producir estadísticas válidas y fiables, es necesario que los gobiernos establezcan el marco legal y provean los recursos necesarios a los sistemas estadísticos de sus países para permitir que los estadísticos produzcan la información estadística necesaria, sin interferencia alguna, utilizando la mejor metodología y las mejores técnicas disponibles a partir de las fuentes más adecuadas de información. Los proveedores de datos, sean estos individuos, empresas u organizaciones, deberían proporcionar la información requerida con veracidad y lo más completa posible. Las leyes nacionales sobre el sistema estadístico deberían garantizar que la información personal o individual sea utilizada únicamente para fines estadísticos. Además, los resultados de los trabajos estadísticos deberían estar accesibles a todos los usuarios sin distinción. Esta apreciación de la información estadística y la


24

ESTADÍSTICA (2015), 67, 188 y 189, pp. 21-32

forma como está producida es de origen reciente. Hace más o menos veinticinco años atrás en muchos países la situación era muy distinta. Las estadísticas: su naturaleza y desarrollo inicial Históricamente todas las sociedades de cierto nivel de complejidad requirieron información que les permitiera y los habilite para regular los asuntos de la sociedad. Hay evidencia, física y literaria, que las sociedades históricas clásicas, tales como Babilonia y Egipto tenían sistemas matemáticos bien desarrollados que se utilizaron para preparar los registros de población, de los tamaños de las unidades agrícolas y de la producción, de las relaciones comerciales y de las transacciones comerciales y financieras. Hay evidencia histórica que el hombre sabía contar ¡antes de leer y escribir! Esto está ejemplificado por el imperio Inca en América del Sur, que no conocía la escritura, pero tenía un sistema de contabilidad bien desarrollado basado en el quipu. Este fue un recurso mnemotécnico consistente en “un conjunto de hileras con nudos en el cual el color de la hilera representaba la característica a medir, y la forma y complejidad de los nudos los valores numéricos”. Hemming (1972, p.61). Después de la conquista del Perú, hasta cerca de 1600, las autoridades españolas reconocieron los quipus como registros válidos en los procesos judiciales y permitieron a los empleados indígenas que trabajaban en la administración colonial que los utilizaran como instrumentos de recolección de datos. Loza (1998) En la Europa medieval los administradores de ciudades-estado, ducados y reinos mantenían registros, de manera más o menos sistemática, sobre asuntos que eran de interés para los gobernantes, principalmente con fines de impuestos y defensa (hombres capaces de portar armas) i. Estos eran de mayor importancia después de la conquista de nuevas tierras, y probablemente la recopilación de datos más exacta y completa realizada en la historia fue la compilación de información para el llamado “Doomsday Book” ii (“el libro del día del juicio final”), que fue un inventario exhaustivo de la población, sus tierras y bienes de las comunidades de gran parte de Inglaterra y de Gales después de la conquista normanda en 1067. Las autoridades eclesiásticas mantuvieron registros detallados de sus feligreses, de los nacimientos, defunciones y matrimonios. Desde la época de la Ilustración las personas interesadas en el avance del conocimiento, la ciencia y la sociedad establecieron "sociedades científicas" en las cuales se debatieron temas de interés científico y social. Investigadores particulares llevaban a cabo estudios numéricos sobre una amplia gama de fenómenos de población, sociales, económicos y de salud, que dieron lugar al desarrollo de la


MALAGUERRA et al.: Las estadísticas como instrumento para sociedades… `

`

25

"aritmética política", un término introducido por William Petty en Inglaterra (Meitzen y Falkner, 1891, p.30 y Sitgler, 2005, p.223). Estos estudios fueron una forma temprana de la estadística descriptiva, o la contabilidad de la nación, considerados de interés nacional y que contribuían al poder y prestigio de la nación. La aritmética política era un instrumento para el mejoramiento de la administración pública de una nación. El autor más conocido en este campo es John Graunt, considerado como el fundador de la demografía, y cuyos estudios sobre las causas de la mortalidad (Observations upon the Bills of Mortality) mostraron una metodología sólida y la capacidad para trabajar con datos incompletos, inconsistentes y plagados de errores que soportan la metodología y el análisis moderno. Los hallazgos de estos estudios no siempre fueron del agrado de los gobernantes, como se ejemplifica en el caso de Johan Heinrich Waser, un "burgués" (ciudadano) y científico de la ciudad de Zúrich, Suiza, cuyo estudio sobre los edificios y los riesgos de incendio se consideró un acto de traición. Fue decapitado el 27 de marzo 1780 (Graber, 1980). Los Padres Fundadores de los Estados Unidos de América fueron probablemente los primeros en establecer un vínculo directo entre la democracia, la gobernabilidad y las estadísticas cuando formularon la Constitución. En el artículo 1, inciso 3 iii establecieron que el número de representantes y los impuestos directos de los Estados deberían basarse en el número de residentes en cada Estado, el que se establecería mediante una enumeración decenal. Paralelamente a la recopilación y el análisis de información numérica de las naciones, se delinearon técnicas y métodos que fueron desarrollados en matemáticas, teoría de las probabilidades y de los errores, los que proporcionarían las bases teóricas de la emergente ciencia moderna de la estadística. Dada la fragmentación política y cultural de Europa las primeras compilaciones estadísticas utilizaron una amplia variedad de métodos, y los estudios sobre el mismo tema muy a menudo no eran comparables. Estandarización, su impacto, convergencia y divergencia. Los requerimientos de datos estadísticos de la Francia Napoleónica y postNapoleónica y de Prusia tuvieron consecuencias importantes para el desarrollo de las estadísticas públicas (oficiales). El siglo XIX fue una época de cambios políticos, económicos y sociales rápidos y profundos, que tuvieron consecuencias importantes para el desarrollo de las estadísticas. El establecimiento de sociedades de estadística, (la Sociedad de Estadísticas de París en 1803, la Real Sociedad Estadística en Londres de 1834 y la Asociación Americana de Estadística en 1839) promovió y contribuyó a la estandarización de los métodos y procedimientos


26

ESTADÍSTICA (2015), 67, 188 y 189, pp. 21-32

estadísticos. Estos esfuerzos culminaron en una serie de conferencias internacionales de estadísticas entre 1853 y 1885, organizadas originalmente bajo el dinámico liderazgo de Adolphe Quetelet, que en 1885 condujo a la creación del Instituto Internacional de Estadística (ISI) (De Neumann Spallart, 1886). Los resultados de las encuestas estadísticas realizadas por investigadores particulares influyeron en el crecimiento de la conciencia nacional (Suiza) y contribuyeron a la creación de los estados nacionales (Italia y Alemania). En varios países europeos la creación de Comisiones Nacionales de Estadística tuvo un impacto positivo en la aceptación de la información estadística y su utilidad para el gobierno. Varios países comenzaron a utilizar estadísticas en su administración pública, mediante la creación de unidades estadísticas en los ministerios o, más tarde, mediante el establecimiento de las oficinas estadísticas nacionales y subnacionales. Esto confirmó la importancia de la información estadística para el desarrollo político y de gobierno. Sin embargo, al mismo tiempo esto cambiaba el estado y la posición de los estadísticos, que hasta entonces eran en su mayoría académicos y científicos independientes. Los estadísticos en la administración pública se convirtieron en servidores públicos sujetos a las normas y reglamentos de la administración pública, lo que podría poner en peligro su independencia científica. La transformación de los regímenes autoritarios en democráticos liberales cambió la naturaleza de las estadísticas oficiales. No sólo eran necesarios para la formulación y desarrollo de políticas, sino que también se convirtieron en medios de verificación disponibles para el Parlamento, para el electorado y en última instancia para la población en general, sobre el cumplimiento del gobierno con las políticas establecidas. El nacimiento de los Principios Fundamentales En el siglo XX se realizaron importantes avances en la estadística teórica y aplicada. El desarrollo estadístico tanto en "las estadísticas administrativas como las científicas" fue promovido por el Instituto Internacional de Estadística (ISI). En el siglo XX la Liga de Naciones fue un foro para los directores de todas las oficinas nacionales de estadística para discutir cuestiones relativas a las normas estadísticas que se hicieron especialmente relevantes durante la primera Conferencia Internacional de Estadísticas Económicas en 1928. El diálogo técnico fue continuado por la Conferencia de Estadísticos Europeos (CES) iv, establecido en 1953. Hasta los años sesenta los principales actores en el desarrollo de las estadísticas eran europeos, norteamericanos, algunos miembros de la Commonwealth británica, con aportes limitados de los latinoamericanos y asiáticos. La creación de la Unión Soviética a principios del siglo XX y del 'bloque oriental' después de la Segunda Guerra Mundial crearon dos principales bloques


MALAGUERRA et al.: Las estadísticas como instrumento para sociedades… `

`

27

políticos antagónicos que consistían en las "economías de libre mercado " por una parte y las "economías socialistas de planificación centralizada" por otra. Estos bloques tenían diferentes visiones políticas y prácticas en el uso de la ciencia en general y de la estadística en particular. Después de un período de antagonismo desarrollaron un modus vivendi, en el que la divergencia en las ideologías y los objetivos y aplicación de la estadística se respetaron mutuamente. Las actividades de la Conferencia de Estadísticos Europeos (CES de las siglas de su nombre en inglés Conference of European Statisticians) se concentraron en temas técnicos de interés común mientras que las cuestiones políticas e ideológicas que subyacían a los dos sistemas y la forma en que las estadísticas eran recogidas y utilizadas, fueron evitadas. Era secreto público que en los países socialistas las estadísticas servían principalmente al interés del gobierno (el partido) y que la información podría ser falsificada, distorsionada o suprimida y que sólo información seleccionada estaba a la disposición de la comunidad científica y del público en general (Anderson et al., 1994). Las oficinas de estadística de los países democráticos estaban pensadas para ser guiadas por principios, que luego se convertirían en los principios fundamentales de las estadísticas oficiales. Sin embargo temas similares u otros temas que afectaban igualmente a sus oficinas de estadística no pudieron ser discutidos dentro de las actividades del CES. Por lo tanto cuestiones relativas a la independencia de la estadística oficial y de los estadísticos, simplemente no fueron tratados. Como resultado el debate público del rol de la estadística, los estadísticos, y la estadística oficial, no tuvo lugar La caída del Muro de Berlín el 9 de noviembre 1989 alteró profundamente la estructura geopolítica del mundo; engendró el optimismo público sobre futuros desarrollos basados en los principios de la democracia, incluida la transparencia y responsabilidad, y creó un cambio fundamental en la mentalidad de los miembros del CES. Tres meses después de la caída del muro de Berlín, en una reunión extraordinaria de la CES, sus miembros iniciaron el debate sobre las consecuencias de los cambios geopolíticos para la cooperación estadística internacional. Las consecuencias del cambio fueron consideradas por ambos, tanto por los antiguos países socialistas como por los países de economía de mercado libre. A partir de entonces, los estadísticos discutieron y llegaron a un consenso preliminar sobre la naturaleza de la información estadística, como un bien público, su papel para el gobierno y para la democratización de la sociedad. Más importante aún, fueron los estadísticos quienes atrajeron la atención de los políticos y del público en general sobre la importancia de las estadísticas para los gobiernos. La CES fue una de las primeras instituciones, si no es la primera, en actuar sobre el cambio de la concepción geopolítica del mundo y proponer nuevas disposiciones


28

ESTADÍSTICA (2015), 67, 188 y 189, pp. 21-32

para la profesión dentro de la nueva estructura geopolítica. Fue la delegación polaca, a través del Sr. Jozef Olenski, quien, para sorpresa de algunos de las delegaciones de los países democráticos, solicitó que el CES desarrollara y proclamara un tratado internacional de las estadísticas oficiales. En los debates que siguieron la idea atrajo apoyo considerable, pero se decidió que lo que se necesitaba sería una Carta de los principios fundamentales, que serían aplicables a todos los países, no sólo a los países en transición del sistema socialista a una economía de libre mercado. La Mesa de la CES pidió a la Oficina Nacional de Estadísticas de Polonia que desarrollara un borrador de los principios fundamentales de las estadísticas oficiales para la siguiente reunión anual (1990). La formulación de los Principios Fundamentales implicó considerable reflexión y discusión, como en el CES había dos corrientes opuestas de pensamiento: por un lado, los que favorecían una formalización de las funciones, deberes y privilegios de los estadísticos y las oficinas de estadística y los que estaban a favor de un enfoque de política práctica más flexible. Los miembros de la CES y el personal de la secretaría de la Comisión Económica de las Naciones Unidas para Europa, (CEPE) pasaron muchas horas discutiendo propuestas y contrapropuestas. Se logró un compromiso entre estos dos enfoques y sobre el lenguaje a utilizar en el documento; y en la reunión de la Conferencia de Estadísticos Europeos de 1991 se aprobaron los Principios Fundamentales de las Estadísticas Oficiales. Los estadísticos habían completado la tarea de establecer principios comunes para la profesión y cómo actuar en la sociedad. Los gobiernos debían proporcionar el marco legal y los recursos para el sistema estadístico de sus países para permitir a los estadísticos producir la información estadística necesaria sin interferencia, utilizando la mejor metodología y técnicas disponibles de las fuentes más adecuadas de información. Los informantes, ya sean individuos, empresas u organizaciones, tienen que proporcionar la información requerida con veracidad y en la forma más completa posible. Las oficinas estadísticas tienen que garantizar que dicha información individual será utilizada únicamente con fines estadísticos. Además, los resultados de encuestas estadísticas han de ponerse a disposición de todos los usuarios sin distinción. Concernía a los líderes políticos actuar su parte, primero en Europa y luego globalmente para establecer la aplicabilidad universal de esos principios. Esto se logró en enero de 2014 v. Conclusiones Tras la adopción de los Principios Fundamentales por la Comisión Económica de las Naciones Unidas para Europa (CEPE) en 1992, la Comisión de Estadística de


MALAGUERRA et al.: Las estadísticas como instrumento para sociedades… `

`

29

las Naciones Unidas las aprobó en 1994, dándoles validez universal. Aunque los Principios Fundamentales fueron originalmente concebidos para ayudar a las antiguas economías socialistas a modernizar sus sistemas estadísticos nacionales, atrajo también a los estadísticos de los países en desarrollo. Muchos de estos países sólo habían alcanzado la independencia nacional después de la Segunda Guerra Mundial y habían heredado servicios estadísticos que eran a veces autoritarios y anticuados, que se basaban en el interés del poder colonial y no fueron orientados hacia el desarrollo nacional. En consecuencia, muchos países en desarrollo adoptaron los Principios Fundamentales como directrices para la organización de sus sistemas estadísticos nacionales y la ejecución de sus programas estadísticos. En 2009 la Unión Africana aprobó la Carta Africana de Estadística que, según el artículo 3 incorpora plenamente los Principios Fundamentales. En 2013 el Consejo Social Económico (ECOSOC) respaldó los Principios Fundamentales para las Estadísticas Oficiales (E/2013/21) y el 29 de enero 2014 fueron promovidos por la Asamblea General de las Naciones Unidas dándoles así la aplicabilidad universal Estadísticas actualizadas, válidas y fiables son esenciales para la gestión de una sociedad democrática. En las sociedades pre- y no - democráticas la información, y por lo tanto las estadísticas, eran parte de la base de poder político de los gobernantes, los reyes, los príncipes, los dictadores o las potencias colonizadoras. Los miembros de la sociedad no tenían acceso a la información estadística. La democratización de los regímenes cambió el papel de las estadísticas oficiales. Ahora eran consideradas bienes públicos, que deberían ser compartidos con la población y que podían servir como medios de verificación del cumplimiento de las políticas nacionales. La información es poder: quien posee la información posee el poder. Es muy recomendable que los líderes políticos, incluso los parlamentarios, los estadísticos y el público en general reflexionen sobre el enunciado de un estadístico norteamericano sobre la contribución de las estadísticas a la gobernabilidad y la paz mundial. El rol de las estadísticas en promover la gobernabilidad y la paz mundial fue aludido por el Secretario Adjunto y Estadístico de la Fundación Carnegie para la Paz Internacional Dr. S.N D. Norte en la sesión conmemorativa del septuagésimo quinto aniversario de la Asociación Americana de Estadística (ASA) en 1918, quien declaró: “La estadística es la hermana gemela del derecho internacional, multiplica las formas y métodos de ayuda mutua, cooperación y entendimiento entre las naciones. Ambas ciencias suministran enlaces indispensables para el fomento de la unidad del mundo.”(Koren, 1918) vi Para asegurar que los Principios Fundamentales continúan siendo aceptados y respetados se repite aquí una sugerencia hecha durante la sesión de la Comisión de Estadística de las Naciones Unidas del año 2014 (Malaguerra, 3 marzo 2014): las


30

ESTADÍSTICA (2015), 67, 188 y 189, pp. 21-32

organizaciones internacionales vinculadas a la estadística deberían discutir regularmente los Principios Fundamentales y su papel en el proceso democrático, y proporcionar a los estadísticos jefes de oficinas nacionales la posibilidad de exponer sus puntos de vista al respeto. Una segunda sugerencia es que, de acuerdo con el principio de transparencia y responsabilidad, se cree un inventario del cumplimiento, por parte de las oficinas nacionales de estadística, con respecto a las disposiciones de los Principios Fundamentales, basado en datos de dominio público. Este ejercicio debería ser realizado por un instituto de investigación independiente o fundación científica y los Estados Miembro del CES deberían ser los primeros en tratarlo ya que los Principios Fundamentales se desarrollaron por su iniciativa. Bibliografía ANDERSON, B.A.; KALEV, K. and SILVER, B.D. (1994).“Development and prospects for population statistics in countries of the former Soviet Union”.Population Index.60 (1): 4 – 20. CONFERENCE OF EUROPEAN STATISTICIANS.(2014). Members’ Guidebook. “http://www.unece.org/fileadmin/DAM/stats/documents/ece/ces/bur/2014/October/ CES_Members_Guidebook.pdf” DE NEUMANN SPALLART, M.F.X. (1886). “La fondation de l’Institute Internationale de la Statistique: Aperçu historique”. Bulletin de l’Institut Internationale de la Statistique. I (1-2): 1 – 34. GLASS, D.V. (1964). “John Graunt and His Natural and Political Observations”. Notes and Records of the Royal Society of London.19 (1): 63-10. Jun., 1964. Stable URL: http://www.jstor.org/stable/3519862. Accessed: 26-03-2015 17:00 UTC. GRABER, R. (1980). “Der Wasser-Handel”.Revue Suisse d’histoire. 39: 321 – 356. HEMMING, J. (1972). The Conquest of the Incas.First ABACUS edition.Sphere Books Ltd. London. KOREN, J. (1918). The history of statistics: Their development and progress in many countries; In Memoirs to commemorate the Seventy Fifth anniversary of the


MALAGUERRA et al.: Las estadísticas como instrumento para sociedades… `

`

31

American Statistical Association. The Macmillan Company of New York. New York. LOZA, C.B. (1998). “Du bonusagedesquipus face à l’administration colonial Espagnole (1550 – 1600)”. Population (French edition: Institut National d´Ètudes Démographiques). 53 (1-2 Population et Histoire): 139 – 159. MALAGUERRA C. (2012). Keynote speech at the Conference European Statisticians.Paris. France. http://www.unece.org/fileadmin/DAM/stats/documents/ece/ces/2012/Malaguerra_k eynote_address.pdf MALAGUERRA, C. (2014). Video presentation at United Nations Statistical Commission, High Level Forum on Official Statistics UN Fundamental Principles of Official Statistics. 3 March 2014. New York. “http://unstats.un.org/unsd/statcom/statcom_2014/seminars/High_Level_Forum/def ault.html”. MEITZEN, A. and FALKNER, R. P. (1891).“History, Theory, and Technique of Statistics. Part First: History of Statistics”. The Annals of the American Academy of Political and Social Science.1 (2 pt 1): 1–100. Retrieved from "http://www.jstor.org/stable/1008943" Accessed: 27-03-2015 20:07 STIGLER, S.M. (2005). “Statistics and the Wealth of Nations”.International StatisticalReview.73(2): 223 – 225.


32

ESTADÍSTICA (2015), 67, 188 y 189, pp. 21-32

Notas i

Una reseña excelente del desarrollo de las estadísticas hasta el final del siglo 19 se puede encontrar en: Meitzen, A. and Falkner, R.A., Science, History, Theory, and Technique of Statistics. Part First: History of Statistics, in: Annals of the American Academy of Political and Social Science, Vol. 1, Supplement 2, Part 1 (Mar., 1891), pp. 1+3-100 (Stable URL: “http://www.jstor.org/stable/1008943” Accessed: 27-03-2015 20:07 UTC). ii

Los documentos originales están en el Archivo Nacional de Inglaterra en Londres, y son accesibles on line en http://www.nationalarchives.gov.uk/domesday/

iii

Ver http://constitution.findlaw.com/articles.html. Este artículo fue modificado por las enmiendas 13º (1865) y 14º (1868) Ver “http://www.senate.gov/civics/constitution_item/constitution.htm" \l "amendments “ iv

A pesar del nombre la Conferencia está abierta a los directores de la Oficina Nacional de Estadística de todos los países a condición de que participen regularmente en las actividades de la Conferencia. Actualmente los directores de la Oficina Nacional de Estadística de los Estados Miembro de la Comisión Económica para Europa ( CEPE ), que incluyen Canadá y los EE.UU., los países miembros de la OCDE y otros países fuera de la región, como Brasil, la China, Colombia y Mongolia los miembros regulares. Ver: Conference of European Statisticians. Members’ Guidebook a http://www.unece.org/fileadmin/DAM/stats/documents/ece/ces/bur/2014/October/CES_Me mbers_Guidebook.pdf v

Esta sección se basa en las notas de la presentación de Carlo Malaguerra en la Conference de Estadísticos Europeos, Paris, Francia; Ver:“http://www.unece.org/fileadmin/DAM/stats/documents/ece/ces/2012/Malaguerra_key note_address.pdf”

vi

.Ver:“http://www.au.int/en/sites/default/files/AFRICAN_CHARTER_ON_STATISTICS.pdf”

Artículo Invitado Recibido Octubre 2015 Revisado Diciembre 2015


ESTADÍSTICA (2015), 67, 188 y 189, pp. 33-74 © Instituto Interamericano de Estadística

LIKELIHOOD BASED INFERENCE FOR QUANTILE REGRESSION IN NONLINEAR MIXED EFFECTS MODELS CHRISTIAN E. GALARZA Escuela Superior Politécnica del Litoral, Guayaquil, Ecuador. cgalarza88@gmail.com; (+593) 4 2210505 VICTOR H. LACHOS Departamento de Estatı́stica, IMECC, Universidade Estadual de Campinas, Campinas, São Paulo, Brazil. hlachos@ime.unicamp.br; (+55) 19 35216078 ABSTRACT

Longitudinal data are frequently analyzed using normal mixed effects models. Moreover, the traditional estimation methods are based on mean regression, which leads to non-robust parameter estimation for non-normal error distributions. Compared to the conventional mean regression approach, quantile regression (QR) can characterize the entire conditional distribution of the outcome variable and is more robust to the presence of outliers and misspecification of the error distribution. This paper develops a likelihoodbased approach for analyzing QR models for correlated continuous longitudinal data via the asymmetric Laplace distribution (ALD). Exploiting the nice hierarchical representation of the ALD, our classical approach follows the Stochastic Approximation of the EM (SAEM) algorithm for deriving exact maximum likelihood estimates of the fixed-effects and variance components in nonlinear mixed effects models (NLMMs). We evaluate the finite sample performance of the algorithm and the asymptotic properties of the ML estimates through empirical experiments and applications to two real life datasets. The proposed SAEM algorithm is implemented in the R package qrNLMM. Keywords Asymmetric Laplace distribution, Nonlinear mixed effects models, Quantile regression, SAEM algorithm, Stochastic Approximations.


34

ESTADÍSTICA (2015), 67, 188 y 189, pp. 33-74

RESUMEN

Los datos longitudinales son frecuentemente analizados usando modelos de efectos mixtos normales. Por otra parte, los métodos de estimación tradicionales son basados en regresión en media, lo cual conduce a estimaciones no robustas de los parámetros cuando los errores no se distribuyen normalmente. Comparada con el enfoque de la regresión en media tradicional, la regresión cuantı́lica (RC) puede caracterizar completamente la distribución condicional de la variable de respuesta y es más robusta ante la presencia de valores atı́picos y especificaciones erróneas de la distribución del error. Este artı́culo usa un enfoque basado en verosimilitud para analizar modelos de RC para datos continuos longitudinales correlacionados usando la distribución Laplace asimétrica (DLA). Haciendo uso de la representación estocástica de la DLA, nuestro enfoque clásico utiliza una Aproximación Estocástica del algoritmo EM (SAEM) para conseguir estimativas de máxima verosimilitud (MV) exactas para los efectos fijos y los componentes de varianza en modelos no lineales de efectos mixtos. Evaluamos el desempeño del algoritmo en muestras finitas y las propiedades asintóticas de las estimativas de MV a través de experimentos empı́ricos y aplicaciones para dos conjuntos de datos reales. El algoritmo SAEM propuesto se encuentra implementado en el paquete de R qrNLMM. Palabras clave

Distribución Laplace asimétrica, Modelos no lineales de Efectos Mixtos, Regresión cuantı́lica, Algoritmo SAEM, Aproximaciones Estocásticas. 1. Introduction

Nonlinear mixed-effects models (NLMMs) are frequently used to analyze grouped, clustered, longitudinal and multilevel data because of their potential to handle, on one hand, nonlinearities in the relationship between the observed response and the covariates and random effects, and on the other hand, to take into account within and between-subject correlations presented in this type of data (Pinheiro & Bates, 2000; Davidian & Giltinan, 2003; Wu, 2010). Moreover, NLMMs are also flexible and often mechanistic, based on biological, chemical, physics mechanisms, among others, leading to a natural modelling using a known family of nonlinear functions providing desirable characteristics such as asymptotes, a unique maximum


GALARZA et. al.: Likelihood based inference for quantile regression...

35

value, monotonicity, positive range, etc. Majority of these NLMMs estimate covariate effects on the response through a mean regression, controlling for between-cluster heterogeneity via normally-distributed clusterspecific random effects and random errors. However, this centrality-based inferential framework is often inadequate when the conditional distribution of the response (conditional on the random terms) is skewed, multimodal, or affected by atypical observations. In contrast, conditional quantile regression (QR) methods (Koenker, 2004, 2005) quantifying the entire conditional distribution of the outcome variable were developed that can provide assessment of covariate effects at any arbitrary quantiles of the outcome. In addition, QR methods do not impose any distribution assumption on the error, except requiring that the error term has a zero conditional quantile such as the ALD. Because of its popularity and the flexibility it provides, standard QR methods are implementable via available software packages, for example, the R package quantreg(). Although QR was initially developed under a univariate framework, the abundance of clustered data in recent times lead to its extensions into mixed modeling framework via either the distribution-free route Lipsitz et al. (1997); Galvao & Montes-Rojas (2010); Galvao Jr (2011); Fu & Wang (2012), or the traditional likelihood-based route mostly using the ALD Geraci & Bottai (2007); Yuan & Yin (2010); Geraci & Bottai (2014). Among the ALD-based models, Geraci & Bottai (2007) proposed a Monte Carlo EM (MCEM)-based conditional QR model for continuous responses with a subject-specific random (univariate) intercept to account for within-subject dependence in the context of longitudinal data. However, due to the limitations of a simple random intercept model to account for the between-cluster heterogeneity, Geraci & Bottai (2014) extended their previous Geraci & Bottai (2007) model to a general linear quantile mixed effects regression model (QR-LMM) with multiple random effects (both intercepts and slopes). However, instead of going the MCEM route, the estimation of the fixed effects and the covariance components were implemented using an efficient combination of Gaussian quadrature approximations and non-smooth optimization algorithms. Yuan & Yin (2010) applied the version of QR of Geraci & Bottai (2007) to linear mixed effects models for longitudinal measurements with missing data. Wang (2012) considered QR-NLMMs from a Bayesian perspective and shown that QR-NLMMs may be a better measure of centrality for skewed or multimodal data and more robust against nonnormality of the distribution of random errors than the mean regression


36

ESTADÍSTICA (2015), 67, 188 y 189, pp. 33-74

estimator. Although some results on QR-NLMMs have recently appeared in the literature, to the best of our knowledge, there seem to be no studies on exact inference for QR-NLMMs from a likelihood based perspective. In this paper, we proceed to achieve that via a robust parametric ALDbased QR-NLMMs, where the full likelihood-based implementation follows a stochastic version of the EM algorithm (SAEM), proposed by Delyon et al. (1999), for maximum likelihood (ML) estimation in contrast to the approximations proposed by Geraci & Bottai (2014) for QR-LMMs. The SAEM algorithm has been proved to be more computationally efficient than the classical MCEM algorithm due to the recycling of simulations from one iteration to the next in the smoothing phase of the algorithm. Moreover, as pointed out by Meza et al. (2012) the SAEM algorithm, unlike the MCEM, converges even in a typically small simulation size. Recently, Kuhn & Lavielle (2005) showed that the SAEM algorithm is very efficient in computing the ML estimates in mixed effects models. Our empirical results shows that the ML estimates based on the SAEM algorithm do provide good asymptotic properties. Furthermore, application of our method to two longitudinal datasets is illustrated via the R package qrNLMM(). The rest of the paper proceeds as follows. Section 2 presents some preliminaries, in particular the connection between QR and ALD and an outline of the EM and SAEM algorithms. Section 3 develops the MCEM and the SAEM algorithms for a general NLMM, while Section 4 outlines the likelihood estimation and standard errors. Section 5 presents some simulation studies. Application of the SAEM method to two longitudinal datasets, one examining the Soybean genotypes data and the other on a HIV viral load study are presented in Section 6. Finally, Section 7 concludes, sketching some future research directions. 2. Preliminaries 2.1. Connection between QR and ALD

Following Yu & Moyeed (2001), a random variable Y is distributed as an ALD with location parameter µ, scale parameter σ > 0 and skewness pa-


GALARZA et. al.: Likelihood based inference for quantile regression...

rameter p ∈ (0, 1), if its probability density function (pdf) is given by yâˆ’Âľ p(1 − p) exp âˆ’Ď p f (y|Âľ, Ďƒ , p) = , Ďƒ Ďƒ

37

(1)

where Ď p (.) is the check (or loss) function defined by Ď p (u) = u(p − I{u < 0}), with I{.} the usual indicator function. This distribution is denoted follows an expoby ALD(Âľ, Ďƒ , p). It is easy to see that W = Ď p Y âˆ’Âľ Ďƒ nential(1) distribution. Figure 1 plots the ALD illustrating how the the skewness changes with altering choices for p. For example, when p = 0.1, most of the mass is concentrated around the right tail, while for p = 0.5, both tails of the ALD have equal mass and the distribution resemble the more common double exponential distribution. In contrast to the normal distribution with a quadratic term in the exponent, the ALD is linear in the exponent. This results in a more peaked mode for the ALD together with thicker tails. On the contrary, the normal distribution has heavier shoulders compared to the ALD.

0.30

Figure 1. Standard asymmetric Laplace density

0.15 0.00

0.05

0.10

density

0.20

0.25

ALD(0,1,p=0.1) ALD(0,1,p=0.3) ALD(0,1,p=0.5) ALD(0,1,p=0.7)

−4

−2

0

2

4

ALD abides by the following stochastic representation (Kotz et al., 2001; Kuzobowski & Podgorski, 2000). Let U âˆź exp(Ďƒ ) and Z âˆź N(0, 1) be two


ESTADÍSTICA (2015), 67, 188 y 189, pp. 33-74

38

independent random variables. Then, Y ∼ ALD(µ, σ , p) can be represented as √ d (2) Y = µ + ϑ pU + τ p σUZ, d

1−2p 2 where ϑ p = p(1−p) and τ p2 = p(1−p) , and = denotes equality in distribution. This representation is useful in obtaining the moment generating function (mgf), and formulating the estimation algorithm. From (2), the hierarchical representation of the ALD is given as

Y |U = u ∼ N(µ + ϑ p u, τ p2 σ u), U ∼ exp(σ ).

(3)

This representation will be useful for the implementation of the EM algorithm. Moreover, since Y |U = u ∼ N(µ + ϑ p u, τ p2 σ u), one can easily derive the pdf of Y , given by δ (y) 1 1 exp A(y), f (y|µ, σ , p) = √ 3 γ 2π τ p σ 2

(4)

1/2 τ √ , γ = √p and A(y) = 2 δ (y) K1/2 (δ (y)γ), with where δ (y) = τ|y−µ| γ 2 σ p σ Kν (.), the modified Bessel function of the third kind. It easy to see that that the conditional distribution of U, given Y = y, is U|(Y = y) ∼ GIG( 12 , δ , γ), where GIG(ν, a, b) represents the Generalized Inverse Gaussian (GIG) distribution (Barndorff-Nielsen & Shephard, 2001) with the pdf n 1 o (b/a)ν ν−1 2 2 h(u|ν, a, b) = u exp − a /u+b u , u > 0, ν ∈ R, a, b > 0. 2Kν (ab) 2 The moments of U can be expressed as E[U k ] =

ν+k (ab)

a k K b

Kν (ab)

,k ∈ R

(5)

Some useful properties of the Bessel function of the third kind Kλ (u) are: (i) Kν (u) = K−ν (u); (ii) Kν+1 (u) = 2ν u Kν (u)+Kν−1 (u); (iii) for non-negative q −k π integer r, Kr+1/2 (u) = 2u exp(−u) ∑rk=0 (r+k)!(2u) (r−k)!k! . A special case is q π K1/2 (u) = 2u exp(−u).


GALARZA et. al.: Likelihood based inference for quantile regression...

39

2.2. The EM and SAEM algorithms

In models with missing data, the EM algorithm (Dempster et al., 1977) has established itself as the most popular tool for obtaining the ML estimates of the model parameters. This iterative algorithm maximizes the complete log-likelihood function `c (θθ ; ycom ) at each step, converging quickly to a stationary point of the observed likelihood (`(θθ ; yobs )) under mild regularity conditions (Wu, 1983; Vaida, 2005). The EM algorithm proceeds in two simple steps: E-Step: Replace the observed likelihood by the complete likelihood and (k) (k) b b compute its conditional expectation Q(θθ |θ ) = E `c (θθ ; ycom )|θ , yobs , (k) where θb is the estimate of θ at the k-th iteration; (k) (k+1) . M-Step: Maximize Q(θ |θb ) with respect to θ obtaining θb

However, in some applications of the EM algorithm, the E-step cannot be obtained analytically and has to be calculated using simulations. Wei & Tanner (1990) proposed the Monte Carlo EM (MCEM) algorithm in which the E-step is replaced by a Monte Carlo approximation based on a large number of independent simulations of the missing data. This simple solution is infact computationally expensive, given the need to generate a large number of independent simulations of the missing data for a good approximation. Thus, in order to reduce the amount of required simulations compared to the MCEM algorithm, the SAEM algorithm proposed by Delyon et al. (1999) replaces the E-step of the EM algorithm by a stochastic approximation procedure, while the Maximization step remains unchanged. Besides having good theoretical properties, the SAEM estimates the population parameters accurately, converging to the global maxima of the ML estimates under quite general conditions (Allassonnière et al., 2010; Delyon et al., 1999; Kuhn & Lavielle, 2004). At each iteration, the SAEM algorithm successively simulates missing data with the conditional distribution, and updates the unknown parameters of the model. Thus, at iteration k, the SAEM algorithm proceeds as follows: E-Step: • Simulation: Draw (q(`,k) ), ` = 1, . . . , m from the conditional distribu-


ESTADÍSTICA (2015), 67, 188 y 189, pp. 33-74

40

tion f (q|θ (k−1) , yi ). • Stochastic Approximation: Update the Q(θ |θb(k) ) function as Q(θ |θb(k) ) ≈ Q(θ |θb(k−1) ) " # 1 m + δk ∑ `c(θ ; yobs, q(`,k))|θb(k), yobs−Q(θ |θb(k−1)) m `=1 M-Step: • Maximization: Update θb(k) as θb(k+1) = arg max Q(θ |θb(k) ), θ

where δk is a smoothness parameter (Kuhn & Lavielle, 2004), i.e., a de∞ 2 creasing sequence of positive numbers such that ∑∞ k=1 δk = ∞ and ∑k=1 δk < ∞. Note that, for the SAEM algorithm, the E-Step coincides with the MCEM algorithm, however a small number of simulations m (suggested to be m ≤ 20) is necessary. This is possible because unlike the traditional EM algorithm and its variants, the SAEM algorithm uses not only the current simulation of the missing data at the iteration k denoted by (q(`,k) ), ` = 1, . . . , m but some or all previous simulations, where this ‘memory’ property is set by the smoothing parameter δk . Note, in equation (2.2), if the smoothing parameter δk is equal to 1 for all k, the SAEM algorithm will have ‘no memory’, and will be equivalent to the MCEM algorithm. The SAEM with no memory will converge quickly (convergence in distribution) to a solution neighbourhood, however when the algorithm with memory will converge slowly (almost sure convergence) to the ML solution. We suggested the following choice of the smoothing parameter given as ( 1, for 1 ≤ k ≤ cW (6) δk = 1 for cW + 1 ≤ k ≤ W k−cW , where W is the maximum number of iterations, and c a cut point (0 ≤ c ≤ 1) which determines the percentage of initial iterations with no memory. For example, if c = 0 the algorithm will have memory for all iterations, and hence will converge slowly to the ML estimates. If c = 1, the algorithm will have no memory, and so will converge quickly to a solution neighbourhood. For the first case, W would need to be large in order to achieve the


GALARZA et. al.: Likelihood based inference for quantile regression...

41

ML estimates. For the second, the algorithm will output a Markov Chain where after applying a burn in and thin, the mean of the chain observations can be a reasonable estimate. A number between 0 and 1 (0 < c < 1) will assure an initial convergence in distribution to a solution neighbourhood for the first cW iterations and an almost sure convergence for the rest of the iterations. Hence, this combination will leads us to a fast algorithm with good estimates. To implement SAEM, the user must fix several constants matching the number of total iterations W and the cut point c that defines the starting of the smoothing step of the SAEM algorithm, however those parameters will vary depending of the model and the data. To determinate those constants, a graphical approach is recommended to monitor the convergence of the estimates for all the parameters, and, if possible, to monitor the difference (relative difference) between two successive evaluations of the log-likelihood `(θθ |yobs ), given by ||`(θθ (k+1) |yobs ) − `(θθ (k) |yobs )|| or ||`(θθ (k+1) |yobs )/`(θθ (k) |yobs ) − 1||, respectively. 3. QR for nonlinear mixed models and algorithms

We proposed the following general mixed-effects model. Let yi = (yi1 , ..., yini )> denote the continuous response for subject i and let η = (η(φi , xi1 ), ..., η(φi , xini ))> represents a nonlinear differentiable function of vector-valued mixed-effects random parameters φi of dimension r and a matrix of covariates xi of dimensions ni × r. We define the NLMM as yi = η(φφ i , xi ) + ε i ,

φi = Ai β p + Bi bi ,

(7)

where Ai and Bi are design matrices of dimensions r × d and r × q, respectively, possibly depending on elements of xi and incorporating time varying covariates in fixed or random effects, β p is the regression coefficient corresponding to the pth quantile, bi is a q-dimensional random effects vector associated to the i-th subject and and εi the independent and identically distributed vector of random errors. We define pth quantile function of the response yi j as Q p (yi j |xi j , bi ) = η(φi , xi j ) = η(Ai β p + Bi bi , xi j ).

(8)

where Q p denotes the inverse of the unknown distribution function F, the iid random effects bi are distributed as bi ∼ Nq (0, Ψ ), where the dispersion


42

ESTADÍSTICA (2015), 67, 188 y 189, pp. 33-74

α ) depends on unknown and reduced parameters α , and matrix Ψ = Ψ (α iid the errors are distributed as εi j ∼ ALD(0, σ ) and both uncorrelated. Then, yi j |bi independently follows as ALD with the density given by ( !) yi j − η(Ai β p + Bi bi , xi j ) p(1 − p) β p , bi , σ ) = exp −ρ p . f (yi j |β σ σ (9) 3.1. A MCEM algorithm

First, we develop a MCEM algorithm for ML estimation of the parameters in the QR-NLMM. The model exhibits a flexible hierarchical representation, which is useful in deriving the theoretical properties. From (3), the QR-NLMM defined in (8)-(9), can be represented in a hierarchical form as: yi |bi , ui ∼ Nni η (Ai β p + Bi bi , xi ) + ϑ p ui , σ τ p2 Di , bi ∼ Nq (0, Ψ ), ni

ui ∼

∏ exp(σ ),

(10)

j=1

for i = 1, . . . , n, where ϑ p and τ p2 are as in (2); Di represents a diagonal matrix that contains the vector of missing values ui = (ui1 , . . . , uini )> and exp(σ ) denotes the exponential distribution with mean σ . Let yic = > (y>i , b>i , u>i )> , with yi = (yi1 , . . . , yini )> , bi = bi1 , . . . , biq , ui = (ui1 , . . . , uini )> β p(k)> , σ (k) , α (k)> )> , the estimate of θ at the k-th iteration. and let θ (k) = (β Since bi and ui are independent for all i = 1, . . . , n, it follows from (3) that the complete-data log-likelihood function is of the form n

`c (θθ ; yc ) = ∑ `c (θθ ; yic ), i=1

where

1 3 1 1 > −1 `c (θθ ; yic ) = constant− ni logσ − log Ψ − b> i Ψ bi − ui 1ni 2 2 2 σ 1 η (Ai β p + Bi bi , xi )−ϑ p ui )> D−1 − (yi −η i 2σ τ p2 η (Ai β p + Bi bi , xi )−ϑ p ui ). (yi −η

(11)


GALARZA et. al.: Likelihood based inference for quantile regression...

43

Since Ai , Bi and xi are known matrices, we will simplify the notation by β p , bi ) to represent η (φφ i , xi ) = η (Ai β p + Bi bi , xi ). Given the writing η (β current estimate θ = θ (k) , the E-step calculates the function (k) (k) Q(θθ |θb ) = ∑ni=1 Qi (θθ |θb ),

where

Qi (θθ |θb

(k)

n o ) = E `c (θθ ; yic )|θθ (k) , y

(12)

o (k)

1 n 1 3 −1 \ >

∝ − ni logσ − log Ψ − tr (bb )i Ψ 2 2 2 h (k) τ p4 (k)> 1 (k) >d > −1 −1 [ − yi Di yi − 2ϑ p yi 1ni + ubi 1ni − 2y> i (D η )i 2 2σ τ p 4 i (k) \ −1 bi (k) + η > + 2ϑ p 1> ni η i Di η i where η i = η (Ai β p + Bi bi , xi ) for simplicity, tr(A) indicates the trace of matrix A and 1 p is the vector of ones of dimension p. The calculation of these function requires expressions for (k) ηbi = E η i |θθ (k) , yi , ubi (k) = E ui |θθ (k) , yi , (k) (k) > (k) −1 d >) \ θ , yi , θ (k) , yi , (bb D = E D−1 i = E bi bi |θ i |θ i n o > −1 (k) (k) (k) (k) −1 [ \ > −1 θ θ (k) , yi , (D η )i = E D−1 η |θ , y i , (η D η )i = E η i Di η i |θ i i which do not have closed forms. Since the joint distribution of the missing data (bi(k) , u(k) i ) is unknown and the conditional expectations cannot be computed analytically, for any function g(.), the MCEM algorithm approximates the conditional expectations above by their Monte Carlo approximations 1 m (`,k) (13) E[ g (bi , ui ) |θθ (k) , yi ] ≈ ∑ g(b(`,k) i , ui ), m `=1 which depend of the simulations of the two latent (missing) variables b(k) i (k) θ and u(k) from the conditional joint density f (b , u |θ , y ). Using known i i i i properties of conditional expectations, the expected value in (13) can be


44

ESTADÍSTICA (2015), 67, 188 y 189, pp. 33-74

more accurately approximated as Ebi ,ui [ g(bi , ui )|θθ (k) , yi ] = Ebi [ Eui [ g(bi , ui )|θθ (k) , bi , yi ]|yi ] ≈

1 m θ (k) , b(`,k) Eui [ g(b(`,k) ∑ i , ui )|θ i , yi ], m `=1

(14)

where b(`,k) is a sample from the conditional density f (bi |θθ (k) , yi ). Note that (14) is a more accurate approximation once it only depends of one MC approximation, instead two as needed in (13). Now, for drawing random samples from the full conditional distribution f (ui |yi , bi ), first note that the vector ui |y i , bi can be written as ui |yi , bi = [ ui1 |yi1 , bi · · · uini |yini , bi ]> , since ui j yi j , bi is independent of uik | yik , bi , for all j, k = 1, 2, . . . , ni and j 6= k. Thus, the distribution of f (ui j |yi j , bi ) is proportional to

β p , bi ) + ϑ p ui j , σ τ p2 ui j ) × exp(σ ), f (ui j |yi j , bi ) ∝ φ (yi j ηi j (β which, from Subsection 2.1, leads to ui j |yi j , bi ∼ GIG( 12 , χi j , ψ), where χi j and ψ are given by χi j =

|yi j −ηi j (β p ,bi )|

√ τp σ

and

τp ψ= √ 2 σ

(15)

From (5), and after generating samples from f (bi |θθ (k) , yi ) (see Subsection 4.2), the conditional expectation Eui [·|θθ , bi , yi ] in (14) can be computed analytically. Finally, the proposed MCEM algorithm for estimating the parameters of the QR-NLMM can be summarized as follows: MC E-step: Given θ = θ (k) , for i = 1, . . . , n; • Simulation Step: For ` = 1, . . . , m, draw b(`,k) from f (bi |θθ (k) , yi ), as i described later in Subsection 4.2. • Monte Carlo approximation: Using (5) and the simulated sample above, evaluate E[ g (bi , ui ) |θθ (k) , yi ] ≈

1 m θ (k) , b(`,k) ∑ Eui [ g(b(`,k) i , ui )|θ i , yi ]. m `=1


GALARZA et. al.: Likelihood based inference for quantile regression...

45

(k) (k) M-step: Update θb by maximizing Q(θθ |θb ) ≈ m1 ∑m ∑n `c (θ ; yi , bi(l,k) , ui ) (k) over θb , which leads to the following estimates: (k+1)

c β p

)#−1 m > 1 (k) (k) −1 (`,k) c + × =β ∑ ∑ Ji E (Di ) Ji p i=1 m `=1 " ( )# n (k) 1 m (k)> (`,k) (`,k) , yi − η (βcp , bi ) − ϑ p E (ui )(`,k) ∑ m ∑ 2Ji E (D−1 i ) i=1 `=1 (k)

"

(

(

(k+1) (k+1) 1 m > −1 (`,k) b η (βcp σ (yi −η , b(`,k) (yi η (βcp , b(`,k) ∑ i )) E (D ) i )) m `=1 #) 4 (k+1) τ p (`,k) (`,k)> > 1ni and −2ϑ p (yi η (βcp , bi )) 1ni + E (ui ) 4 " # n m (k+1) > 1 1 (`,k) (`,k) b Ψ = ∑ , ∑ bi bi n i=1 m `=1 (k+1)

1 n = ∑ 3Nτ p2 i=1

n

β p , bi )/∂ β p> , N = ∑ni=1 ni and expressions E (ui )(`,k) and where Ji = ∂ η (β (`,k) E (D−1 are defined in Appendix B. Note that for the MC E-step, we i ) (`,k) need to draw samples bi , ` = 1, . . . , m, from f (bi |θθ (k) , yi ), where m is the number of Monte Carlo simulations to be used, a number suggested to be large enough. A simulation method to draw samples from f (bi |θθ (k) , yi ), is described in Subsection 4.2. 3.2. A SAEM algorithm

As mentioned in Subsection 2.2, the SAEM circumvents the cumbersome problem of simulating a large number of missing values at every iteration, leading to a faster and efficient solution than the MCEM. In summary, the SAEM algorithm proceeds as follows: E-step: Given θ = θ (k) for i = 1, . . . , n; • Stochastic approximation: Update the MC approximations for the conditional expectations by their stochastic approximations, given by


46

ESTADÍSTICA (2015), 67, 188 y 189, pp. 33-74

(k) S1,i

(k−1) = S1,i + δk

(k) S2,i

(k−1) = S2,i + δk

"

# 1 m (k)> (k−1) −1 (`,k) (k) ∑ Ji E (Di ) Ji − S1,i , m `=1

"

# h ii (k) 1 m h (k)> (k−1) (`,k) −1 (`,k) (`,k) c − S2,i , ∑ 2Ji E (Di ) yi − η (β p , bi ) − ϑ p E (ui ) m `=1

"

(k+1) (k+1) 1 m h > −1 (`,k) η (βcp , b(`,k) ∑ (yi − η (βcp , b(`,k) i )) E (D ) (yi −η i )) m `=1 # # (k+1) τ p4 (k−1) (`,k) > (`,k)> c 1ni − S3,i , bi )) 1ni + E (ui ) −2ϑ p (yi − η (β p 4 " # 1 m (`,k) (`,k)> (k) (k−1) (k−1) S4,i = S4,i + δk ∑ [bi bi ] − S4,i . m `=1

(k)

(k−1)

S3,i = S3,i

+ δk

(k) (k) (k) M-step: Update θb by maximizing Q(θθ |θb ) over θb , which leads to the following expressions: " #−1 n n (k+1) (k) (k) (k) βc = βc + S S , p

p

i=1

b (k+1) = σ

1,i

2,i

i=1

1 n (k) ∑ S3,i , 3Nτ p2 i=1

b (k+1) = 1 Ψ n

n

(k)

∑ S4,i .

(16)

i=1

(0) Given a set of suitable initial values θb (as detailed Appendix A), the SAEM iterates till convergence at iteration k if ) ( (k+1) (k) |θbi − θbi | max < δ2 (17) (k) i |θb | + δ1 i

is satisfied for three consecutive times where δ1 and δ2 are some small values pre established. The consecutive evalution of (17) avoids a fake convergence produced by an unlucky Monte Carlo simulation. Based on (Searle et al., 1992) pag. 269, we use δ1 = 0.001 and δ2 = 0.0001 as suggested by several researchers. The proposed criterion above will need an extreme large number of iterations (more than usual) in order to detect convergence for parameters that are close to the boundary of the parametric space. In this case for variance components, a parameter value close to zero will inflate the ratio in (17) and the convergence will not be attained even though


GALARZA et. al.: Likelihood based inference for quantile regression...

47

the likelihood was maximized with few iterations. As proposed by (Booth & Hobert, 1999) we use also a second convergence criteria besides to the first one, defined by    |θb(k+1) − θb(k) |  i < δ2 , (18) max q i   i (k) c i ) + δ1 var(θ where (18) evaluates the parameter estimates changes relative to their standard errors leading to a convergence detection even for bounded parameters. Also the values δ1 and δ2 are some small values pre established and not necessarily equal to the one for (17). Based on simulation we suggest to fix δ1 = 0.0001 and to test different values for δ2 between 0.0001 and 0.0005 when smaller means more accuracy. We use δ1 = 0.0001 and δ2 = 0.0002 by default which assures us a high accuracy. This stopping criteria is similar to the one proposed by (Bates & Watts, 1981) for Non linear Least Squares. 3.3. Missing data simulation method

In order to draw samples from f (bi |yi , θ ), we utilize the Metropolis-Hastings (MH) algorithm (Metropolis et al., 1953; Hastings, 1970), a MCMC algorithm for obtaining a sequence of random samples from a probability distribution for which direct sampling is not possible. The MH algorithm proceeds as follows: Given θ = θ (k) , for i = 1, . . . , n; (0,k)

1. Start with an initial value bi

.

2. Draw b∗i ∼ h(b∗i |bi(`−1,k) ) from a proposal distribution with the same support as the objective distribution f (bi |θθ (k) , yi ). 3. Generate U ∼ U(0, 1). ( ) (k) (0,k) f b∗i |θ ,yi h bi |b∗i 4. If U > min 1 , (0,k) (k) ∗ (0,k) , return to the step 2, else f bi |θ ,yi h bi |bi (`,k)

bi

= b∗i

(2,k) (m,k) 5. Repeat steps 2-4 until m samples (b(1,k) ) are drawn i , bi , . . . , bi (k) from bi |θθ , yi .


48

ESTADÍSTICA (2015), 67, 188 y 189, pp. 33-74

Note that the marginal distribution f (bi |yi , θ ) (omitting θ ) can be represented as f (bi |yi ) ∝ f (yi |bi ) × f (bi ) , ni Ψ where bi ∼ Nq (0, ) and f (yi |bi ) = ∏ j=1 f (yi j |bi ), with yi j |bi ∼ ALD x> ijβp +zi j bi , σ , p . Since the objective function is a product of two distributions (with both support lying in R), a suitable choice for the proposal density is a multivariate normal distribution with the mean and variancecovariance matrix that are the stochastic approximations of the conditional expectation E(b(k−1) |yi ) and the conditional variance Var(b(k−1) |yi ) respeci i tively, obtained from the last iteration of the SAEM algorithm. This candidate (with possible information about the shape of the target distribution) leads to better acceptance rate, and consequently a faster algorithm. The re(1,k) (2,k) (m,k) sulting chain bi , bi , . . . , bi is a MCMC sample from the marginal (k) conditional distribution f (bi |θθ , yi ). Due the dependent nature of these MCMC samples, at least 10 MC simulations are suggested.

4. Estimation 4.1. Likelihood Estimation

Given the observed data, the likelihood function `o(θ|y) of the model defined in (8)-(9) is given by n

n

`o (θθ |y) = ∑ log f (yi |θθ )) = ∑ log i=1

i=1

Z Rq

f (yi |bi ; θ ) f (bi ; θ ) dbi ,

(19)

where the integral can be expressed as an expectation with respect to bi , i.e., Ebi [ f (yi |bi ; θ )]. The evaluation of this integral is not available analytically and is often replaced by its MC approximation involving a large number of simulations. However, alternative importance sampling (IS) procedures might require a smaller number of simulations than the typical MC procedure. Following (Meza et al., 2012), we can compute this integral using an IS scheme for any continuous distribution fb(bi ; θ ) of bi having the same support as f (bi ; θ ). Re-writing (21) as n

`o (θθ |y) = ∑ log i=1

Z Rq

f (yi |bi ; θ )

f (bi ; θ ) b f (bi ; θ ) dbi . fb(bi ; θ )


GALARZA et. al.: Likelihood based inference for quantile regression...

49

we can express it as an expectation with respect to b∗i , where b∗i ∼ fb(b∗i ; θ ). Thus, the likelihood function can now be expressed as " #) ( ∗(`) n f (b ; θ ) 1 m ni `o (θθ |y) ≈ ∑ log ∑ ∏ [ f (yi j |b∗i (`); θ )] fb(b∗i (`); θ ) , (20) m i=1 `=1 j=1 i

∗(`) ∗(`) where {bi }, l = 1, . . . , m, is a MC sample from fb(b∗i ; θ ), and f (yi |bi ; θ ) ∗(`) i is expressed as ∏nj=1 f (yi j |bi ; θ ) due to independence. An efficient choice for fb(bi∗(`) ; θ ) is f (bi |yi ). Therefore, we use the same proposal distribu∗(`) b b ), b bi , Σ tion discussed in Subsection 4.2, and generate samples bi ∼ Nq (µ i (w) b b bi = E(b(w) where µ |y ) and Σ = Var(b |y ), which are estimated empirii i bi i i cally during the last few iterations of the SAEM at convergence.

4.2. Missing data simulation method

In order to draw samples from f (bi |yi , θ ), we utilize the Metropolis-Hastings (MH) algorithm (Metropolis et al., 1953; Hastings, 1970), a MCMC algorithm for obtaining a sequence of random samples from a probability distribution for which direct sampling is not possible. The MH algorithm proceeds as follows: Given θ = θ (k) , for i = 1, . . . , n; (0,k)

1. Start with an initial value bi

.

2. Draw b∗i ∼ h(b∗i |bi(`−1,k) ) from a proposal distribution with the same support as the objective distribution f (bi |θθ (k) , yi ). 3. Generate U ∼ U(0, 1). ( ) (k) (0,k) f b∗i |θ ,yi h bi |b∗i 4. If U > min 1 , (0,k) (k) ∗ (0,k) , return to the step 2, else f bi |θ ,yi h bi |bi (`,k)

bi

= b∗i

(2,k) (m,k) 5. Repeat steps 2-4 until m samples (b(1,k) ) are drawn i , bi , . . . , bi (k) from bi |θθ , yi .


ESTADÍSTICA (2015), 67, 188 y 189, pp. 33-74

50

Note that the marginal distribution f (bi |yi , θ ) (omitting θ ) can be represented as f (bi |yi ) ∝ f (yi |bi ) × f (bi ) ,

i f (yi j |bi ), with yi j |bi ∼ ALD η(Ai β p where bi ∼ Nq (0, Ψ ) and f (yi |bi ) = ∏nj=1 +Bi bi , xi j ), σ , p . Since the objective function is a product of two distributions (with both support lying in R), a suitable choice for the proposal density is a multivariate normal distribution with the mean and variancecovariance matrix that are the stochastic approximations of the conditional expectation E(b(k−1) |yi ) and the conditional variance Var(b(k−1) |yi ) respeci i tively, obtained from the last iteration of the SAEM algorithm. This candidate (with possible information about the shape of the target distribution) leads to better acceptance rate, and consequently a faster algorithm. The re(1,k) (2,k) (m,k) sulting chain bi , bi , . . . , bi is a MCMC sample from the marginal (k) conditional distribution f (bi |θ , yi ). Due the dependent nature of these MCMC samples, at least 10 MC simulations are suggested.

5. Estimation 5.1. Likelihood Estimation

Given the Abserved Aata, the likelihood Aunction ` o(θ|y) Af the model de-fined in (8)-(9) is Aiven by n

n

`o (θθ |y) = ∑ log f (yi |θθ )) = ∑ log i=1

i=1

Z Rq

f (yi |bi ; θ ) f (bi ; θ ) dbi ,

(21)

where the integral can be expressed as an expectation with respect to bi , i.e., Ebi [ f (yi |bi ; θ )]. The evaluation of this integral is not available analytically and is often replaced by its MC approximation involving a large number of simulations. However, alternative importance sampling (IS) procedures might require a smaller number of simulations than the typical MC procedure. Following Meza et al. (2012), we can compute this integral using an IS scheme for any continuous distribution fb(bi ; θ ) of bi , having the same support as f (bi ; θ ). Re-writing (21) as n

`o (θθ |y) = ∑ log i=1

Z Rq

f (yi |bi ; θ )

f (bi ; θ ) b f (bi ; θ ) dbi . fb(bi ; θ )


GALARZA et. al.: Likelihood based inference for quantile regression...

51

we can express it as an expectation with respect to b∗i , where b∗i ∼ fb(b∗i ; θ ). Thus, the likelihood function can now be expressed as " #) ( ∗(`) n f (b ; θ ) 1 m ni `o (θθ |y) ≈ ∑ log ∑ ∏ [ f (yi j |b∗i (`); θ )] fb(b∗i (`); θ ) , (22) m i=1 `=1 j=1 i

∗(`) ∗(`) where {bi }, l = 1, . . . , m, is a MC sample from fb(b∗i ; θ ), and f (yi |bi ; θ ) ∗(`) i is expressed as ∏nj=1 f (yi j |bi ; θ ) due to independence. An efficient choice for fb(bi∗(`) ; θ ) is f (bi |yi ). Therefore, we use the same proposal distribution ∗(`) b b ), b bi , Σ discussed in Subsection 4.2, and generate samples bi ∼ Nq (µ i b b bi = E(b(w) where µ |y ) and Σ = Var(b |y ), which are estimated empiri i i bi i ically during the last few iterations of the SAEM at convergence.

5.2. Standard error approximation

Louis’ missing information principle (Louis, 1982) relates the score function of the incomplete data log-likelihood with the complete data log-likelihood ∇c (θθ ; Ycom |Yobs )], where through the conditional expectation ∇o (θθ ) = Eθ [∇ ∇ o (θ ) = ∂ `o (θθ ; Yobs )/∂ θ and ∇ c (θθ ) = ∂ `c (θ ; Ycom )/∂ θ are the score functions for the incomplete and complete data, respectively. As defined in Meilijson (1989), the empirical information matrix can be computed as n

1 Ie (θθ |y) = ∑ s(yi |θθ ) s> (yi |θb ) − S(y|θθ ) S> (y|θθ ), n i=1

(23)

where S(y|θθ ) = ∑ni=1 s(yi |θθ ) and s(yi |θθ ) is the empirical score function for the i-th individual. Replacing θ by its ML estimator θ̂θ and considering ∇ o (θ̂θ ) = 0, equation (23) takes the simple form n

Ie (θb |y) = ∑ s(yi |θb ) s> (yi |θb ).

(24)

i=1

At the kth iteration, the empirical score function for the i-th subject can be computed as " # m 1 s(yi |θθ )(k) = s(yi |θθ )(k−1) + δk ∑ s(yi, q(k,`); θ (k)) − s(yi|θθ )(k−1) , m `=1 (25)


52

ESTADÍSTICA (2015), 67, 188 y 189, pp. 33-74

where q(`,k) , ` = 1, . . . , m, are the simulated missing values drawn from the conditional distribution f (·|θ (k−1) , yi ). Thus, at iteration k, the observed information matrix can be approximated as Ie (θθ |y)(k) = ∑ni=1 s(yi |θθ )(k) s> (yi |θθ )(k) , b θ |y)| b )−1 is an estimate of the such that at convergence, I−1 e (θ |y) = (Ie (θ θ =θ covariance matrix of the parameter estimates. Expressions for the elements of the score vector with respect to θ are given in Appendix A. 6. Simulated data

In order to examine the performance of the proposed method, here we present some simulation studies. The first simulation study shows that the ML estimates based on the SAEM algorithm do provide good asymptotic properties. The second study investigates the consequences for population inferences when the normality assumption is inappropriate. We used heavy tailed distribution for the random error term in order to test the robustness of the proposed method in terms of parameter recovery. Figure 2. Illustration of the effect of including the random effect b1i in the first parameter of the nonlinear growth-curve logistic model. 20

Inclusion of b2

20

Inclusion of b1

10

Leaf weight (g)

15

b2=− 6 b2=− 4 b2=− 2 b2=0 b2=2 b2=4 b2=6

0

5

10 0

5

Leaf weight (g)

15

b1=− 3 b1=− 2 b1=− 1 b1=0 b1=1 b1=2 b1=3

20

30

40

50

60

70

20

30

40

50

60

70

Time since planting (days)

Time since planting (days)

Figure 4: Result of the s in soybean plants hypothetic

10 8 6 4

10 8 6

b2=− 0.45 b2=− 0.3 b2=− 0.15 b2=0 b2=0.15 b2=0.3 b2=0.45

2

4

6

1

4

1

2

8

1

Theophylline concentration (mg/L)

10

1 1

2

Theophylline concentration (mg/L)

1

Theophylline concentration (mg/L)

b1=− 1.2

As in Pinheiro & Bates (1995), we performedb the =− 0.8 first simulation study with b =− 0.4 b =0 the following three parameter nonlinear growth-curve logistic model: b =0.4 b =0.8 β1 + b1i b =1.2 yi j = + εi j , i = 1, . . . , n, j = 1, . . . , 10, (26) 1 + exp (−[ti j − β2 ]/β3 )

12

Inclusion of b2 12

Inclusion of b1

12

6.1. Asymptotic properties


GALARZA et. al.: Likelihood based inference for quantile regression...

53

where ti j = 100, 267, 433, 600, 767, 933, 1100, 1267, 1433, 1600 for all i. The goal is to estimate the fixed effects parameters β ’s for a grid of percentiles p = {0.50, 0.75, 0.95}. A random effects b1i was added to the first growth parameter β1 and its effect over the growth-curve is shown in Figure 4. Parameters interpretation for this model is going to be discussed in the Application Section. The random effects b1i and the error ε i = (εi1 . . . , εi10 )> iid iid are non-correlated been b1i ∼ N(0, σb2 ) and εi j ∼ ALD(0, σe , p). We set β p = (β1 , β2 , β3 )> = (200, 700, 350)> , σe = 0.5, σb2 = 10. Using the notation in (7) the matrices Ai and Bi are given by I3 and (1, 0, 0)> respectively. For varying sample sizes of n = 25, 50, 100 and 200, we generate 100 data samples for each scenario. In addition, we also choose m = 20, c = 0.25 and W = 500 for the SAEM convergence parameters. For all scenarios, we compute the square root of the mean square error (RMSE), the bias (Bias) and the Monte carlo standard deviation (MC-Sd) for each parameter over the 100 replicates. They are defined as v u 100 u1 2 ( j) and Bias(θbi ) = θbi − θi (27) MC-Sd(θbi ) = t ∑ θbi − θbi 99 j=1 q b where RMSE(θi ) = MC-Sd2 (θbi ) + Bias2 (θbi ), the Monte carlo mean θbi = 1 100 b( j) (MC Mean) and θi ( j) is the estimate of θi from the j-th sam100 ∑ j=1 θi ple, j = 1 . . . 100. Based on Figure 3, for the bias we can see a patterns of convergence to zero when n increases for both parameters. The values of MC-Sd and RMSE decrease monotonically when n is increased where it is evident that for extreme quantiles estimating, the standard deviation is much higher while for quantiles q = 50 and q = 75 are asymptotically equal. The worst scenario seems to happen while estimating extreme quantiles and maybe a sample size greater than 200 is needed to obtain a reasonably reduction of bias and SD. However, as a general rule, we can say that bias and MSE tend to approach to zero when the sample size is increasing, indicating that the approximates ML estimates based on the proposed SAEM algorithm do provide good asymptotic properties. The parameter β1 has been discarded in the graphical analysis because it varies along quantiles so its bias too as seen in Table 1. This parameter represents the asymptotic growth so this parameter is highly susceptible to the quantile to be estimated, however it also provides good asymptotic properties for its standard deviation. Table 1 also show an excellent recovery for the nuisance parameter σe , small standard deviations and good asymptotic


ESTADÍSTICA (2015), 67, 188 y 189, pp. 33-74

54

properties in terms of bias and SD. Table 1. Results based on 100 simulated samples. Monte carlo mean and standard deviation (MC Mean and MC-Sd) for the fixed effects β1 . β2 . β3 and the nuisance parameter σe . obtained after fitting the QR-NLMM model under different settings of quantiles and sample sizes. Quantile (%) 50

75

95

n 25 50 100 200 25 50 100 200 25 50 100 200

β1 MC Mean 199.75 199.79 200.16 200.03 203.77 203.90 204.20 204.34 201.15 201.77 201.94 202.11

β2 MC-Sd MC Mean (2.35) 700.19 (1.69) 700.09 (1.15) 700.08 (0.75) 699.96 (2.50) 700.18 (1.81) 700.20 (1.31) 699.83 (0.92) 700.00 (2.79) 700.26 (2.15) 700.53 (1.56) 700.18 (1.08) 700.06

β3 MC-Sd MC Mean (2.00) 350.13 (1.29) 350.03 (0.92) 350.06 (0.64) 349.98 (2.07) 350.15 (1.60) 350.16 (1.08) 349.88 (0.70) 350.01 (6.52) 350.14 (4.84) 349.74 (3.55) 349.73 (2.60) 349.98

σe MC-Sd MC Mean (1.35) 0.503 (0.86) 0.498 (0.72) 0.497 (0.50) 0.499 (1.56) 0.499 (1.11) 0.495 (0.74) 0.499 (0.49) 0.498 (3.92) 0.506 (2.83) 0.508 (2.32) 0.505 (1.54) 0.502

MC-Sd (0.035) (0.021) (0.017) (0.012) (0.035) (0.025) (0.017) (0.011) (0.035) (0.024) (0.015) (0.012)

100

200

5 4 3 0

1

1 0 25

2

RMSE( β2)

4 2

3

SD( β2)

0.3 0.1 −0.1

BIAS( β2)

5

6

6

0.5

Figure 3. Bias, Standard Deviation and RMSE for β1 (upper panel) and β2 (lower panel) for varying sample sizes over the quantiles p = 0.50, 0.90, 0.95.

25

100

200

25

n

100

200

n

3 1 0

0 25

100 n

200

2

RMSE( β3)

1

2

SD( β3)

0.0 −0.2 −0.1

BIAS( β3)

3

0.1

4

4

n

25

100 n

200

25

100 n

200


55

GALARZA et. al.: Likelihood based inference for quantile regression...

6.2. Robustness study Figure 4. Illustration of 50 simulated curves from the growth-curve logistic model using different distributions for the random effect term. From left to right panel, the random effects has been generated from a Normal, a Student t4 and a Contaminated Normal(ν1 = 0.1,ν2 = 0.1), all with location parameter µ = 0 and scale parameter σb2 = 10. 500

1000

1500 Contaminated Normal

50 100 150 200

Student-t

0

growth (cm)

Normal

500

1000

1500

time (da ys)

500

1000

1500

The goal of this simulation study is to asses the robustness or bias incurred when one assumes a normal distribution for random effects and the Table 2. Results based on 100 simulated samples. MC Mean, Bias, MC-Sd and RMSE for the fixed effects β1 , β2 , β3 and the nuisance parameter σe obtained after fitting the QR-NLMM for quantiles 0.50 and 0.75 using four different distribution settings for the random effects. Fit

Student-t4

MC Mean Bias MC-Sd RMSE

Contamination 10% MC Mean Bias MC-Sd RMSE 20% MC Mean Bias MC-Sd RMSE 30% MC Mean Bias MC-Sd RMSE

Quantile 50% β1 β2 β3 (200) (700) (350) 200.22 700.00 349.99 0.22 0.00 -0.01 (1.98) (1.28) (0.98) 1.99 1.28 0.98 199.87 -0.13 (1.90) 1.90 200.05 0.05 (1.96) 1.96 200.16 0.16 (2.10) 2.11

700.10 0.10 (1.26) 1.27 699.91 -0.09 (1.28) 1.28 700.06 0.06 (1.05) 1.05

σe (0.5) 0.501 0.001 (0.024) 0.024

349.9 0.499 -0.1 -0.001 (0.88) (0.024) 0.88 0.024 350.08 0.497 0.08 -0.003 (0.90) (0.024) 0.90 0.024 350.07 0.496 0.07 -0.004 (0.93) (0.024) 0.93 0.024

Quantile 75% β1 β2 β3 (200) (700) (350) 204.43 700.39 350.18 4.43 0.39 0.18 (2.17) (1.69) (1.09) 4.93 1.74 1.11

σe (0.5) 0.501 0.001 (0.024) 0.024

205.02 700.18 350.05 0.501 5.02 0.18 0.05 0.001 (1.92) (1.80) (1.16) (0.024) 5.38 1.81 1.16 0.024 205.35 700.20 350.11 0.496 5.35 0.20 0.11 -0.004 (2.00) (1.55) (1.19) (0.023) 5.71 1.56 1.20 0.023 206.63 699.91 350.01 0.497 6.63 -0.09 0.01 -0.003 (2.60) (1.60) (1.06) (0.022) 7.13 1.60 1.06 0.023


56

ESTADÍSTICA (2015), 67, 188 y 189, pp. 33-74

actual distribution belongs to a heavy tailed distributions. The use of heavy tailed distributions for the random effects will let us to simulate the presence of outliers leading us to test adequately the performance of the proposed method in terms of robustness. The design of this simulation study is as in the previous subsection but for a set of quantiles {0.50, 0.75} and a fixed sample size n = 50 we are going to simulate 100 Monte Carlo samples generating the random effect term from a Student-t distribution with ν = 4 degrees of freedom and from a Normal Contaminated distribution (ν1 = 0.1, ν2 = {0.1, 0.2, 0, 3}), i.e., with three scenarios of contamination, 10%, 20% and 30%. All simulations are created by using the same values of β p = (200, 700, 350)> , nuisance parameter σe = 0.5 and scale parameter σb2 = 10 for the respectively random effect distribution. From Table 2 we can see that the proposed model is really robust even for worst scenarios of contamination. The parameter recovery is highly accurate even for the non-centered quantile 0.75. For quantile 0.75, the β1 parameter tends to increase for higher levels of contamination. As expected, the MC-Sd and consequently the RMSE increase in presence of outliers. As a general rule, we can conclude that the proposed model is robust in presence of outliers or misspecification of the random effect distribution. 7. Illustrative examples

In this section, we illustrate the application of our method to two interesting longitudinal datasets from the literature. 7.1. Growth curve: Soybean data

For the first application, we are going to consider the Soybean genotypes data analyzed by Davidian & Giltinan (1995) and Pinheiro & Bates (2000), a longitudinal experiment consisting of measuring along time the leaf weight (in g) as a measure of growth of two kinds of Soybean genotype plants to be compared, a commercial variety, Forrest (F), and an experimental strain, Plan Introduction #416937 (P). The samples were taken approximately weekly during 8 to 10 weeks. For three consecutive years, 1988, 1989 and 1990, the plants were planted in 16 plots (8 per each genotype) and the mean leaf weight of six randomly selected plants was measured. We use the three parameter logistic model in (26) introducing a random effect term for each parameter and a dichotomic covariate as


GALARZA et. al.: Likelihood based inference for quantile regression...

57

30

40

50

60

70

80

20

time (weeks)

yi j =

20 15 10 0

0

20

Forrest Plan

5

average leaf weight (gr)

30 25 20 15 10

average leaf weight (gr)

Forrest Plan

5

25 20 15 10 5 0

average leaf weight (gr)

30

Figure 5. Soybean data: (a) Leaf weight profiles versus time. (b) Leaf weight profiles versus time by genotype. (c) Ten randomly selected leaf weight profiles versus time been five per each genotype.

30

40

50

60

70

20

80

30

40

50

60

70

80

time (weeks)

time (weeks)

ϕ1i + εi j , i = 1, . . . , 412, j = 1, . . . , ni , (28) 1 + exp (−[ti j − ϕ2i ]/ϕ3i )

where, ϕ1i =β1 + β4 geni + b1i ϕ2i =β2 + b2i ϕ3i =β3 + b3i . The observed value yi j represents mean weight of leaves (in g) from six randomly selected soybean plants in the ith plot, after ti j days of been planted; geni is a dichotomic variable for the genotype of plant i (0=forrest, 1=plan Introduction) and εi j is the measurement error for the 412 plants. Let be β p = (β1 , β2 , β3 , β4 )> and bi = (b1i , b2i , b3i )> the fixed and random effects vector respectively. Then the matrices Ai and Bi are defined as 

 1 0 0 geni Ai = 0 1 0 0  0 0 1 0

and

 1 0 0 Bi = 0 1 0 . 0 0 1

(29)

The three parameter interpretation are the asymptotic leaf weight, the time at which the leaf reaches half of its asymptotic weight and the time elapsed between the leaf reaching half and 0.7311 = 1/(1 + e−1 ) of its asymptotic weight, respectively. Due the goal of comparing the final (asymptotic)


58

ESTADÍSTICA (2015), 67, 188 y 189, pp. 33-74

Figure 6. Fitted quantile regression for several quantiles for the Soybean data by genotype.

25 20 15 10

average leaf weight (gr)

0

5

25 20 15 10 5 0

average leaf weight (gr)

30

Plan Introduction

30

Forrest

0

20

40

60

time (weeks)

80

100

0

20

40

60

80

100

time (weeks)

growth of the two kind of Soybeans, the dichotomic covariate geni was incorporated in the first component of the growth function, then the fourth fixed effect β4 will represent the difference (in g) of the asymptotic leaf weight between the plan introduction type and the forrest one (control). As seen in middle and right panel of figure 5, it appears to exist a significance difference between the experimental and control Soybean so we expect a positive non zero β4 estimate for most of quantiles. Figure 6 shows the fitted regression lines for quantiles 0.10, 0.25, 0.50, 0.75 and 0.90 by genotype. From this figure we can see clear how the extreme quantiles estimation functions captures the full data variability and evidences some atypical observations, specially for the plan introduction genotype. Quantile functions (for same quantile value) looks really different for each genotype due the significance of β4 over the model as seen in Figure 7. After fitting the quantile regression over the grid p = {0.05, 0.10, ..., 0.95}, we show a graphical summary of the obtained results in Figure 7. We assessed the convergence of the fixed effect estimates, variance components of the random effects and nuisance parameters using graphical criteria as shown in Figure 11 in Appendix D. It shows a 95% confidence band for the fixed effect parameters β1 , β2 , β3 , β4 and for the nuisance parameter σ where the solid lines are the Q0.025 percentile and Q0.975 percentile obtained through the estimation of the standard errors based on the empirical information matrix. We can see that the effect of the genotype results significant for all the quantile profile and the difference varies with respect


GALARZA et. al.: Likelihood based inference for quantile regression...

59

β2 0.05

0.25

0.45

0.65

50 51 52 53 54 55 56

11

β1

16

21

Figure 7. Point estimates (center solid line) and 95% confidence intervals for model parameters after fitting the QR to the Soybean data across various quantiles. The interpolated curves are spline-smoothed.

0.85

0.05

0.25

0.65

0.85

10 12 8

9.0

4

β4

6

8.5

-2

0

7.5

2

8.0

β3

0.45

quantiles

9.5

quantiles

0.05

0.25

0.45

0.65

0.85

0.05

0.25

0.45

0.65

0.85

quantiles

0.05

0.15

σ

0.25

0.35

quantiles

0.05

0.25

0.45

0.65

0.85

quantiles

to the conditional quantile been more significant for lower quantiles. This can be corroborated in Figure 6 where the difference between the 0.10 estimated quantile functions for different genotypes is greater than for other quantiles. Using the information provided by the 95th percentile, we infer that the Soybean plants that grew more have a mean leaf weight around 19.35 grams for the Forrest genotype and 23.25 grams for the plan introduction one, then the asymptotic difference for the two genotypes is around 4 grams. The behavior of the estimate of the nuisance parameter σ is sym-


60

ESTADÍSTICA (2015), 67, 188 y 189, pp. 33-74

metric with respect to p = 0.50, taking its maximum value and variability on it and both decreasing for extreme quantiles. This behavior is because the variance within subjects depends of the quantile to be estimated, been proportional to the asymmetry of the error term then for extreme quantiles the nuisance parameter should be reduced. 7.2. HIV viral load study

The data set belongs to a clinical trial (ACTG 315) studied in previous works by Wu (2002) and Lachos et al. (2013). In this study, we analyze the HIV viral load of 46 HIV-1 infected patients under antiretroviral treatment (protease inhibitor and reverse transcriptase inhibitor drugs). The viral load and some other covariates were mesured several times days after the start of treatment been 4 and 10 the minimum and maximum number of measures per patient respectively. Wu (2002) found that the only significance covariate for modelling the virus load was the CD4 therefore the other covariates even though they could be incorporated to the model for instance they are going to be discard. Figure 8 shows the profile of viral load in log10 scale and CD4 cell count/100 per cubic millimeter versus time (in days/100) for six randomly selected patients. We can see that appear to exist some relationship between the viral load and the CD4 cell count and it seems to be inversely proportional, i.e., high CD4 cell count leads to lower levels of viral load. This is because the CD4 cells (also called T-cells) alert the immune system to invasion of viruses and/or bacteria so lower CD4 count means a weaker immune system. Normal counts of CD4 cells are from 500-1000 cells per cubic millimeter whereas fewer counts than 200 cells/mm3 will be a high qualification to diagnose AIDS. We can evidence the mentioned before in the right panel of Figure 8 where the three patients who have less than 200 CD4 cells/mm3 (delimited by the horizontal dashed line in 0.02) are the ones with higher levels of viral load. In order to fit the nonlinear data we will use the nonlinear model proposed by Wu (2002) and also used by Lachos et al. (2013). The proposed bi-exponential NLME model is given by: yi j = log10 e(ϕ1i −ϕ2iti j ) + e(ϕ3i −ϕ4iti j ) + εi j ,

i = 1, . . . , 46, j = 1, . . . , ni , (30)


GALARZA et. al.: Likelihood based inference for quantile regression...

61

0.03

CD4 count

0.02

4 0.0

0.5

1.0

1.5

0.00

2

0.01

3

log10 HIV RNA

5

0.04

6

0.05

Figure 8. ACTG 315 data. Profiles of viral load (response) in log10 scale and CD4 cell count (in cells/100mm3 ) for ten randomly selected patients.

2.0

time since infection

0.0

0.5

1.0

1.5

2.0

time since infection

with ϕ1i =β1 + b1i ϕ3i =β3 + b3i

ϕ2i =β2 + b2i ϕ4i j =β4 + β5CD4i j + b4i ,

where the observed value yi j represents the log-10 transformation of the viral load for the ith patient at time j, CD4i j is the CD4 cell count (in cells/100mm3 ) for the ith patient at time j and εi j is the measurement error for the 46 patients. Let be β p = (β1 , β2 , β3 , β4 , β5 )> and bi = (b1i , b2i , b3i , b4i )> the fixed and random effects vector respectively and CD4i = (CD4i1 , . . . ,CD4ini )> . Then the matrices Ai and Bi are defined as I3 0 0 Ai = 0 1ni CD4i

and

I3 0 Bi = . 0 1ni

(31)

The parameters ϕ2i and ϕ4i are the two-phase viral decay rates, which represent the minimum turnover rates of productively infected cells and that of latently or long-lived infected cells if therapy was successful, respectively. For more details about the model in (30) see Grossman et al. (1999) and Perelson et al. (1997). Figure 9 shows the fitted regression lines for quantiles 0.10, 0.25, 0.50, 0.75 and 0.90 for the HIV data. In order to plot, first, we fixed the CD4 covariate using the predicted sequence from a linear regression (including a quadratic term) for explaning the CD4 cell count with respect to time. We can see


62

ESTADĂ?STICA (2015), 67, 188 y 189, pp. 33-74

4 2

3

log10 HIV RNA

5

6

Figure 9. ACTG 315 data: Fitted quantile regression functions overlayed for the HIV data.

0.0

0.5

1.0

1.5

2.0

Days since infection

how quantile estimated functions follow the data behaviour satisfactorily and turn easily to estimate a specific viral load quantile at any time of the experiment. Extreme quantile functions bound the most of the observed profiles and evidence possible influential observations. The results after fitting QR over the grid of quantiles p = {0.05, 0.10, ..., 0.95} are shown in figure 10. The convergence of estimates for all parameters were also assessed using the graphical criteria in Figure 12 in Appendix D. based on We have found that the first phase viral decay rate is positive and its effect tends to increase proportionally along quantiles. For the second phase viral decay rate we have that this second rate is positive correlated with the CD4 count and therefore with the therapy time. Then, more days of treatment implies a higher CD4 cell count and therefore a higher second phase viral decay. The CD4 cell process for this model has a different behavior than for the expansion phase (Huang & Dagne (2011)). The significance of the CD4 covariate increases positively with respect to quantiles (until quantile p = 0.60 approximately) and then its effect becomes constant for greater quantiles. The behavior of the estimate of the nuisance parameter Ďƒ is the same as in Application 1. 8. Conclusions

In this paper, we investigate quantile regression of nonlinear mixed effects models from a likelihood-based perspective. The ALD distribution.


GALARZA et. al.: Likelihood based inference for quantile regression...

63

34 32

β2

12 9

28

10

30

11

β1

13

36

38

14

Figure 10. ACTG 315 data: Point estimates (center solid line) and 95% confidence intervals for model parameters after fitting the QR-NLMM to the HIV data across various quantiles. The interpolated curves are spline-smoothed.

0.05

0.25

0.45

0.65

0.85

0.05

0.25

0.65

0.85

1 β4

−2

−1

0

8 7 β3

−3

6

−4

5 4 0.05

0.25

0.45

0.65

0.85

0.05

0.25

0.45

0.65

0.85

quantiles

-0.2

0.00

0.0

0.05

0.4

σ

0.10

0.8

1.0

0.15

quantiles

β5

0.45

quantiles

9

quantiles

0.05

0.25

0.45

0.65

quantiles

0.85

0.05

0.25

0.45

0.65

0.85

quantiles

and SAEM algorithm are combined to propose an exact ML estimation method, in contrast to the approximated method proposed by Geraci & Bottai (2014). We evaluate the robustness of estimates, as well as, the finite sample performance of the algorithm and the asymptotic properties of the ML estimates through empirical experiments and applications to two real datasets. We believe that this paper is the first attempt for exact ML estimation in the context of QR-NLMMs. The methods developed can be readily implemented inside R through package qrNLMM().


64

ESTADÍSTICA (2015), 67, 188 y 189, pp. 33-74

There are a number of possible extensions of the current work. For modelling both skewness and long tails in the random effects, the scale mixtures of skew-normal (SMSN) distributions (Lachos et al., 2010) is a feasible choice. Also, HIV viral loads studies include covariates (viz. CD4 cell counts) that often comes with substantial measurement errors (Wu, 2002). How to incorporate measurement error in covariates within our robust framework can also be part of future research. An in-depth investigation of such extensions is beyond the scope of the present paper, but certainly an interesting topic for future research. Acknowledgements

The research of V. H. Lachos was supported by Grant 305054/2011-2 from Conselho Nacional de Desenvolvimento Cientı́fico e Tecnológico (CNPqBrazil) and by Grant 2014/02938-9 from Fundação de Amparo à Pesquisa do Estado de São Paulo (FAPESP-Brazil). Appendix A

Specification of initial values

It is well known that a smart choice of the initial values for the ML estimates can assure a fast convergence of an algorithm to the global maxima solution. Obviating the random effects term, i.e., bi = 0, let yi ∼ η (β β p , 0), σ , p). Next, considering the ML estimates for β p and σ ALD(η as defined in Yu & Zhang (2005) for this model, we follow the steps below for the QR-LMM implementation: (0) 1. Compute an initial value βb p as (0) βb p = arg min βp ∈Rk

n

∑ ρ p(yi − η (ββ p, 0)).

i=1

(0) 2. Using the initial value for βb p obtained above, compute σb (0) as

σb (0) =

1 n ∑ ρ p(yi − η (ββ p, 0)). n i=1

3. Use a q × q identity matrix Iq×q for the the initial value Ψ (0) .


65

GALARZA et. al.: Likelihood based inference for quantile regression...

Appendix B

Computing the conditional expectations

Due the independence between ui j yi j , bi and uik | yik , bi , for all j, k = 1, 2, . . . , ni and j = 6 k, we can write ui |yi , bi = [ ui1 |yi1 , bi ui2 |yi2 , bi · · · uini |yini , bi ]> . Using this fact, we are able to compute the conditional expectations E (ui ) and E (D−1 i ) in the following way. Using matrix expectation properties, we define these expectations as E (ui ) = [E (ui1 ) E (ui1 ) · · · E (uini )]>

(B.1)

and E (u−1 0 ... i1 )  0 E (u−1 ) ... i2

 −1  E (D−1 i ) = diag(E (ui )) = 

.. .

.. .

0

0

..

.

0 0

.. .

... E (u−1 in )

 . 

(B.2)

i

We already have ui j |yi j , bi ∼ GIG( 12 , χi j , ψ) where χi j and ψ are defined in (15). Then, using (5), we compute the moments involved in the equations χ ψ above as E (ui j ) = ψi j (1 + χi1j ψ ) and E (u−1 i j ) = χi j . Thus, for iteration k of the algorithm and for the `th Monte Carlo realization, we can compute (`,k) E (ui )(`,k) and E [D−1 using equations (B.1)-(B.2) where i ] (k)

E (ui j )

(`,k)

=

Appendix C

(`,k)

2|yi j −ηi j (β p ,bi τ p2

)|+4σ (k)

and

(`,k) E (u−1 ij )

τ p2 = (k) (`,k) . 2|yi j −ηi j (β p ,bi )|

The empirical information matrix

In light of (11), the complete log-likelihood function can be rewritten as 3 1 1

> −1 `ci (θθ ) = − ni logσ − ζ D ζ − log Ψ (C.1) i i i 2 2σ τ p2 2 1 > −1 1 − bi Ψ bi − u> 1n 2 σ i i η (β β p , bi ) − ϑ p ui and θ = (β β p> , σ , α > )> . Differentiating where ζi = yi −η with respect to θ , we have the following score functions: ∂ `ci (θθ ) ∂ η ∂ ζi ∂ `ci (θθ ) 1 > −1 = = J D ζi , ∂βp ∂ β p ∂ η ∂ ζi σ τ p2 i i


ESTADÍSTICA (2015), 67, 188 y 189, pp. 33-74

66

with Ji defined in section 3.2. and 3ni 1 1 1 > ∂ `ci (θθ ) = − + 2 2 ζi> D−1 i ζi + 2 ui 1ni . ∂σ 2 σ 2σ τ p σ Let α be the vector of reduced parameters from Ψ , the dispersion matrix for bi . Using the trace properties and differentiating the complete loglikelihood function, we have that

1 ∂ n ∂ `ci (θθ ) −1 > Ψ bi bi } = − log Ψ − tr{Ψ ∂Ψ ∂Ψ 2 2 1 1 Ψ−1 } + tr{Ψ Ψ−1 Ψ −1 bi b> = − tr{Ψ i } 2 2 1 Ψ−1 (bi b> Ψ−1 } = tr{Ψ i − Ψ )Ψ 2 Next, taking derivatives with respect to a specific α j from α based on the chain rule, we have ∂ `ci (θθ ) ∂ Ψ ∂ `ci (θθ ) = ∂αj ∂αj ∂Ψ ∂Ψ 1 Ψ−1 (bi b> Ψ−1 }. tr{Ψ = i − Ψ )Ψ ∂αj 2

(C.2)

where, using the fact that tr{ABCD} = (vec(A> ))> (D> ⊗ B)(vec(C)), (C.2) can be rewritten as > ∂ `ci (θθ ) 1 −1 Ψ ⊗ Ψ −1 )(vec(bi b> = (vec( ∂∂ Ψ ))> (Ψ i − Ψ )). α j ∂αj 2

(C.3)

Let Dq be the elimination matrix (Lavielle, 2014) that transforms the vecΨ)) into its half-vectorized form vech(Ψ Ψ), such torized Ψ (written as vec(Ψ 1 Ψ Ψ that Dq vec(Ψ ) = vech(Ψ ). Using the fact that for all j = 1, . . . , 2 q(q + 1), the vector (vec( ∂ Ψ )> )> corresponds to the jth row of the elimination ma∂αj

trix Dq , we can generalize the derivative in (C.3) for the vector of parameters α as ∂ `ci (θθ ) 1 Ψ−1 ⊗ Ψ −1 )(vec(bi b> = Dq (Ψ i − Ψ )). ∂α 2 Finally, at each iteration, we can compute the empirical information matrix (24) by approximating the score for the observed log-likelihood by the stochastic approximation given in (25).


GALARZA et. al.: Likelihood based inference for quantile regression...

Appendix D

67

Figures

100

300

500

β3 100

500

ψ1 0

100

300

500

300

500

100

300

500

0

5

2

ψ4

15

Iteration

ψ3

ψ2 0 5 0

100

300

500

0

100

300

500

Iteration

0.5 0.2

ψ6

2.0 0

100

300 Iteration

500

0

100

300 Iteration

0

100

300 Iteration

0.8

Iteration

ψ5

0

Iteration

15

Iteration

0.0

100

5

0.30

300

0

Iteration

0.45

σ

β4

3.7

100

500

Iteration

3.5 0

300

8.1 8.3 8.5

54.4 0

Iteration

15 25

0

53.6

β2

β1

16.5 16.8

Figure 11. Graphical summary for the convergence of the fixed effect estimates, variance components of the random effects, and nuisance parameters performing a median regression for the Soybean data. The vertical dashed line delimits the beginning of the almost sure convergence as defined by the cut-point parameter c = 0.25.

500

500


ESTADÍSTICA (2015), 67, 188 y 189, pp. 33-74

68

6.70 6.55

β3

31.5 30.0

300

500

0

100

300

500

0

100

300

500

500

300

500

0

100

ψ 10

2.5

100

300 Iteration

500

300

500

0

100

300

500

Iteration

ψ9

0.0 0

100

300 Iteration

1.5 0

500

−0.2

ψ8 500

Iteration

−1.0

ψ7

0.0

300

300

1.0

0

Iteration

−0.6

100

100

2.2 ψ6

0.0 ψ5 500

Iteration

0

500

Iteration

−0.6

ψ4

1.0

300

300

1.5 100

Iteration

0.4

100

100

0.5 0

Iteration

0

0

ψ3

ψ2 −0.2 300

500

Iteration

0.4

1.7 1.4

100

300 Iteration

Iteration

1.1 0

100

0.20 0

Iteration

ψ1

500

σ

β5 100

0.50 0.70

−1.4 0

300 Iteration

−2.0

β4

Iteration

0.14

100

1.6

0

0.6

11.55

β1

β2

11.70

Figure 12. Graphical summary for the convergence of the fixed effect estimates, variance components of the random effects, and nuisance parameters performing a median regression for the HIV data. The vertical dashed line delimits the beginning of the almost sure convergence as defined by the cut-point parameter c = 0.25.

500

0

100

300 Iteration

500


GALARZA et. al.: Likelihood based inference for quantile regression...

Appendix E Output from R package qrNLMM() --------------------------------------------------Quantile Regression for Nonlinear Mixed Model --------------------------------------------------Quantile = 0.5 Subjects = 48 ; Observations = 412 - Nonlinear function function(x,fixed,random,covar=NA){ resp = (fixed[1] + random[1])/(1 + exp(((fixed[2] + random[2]) - x)/(fixed[3] + random[3]))) return(resp)} ----------Estimates ----------- Fixed effects Estimate Std. Error z value Pr(>|z|) beta 1 18.80029 0.53098 35.40704 beta 2 54.47930 0.29571 184.23015 beta 3 8.25797 0.09198 89.78489

0 0 0

sigma = 0.31569 Random effects Variance-Covariance Matrix matrix b1 b2 b3 b1 24.36687 12.27297 3.24721 b2 12.27297 15.15890 3.09129 b3 3.24721 3.09129 0.67193 -----------------------Model selection criteria -----------------------Loglik AIC BIC HQ Value -622.899 1265.798 1306.008 1281.703 ------Details -------

69


70

ESTADĂ?STICA (2015), 67, 188 y 189, pp. 33-74

Convergence reached? = FALSE Iterations = 300 / 300 Criteria = 0.00058 MC sample = 20 Cut point = 0.25 Processing time = 22.83885 mins


GALARZA et. al.: Likelihood based inference for quantile regression...

71

References

ALLASSONNIERE, S.; KUHN, ESTELLE AND TROUVE, A. (2010). "Construction of Bayesian deformable models via a stochastic approximation algorithm: a Convergence study". Bernoulli. 16(3): 641– 678. BARNDORFF-NIELSEN, O. AND SHEPHARD, N. (2001). "Non-gaussian ornstein-uhlenbeck-based models and some of their uses in financial economics". Journal of the Royal Statistical Society: Series B (Statistical Methodology). 63(2): 167–241. BATES, D. AND WATTS, D. (1981). "A Relative Off set Orthogonality Convergence Criterion for Nonlinear least Squares". Technometrics. 23(2): 179–183. BOOTH, J. AND HOBERT, J. (1999). "Maximizing generalized lin-ear mixed model likelihoods with an automated monte carlo em algorithm". Journal of the Royal Statistical Society: Series B (Statistical Methodology). 61(1): 265–285. DAVIDIAN, M. AND GILTINAN, D. (2003). "Nonlinear models for repeated measurement data: an overview and update". Journal of Agricultural, Biological and Environmental Statistics. 8(4): 387–419. DAVIDIAN, M. AND GILTINAN, D. (1995). "Nonlinear Models for Repeated Measurement Data". CRC Press. Volume 62. DELYON, B.; LAVIELLE, M. AND MOULINES, E. (1999). "Convergence of a stochastic approximation version of the em algorithm". Annals of Statistics. 8: 94–128. DEMPSTER, A.; LAIRD, N. AND RUBIN, D. (1977). "Maximum likelihood from incomplete data via the EM algorithm". Journal of the Royal Statistical Society, Series B. 39: 1–38. FU, L. AND WANG, Y. (2012). "Quantile regression for longitudinal data with a working correlation model". Computational Statistics & Data Analysis. 56(8): 2526–2538. GALVAO, A. F AND MONTES-ROJAS, G. (2010). "Penalized quantile regression for dynamic panel data". Journal of Statistical Planning and Inference. 140(11): 3476–3497.


72

ESTADÍSTICA (2015), 67, 188 y 189, pp. 33-74

GALVAO, A. (2011). "Quantile regression for dynamic panel data with fixed effects". Journal of Econometrics. 164(1): 142–157. GERACI, M. AND BOTTAI, M. (2007). "Quantile regression for longitudinal data using the asymmetric Laplace distribution". Biostatistics. 8(1): 140– 154. GERACI, M. AND BOTTAI, M. (2014). "Linear quantile mixed models". Statistics and Computing. 24(3): 461–479. GROSSMAN, Z.; POLIS, M.; FEINBERG, M.; GROSSMAN, Z.; LEVI, I.; JANKELEVICH, S.; YARCHOAN, R.; BOON, J.; DE WOLF, F.; LANGE, J. AND OTHERS (1999). "Ongoing hiv dissemination during haart". Nature medicine. 5(10): 1099–1104. HASTINGS, W. (1970). "Monte Carlo sampling methods using Markov chains and their applications". Biometrika. 57(1): 97–109. HUANG, Y. AND DAGNE, G. (2011). "A bayesian approach to joint mixed-effects models with a skew-normal distribution and measurement errors in covariates". Biometrics. 67(1): 260–269. KOENKER, R. (2004). "Quantile regression for longitudinal data". Journal of Multivariate Analysis. 91(1): 74–89. KOENKER, R. (2005). "Quantile Regression". Cambridge University Press, New York, NY. KOTZ, S.; KOZUBOWSKI, T. AND PODGORSKI, K. (2001). "The Laplace distribution and generalizations: A revisit with applications to communications, economics, engineering and finance". Birkhauser. KUHN, E. AND LAVIELLE, M. (2004). "Coupling a stochastic approximation version of EM with an MCMC procedure". ESAIM: Probability and Statistics. 8: 115–131. KUHN, E. AND LAVIELLE, M. (2005). "Maximum likelihood estimation in nonlinear mixed effects models". Computational Statistics & Data Analysis. 49(4): 1020–1038. KUZOBOWSKI, T. AND PODGORSKI, K. (2000). "A multivariate and asymmetric generalization of laplace distribution". Computational Statistics. 15(4): 531–540.


GALARZA et. al.: Likelihood based inference for quantile regression...

73

LACHOS, V.; GHOSH P. AND ARELLANO-VALLE R. (2010). "Likelihood based Inference for Skew–Normal Independent Linear Mixed Models". Statistica Sinica. 20(1): 303–322. LACHOS V.; CASTRO L. AND DEY D. (2013). "Bayesian inference in nonlinear mixed-effects models using normal independent distributions". Computational Statistics & Data Analysis. 64: 237–252. LAVIELLE, M. (2014). "Mixed Effects Models for the Population Approach". Chapman and Hall/CRC, Boca Raton, FL. LIPSITZ, S.; FITZMAURICE, G.; MOLEN-BERGHS, G. AND ZHAO, L. (1997). "Quantile Regression Methods for Longitudinal Data with Dropouts: Application to CD4 Cell Counts of Patients Infected with the Human Immunodeficiency Virus". Journal of the Royal Statistical Society: Series C (Applied Statistics). 46(4): 463–476. LOUIS, T. (1982). "Finding the observed information matrix when using the EM algorithm". Journal of the Royal Statistical Society - Series B (Methodological). 44(2): 226–233. MEILIJSON, I. (1989). "A fast improvement to the EM algorithm on its own terms". Journal of the Royal Statistical Society. Series B (Methodological). 51(1): 127–138. METROPOLIS, N.; ROSENBLUTH, A.; ROSENBLUTH, M.; TELLER, A. AND TELLER, E. (1953). "Equation of state calculations by fast computing machines". Journal of Chemical Physics. 21: 1087–1092. MEZA, C.; OSORIO, F. AND DE LA CRUZ, R. (2012). "Estimation in nonlinear mixed-effects models using heavy-tailed distributions". Statistics and Computing. 22: 121–139. PERELSON, A.; ESSUNGER, P.; CAO, Y.; VESANEN, M.; HURLEY, A.; SAKSELA, K.; MARKOWITZ, M. AND HO, D. (1997). "Decay characteristics of HIV-1-infected compartments during combination therapy". Nature. 387(6629): 188. P INHEIRO , J.C. AND BATES , D.M. (1995). Approximations to the loglikelihood function in the nonlinear mixed effects model. Journal of Computational and Graphical Statistics, 4, 12–35.


74

ESTADÍSTICA (2015), 67, 188 y 189, pp. 33-74

PINHEIRO, J. AND BATES, D. (2000). "Mixed-effects Models in S and SPLUS". Springer, New York, NY. SEARLE, S.; CASELLA, G. AND MCCULLOCH, C. (1992). "Variance components". John Wiley & Sons. Vol. 391. VAIDA, F. (2005). "Parameter convergence for EM and MM algorithms". Statistica Sinica. 15(3): 831–840. WANG, J. (2012). "Bayesian quantile regression for parametric nonlinear mixed effects models". Statistical Methods and Applications. 21: 279–295. WEI, G. AND TANNER, M. (1990). "A Monte Carlo implementation of the EM algorithm and the poor man’s data augmentation algorithms". Journal of the American Statistical Association. 85(411): 699–704. WU, C. (1983). "On the convergence properties of the em algorithm". The Annals of statistics. Pages: 95–103. WU, L. (2002). "A joint model for nonlinear mixed-effects models with censoring and covariates measured with error, with application to aids studies". Journal of the American Statistical association. 97(460): 955–964. WU, L. (2010). "Mixed Effects Models for Complex Data". Chapman & Hall/CRC, Boca Raton, FL. YU, K. AND MOYEED, R. (2001). "Bayesian quantile regression". Statistics & Probability Letters. 54(4): 437–447. YU, K. AND ZHANG, J. (2005). "A three-parameter asymmetric Laplace distribution and its extension". Communications in Statistics - Theory and Methods. 34(9-10): 1867–1879. YUAN, Y. AND YIN, G. (2010). "Bayesian quantile regression for longitudinal studies with nonignorable missing data". Biometrics. 66(1): 105–114.

Received August 2015 Revised December 2015 Winner paper of the IASI Award for Excellence - extraordinary contest for the 75th anniversary of IASI.


ESTADÍSTICA (2015), 67, 188 y 189, pp. 75-88 © Instituto Interamericano de Estadística

DEMOCRACIA Y ESTADÍSTICA EN MÉXICO EDMUNDO F. BERUMEN TORRES Director General de Berumen y Asociados dirección@berumen.com.mx, Tel. (52-55) 5093 8600

RESUMEN El artículo hace una breve reseña de la creciente interrelación que, en años y elecciones presidenciales recientes, se ha establecido entre los procesos y procedimientos electorales establecidos por las Leyes, Reglamentos e Instituciones que rigen nuestra democracia, y el uso creciente de algunas técnicas y métodos de la estadística, que han apoyado la depuración de la infraestructura básica en que se sustenta cada elección (padrones electorales y listas nominales), así como las técnicas estadísticas para anticipar estimaciones de sus resultados mediante distintos tipos de encuestas y conteos rápidos. Palabras clave Padrón Electoral, Lista Nominal, Encuestas, Estimaciones, Resultados Oficiales ABSTRACT This paper gives a brief account of the increased interaction, in recent years and presidential elections, between the established regulations and electoral processes dictated by relevant Laws and Regulations that administer our democracy, and the statistical tools that have played a key role in improving the quality of voter registry, election lists, as well as statistical tools to have early estimate of election results, such as exit polls, quick counts, and others. Keywords Voter Registry, EligibleVoters, Surveys, Poll Estimates, Official Results

La Oficina Editorial agradece a la Lic. Delia Keller su colaboración en la edición de este artículo


76

ESTADÍSTICA (2015), 67, 188 y 189, pp. 75-88

Antecedentes La historia de los países que se autodefinen como democráticos relatan los acontecimientos que les permitieron evolucionar hasta alcanzarla, pero podríamos afirmar que ninguna historia es similar, cada uno evolucionó de manera diferente. Sin embargo hay un denominador común a todos ellos, el logro ha implicado un proceso eleccionario que, de manera periódica es utilizado para elegir sus autoridades de los poderes Ejecutivo y legislativo. Son justamente estos procesos los que hermanan la democracia con la estadística. Este documento se refiere a algunos sucesos acaecidos en México durante las últimas dos décadas en el largo camino que han recorrido de la mano Democracia y Estadística “a la mexicana”. Aritmética. No se llega a la democracia ni a la estadística si antes no visitamos la aritmética, y aún antes, a la simple actividad de enumerar cosas y casos. Son la esencia de instrumentos electorales para la cuenta de votos de ciudadanos con derecho a votar; de opciones válidas que pueden ser votadas; de sumas que acumulan un resultado; y de la calificación oficial del resultado final una vez resueltas las impugnaciones y cuestionamientos sobre los mismos. Padrones electorales y Listas Nominales. En México durante varios períodos de 6 años, tiempo que dura en el Poder Ejecutivo el candidato que gana la elección de Presidente, la discusión política resultaba monotemática: según los partidos de oposición el mayor villano se encontraba en la calidad del Padrón Electoral (PE) y la posterior Lista Nominal (LN). En ellos se podrían descubrir personas fallecidas nunca dadas de baja, migrantes fuera del país (sin derecho a votar en la época) y clones de aquellos electores que aseguraban votos a favor del partido en el poder. Puede haber diferencias entre la cantidad de ciudadanos empadronados y los que están incluidos en las listas nominales de una elección particular, ya que los primeros llevaron a cabo el trámite de registro, y los segundos se tomaron posteriormente la molestia de buscar su credencial de elector una vez procesada. Por tanto las cifras del PE siempre suelen ser mayores a las de la LN. Para dar fin a esas discusiones a mediados de los años 90’s del milenio pasado se llevaron a cabo más de 90 “auditorías técnicas” al PE, a niveles nacional, regional y local, cada una de las cuales dio pistas para depurar el PE y las LN. El resultado implicó que las elecciones de 1994 (Presidente, senadores y diputados locales), se realizaran con la LN más depurada a la fecha (y quizá después de ella, a pesar de nuevas auditorías técnicas para actualizarlo). Más aún, un importante logro que aún


BERUMEN TORRES, EDMUNDO F.: Democracia y estadística en México

77

perdura fue que dejó de ser el centro de discusión política las artimañas electoreras atribuidas a la calidad del PE y LN. Cabe enfatizar que la estadística jugó el papel de actor principal en estos ejercicios. Se tomaron muestras estrictamente probabilísticas del PE y LN y se fue a buscar en los domicilios registrados a los ciudadanos de la muestra, y viceversa: se seleccionaron muestras estrictamente probabilísticas de personas de 18 años de edad cumplidos, empadronados o no según lo declaraban, y se les buscaba en el PE y LN. En ese proceso el Instituto Federal Electoral (IFE entonces, ahora INE: Instituto Nacional Electoral), órgano constitucional responsable de cumplir con la función de organizar las elecciones federales de México, contrató, por primera vez, a tres empresas para que el 21 de agosto de 1994 realizaran ejercicios de conteo rápido (CR). Los conteos de las tres empresas se ejecutaron con la misma base metodológica, esto es utilizando una muestra nacional estrictamente probabilística, del mismo tamaño, 500 secciones electorales (SE), 100 por Circunscripción. Todas fueron seleccionadas con el mismo diseño probabilístico: estratificado por Circunscripción y, dentro de éstas, según si la SE estaba clasificada como urbana, rural o mixta; con afijación proporcional al número de ciudadanos registrados en sus LN y con igual probabilidad dentro de cada estrato. Se adoptó esta metodología para que los distintos actores interesados en el resultado del proceso tuviesen una referencia contra la cual comparar las muchas otras “auditorías técnicas” que pudieran realizarse. El proceso estadístico de las tres empresas fue idéntico: a) recolección en campo de los resultados consignados en las actas de escrutinio y cómputo de cada casilla en cada SE seleccionada en la muestra; b) transcripción en formato de campo; c) transmisión al centro de captura; d) validación de los datos de cada empresa, con los protocolos de seguridad para minimizar riesgos de intrusos con intenciones de sembrar datos falsos. Merece la pena subrayar que en este ejercicio no se entrevista a ningún tipo de informante, se restringe sólo a la transcripción de los datos consignados en las actas de escrutinio y cómputo de cada casilla en las SE de la muestra, y éstas son las mismas exhibidas en “cartulinas” expresamente diseñadas para ello, que se hacen públicas al pegarlas al exterior de cada casilla una vez que se concluye con la elaboración de las actas de escrutinio y cómputo para cada elección. Se esperaba que la convergencia de los resultados de las tres empresas contribuyeran a la confianza pública en la estimación. Así sucedió, si bien dichas


78

ESTADÍSTICA (2015), 67, 188 y 189, pp. 75-88

estimaciones fueron conocidas primero por el Consejo del IFE, el inicio de la difusión de los resultados de manera pública fue realizada por terceros, cuyas estimaciones eran congruentes con los datos de las empresas contratadas. Es decir que el Consejo dejó que los “de fuera” tomaran el reflector para luego, cuando lo consideró prudente, “cantar” lo estimado por las empresas contratadas por el propio IFE, y luego ratificar que el miércoles siguiente iniciarían los cómputos distritales. Fue una elección con una noche y un amanecer siguiente terso, que dio confianza y tranquilidad al ciudadano sobre que su voto fue contado y contó en el resultado de la elección. Adicionalmente se dieron a conocer algunas encuestas de salida (ES) que, por ser las únicas encuestas que entrevistan “votantes” luego de haber sufragado, son consideradas como el instrumento cuyo diseño original permite conocer el perfil de los votantes en una elección. Las encuestas previas al acto electoral entrevistan ciudadanos con credencial de elector vigente, que pueden o no convertirse en “votantes” el día de la jornada electoral, éste puede ser un procedimiento para estimar el posible resultado de la elección. Las estimaciones surgidas de las ES están disponibles en cuanto cierran las últimas casillas y por ello son más oportunas que las estimaciones provenientes del conteo rápido (CR), aunque no necesariamente más precisas. En 1994 las ES que se divulgaron no dieron resultados divergentes con las posteriores estimaciones provenientes de distintos CR. Según consta en los anales históricos del IFE, fue justo en 1994 que la reforma electoral aprobada instituyó la figura de "Consejeros Ciudadanos", personalidades propuestas por las fracciones partidarias en la Cámara de Diputados y electos por el voto de las dos terceras partes de sus miembros sin considerar la profesión o título que poseyeran. Por su parte, los partidos políticos conservaron un representante con voz, pero sin voto en las decisiones del Consejo General. Ese año el Consejo General del IFE quedó organizado de la siguiente forma: un Presidente del Consejo General (Secretario de Gobernación), seis consejeros ciudadanos, cuatro consejeros del poder legislativo, y representantes de los partidos políticos con registro. También fue 1994 la primera elección presidencial del IFE, donde se instauró por primera vez el Programa de Resultados Electorales Preliminares (PREP), implementado por la Dirección General del IFE, que tuvo la finalidad específica de captar los resultados del mayor número de casillas posible, de acuerdo al ritmo en que éstos llegaran a las sedes de los Consejos Distritales correspondientes. El PREP se basó en los resultados anotados en la primera copia de las actas de escrutinio y cómputo de las casillas, elaborada por los funcionarios de casilla ante la presencia de los representantes de los partidos políticos. La copia del acta fue


BERUMEN TORRES, EDMUNDO F.: Democracia y estadística en México

79

colocada por separado en un sobre llamado “sobre PREP”, que el presidente de la mesa directiva de cada una de las casillas hizo llegar al Consejo Distrital. La coordinación general del PREP diseñó una red de transmisión con 300 Centros de Acopio y Transmisión de Datos (CEDAT), los cuales se instalaron en cada distrito electoral. En estos centros, la transmisión de los datos se hizo vía telefónica. Se instalaron dos Centros Nacionales de Recepción de Resultados Electorales Preliminares (CENARREP), uno principal y otro alterno. La difusión de la información al Consejo General del Instituto se realizó a través de diversos formatos, tales como terminales computacionales, pantallas de televisión, medios magnéticos e impresos. El Programa cerró sus operaciones después de cuatro días (96 horas), y logró contabilizar aproximadamente el 92.27% de las casillas. Cabe aclarar que el PREP a diferencia de las ES y CR no es un ejercicio de estimación del resultado final, simplemente va dando cuenta de la suma acumulada de votos conforme se transmiten los datos de cada casilla que operó durante la elección. Previo a lo reseñado, las encuestas de opinión sobre intenciones de voto para la próxima elección presidencial se realizaban mucho antes de la elección de agosto de 1994, pero su consumo y conocimiento se restringía a una élite de funcionarios y políticos de primer nivel, así como algunos integrantes de cúpulas empresariales. Los ciudadanos no eran actores que merecieran conocer, mucho menos opinar al respecto. Es interesante conocer algunas de las reseñas de cómo pudieron los ciudadanos acceder a la información y cómo se forzó la salida del closet de los resultados de las encuestas sobre temas electorales. Miguel Basañez escribió “….Hubo dos elementos que explicaron mi entusiasmo y participación en el proyecto. Primero, encontrar una excelente revista especializada en encuestas, lo que ocurrió en la reunión de WAPOR (World Association for Public Opinion Research) en Toronto, en mayo de 1988. Desde ese momento soñé con la posibilidad de que un día se publicara en México Public Opinion. Ésa fue mi inspiración y de hecho mi propuesta inicial al grupo fundador de Este País. Segundo elemento, el éxito de la encuesta de la elección de 1988 que, por conducto de Federico Reyes Heroles, me encargó La Jornada. Se abrió ahí la posibilidad que varios acariciábamos de contribuir a la democratización del país vía las encuestas. Se convertirían en martillazos numéricos para abrir la concha autoritaria. Dardos venenosos al viejo dinosaurio.” Poco después nace la revista Este País justo con ese propósito, excelente publicación que sigue auspiciando el tema pero enriquecido con muchos otros que abarcan ejercicios de prospectiva, ensayos temáticos de relevancia nacional, entrega de indicadores diversos y espacio para divulgar cultura.


80

ESTADÍSTICA (2015), 67, 188 y 189, pp. 75-88

Abierto el closet, con las acciones del IFE para depurar el Padrón Electoral (PE) y la Lista Nominal (LN), y el buen resultado de los Conteos Rápidos (CR) contratados por el IFE en la elección de 1994, ya nunca más retornaron las encuestas electorales a ser sólo para el consumo de élites. Los ciudadanos se empoderaron en el tema. El auge Una de las características de nuestra democracia es que a muy pocos ciudadanos les agrada participar en las acciones y programas del gobierno que contribuyen al bienestar general y que a nadie perjudican. Por mencionar uno trivial, más no por ello irrelevante, el no tirar basura y recoger la de otros para depositarla en su lugar, ya no digamos el clasificarla. Lo que a todos nos fascina, quizá por largo ayuno de centurias, es el “juego del voto”. No bien toma posesión y se conoce a los integrantes del gabinete del Presidente recién electo, 87 millones de pares de ojos escudriñan caras y nombres para iniciar el juego de quién de ellos será el próximo Presidente. ¿Acaso no está en el presídium y entrará en el primer ajuste; acaso será alguno de los Gobernadores invitados al acto? Antes de finalizar el primer año de gobierno inician las encuestas que hurgan buscando actores políticos ansiosos en ser “el próximo ungido”, que en el mismo arranque se queman; discretos a los que se les descubren virtudes ocultas en el primer par de años; señas de cordialidad especial del señor Presidente hacia fulano o zutano, etc. Inician las series de a quiénes ve bien la ciudadanía, a quiénes mal, el esperado “ranking” del gabinete, y claro, no puede quedar atrás el de Gobernadores. Coloridas gráficas de distintos medios periódicamente dan cuenta de ello, plumas floridas de distintos analistas se alinean con unos u otros, y la pobreza y hambre de millones se olvida mientras las carreras de caballos, briosos o flacos, nos entretienen. Encuestador que no tiene medio que lo auspicie, o Ministro de los que se sienten con posibilidades, o Comité Ejecutivo Nacional de algún Partido, o la Oficina de la Presidencia, o Gobernadores con recursos, o Empresario con intereses, o Cúpulas de poder, o la Secretaría de Gobernación, etc., casi es un paria en su gremio.


BERUMEN TORRES, EDMUNDO F.: Democracia y estadística en México

81

El cómo va la gestión del Presidente y su Gobierno, que igual se mide y exhibe por múltiples encuestas, son anécdotas colaterales que de inmediato se correlacionan con a quién favorecen, disminuyen o de plano destruyen en sus aspiraciones presidenciales. Métodos Como antaño en botica (hoy día farmacia), hay para escoger. Y bien que lo haya, pues casi desde su origen se reconoció que las encuestas más interesantes, las que miden hechos, opiniones y percepciones de cualquier sociedad, son tanto una ciencia como un arte. Así nos topamos con sondeos por cuotas levantados en centros de afluencia, otros en viviendas con distintos procedimientos de sustitución para cumplir cuotas, otros mediante entrevistas por teléfono a viviendas con “línea fija” residencial, otros mediante encuestas “en-línea”, otros sustentados en muestras estrictamente probabilísticas sin sustitución en ninguna de sus etapas de selección. Eso en cuanto a la recolección de datos, lo mismo sucede en cuanto a cómo procesarlos para arribar a resultados finales. Nuevamente hay de todo, desde los que ignoran el diseño del que provienen hasta los que usan cada incidencia del mismo (en selección de muestra, en campo, en datos de fuentes externas, etc.) para identificar ponderadores provenientes del diseño de muestra, ajustes de distinta naturaleza según las incidencias de campo, y fuentes externas a la encuesta cuando se justifique. Hay clientes para toda versión, desde las más económicas y prontas hasta las más caras y ortodoxas pero menos oportunas. Conforme se acercan las fechas de registro de candidatos para nueva elección, pre-campañas y campañas, y ya nombrados los candidatos de cada partido, más preferidas son las que tienen carácter estrictamente probabilístico. Dificultad intrínseca en encuestas previas a la jornada electoral. Las estimaciones de las encuestas pre-electorales, al igual que las levantadas el día de la elección, de origen están sujetas a incertidumbre vs las cuentas oficiales de votos, sujetas a reglas precisas. Son ciencia y arte vs hechos factuales futuros (lejanos o cercanos). Las poblaciones de una y otra son distintas. Encuestas previas a la elección: LN = Votantes + No Votantes -indistinguibles en las encuestas

(1)


82

ESTADÍSTICA (2015), 67, 188 y 189, pp. 75-88

Encuestas de Salida y Conteos Rápidos: Votantes = Votos Válidos + Anulados, bien definidos en cómputos

(2)

Programa de Resultados Preliminares Oportunos, no es encuesta, (PREP): LN = Votos Válidos + Anulados + No Votantes

(3)

Cuando se seleccionan las muestras estrictamente probabilísticas de ciudadanos con credencial de elector vigente, domiciliada al menos en el municipio donde está ubicada su vivienda en muestra, (de facto un marco muestral de áreas para la LN), el reto, nada sencillo de superar, es el de distinguir quiénes de los entrevistados el día de la elección se convertirán en “Votantes”, cuáles emitirán “Votos Válidos”, cuáles votos “Anulados”, y quiénes no acudirán a votar, los “No Votantes”, a pesar de estar en la LN. Para completar el cuadro afloran los que en la pregunta de intención de voto deciden no responderla; ¿cómo tratarlos? Y el reto no termina ahí, pues las mismas preguntas se debe hacer el encuestador respecto a los miembros de la población objetivo que resultaron seleccionados por el diseño estrictamente probabilístico y terminaron en alguna de las muchas variantes de “no-respuesta” total a la encuesta. Infelizmente el uso extendido de encuestas por cuotas, no probabilísticas, esconden e ignoran este creciente problema. Luego viene el reto de comunicar al cliente las estimaciones resultantes, sus limitaciones y virtudes. Probado está que buenos comunicadores no somos (los encuestadores); y aquellos que sí, sus clientes, dueños de la información, se encargan de mal divulgar los pocos resultados que seleccionan según estrategias o caprichos. En tanto no distorsionen o de plano mientan en lo difundido, muy su derecho, son los dueños; caso contrario tenemos (los encuestadores) el derecho y la obligación de salir, casi en tiempo real, a señalar la pifia o burdo engaño, ejemplos abundan. Entre los errores comunes de comunicación está una de las etiquetas favoritas “los indecisos”. ¿Quiénes son? Los que fueron entrevistados y rehusaron responder la pregunta sobre su intención de voto, claman sesudos analistas; falso, quizá sean mayoría dentro de este grupo los que hace tiempo ya decidieron por quién votar y optan por no compartir esta decisión, por la razón que sea. Sólo una encuesta tipo panel, que entrevista la misma muestra periódicamente, puede aproximar una respuesta al contrastar lo que el informante responde en una medición vs otra;


BERUMEN TORRES, EDMUNDO F.: Democracia y estadística en México

83

quienes cambian con frecuencia quizá sean los indecisos, que nuevamente no sabemos si se convertirán luego en votantes y que emitan un voto válido. Tantos escollos y problemas a salvar, imposible, ¡a la basura con las encuestas electorales! Pues no. Es una virtud y no una deficiencia el reconocer que nuestra actividad mide con incertidumbre, y que al hacerlo sustentado en muestras estrictamente probabilísticas tiene la virtud adicional de permitirnos medirla con los propios datos de la muestra a mano, para cada una de las estimaciones prioritarias, al nivel de confianza que se desee, y que esto es bueno. Tarea, entre muchas, tenemos en aprender a comunicar mejor que las encuestas de pre-campaña y de campaña son ejercicios de estimación (que entrevistan electores, no votantes) totalmente diferentes al de las encuestas de salida (únicas que entrevistan votantes), al de los conteos rápidos (que no entrevistan a nadie), y al de los PREP que no entrevistan a nadie y no son ejercicios de estimación, que simplemente acumulan y suman datos hasta que se decide cerrarlos para esperar el resultado oficial proveniente de los cómputos distritales, cuyo resultado tiene obligación de “cantar” el consejero Presidente del IFE, sin adjetivo ni juicio alguno, y luego esperar a que el Tribunal Federal Electoral (TRIFE) dictamine el resultado oficial de la elección una vez resueltas las impugnaciones que presenten, en su caso. Por supuesto, como toda buena ensalada, surgen aderezos apetitosos provenientes de los profesionales en técnicas de investigación cualitativa “que le dan sabor al caldo”. Pero eso es tema para otro artículo. Tiempos Pasan con desenfado, con sobresaltos esporádicos de entusiasmo espurio o decepción real los años del sexenio, hasta que los tiempos marcan fechas fatales próximas que revitalizan la proliferación de sondeos azarosos (que no probabilísticos) y encuestas de todo tipo: sea para “auscultar” y explorar posibilidades de aspirantes y posibles candidatos, presentando distintos escenarios del tipo “si XXX fuera el candidato del PPP para la próxima elección de … y NNN el de BBB y RRR el de ZZZ, por cuál de ellos votaría”, con todas las variantes imaginables; para una vez nombrados los candidatos de cada partido, pasar a medir las preferencias de los ciudadanos con credencial vigente del Registro Federal Electoral (RFE), si las elecciones fueran el día en que son entrevistados: múltiples y variadas versiones de “carreras de caballos” se divulgan, algunas con el ánimo de influir en la intención de voto el día de la elección (a la fecha no hay evidencia que esto suceda); otros resultados que no se divulgan, pues son para consumo interno


84

ESTADÍSTICA (2015), 67, 188 y 189, pp. 75-88

de estrategas de los distintos candidatos-partidos para proponer ajustes a mensajes, discursos, imagen, publicidad, “slogans”, etc. Las anécdotas abundan de resultados “sospechosos” por su gran similitud al grado de ser casi idénticos, rareza estadística, hasta divergentes en quién resultaría el ganador en la fecha de la encuesta, y todas las variantes intermedias (tendencias de series que al paso del tiempo se cruzan, a veces justo el día de la elección, una de las favoritas) que son insumos ansiosamente esperados por analistas y columnistas especializados de uno y otro bando, para especulaciones sin fin. Claro, la diosa de la fecundidad es despertada para parir de inmediato todo tipo de acrónimos de supuestas encuestadoras que nunca antes se les conoció investigación alguna en éste o alguno otro tema, y o maravilla, con recursos abundantes para pagar y publicar a página completa (a veces dos) sus resultados en diarios de circulación nacional. Terminado el proceso electoral en turno desconocido, virus mortal sorpresivamente ataca a todas y en el acto mueren sin que nadie acuda a dar sus condolencias a supervivientes (por cierto, muy difícil de ubicar). Vuelve a dormir la diosa para salir de su letargo con precisión de reloj suizo cada nuevo periodo electoral y volver a parir engendros similares con igual destino, pero que mucho dañan al gremio de encuestadores con larga y conocida reputación de profesionalismo. Entreverado con lo anterior el IFE realiza algunos estudios vía encuestas que permiten darle una manita de gato al PE y LN que depure lo más grave y notorio de las desactualizaciones que se dan de manera natural. Dicho sea de paso que tales desactualizaciones son por irresponsabilidad del ciudadano que no registra ante el IFE cambios en su situación (por ejemplo un cambio de domicilio) y datos de identificación. Hay errores que se detectan en estos ejercicios pero que no se clasifican como graves, en el sentido de que no impiden al ciudadano el ejercer su derecho a votar el día de la elección, ejemplos ilustrativos son: cambios de domicilio dentro de la misma SE, pues igual les toca votar en el mismo lugar; registro equivocado de edad, pues mismo error aparece en la LN contra la que se cotejan los datos de su credencial; incluso registro erróneo del sexo que igual se replica en su credencial y LN. Otros, sin embargo, sí son graves y un obstáculo a su derecho a votar, por ejemplo: cambio de domicilio a otro fuera de la SE de origen, que si es lejano, le imponen el recurrir a trasladarse a su “vieja” casilla donde está registrado su nombre en la LN para poder votar. Jornada Electoral El tiempo inexorable nos conduce inevitablemente a la fecha de la jornada electoral de cada elección y termina el periodo de gestación que da a luz dos instrumentos


BERUMEN TORRES, EDMUNDO F.: Democracia y estadística en México

85

estadísticos y uno aritmético cuyo mejor destino es nacer y morir el mismo día después de cumplir con éxito su razón de ser: las encuestas de salida (ES), los conteos rápidos (CR) y el programa de resultados electorales preliminares (PREP), ya comentados con anterioridad. Tropiezos Después de la historia de éxitos durante y después de las elecciones presidenciales de 1994, a pesar de un sismo político en las elecciones presidenciales del 2000 que puso a prueba nuestra democracia, algunas encuestas ya en campaña, las menos y de manera errática, daban estimaciones donde el partido en el poder durante más de siete décadas no resultaba el ganador. Aberraciones estadísticas destilaban litros de tinta y mesas de discusión. Hasta que llegó la jornada electoral y las encuestas de salida anticipaban posibles cambios en el ánimo de los ciudadanos, que se reflejaron en un voto diferenciado que dio el triunfo en el Poder Ejecutivo a un partido de oposición, el Partido Acción Nacional (PAN), pero no el control en el Congreso, donde ningún partido obtuvo mayoría absoluta. Consecuencia: sufrió la credibilidad en las encuestas pre-electorales. No así los ejercicios estadísticos de CR que nuevamente contrató el IFE tres empresas, y que una vez que el Consejero Presidente (José Woldenberg) “cantó” públicamente los que cada empresa reportó como su estimación, segundos después en cadena nacional el Presidente (del partido en el poder más de 7 década) en turno anunció que, de acuerdo a esas estimaciones, la oposición por primera vez salía triunfante en la elección de Presidente. Días después (disturbios, impugnaciones, plantones, etc., de por medio) el TRIFE ratificó el resultado oficial a favor de la oposición. Elección Presidencial de 2006. Con el antecedente de la elección del 2000, la mal llamada “guerra de encuestas” se agudizó, el gremio de encuestadores sufrió mayor desgaste, de paso mancharon a las ES, CR y hasta el mismo PREP, pues lo cerrado del resultado no permitió identificar al probable ganador y ello provocó daños colaterales con acciones inmediatas de gran riesgo (multitudinarias y frecuentes manifestaciones, cierre de avenidas principales, impugnaciones, demandas de recuentos “voto por voto, casilla por casilla”, etc.) y secuelas latentes pero sin mayor consecuencia, que brotan de vez en vez aún en el presente y algo más del futuro inmediato. Pero ésa es otra sabrosa historia para otra ocasión. Elección Presidencial de 2012. Se agudiza guerra de encuestas, periodistaperiódico nacional como gallito de pelea kikirikea medición diaria de encuestadora contratada para ello y reta que ya se verán las caras el día de la elección. Carrera de caballos muestra a salidor a la cabeza desde el arranque ganando puntos hasta ser


86

ESTADÍSTICA (2015), 67, 188 y 189, pp. 75-88

de más de un cuerpo, llega a ser de dos dígitos cercanos a 20 puntos porcentuales. Un par de mediciones se atreven a publicar que no, que la ventaja es de sólo un dígito y con tiempo para cerrarse aún más. Surge de nuevo el calificativo burlón de “aberrantes”, quizá ni quien va en segundo lugar lo cree pues no se nota que haga algún ajuste para achicar distancia… y llega el día de la elección. Los aberrantes eran el resto, gana el puntero con cómoda distancia, pero de un solo dígito. El acabose para las casas encuestadoras, todas cuestionadas, descrédito que afecta a todo el gremio, aún a quienes no se dedican a este tipo de mediciones. TRIFE confirma resultado oficial con ventaja del orden de magnitud entre los dos aberrantes. Periodista-periódico kikirikero no deja de escribir su columna diaria entre semana, eso sí, corre a encuestador. Culpables. ¡Encuestadoras! Claman los actores y partidos políticos participantes. ¡Encuestadoras! Clama el círculo rojo. ¡Encuestadoras! Claman los medios. ¡Encuestadoras! Clama el resto. Falso digo yo. La realidad reside en la dificultad de distinguir lo expuesto en las tres sencillas expresiones (1), (2) y (3) expuestas al tratar los métodos. Intentos vía la ruta de “votantes probables” hay muchos, pero todos se quedan cortos al no preocuparse por aplicar esfuerzos en la creciente no-respuesta total a la encuesta. Fácil exponerlo, difícil resolverlo, sobre todo porque las aproximaciones con más expectativas de aproximación a algo mejor conducen inevitablemente a encuestas panel, estrictamente probabilísticas, con varias revisitas para ubicar y convencer al seleccionado a responder la encuesta; posibles pero resultan costosas y hasta ahora no hay quien esté dispuesto a pagarlas. Voces calificadas y sensatas como la de José Woldenberg en su nota editorial semanal en el periódico Reforma de fecha 18 de julio de 2013 (sin desperdicio leerla completa), termina afirmando en último párrafo: “A pesar de ello, las encuestas se siguieron realizando en serio y en serie. Pero, dado el escándalo que se produjo en 2012, cuando un puñado de importantes encuestadoras estuvo dando a lo largo del proceso un posible escenario que resultó mucho más estrecho el día de la elección, ahora también han menguado de manera considerable las encuestas que se hacen públicas sobre las intenciones del voto. Total: que el mecanismo que tan buenos resultados dio a lo largo de un periodo, parece que –por miedo–se empieza a desmantelar.” Terrible advertencia de que puede darse un regreso al origen, para nuevamente encerrarlas en el closet al que sólo tienen acceso élites de siempre.


BERUMEN TORRES, EDMUNDO F.: Democracia y estadística en México

87

Epílogo Elecciones Locales del 7 de julio de 2013. Tan reciente la experiencia y continuada permanencia en medios que seguro los detalles siguen en la mente de muchos. Resumo uno que se dio en la elección que más atención concentró, la de Gobernador en Baja California: ansias de novillero en reconocidos toreros de larga y exitosa trayectoria política los llevó, contra consejas del más alto nivel, a salir a declararse ganadores, según encuestas de salida y conteos rápidos por ellos conocidos (nunca nombraron las casas encuestadoras ni los resultados estimados para cada partido-coalición-candidato), para pocas horas después, el mismo día de la jornada electoral recular en público y hacer un llamado a la prudencia para esperar el resultado oficial. La cereza en el pastel fue que el PREP que, como ya dijimos se restringe a recibir, acumular y sumar los resultados conforme los van reportando de los distintos Distritos Electorales, sumaba mal. Vuelta a la aritmética y el simple ejercicio de enumeración. Los errores reconocidos, eran pequeños y no afectaban el resultado que con ellos se especulaba (que no es una estimación estadística), pero bastó para que el Instituto Electoral y de Participación Ciudadana de Baja California (IEPCBC) descalificara el PREP y a la empresa contratada por el propio IEPC-BC. Los novilleros por su lado vociferaron la frase menos afortunada al reclamar un recuento: “voto por voto, casilla por casilla”. Reclamo en total contradicción a la reseña victoriosa que momentos antes dieron citando frases imputadas a Luis Donaldo Colosio cuando reconoció el triunfo del PAN en 1989. Pendientes. Es momento de convencer a los clientes (y algunos colegas) a abandonar el “muestreo de cuotas” dentro de las manzanas de la muestra y continuar con esquemas estrictamente probabilísticos, aunque esto incremente los costos del trabajo de campo de manera significativa, pues no permite ningún esquema de “sustitución”, por sofisticado que sea, ante cualquier tipo de norespuesta; que implica varias visitas en diferentes horas y días a los hogares seleccionados para intentar encontrar y lograr entrevistar al miembro específico que resulte seleccionado mediante un esquema estrictamente probabilístico; y donde la experiencia de cada encuestador le dirá qué cantidad de sobre-muestra deberá seleccionar en el origen para que, al final del trabajo de campo, se cuente con un número “cercano” al deseado de entrevistas completas. Urge un compromiso de transparencia entre medios de comunicación y agencias encuestadoras. Es fundamental que la encuesta que sea pagada y publicada por un medio de comunicación tenga el entero reconocimiento del grupo editorial: la casa encuestadora y el medio deben asumir la responsabilidad de los datos que arrojen


88

ESTADÍSTICA (2015), 67, 188 y 189, pp. 75-88

sus mediciones. Incluye el concertar a-priori el formato y contenido de la difusión y/o publicación de algunos de los resultados y así evitar sorpresas a-posterior. Incluye el examinar si debemos arribar a un convenio-contrato básico que usemos toda la industria (o al menos los agremiados en la Asociación Mexicana de Agencias de Investigación de Mercados y Opinión Pública, AMAI, donde se estipulen cláusulas preventivas de excesos, abusos dolosos o incluso groseras manipulaciones en la difusión (difusión a la que por cierto tienen derecho al ser los dueños de los resultados). Necesitamos acercarnos aún más a los medios, sus conductores y plumas especializadas en el tema, para, de manera conjunta, aprender unos de otros a comunicar mejor todo lo anterior así como los resultados sin demérito de hacerlo en un contexto noticioso. Debemos diversificar los temas que medimos. Durante la elección presidencial no medimos temas específicos relacionados con las propuestas de los candidatos. Simplemente nos enfocamos a medir la carrera de caballos para saber quien encabezaba las preferencias electorales, pero dejamos a un lado lo que los mexicanos pensaban sobre temas fundamentales coyunturales o estructurales, ejemplos abundan.

Artículo Invitado Recibido julio 2013 Revisado diciembre 2015


ESTADÍSTICA (2015), 67, 188 y 189, pp. 89-115 © Instituto Interamericano de Estadística

INFERENCE WITH MISSING DATA USING LATENT GROWTH CURVES: AN APPLICATION TO REAL DATA DELFINO VARGAS-CHANES Programa Universitario de Estudios del Desarrollo. Universidad Nacional Autónoma de México, UNAM. Edificio de la Unidad de Posgrado, 2º Piso, cubículo 2. Costado sur de la Torre II de Humanidades, Ciudad Universitaria, Ciudad de México, CP. 04510México City dvchanes@unam.mx FREDERICK O LORENZ Dept of Statistics and Psychology at Iowa State University, Ames IA, 50010 folorenz@iastate.edu

ABSTRACT In this paper we investigate the efficiency of three data imputation methods – Expectation Maximization (EM), full information maximum likelihood (FIML), multiple imputations (MI) –as they apply to three patterns of missing data: missing completely at random (MCAR), missing at random (MAR), and nonignorable (NI) missing. The results showed that, compared to the population model, estimates obtained using the EM algorithm, FIML, and MI were relatively unbiased and had small standard errors when the data were MCAR or MAR. Both FIML and Multiple Imputation yielded the small bias. All three imputation methods (EM, FIML, and MI) yielded larger biases under NI conditions but still did better than the listwise deletion. The findings indicate that imputation methods are recommended over listwise deletion whenever the amount of missing cases exceeds 10 percent. Key Words Latent Growth Curves, multiple imputation, missing data


90

ESTADÍSTICA (2015), 67, 188 y 189, pp. 89-115

RESUMEN En este trabajo investigamos la eficacia de tres métodos de imputación de datos – Esperanza Maximización (EM), máxima verosimilitud con información completa (FIML) e imputaciones múltiples (MI)-que se aplican a tres patrones de datos faltantes: completamente al azar (MCAR), al azar (MAR), y no ignorable (NI). Al comparar cada uno de estos patrones con el modelo poblacional, los resultados mostraron que las estimaciones obtenidas utilizando el algoritmo EM, FIML y MI fueron relativamente insesgadas y con errores estándar pequeños bajo MCAR o MAR. Además, tanto el método de estimación FIML y MI obtuvieron un sesgo pequeño comparado con los parámetros poblacionales. Por otro lado, los tres métodos de imputación (EM, FIML y MI) produjeron sesgos considerables cuando los datos faltantes son NI, pero aun así la estimación fue mejor que cuando se eliminan los datos faltantes. Los resultados indican que los métodos de imputación se recomiendan siempre que el porcentaje de datos faltantes supera el 10%. Palabras Clave Curvas latentes de crecimiento, imputación múltiple, datos faltantes Introduction Researchers in many disciplines (e.g., sociology, psychology, and epidemiology) face the inevitable issue of nonresponse bias whenever respondents miss items or drop out of a study. On some occasions, missing data can occur by design according to a randomization plan under the researcher’s control. In many instances, however, missing data arises because respondents ignore sections of a questionnaire or skip specific items for no apparent reason (item-missing values). In panel data studies, some respondents drop out of the panel at certain point, often without providing a reason (panel attrition), or new respondents are added to the study. These sources of missing data may lead to bias or lead to insufficient statistical power. Although new respondents are sometimes added to on-going panels to reduce the problem of declining statistical power, it does not solve the problem of missing data, and we need statistically defensible imputation methods to analyze missing data. There are two potential consequences of missing data. The first is the decrease in precision (parameter estimates have wider confidence intervals) and loss of power caused by the reduction in data available. The second is the potential for bias in the estimation (Bell and Fairclough, 2014). Methods of imputing data have been used with success. For example, multiple imputation has been used in Census Data for recalibrating the categories of


VARGAS-CHANES et al.: Inference with missing data using latent growth curves...

91

industries of the 1980 data using previous census information (Treiman, Bielby, and Cheng, 1988) and to estimate income in cross sectional data (Martin, Little, Samuel, and Triest, 1986). Imputation methods have also been used in modeling contexts, such as in estimating structural equation models with incomplete data (McArdle, 1994), in estimating latent growth curves for developmental data (McArdle and Epstein, 1987), and in fitting multilevel models (MuthÊn, Kaplan, and Hollis, 1987).Bell and Fairclough (2014) have used imputation methods for studying patient’s quality of life outcomes in clinical settings. García-Laencina et.al (2015), have used multiple imputation on the 5-year survival prediction of breast cancer patients with unknown discrete values. Young and Jhonson (2014) have used imputation methods using proportional hazard models and compares the model with and without imputations. Despite these applications of imputation methods to missing data, we know little about the efficiency of imputation methods, especially when applied to panel data. The objective of this paper is to assess the efficiency of three imputation methods: Expectation Maximization (EM) algorithm, Full Information Maximum Likelihood (FIML), and Multiple Imputations (MI). These three methods are applied to growth curve models where the population parameters are presumed to be known and where there is relatively low and high rates of missing data. The results are compared with those obtained using listwise deletion Methods of Imputation Basic Concepts The ability to impute missing data depends fundamentally on the extent to which researches know the reason data are missing. Data are said to be ignorably missing if they are missing completely at random (MCAR) or they are missing at random (MAR). In general, missing completely at random (MCAR) implies that there are no particular reasons for data to be missing, so that the mechanism giving rise to the missing is not related either to the observed or missing data (Rubin, 1976). This condition is equivalent to extracting a random subsample from a population in which each observation has an equal probability of being selected. For example, a respondent may simply forgets to answer a question when completing a questionnaire; or simply because there is no data available for an unknown reason. Missing at random is more restrictive. Data are MAR if the probability of recording (or not recording) the response to a particular question is related to (or can be predicted by) other variables in the study. For example, when we record alcohol use among adolescents, and the amount of alcohol they consume is related to their


92

ESTADĂ?STICA (2015), 67, 188 y 189, pp. 89-115

gender, then the gender of the respondent is a known variable that can be used to estimate the respondent’s alcohol use. In other words, the probability that an observation is missing depends on what we know about the respondent (i.e., the mechanism that produces missing values is available to the researcher). In contrast, data are said to be noningnorably missing (NI) when the mechanism generating the missing data depends on the variable itself or other unobserved variables. For example, if adolescents who drink a lot refuse to report heavy drinking —e.g. the probability of the value to be missing increases in the same proportion as the amount of drinking— then the mechanism generating the nonresponse is related to unobserved variables (i.e., the amount of drinking). It is nonignorable missing because the mechanism explaining the incomplete observations depends the amount of the variable itself or the mechanism explaining is not observed and it is inaccessible. This pattern of missingness makes it very difficult to recover missing data because the predictors of the missing values are themselves not accessible to the researcher. Formally speaking suppose we have N persons observed in n occasions and đ?‘›đ?‘— < đ?‘› is the number of values observed for person j-th. Let’s denote đ?‘…đ?‘–đ?‘— = 1if we observe the outcome đ?‘Œđ?‘–đ?‘— for the person j-th measured at time i, and X represent the covariates. The vector of outcomes can be expressed in terms of observed and missing values đ?‘Œ = (đ?‘Œđ?‘œđ?‘?đ?‘ , đ?‘Œđ?‘šđ?‘–đ?‘ đ?‘ ), where đ?‘Œđ?‘œđ?‘?đ?‘ denotes the observed values and đ?‘Œđ?‘šđ?‘–đ?‘ đ?‘ denotes the missing values. Then đ?‘€đ??śđ??´đ?‘…: đ?‘ƒďż˝đ?‘…đ?‘–đ?‘— ďż˝đ?‘Œđ?‘–đ?‘— , đ?‘Œđ?‘–−1đ?‘— , ‌ , đ?‘Œ1đ?‘— , đ?‘‹ďż˝ = đ?‘ƒďż˝đ?‘…đ?‘–đ?‘— ďż˝đ?‘‹ďż˝ = đ?‘ƒ(đ?‘…|đ?‘‹)

đ?‘€đ??´đ?‘…: đ?‘ƒďż˝đ?‘…đ?‘–đ?‘— ďż˝đ?‘Œđ?‘–đ?‘— , đ?‘Œđ?‘–−1đ?‘— , ‌ , đ?‘Œ1đ?‘— , đ?‘‹ďż˝ = đ?‘ƒďż˝đ?‘…đ?‘–đ?‘— ďż˝đ?‘Œđ?‘–−1đ?‘— , ‌ , đ?‘Œ1đ?‘— , đ?‘‹ďż˝ = đ?‘ƒ(đ?‘…|đ?‘Œđ?‘œđ?‘?đ?‘ , đ?‘‹)

đ?‘ đ??ź: đ?‘ƒďż˝đ?‘…đ?‘–đ?‘— ďż˝đ?‘Œđ?‘–đ?‘— , đ?‘Œđ?‘–−1đ?‘— , ‌ , đ?‘Œ1đ?‘— , đ?‘‹ďż˝ = đ?‘ƒďż˝đ?‘…đ?‘–đ?‘— ďż˝đ?‘Œđ?‘–đ?‘— , đ?‘Œđ?‘–−1đ?‘— , ‌ , đ?‘Œ1đ?‘— , đ?‘‹ďż˝ = đ?‘ƒ(đ?‘…|đ?‘Œđ?‘œđ?‘?đ?‘ , đ?‘Œđ?‘šđ?‘–đ?‘ đ?‘ , đ?‘‹).

Imputation Methods Three methods for handling missing data have emerged from contemporary literature as preferred alternatives to listwise deletion. They are the Expectation Maximization (EM) algorithm, Full Information Maximum Likelihood (FIML), and Multiple Imputations (MI). Demptser and his colleagues provided the first step toward the development of defensible imputation methods for data imputation in their seminal work in the 70s (Dempster, Laird, and Rubin, 1977). This method


VARGAS-CHANES et al.: Inference with missing data using latent growth curves...

93

provided the theory of Expectation Maximization (EM) and offered a new perspective to maximum likelihood methods for dealing with missing data. The EM-algorithm. The main idea of the EM algorithm consists in distinguishing two steps in a sequence: the E-step and the M-step (Dempster et al., 1977). First, the E-step computes the expected values of the incomplete observations given the observed data and existing parameter estimates. Second, the maximization (M) step replaces missing data with the expected conditional values computed using methods of maximum likelihood. Both E- and M-steps are repeated iteratively until no further changes in estimates can be made (a criterion of convergence is met). Rubin extended this work by proposing a stochastic approach to estimating missing data that included Monte Carlo Markov Chain (MCMC) techniques that promised to improve the efficiency of estimators (Rubin, 1987). This approach is distinguished by generating multiple estimates of the missing cases, and is labeled “multiple imputation� (MI). MCMC methods include several simulation techniques like Gibbs sampling, Metropolis algorithm, data augmentation, and sampling importance resampling (SIR) among others (Rubin, 1987; Schafer and Olsen, 1998; Tanner, 1993). The main difference between MI and EM-algorithm is that, whereas EM-algorithm provides a single data set with imputed data by estimating observations that are missing, MI augments data by simulating a possible set of values providing several data sets with complete information. The MI method. There are three steps in MI: The first is the imputation step, MI simulates data sets where data are missing and generates complete data sets by imputing missing data, a procedure similar to EM-algorithm. The second step use MCMC methods, one of these methods is data augmentation algorithm that in turn has two steps, the imputation step (called I-step) generates initial estimates of missing values given the conditional distribution of the observed values and posterior step (called P-step). The I-step generates initial estimates of missing values given the conditional distribution of the observed values and initial parameter estimates of the distribution. The P-step generates starting values of the parameters give the joint distribution of the observed and the initial imputations of the missing values from the previous step. Together, the I- and P-steps generate a stochastic Markov chain that converges in distribution to a certain value and produces multiple estimates of the missing data. The goal of the MI method is to generate enough complete data sets to capture variability of the simulated parameter estimates (Rubin 1987, Chapter 3). The rationale of MI is that one datum does not represent the original variation of the respondent; meanwhile multiple observations based on simulated data could be more representative of the variability in the possible outcome. The third step is the pooling phase that takes


94

ESTADĂ?STICA (2015), 67, 188 y 189, pp. 89-115

different forms depending on the analytic model of interest. Some rules need to be applied to obtain the parameter estimates of the pooled model to account for the variability of the imputations, as explained in Rubin (1987), and Enders (2010). Figure 1 represents graphically these three steps. Figure 1. Schematic representation of the MI method for longitudinal data

Source: Authors elaboration adapted from Lee and Simpson (2014)

FIML. MuthĂŠn and colleagues suggested a regression approach for data imputation in the context of structural equations (MuthĂŠn et al., 1987). They propose using a regression model to predict missing data from available information. Building from this idea, another method is proposed in the context of structural equations, Full Information Maximum Likelihood (FIML) (Arbuckle, 1996). FIML estimation for data imputation is an approach that first uses maximum likelihood estimation for data subsets with complete values and then generates several covariance matrices with their corresponding likelihood functions. A combined likelihood function that incorporates all possible subsets of likelihood functions based on subsets of complete data is generated. With FIML, there is no actual data points that are imputed. Instead, a maximum likelihood function estimates the parameters with the available data, preserving all cases.


VARGAS-CHANES et al.: Inference with missing data using latent growth curves...

95

FIML computes many covariance matrices depending on the number of complete patterns in the data set. Each pattern is complete if it has a subset of variables from the original data set with no missing cases (see Figure 2a). A final maximum likelihood estimation procedure is constructed over all possible covariance matrices and generates a unique set of parameter estimates for the model (see Figure 2b). Figure 2. Full Information Maximum Likelihood

(a)

(b) Source: Authors elaboration based from Arbuckle (1996)


96

ESTADĂ?STICA (2015), 67, 188 y 189, pp. 89-115

The Analytic Model Many researchers in different disciplines use latent growth curves (LGC) to study trajectories of particular outcomes over time. For example, Lin, Hsieh and Chen (2015) and Wimmers and Lee (2015) study trajectories of student’s school performance. In sociology Knight et al. (2009)study two dimensions, acculturation and enculturation processes, among Hispanic adolescents using LGC as a parallel process; Vargas and CortÊs (2014)fit trajectories of marginalization of 2454 municipalities in MÊxico using data from 1990 to 2010. In psychology Yoon, Brown, Bowers, Sharkey and Horn (2015), study the effect of depression using four waves of data collection on a depression scale using LGC with a zero inflated Poisson model. In medicine Fairclough (2002) studies quality of life patient outcomes using longitudinal data. LGC models are framed within the context of structural equations modeling (SEM) techniques of second generation (Muthen, 2002). Researchers applying SEM to panel data have found that growth curves provide a useful way to summarize change across time (Bollen and Curran, 2006). Complete descriptions of this approach have been provided by numerous scholars (Kaplan, 2009; Ragosa, Brandt, and Zimowski, 1982; Willett and Sayer, 1994). LGC can be considered as a special cases of hierarchical linear models, multilevel models, or random coefficients models (McArdle and Epstein, 1987; MuthÊn, 1991; Verbeke and Molenberghs, 2000; Willett and Sayer, 1994). The LGC model estimates a random intercept and slope, which describe the initial value and the rate of change of Y. The LGC model can be stated as

Yi j = π 0 j + π 1 j Νi + ξ ij

Ď€ 0 j = Îą0 + Îś 0 j

(1)

Ď€ 1 j = Îą1 + Îś 1 j Fori=1, 2, 3, 4, measurement occasions and j=1,2,‌,đ?‘›đ?‘– individuals. In this model, đ?‘Œđ?‘–đ?‘— is a response variable at time i for individual j. The random intercept and slope are đ?œ‹0đ?‘— , đ?œ‹1đ?‘— , respectively; đ?œ†đ?‘– denotes time; and đ?œ€đ?‘–đ?‘— is the error for the measurement termiat time i for individual j. The random effects, đ?œ‹0đ?‘— and đ?œ‹1đ?‘— , can be specified with second level equations with their corresponding term for the intercepts, known as means đ?›ź0 and đ?›ź1 , and the error terms ,đ?œ 0đ?‘— and đ?œ 1đ?‘— associated with the intercepts and slopes, respectively. Each one of these terms has associated its respective variance, đ?‘‰đ?‘Žđ?‘&#x;(đ?œ‹0 ) = đ?œ?00 (often called initial variability for the intercept), and


VARGAS-CHANES et al.: Inference with missing data using latent growth curves...

97

đ?‘‰đ?‘Žđ?‘&#x;(đ?œ‹1 ) = đ?œ?11 (or individual variability of slopes), and the covariance between intercepts and slopes đ??śđ?‘œđ?‘Ł(đ?œ‹0 , đ?œ‹1 ) = đ?œ?01 . One criterion for assessing the model fit is the đ?œ’ 2 statistic, which measure the extent to which covariances implied by the model match the observed covariances. Additionally, in order to identify the model parameters of equation (1) we need specify the variances of the observed variables as follows đ?‘‰đ?‘Žđ?‘&#x;(đ?‘Œđ?‘– ) = đ?‘‰đ?‘Žđ?‘&#x;(đ?œ‹0 ) + đ?œ†2đ?‘– đ?‘‰đ?‘Žđ?‘&#x;(đ?œ‹1 ) + 2đ?œ†đ?‘– đ??śđ?‘œđ?‘Ł(đ?œ‹0 , đ?œ‹1 ) + đ?‘‰đ?‘Žđ?‘&#x;(đ?œ€đ?‘– ) = đ?œ?00 +

đ?œ†2đ?‘– đ?œ?11

(2)

+ 2đ?œ†đ?‘– đ?œ?01 + đ?œŽđ?œ€

and lagged covariances are

đ??śđ?‘œđ?‘Ł(đ?‘Œđ?‘– , đ?‘Œđ?‘–−đ?‘ ) = đ?‘‰đ?‘Žđ?‘&#x;(đ?œ‹0 ) + đ?œ†đ?‘– đ?œ†đ?‘–−đ?‘ đ?‘‰đ?‘Žđ?‘&#x;(đ?œ‹1 ) + (đ?œ†đ?‘– + đ?œ†đ?‘–−đ?‘ )đ??śđ?‘œđ?‘Ł(đ?œ‹0 , đ?œ‹1 )

(3)

= đ?œ?00 + đ?œ†đ?‘– đ?œ†đ?‘–−đ?‘ đ?œ?11 + (đ?œ†đ?‘– + đ?œ†đ?‘–−đ?‘ )đ?œ?01

Figure 3 is a schematic representation of Equation (1), where the observed variables of each individual are đ?‘Ś1 , đ?‘Ś2 , đ?‘Ś3 andđ?‘Ś4 measured four times. This figure represents a linear model where the intercepts and slopes are latent variables with a random component. The loadings associated to the intercept are đ?œ†11 , đ?œ†21 , đ?œ†31 , đ?œ†41 , and are fixed to the unity, for the initial level (the intercept). The loadings associated with the slope are đ?œ†12 , đ?œ†22 , đ?œ†32 , đ?œ†42 and are fixed to 0, 1, 2, and 3 respectively to fit a linear growth. The error terms đ?œ€1 , đ?œ€2 , đ?œ€3 , đ?œ€4 are fixed to zero. ii This modeling technique is not limited to linear models, but also can include quadratic growth, depending on the data structure on measurement of the latent variable iii (McArdle, 1986). For each one of the scenarios proposed in this investigation we will estimate the corresponding LGC parameters as suggested in expression (1). In order to compare the relative closeness of the parameter estimates with respect to the population parameters we define a percent of bias.


98

ESTADĂ?STICA (2015), 67, 188 y 189, pp. 89-115

Figure 3. Schematic representation of a latent growth curve with a linear trend

To compare the models we obtain the bias. The percent of bias of the parameters is calculated by the quotient of the difference between the parameter estimate (đ?›˝Ě‚), and the population parameter (β ) divided by the population parameter (β ).

bias i =

βˆi − β i Ă— 100 βi

(4)

The higher the bias indicates the parameter estimate (đ?›˝Ě‚) is distant from the population parameter (β ) The Sample and Measures The analyses use four waves of information (1989, 1990, 1991, and 1992) collected from the 451 families with complete data and 82 with incomplete data who participated in the Iowa Youth and Families Project (Conger and Elder, 1994). Interviews conducted with the two biological parents, the target child (a seventh grader in 1989), and a near-age sibling in each family. The families were sampled from public and private schools in communities with less than 6,500 inhabitants an eight-county area in north central Iowa. The participation rate for families meeting


VARGAS-CHANES et al.: Inference with missing data using latent growth curves...

99

criteria for inclusion in terms of family structure and geographic location was 78.8 percent and the retention rate was 95 percent during the four years of data collection. For purposes of this study we used the 369 families for whom we have complete data as our population data. It is from these 369 families –called the population sample–that we delete cases according with the missing data mechanisms (MCAR, MAR and NI), the dependent variable is target alcohol use and problems measured in four consecutive waves (1989, 1990, 1991 and 1992) as described in Table 1. Ten covariates were selected (measured at baseline, in 1989) based on past research for generating random subsamples with patterns of incomplete data. The ten covariates include father drinking problems, sibling’s alcohol use and problems, plus tree measures of parenting harsh discipline, nurturant/warm, and management, target’s deviant behavior, peer alcohol use, peer delinquency and target’s gender (Table 1). Table 1. Descriptive Statistics of the Variables Using Complete Cases (population). Variables Target alcohol use and problems in 1989

Label Min. TAP 1989 .00

Max. 23.00

Mean 1.003

Std.Dev 2.708

Target alcohol use and problems in 1990

TAP 1990 .00

44.66

1.471

4.159

Target alcohol use and problems in 1991

TAP 1991 .00

36.99

2.808

5.438

Target alcohol use and problems in 1992

TAP 1992 .00

33.64

4.224

6.183

Father Alcohol Problems

FALCPR .00

12.65

1.130

1.958

Sibling Alcohol Use

SALCUS .00

13.50

1.412

2.501

Sibling Alcohol Problems

SALCPR .00

20.33

1.292

3.110

Harsh/Inconsistent Parenting

HARSH

22.50

9.556

3.463

Nurturant/Involved Parenting

NURTU 11.00 42.25

24.344

5.123

Management-Parenting

MGMT

10.00 27.00

19.891

3.056

Target's Deviant Values

TDEV

6.00

18.00

7.659

1.680

Peers Alcohol Use

PALCUS .00

3.00

.510

.647

Peers Delinquency

PDEL

.00

14.00

2.314

2.312

Gender

TSEX

0

1

-

-

Variables measured at baseline (1989)

4.00

Note: TAP, indicates target adolescent alcohol use and problems.


100

ESTADĂ?STICA (2015), 67, 188 y 189, pp. 89-115

Patterns of Incomplete Data Missing data patterns are generated in accordance with three significant data scenarios: MCAR, MAR and NI as depicted in Table 2. In the MCAR pattern, the units to be missing (đ?‘Œđ?‘šđ?‘–đ?‘ ) are selected without replacement using a uniform distribution from the population (n=369). In this case every individual observation at a given time has the same probability of being incomplete; with this sampling strategy we mimic situations in which there is no particular reason why an observation is missing. We selected a single random subsample without replacement to generate subsamples with 10, 20, and 30 percent of incomplete data, calling them 10 MCAR, 20 MCAR, and 30 MCAR, respectively. Table 2. Simulated MCAR, MAR and NI patterns Pattern of missing data MCAR

MAR

NI

Description Ymis has been deleted by choosing observations to be missing at random without replacement at any wave of data collection. Data were missing at 10, 20 and 30 percent. No covariates had any influence for selecting subjects with missing values. Observations were grouped into three clusters using 10 covariates as described in Table 1. The cluster that scored high on FALCPR, SALCUS, SALCPR, HARSH, NURTU, TDEV, PALCUS, PDEL and low on MGMT was selected to have higher probability of missing values than the other two groups. Ymis was generated by selecting subjects of having missing values with low probability at wave 1 and higher probability at wave 4. Simulating attrition data over time. The amount of missing data is about 30%. Observations were deleted for every subject with drinking levels beyond the median at year 1991 (e.g. 9th grade). Every subject in this category was eligible to be missing at waves 3 or 4. No covariates had influence for selecting subjects with missing values. The amount of missing data is 30%.

In the MAR pattern the mechanism of missing data is hypothetically known. For this case we used 10 covariates (Table 1) to form three clusters of observed cases. The clusters were formed using Wards’ method and the Euclidian distance as a metric (Ward, 1963). These covariates are based on the average of the first two years of the study (1989 and 1990), when adolescents were in 7th and 8th grades. The MAR mechanism was recreated by deleting measures from the most disadvantaged cluster (e.g., non authoritative parenting, a family with drinking history, high score in target delinquency, high scores on negative peer influences)


VARGAS-CHANES et al.: Inference with missing data using latent growth curves...

101

and fewer from the lower disadvantaged clusters. The probabilities of having missing values were small at waves 2 and 3 and higher at wave 4, simulating high attrition rates as waves of data collection progresses. About 30% of the measures were deleted using this mechanism. The variables used in the imputation model for recovering missing data are the same variables used in the cluster step. The NI pattern used a completely different scheme for deleting observations and was obtained by deleting about 30% of the observations focusing on the outcome only. Thus, the NI pattern was created by deleting some adolescents with drinking levels beyond the median at year 1991 (i.e. when adolescents were in 9th grade) without any consideration of covariates. A random sample of adolescents with a history of heavy drinking is deleted. The new sample size, using listwise deletion (LD) is 257. This scheme for deleting observations simulates a scenario whereby adolescents refuse to report the amount of drinking because it is heavy or they leave the panel study. This simulation creates a NI sample and reproduces situations where missing data occurs because the level of the response variable i.e. the higher the amount of alcohol intake the higher the probability of missing data. Overall, we have five scenarios of missing data: 10% MCAR, 20% MCAR, 30% MCAR, 30% MAR; and 30% NI (Table 2). Once data were deleted from the complete (n=369) data set according to the three different mechanisms of missing data (i.e. MCAR, MAR and NI), they were again imputed using three imputation methods (EM-algorithm, multiple imputation, and FIML). A linear growth curve was fit to the complete data set, to the data set without imputation (listwise deletion), and the three imputed data sets. Software There are several programs that perform multiple imputation analysis. For example the program SAS contains the procedures PROC MI and PROC MIANALYZE that perform this analysis (SAS Institute Inc, 2013), this procedure is explained in detail in Vargas, Decker, Schroeder and Offord (2003). We used this program for the EM step and produce a single data. Another program is MPLUS that also performs multiple imputation and fits LGC models, includes a procedure that combines all models into one (Muthén and Muthén, 2013); we used this program to perform FIML estimation. Other programs that perform multiple imputation are STATA, that includes the routines “mi”. The program SPSS/AMOS includes also routines for conducting multiple imputation and fits LGC models and imputation. Additionally R-project,


102

ESTADÍSTICA (2015), 67, 188 y 189, pp. 89-115

includes the routines “AMELIA II”, “mice”, “missForest”, “Hmisc” and “mi” that perform multiple imputation that can be combined with OpenMx that can perform LGC models (Shiyko, Ram, and Grim, 2012). Results Six models were fitted for each scenario. The population model includes all 369 families will be the basis of comparison of all other models and will be used to compare the results to other patterns based on different conditions tested (e.g. MCAR, MAR and NI). The covariance matrices are shown in Table 3 display a typical linear growth, with both means (underneath each matrix) and variances (the diagonal terms) increasing from time 1 to time 4. The pattern of the covariance matrix for the population shows the matrix for the complete data set. For example, for population the variances at time 1, 2, 3 and 4 are 7.33, 17.30, 29.58, and 38.23, respectively. A similar increasing trend is observed on the means, i.e. 1.00, 1.47, 2.81, and 4.22. Although the covariance between two consecutive time points indicates a certain level of dependence, the covariance decreases substantially at two non-consecutive points, e.g. the lowest covariance is at time 1 and 4, or 5.35. This pattern provides good possibilities for a linear growth fit. The covariance matrix and means for the values in 10 MCAR, 20 MCAR, and 30 MCAR data sets are very similar to the corresponding values in the population. However, as the sample size decreases with increasing missing data, the standard deviation increases. The covariance matrix for the MAR sample shows a different story when compared to the population. The means and standard deviations are lower at time 1 and 2, compared to the population, which is consistent with our expectation that more subjects are missing from higher disadvantaged families than from lower clusters. In addition, the texture of the covariance matrix preserved some similarity with respect to the population only for the terms (1,i), and (2,s) for i =1,2,3,4 and s=2,3,4 i.e., the terms in the diagonal and the lower triangle decreased at times 1 and 2 as compared to the population. This pattern would suggest a poor fit at the initial level (see Table 3). The covariance matrix follow the structure of the lagged covariances as shown in equation (3).


VARGAS-CHANES et al.: Inference with missing data using latent growth curves...

103

Table 3: Covariance matrices and descriptive statistics using listwise deletion Population (N=369) (1) (2) (1) TAP1 1989 7.33 (2) TAP 1990 5.98 17.30 (3) TAP 1991 7.15 12.05 (4) TAP 1992 5.35 10.27 Means 1.00 1.47 Standard Deviations 2.71 4.16 20 MCAR (n=298) (1) (2) (1) TAP 1989 6.95 (2) TAP 1990 6.43 20.43 (3) TAP 1991 5.66 12.88 (4) TAP 1992 5.55 11.37 Means 1.04 1.58 Standard Deviations 2.64 4.52 MAR (N=256) (1) (2) (1) TAP 1989 3.66 (2) TAP 1990 3.00 7.78 (3) TAP 1991 3.24 6.88 (4) TAP 1992 2.82 5.92 Means 0.66 1.02 Standard Deviations 1.91 2.79

(3)

(4)

29.58 21.23 38.23 2.81 4.22 5.44 6.18 (3)

(4)

28.39 21.87 39.36 2.83 4.38 5.33 6.27 (3)

(4)

20.52 19.15 35.65 2.28 3.90 4.53 5.97

(1) 7.02 4.40 6.71 4.46 1.01 2.65

10 MCAR (n=331) (2) (3) (4)

12.79 10.43 30.50 7.88 21.17 37.74 1.42 2.84 4.33 3.58 5.52 6.14 30 MCAR (n=253) (1) (2) (3) (4) 8.76 7.04 19.48 8.69 12.42 30.50 5.88 10.93 20.87 39.97 1.05 1.59 3.09 4.47 2.96 4.41 5.52 6.32 NI (n=257) (1) (2) (3) (4) 4.48 2.03 5.18 4.66 4.50 10.34 3.42 2.04 4.90 14.91 0.67 0.57 0.70 2.00 2.12 2.28 3.22 3.86

Note: TAP, indicates target adolescent alcohol use and problems.

Finally, the covariance matrix for the NI missingness pattern shows some potential problems. First, the means and standard deviations depart from the observed trend as observed in the population. The variance terms on the diagonal do not increase as they do in the covariance from the population. Second the remaining offdiagonal covariance terms departs from the pattern for the population; although, the variances and means increase over time, they are different from the population, suggesting potential misleading results under NI missingness pattern. The ML estimates for the LGC model for the population with a linear trend depicted in Equation (1) are shown in Table 4 numbers in parenthesis indicate the standard error. From this table the equation for the population is

yij = 0.973 + 0.979 Νi for i=1, 2, 3,4 and j =1,‌, 369.

(5)


104

ESTADÍSTICA (2015), 67, 188 y 189, pp. 89-115

The point estimates from model (5) will be used as reference to compare the efficiency of imputation methods under different schemes of missing data. The last column of this table indicates the actual size of the sample analyzed in each scenario. For example, for the LD 10%MCAR there are N=331 cases analyzed to estimate the current parameters of the model, whereas for EM 10%MCAR, there are N=369 cases included for the estimation of the model parameters. Table 4: ML parameter estimates (standard errors) for LGC model comparing LD, EM, MI, and FIML Intercept, πo slope, π1 Cov (πo,π1) Mean Mean RMSEA Variance, τ00 Variance, τ11 τ01 Population 0.973 (0.140) 6.514 (0.813) 0.979 (0.097) 2.398 (.296) -0.18 (0.37) 0.11 LD 10 MCAR 0.911 (0.143) 4.868 (0.754) 1.009 (0.104) 2.223 (0.320) 0.07 (0.38) 0.16 20 MCAR 1.016 (0.153) 6.208 (0.905) 1.013 (0.110) 1.013 (0.110) -0.20 (0.43) 0.12 30 MCAR 1.034 (0.186) 8.157 (1.157) 1.057 (0.117) 2.345 (0.368) -0.46 (0.52) 0.10 MAR 0.642 (0.111) 3.381 (0.472) 0.869 (0.102) 2.386 (0.272) -0.19 (0.27) 0.21 NI 0.493 (0.118) 1.214 (0.387) 0.219 (0.056) -0.410 (0.122) 1.18 (0.16) 0.22 EM 10 MCAR 0.977 (0.140) 6.683 (0.805) 0.968 (0.096) 2.407 (0.288) -0.26 (0.37) 0.12 20 MCAR 0.950 (0.141) 5.885 (0.783) 0.931 (0.094) 2.281 (0.280) -0.18 (0.36) 0.12 30 MCAR 0.960 (0.141) 6.189 (0.777) 0.943 (0.096) 2.269 (0.280) 0.06 (0.36) 0.10 MAR 0.920 (0.140) 5.772 (0.713) 1.000 (0.096) 2.455 (0.275) -.019 (0.35) 0.21 NI 0.857 (0.136) 3.739 (0.688) 0.834 (0.092) 1.376 (0.280) 0.43 (0.35) 0.12 FIML 10 MCAR 0.976 (0.141) 6.610 (0.817) 0.976 (0.098) 2.413 (0.304) -0.22 (0.38) 0.12 20 MCAR 0.961 (0.141) 6.030 (0.823) 0.949 (0.097) 2.278 (0.302) -0.21 (0.39) 0.10 30 MCAR 0.970 (0.141) 6.265 (0.845) 0.977 (0.100) 2.216 (0.322) 0.03 (0.41) 0.09 MAR 0.978 (0.141) 6.692 (0.798) 0.960 (0.102) 2.654 (0.306) -0.48 (0.39) 0.16 NI 0.854 (0.137) 3.616 (0.615) 0.688 (0.079) 0.788 (0.201) 1.22 (0.27) 0.17 MI 10 MCAR 0.963 (0.141) 6.147 (0.827) 0.943 (0.098) 2.358 (0.311) 0.12 (1.28) 0.14 20 MCAR 0.978 (0.141) 6.423 (0.851) 0.924 (0.096) 2.546 (0.334) -0.33 (0.59) 0.09 30 MCAR 0.960 (0.141) 6.010 (0.808) 1.066 (0.092) 1.676 (0.303) -0.003 (0.46) 0.09 MAR 0.900 (0.146) 4.735 (0.792) 1.012 (0.099) 2.107 (0.316) -0.34 (0.55) 0.15 NI 0.896 (0.175) 2.839 (1.400) 0.527 (0.116) 0.934 (0.960) 1.27 (1.32) 0.09 Note: The RMSEA indicates the degree of fit of the model. Sample

N 369 331 298 253 256 257 369 369 369 369 369 369 369 369 369 369 369 369 369 369 369


VARGAS-CHANES et al.: Inference with missing data using latent growth curves...

105

LD results Listwise deletion ignores observations with missing data and in some cases this can lead to seriously biased parameter estimates. We have simulated five scenarios of missing data, three under MCAR (at 10, 20 and 30 percent), and two more under MAR, and NI conditions. Thus we can assess the degree to which different amounts of missing data could dampen statistical inferences. Parameters estimates form the population will be used as a reference point and compared to model (5). As shown in Table 4, the slope under 10, 20 and 30 percent under the MCAR pattern are 1.009, 1.013, and 1.057, respectively; the percent of bias for standard errors grows as the amount of missing data increases from 10 to 30 percent i.e. (7.2%, 13.4%, and 20.6%). A most obvious reason is that standard errors increase because less amount of data is available as the percentage of missing data increases. A similar clear tendency of increasing bias in point estimates of slopes or size of variance can be found in the remaining parameter estimates. On the other hand, when data are deleted using MAR or NI conditions the listwise deletion method generates parameter estimates that are biased downwards. For example, the bias for the intercept estimates, and slope under NI are –49.3% and –77.6, respectively (see table 5 for the LD part). The same conclusion can be obtained for the remaining estimates under the MAR and NI condition. Overall, when data is MCAR and estimates are obtained using the LD method, the amount of bias for the estimates increases and standard errors are inflated as the proportion of missing data increases. When data are missing under the MAR or NI patterns, concerns about bias in estimation based on listwise deletion are warranted.


106

ESTADÍSTICA (2015), 67, 188 y 189, pp. 89-115

Table 5: Parameter estimates and standard errors bias for LD, EM, MI, and FIML compared to the Population model

Sample LD 10 MCAR 20 MCAR 30 MCAR MAR NI EM 10 MCAR 20 MCAR 30 MCAR MAR NI FIML 10 MCAR 20 MCAR 30 MCAR MAR NI MI 10 MCAR 20 MCAR 30 MCAR MAR NI

ParameterBias π0

π1

Std. ErrorBias σ (π0) σ(π1)

-6.4% 4.4% 6.3% -34.0% -49.3%

3.1% 3.5% 8.0% -11.2% -77.6%

2.1% 9.3% 32.9% -20.7% -15.7%

7.2% 13.4% 20.6% 5.2% -42.3%

0.4% -2.4% -1.3% -5.4% -11.9%

-1.1% -4.9% -3.7% 2.1% -14.8%

0.0% 0.7% 0.7% 0.0% -2.9%

-1.0% -3.1% -1.0% -1.0% -5.2%

0.3% -1.2% -0.3% 0.5% -12.2%

-0.3% -3.1% -0.2% -1.9% -29.7%

0.7% 0.7% 0.7% 0.7% -2.1%

1.0% 0.0% 3.1% 5.2% -18.6%

-1.0% 0.5% -1.3% -7.5% -7.9%

-3.7% -5.6% 8.9% 3.4% -46.2%

0.7% 0.7% 0.7% 4.3% 25.0%

1.0% -1.0% -5.2% 2.1% 19.6%

EM results Having observed that LD methods rises concerns about biased estimation, we apply the Expectation-Maximization algorithm to see if it does better. It is remarkable that parameter estimates using EM algorithm for 10 MCAR (i.e. 0.977±0.140 and 0.968±0.096) are practically the same as the population, the percent of bias is negligible (i.e. 0.4% and –1.1%; table 5). However, estimates under the MCAR condition are slightly biased downwards as the percentage of missing data increases to 30 percent, point estimates for the intercept and slope are relatively small and not likely to be of concern. The percent of bias from the population of standard errors for the slope at 10, 20 and 30 MCAR are negligible (i.e. -1.0%, -3.1%,


VARGAS-CHANES et al.: Inference with missing data using latent growth curves...

107

and -1.0%, respectively; table 5). The same argument holds for the slope intercepts (e.g. mean and variance). The 95%CI of the parameter estimates for π0 and π1, contain the population, when the EM algorithm is used to impute data under the MAR assumption and standard errors are remarkably similar to the population. However, problems are again found when applying the EM algorithm to impute missing values under NI condition. Despite the stable standard errors for the intercept (0.140), which are practically the same as to the standard errors for the population (0.140), the estimate for the intercept (0.857) is biased downwards compared to the population (0.973). The same observation holds for the remaining parameters (e.g. slope, means and variance). The estimate for the covariance between π0 and π1, is positive (0.43) whereas the corresponding parameter for the population is negative (-0.18), as indicated in Table 4. FIML results One potential disadvantage of FIML methods is that they do not input specific data points; instead, they create parameter estimates for variances and covariances that use all information available. This disadvantage is of little concern when doing structural equation modeling (SEM), because SEM models usually set up with covariance matrices as input. But how FIML performs under these four different conditions? The parameters estimates for the LGC model are shown under three different percentages of incomplete data under MCAR (Table 4). A clear picture emerges from this table when data is MCAR at 10, 20 or 30 percent. Under these three conditions FIML estimates are relatively unbiased compared to population. For example, the bias for the slope estimates for 10, 20 and 30 MCAR are -0.3, 3.1, and -0.2%, respectively compared to the population. The bias for standard errors for the estimates is negligible. Bias for FIML estimates under MAR condition is negligible compared to the population (i.e. the percent of bias for the slope is –2.1% compared to the population). Similar results were obtained under the NI condition is reproduced when using FIML, and parameters estimates are biased downwards (i.e. the percent of bias for the intercept and slope are –12.2% and –29.7%, respectively). FIML obtained biased estimates for the LGC model when missing data is NI. MI results Multiple imputations method was applied to the same scenarios as previously described in this manuscript. In each case five data sets were generated and


108

ESTADÍSTICA (2015), 67, 188 y 189, pp. 89-115

Rubin’s Rules were applied to calculate a single estimate, shown in Table 4. The results show that under MCAR and MAR conditions parameter estimates were unbiased as compared to the parameters from the population. For example, the percent of bias for the slope standard error under 10, 20, 30 MCAR and MAR scenarios were -3.7%, -5.6%, 8.9%, and 3.4%,respectively (see Table 5). The estimates for the mean intercept and standard error are unbiased as well, for MCAR and MAR patterns. Similarly standard errors for all these four scenarios are unbiased with respect to the population. Notice that estimates for standard errors are inflated due to variability induced by the imputation process when the Rubin’s Rules are applied to correct the standard errors (Little and Rubin, 2002). However, the current MI method used does not recover missing data effectively under NI missingness. Estimates for the intercept and slope are biased downwards. For example, the percent of bias for the slope estimate is –46% compared to the population. A plausible explanation of this difference can be attributable to the fact that most of the missing values are found by the 3rd and 4th year of data collection. Thus, the imputed values have a detrimental impact on the slope rather than on the intercept. Overall, under NI missingness pattern parameter estimates are biased downwards for all imputation methods, compared to the population. Discussion In real situations data are incomplete for a variety of reasons, and researchers have to make plausible assumptions to guarantee that inferences are not threatened. In this manuscript we had created different scenarios to reflect real conditions by which of missing data are generated. We learned from them how inferences can be affected and how imputation methods could help. We started these experiments by assuming that the population of 369 observations from which we had four years of complete data. Three different missing data conditions were simulated including MCAR, MAR and NI patterns (Table 2). Our tasks were to recover incomplete data using EM-algorithm, FIML, and MI methods and compare the bias of the parameter estimates with the known population and with results obtained using listwise deletion. Our model was a four-wave linear growth curve of adolescents drinking behavior. The experiments clarified some aspects of data imputation methods and suggested new questions. First, the mean intercepts and slopes when using LD method —the most common approach where missing data are deleted from analyses—were not affected for the three MCAR conditions. However, the standard error of the estimates increased as the number of incomplete cases increased due to the fact that less data was available. These experiments suggested that at 10 percent or less of


VARGAS-CHANES et al.: Inference with missing data using latent growth curves...

109

missing data the LD method provided good estimates, but the standard errors increased as the percentage of missing data increased. Although, the MCAR condition occurs least frequently in the “real world,� it can be generated as a result of a planned design where measures are taken at random (Graham, Taylor, and Cumsile, 2001). In longitudinal studies data are often lost for specific reasons that are related to the characteristics of the study. For example, adolescents who drink at a high level probably do not want to admit it. They skip the questions on drinking, or they simply drop from the study because of multiple conduct problems. Then the MAR assumption could be tenable assuming that specific covariates needed for recovering missing data are available. The MAR assumption is more plausible in most of real situations than MCAR. Results from data imputation using the MAR assumption showed negligible bias from population estimates. MI methods recover missing data efficiently, and similar results were observed when using EM or FIML approaches. The ML estimates for the LGC model using the EM and FIML methods for recovering missing data were relatively unbiased within error expected at random for the population parameters for practically all imputed data sets except for the NI condition. Since EM and FIML are maximum likelihood based methods, it is likely that EM provides good quality imputations, particularly when using covariates and FIML uses the information available efficiently. It is possible that the EMalgorithm uses the monotonic pattern property at wave 1 there are no missing data, as data collection progresses in future waves the amount of missing data increases, creating a stair step pattern that helps improve good quality imputations for longitudinal data as compared to MI (Gelman, et al 1995). The monotonic pattern property is critical in longitudinal data when using the EM-algorithm; in cross sectional data, the EM method would produce good quality imputations whenever the monotone pattern property could hold. Problems arose when longitudinal data was missing because the NI mechanism (i.e. the missing data is related to the variable itself). The NI sample did not have any covariates in the imputation model and all methods used provided poor results. Poor quality of imputations under NI missigness pattern was associated with the inefficiency of the methods used. There are other possible approaches to tackle the NI situations: (1) For example, incorporating autoregressive error terms into the model, adding a logistic regression model into the missing data mechanism and providing adequate priors, could make available better imputations for non-ignorable patterns (Congdon,


110

ESTADĂ?STICA (2015), 67, 188 y 189, pp. 89-115

2003). Another alternative method is using pattern mixture model, where the joint density of R —a matrix with binary coding that indicates weather a datum is missing or present— and đ?‘Œ = (đ?‘Œđ?‘œđ?‘?đ?‘ , đ?‘Œđ?‘šđ?‘–đ?‘ đ?‘ ) can be expressed as the product of the marginal density of R and the conditional density Y|R (Little, 1993, 1994). In this case there are several pattern of missing data and the portion of models specifying the missing data mechanism does not depend on the missing values (Fairclough, 2002). Further, Enders (2011) proposes a pattern mixture model approach to tackle this problem for longitudinal data. However, these approaches are beyond the scope of this investigation. For further Reading on this topic consult (Enders, 2010; McKnight, McKnight, Sidani, and Figueredo, 2007). In real life situations, the MAR mechanism seems to be the most common and we observed in these experiments that using adequate covariates for the imputation model provided good quality imputations. The adequate selection of predictor variables for the MAR model has an effect on the quality of imputations. Good quality imputations might be generated when the analyst uses adequate covariates in the imputation model (Graham and Schafer, 1999; Sinharay, Stern and Russell 2001; Schafer, 1997). When data have a non-ignorable mechanism is better to assume that data are MAR and use multiple imputations. This suggests that even in a wrong assumption, the MI method provides better quality imputations than using LD data. In previous experiments using MI, five data sets are enough to provide unbiased estimates (Little and Rubin, 2002). The number of imputed data sets needed when using MI varies between 5 to 10 data sets to provide reliable estimates (Schafer, 1997). In previous examples presented in this manuscript we found that five imputations was enough to get adequate estimates, and going beyond that did not yield significant improvement (results not shown). However, when the amount of missing data is close to 50 percent, it is quite possible that more than five data sets and more covariates in the imputation model are needed to produce good quality imputations. Overall, in these experiments all imputation methods performed well —except when applied to the NI mechanism— in particular MI and FIML produced good quality imputations under MCAR, and MAR conditions. The advantage of MI is that with this approach we have a better control of the imputation model needed to obtain good quality imputations. An additional note is that MI methods are not designed for making a forecast of missing data at the individual level. Instead MI methods are intended for producing simulated data that preserves the structure from the available data, based on its prior distribution, and its covariance matrix. Caution is needed when imputing discrete variables and special routines should be


VARGAS-CHANES et al.: Inference with missing data using latent growth curves...

111

used (see MIX routines in Schafer, 1997). From these experiments the authors encourage the use of imputation methods and has a tremendous advantage in increasing the power and obtaining unbiased estimates. However, some caution needs to be exercised when conducting imputations in general. The imputation methods presented so far are not able to recover data from a bad study design. If there is a concern that missing data happened because of a bad study design, imputation methods are not warranted for successful recovery of incomplete information. References ARBUCKLE, J. L. (1996). "Full information estimation in the presence of incomplete data". Advanced structural equation modeling: Issues and techniques. G. A. Marcoulides and R. E. Schumacker (Eds.) (pp. 243-277). Mahwah, New Jersey: Lawrence Erlbaum Associates, Inc. BELL, M. L. and FAIRCLOUGH, D. L. (2014). "Practical and statistical issues in missing data for longitudinal patient-reported outcomes." Statistical Methods in Medical Research. 23(5): 440-459. BOLLEN, K. A. and CURRAN, P. J. (2006). Latent curve models: A structural equation perspective. Wiley Series in Probability and Statistics. New York. CONGDON, P. (2003). Applied bayesian modelling. John Wiley & Sons Ltd. West Sussex, England. CONGER, R. D. and ELDER, G. H. J. (1994). Families in troubled times: Adapting change in rural America. Aldine de Gruyter. Hawthorne N.Y. DEMPSTER, A. P., LAIRD, N. M. and RUBIN, D. B. (1977). "Maximum likelihood from incomplete data via the EM algorithm." Journal of the Royal Statistical Society. B. 39: 1-38. ENDERS, C. K. (2010). Applied Missing Data Analysis. The Guilford Press. New York. ENDERS, C. K. (2011). "Missing not at random models for latent growth curve analyses." Psychological Methods.16(1): 1-16.


112

ESTADÍSTICA (2015), 67, 188 y 189, pp. 89-115

FAIRCLOUGH, D. L. (2002). Design and analysis of quality of life studies in clinical trials. Chapman & Hall/CRC. Boca Raton. Florida. GARCÍA-LAENCINA, P. J., HENRIQUES ABREU, P., HENRIQUES ABREU, M. and AFONOSO, N. (2015). "Missing data imputation on the 5-year survival prediction of breast cancer patients with unknown discrete values." Computers in Biology and Medicine.59: 125-133. GELMAN, A., CARLIN, J. B., STERN , H. S. and RUBIN, D. B. (1995). Bayesian data analysis. Chapman & Hall. New York. GRAHAM, J. W. and SCHAFER, J. L. (1999). "On the performance of multiple imputation for multivariate data with small sample". Statistical strategies for small sample research. R. H. Hoyle (Ed.) (pp. 1-29). SAGE publications, Inc. CA. GRAHAM, J. W., TAYLOR, B. J. and CUMSILE, P. E. (2001). "Planned missing data designs in the analysis of change ". New methods for the analysis of change. L. M. Collins and A. G. Sayer (Eds.) (pp. 335-353). American Psychological Association. Washington, D.C. KAPLAN, D. (2009). Structural equation modeling: Foundations and extensions (2nd ed.). Sage Publications. Newbury Park, CA. KNIGHT, G. P., VARGAS-CHANES, D., LOSOYA, S. H., COTA-ROBLES, S., CHASSIN, L. and LEE, J. M. (2009). "Acculturation and enculturation trajectories among Mexican American adolescents." Journal of Research in Adolescence. 19(4): 625-653. LEE, K. J. and SIMPSON, J. (2014). "Introduction to multiple imputation for dealing with missing data." Respirology.19: 162-167. doi: 10.1111/resp.12226. LIN, C.-Y., HSIEH, Y.-H. and CHEN, C.-H. (2015). "Use of latent growth curve modeling for assessing the effects of summer and after-school learning on adolescent students’ achievement gap." Asia Pacific Educ. Rev.16: 49-61. LITTLE, R. J. A. (1993). "Pattern-mixture models for multivariate incomplete data." Journal of American Statistical Association.88: 125-134. LITTLE, R. J. A. (1994). "A class of pattern mixture models for multivariate incomplete data." Biometrika. 81: 471-483.


VARGAS-CHANES et al.: Inference with missing data using latent growth curves...

113

LITTLE, R. J. A. and RUBIN, D. B. (2002). Statistical analysis with missing data (Second ed.). John Wiley & Sons, Inc. Hoboken, New Jersey. MARTIN, D., LITTLE, R. J. A., SAMUEL, M. E. and TRIEST, R. K. (1986). "Alternative methods for CPS income imputation." Journal of the American Statistical Association. 81(393): 29-41. MCARDLE, J. J. (1986). "Dynamic but structural equation modeling with reacted measures data". Handbook of multivariate experimental psychology. J. R. Nesselroade and R. B. Cattel (Eds.). 2: 551-614. New York: Plenum. MCARDLE, J. J. (1994). "Structural factor analysis experiments with incomplete data." Multivariate Behavioral Research.29(4): 409-454. MCARDLE, J. J. and EPSTEIN, D. (1987). "Latent growth curves within developmental structural equation models." Child Development.58: 110-133. MCKNIGHT, P. E., MCKNIGHT, K. M., SIDANI, S. and FIGUEREDO, A. J. (2007). Missing data: a gentle introduction. The Guilford Press. New York. MUTHEN, B. O. (2002). "Beyond SEM: General latent variable modeling.". Behaviometrika.29(1): 81-117. MUTHÉN, B. O. (1991). "Analysis of longitudinal data using latent variable models with varying parameters". Best methods for the analysis of change: Recent advances, unanswered questions, future directions. L. Collins and J. Horn (Eds.) (pp. 1-17). American Psychological Association. Washington, D.C. MUTHÉN, B. O., KAPLAN, D. and HOLLIS, M. (1987). "On structural equation modeling whith data that are not missing completely at random." Psychometrika.52(3): 431-462. MUTHÉN, B. O. and MUTHÉN, L. K. (2013). Mplus Version 7.11 statistical analysis with latent variables: User's Guide (Fourth ed.). Muthen & Muthen. Los Angeles. CA. ROGOSA, D., BRANDT, D. and ZIMOWSKI, M. (1982). "A growth curve approach to the measurement of change." Psychological Bulletin. 90: 726-748. RUBIN, D. B. (1976). "Inferences with missing data." Biometrika.63(3): 581-592.


114

ESTADÍSTICA (2015), 67, 188 y 189, pp. 89-115

RUBIN, D. B. (1987). Multiple imputation for nonresponse in surveys. John Willey & Sons. New York. SAS INSTITUTE INC. (2013). SAS/STAT Software: Changes and Enhancements. Release 9. Cary. NC: SAS Institute Inc. SCHAFER, J. L. (1997). Analysis of incomplete multivariate data. Chapman Hall. New York. SCHAFER, J. L. and OLSEN, M. K. (1998). "Multiple imputation for multivariate missing-data problems: A data analyst's perspective.". Multivariate Behavioral Research.33(4): 545-571. SHIYKO, M. P., RAM, N. and GRIM, K. J. (2012). "An overview of growth mixture modeling". Structural Equation Modeling. R. H. Hoyle (Ed.) (pp. 532-546). SINHARAY, S., STERN, H. S. and RUSSELL, D. (2001). "The use of multiple imputation for the analysis of missing data". Psychological Methods, 6(4): 317-329. TANNER, M. (1993). Tools for statistical inference, methods for the exploration of posterior distributions and likelihood functions.Springer-Verlag. New York. TREIMAN, D. J., BIELBY, W. T. and CHENG, M. t. (1988). "Evaluating a Multiple-Imputation Method for Recalibrating 1970 U.S. Census Detailed Industry Codes to the 1980 Standard." Sociological Methodology.18: 309-345. VARGAS, D. and CORTÉS, F. (2014). "Análisis de las trayectorias de la marginación municipal en México de 1990 a 2010." Revista Estudios Sociológicos, Publicada por El Colegio de México.32 (95): 261-294. VARGAS, D., DECKER, P., SCHROEDER, D. and OFFORD, K. P. (2003). Introduction to multiple imputation methods: handling missing data with SAS V8.2. M. C. T. R. n. 67 (Ed.). (pp. Report No. 67). Rochester, MN: Mayo Clinic Foundation. VERBEKE, G. and MOLENBERGHS, G. (2000). Linear mixed models for longitudinal data. Springer Verlag. New York WARD, J. (1963). "Hierarchical grouping to optimize an objective function." Journal of the American Statistical Association.58: 236-244.


VARGAS-CHANES et al.: Inference with missing data using latent growth curves...

115

WILLETT, J. B. and SAYER, A. G. (1994). "Using covariance structure analysis to detect correlates and predictors of individual change over time." Psychological Bulletin.116: 363-381. WIMMERS, P. F. and LEE, M. (2015). "Identifying longitudinal growth trajectories of learning domains in problem-based learning: a latent growth curve modeling approach using SEM." Adv in Health Sci Educ.20: 467-478. YOON, J. Y., BROWN, R. L., BOWERS, B. J., SHARKEY, S. S. and HORN, S. D. (2015). "Longitudinal psychological outcomes of the small-scale nursing home model: a latent growth curve zero-inflated Poisson model." International Psychogeriatrics.27(6): 1009-1016. doi: 10.1017/S1041610214002865 YOUNG, R. and JOHNSON, D. R. (2014). "Handling missing values in longitudinal panel data with multiple imputation". Journal of Marriage and Family. 77 (February): 277-294. doi: 10.1111/jomf.12144.

Received February 2014 Revised December 2015

i

The measurement term is defined in structural equation models as the error associated with imperfect measurements such as latent variables. In the case of mixed linear models this is called error term. ii These restrictions are necessary to identify the LGC model (Bolen and Curran, 2006) iii If we assume a quadratic model then we need a third latent variable đ?œ‹2 named quadratic slope and the corresponding loadings are đ?œ†13 , đ?œ†23 , ‌ , đ?œ†43 these terms need to be fixed to 1, 4, 9 and 16, respectively.



ESTADÍSTICA (2015), 67, 188 y 189, pp. 117-120 © Instituto Interamericano de Estadística

LAS BODAS DE PLATINO DEL INSTITUTO INTERAMERICANO DE ESTADÍSTICA La revista del Instituto Internacional de Estadística (ISI) publicó en 1959, bajo el título THE INTER AMERICAN STATISTICAL INSTITUTE AT AGE NINETEEN, un artículo de Stuart A. Rice, en ese entonces Presidente Honorario del ISI. Stuart A. Rice, precisamente uno de los fundadores del IASI, en 1940, nos deja una imagen de las circunstancias de la creación de nuestro Instituto, cuando nos dice (en traducción libre al español): “Esta organización única surgió en 1940, en un momento de profunda incertidumbre acerca del futuro del mundo europeo. Después de meses de correspondencia preparatoria y negociación, la entrega real del bebé se produjo durante las mismas horas en que los ejércitos nazis entraban en París. El destino del Instituto Internacional de Estadística era desconocido en el lado occidental del Atlántico. De hecho, al continuar la Segunda Guerra Mundial, se asumía ampliamente entre los miembro americanos que ese organismo mayor había perecido. “IASI era un hijo legítimo del venerable ISI, a pesar de que puedan aparecer como difíciles los fenómenos biológicos indicados. El ISI fue representado, en la concepción, por miembros de Argentina, Brasil y México, y un número de miembros de Canadá y los Estados Unidos, que consultaron entre ellos bajo los auspicios de la Comisión de Estados Unidos para la 25ª Sesión del ISI, originalmente programada para reunirse en Washington en 1939. Para facilitar esta consulta, un acogedor Departamento de Estado confió a la Comisión la organización de una Sesión de Estadística del Octavo Congreso Científico Americano, que se reuniría en Washington en mayo de 1940. Una porción del fondo contribuido por fuentes privadas para preparar la Sesión de 1939 fue desviada, con el permiso de los contribuyentes, para la organización del nuevo Instituto. “La idea general en la mente de los fundadores era proporcionar un medio para colaboración profesional entre los estadísticos de las naciones de las Américas y así llevar adelante el trabajo tradicional del ISI en la relativamente tranquila área de


118

ESTADÍSTICA (2015), 67, 188 y 189, pp. 117-120

América del Norte y del Sur. Por otra parte, la existencia del nuevo Instituto, se pensaba, podría servir para asegurar la continuidad en la vida de la organización mayor. Si el árbol era cortado por la guerra, podría crecer nuevamente de las raíces echadas en el Hemisferio Occidental. “A pesar de la herencia y de las aspiraciones de los padres,sin embargo, el nuevo Instituto comenzó su vida con una personalidad distintiva resultante de influencias ambientales durante el desarrollo prenatal. En 1940, el desarrollo estadístico en las naciones de América, excepto Canadá y los Estados Unidos, era irregular y a menudo elemental. Los cursos de estadística en las universidades eran pocos y enfocados en lo teórico y filosófico, más que en lo realista y práctico. Las estadísticas gubernamentales estaban mucho menos desarrolladas excepto, posiblemente, en el campo del comercio exterior. Los datos censales, cuando disponibles, eran generalmente obsoletos. El número de estadísticos reconocidos no era impresionante y la evaluación social de su profesión daba poca seguridad a sus posiciones. Por otra parte, las distancias entre los países de las Américas son mucho mayores, en promedio, que en el relativamente compacto continente europeo,en el que el Instituto Internacional de Estadística tuvo su origen. Por todas estas razones el Comité Organizador estaba bajo las limitaciones que se aplicarían a cualquier organización internacional de estadísticos en el Hemisferio Occidental”. Lo que no queda reflejado en este extracto es la vitalidad con que empezó el hijo del ISI a cumplir con su propósito de promover el desarrollo estadístico en la región americana. Después de un período de organización extendido entre el 12 de mayo de 1940, fecha de fundación del Instituto, hasta el 30 de junio de 1942, el IASI inició una serie de actividades de destacable importancia. En apretado resumen podemos registrar aquí las actividades más relevantes desarrolladas por el IASI en sus 75 años de existencia: 1. Revista Estadística. El primer número de esta revista científica del IASI fue publicado en 1943. Y continúa publicándose en el presente. 2. Congresos Interamericanos de Estadística. El IASI organizó el Primer Congreso Interamericano de Estadística, realizado en Washington, D.C. en 1947, y el Segundo Congreso Interamericano de Estadística, llevado a cabo en Bogotá en 1950. Posteriormente, la Organización de los Estados Americanos (OEA) se responsabilizó de la serie, con las Conferencias Interamericanas de Estadística, que organizó con el apoyo del IASI, institución con la que había establecido en 1950 un Acuerdo de


FABBRONI, EVELIO O.: Las Bodas de Platino del Instituto Interamericano de…

`

119

cooperación, renovado en tres oportunidades, cuya vigencia fue extendida hasta 1996. 3. Programa del Censo de las Américas. Este programa, en cuya organización trabajó el IASI en su primera década de existencia, tuvo una influencia muy importante en la organización y levantamiento de censos de población y vivienda en los países de la región en las rondas de los años 1950, 1960, 1970 y 1980. 4. Comisión de Mejoramiento de las Estadísticas Nacionales. El IASI estableció en 1950 esta Comisión, que operó con una serie de Subcomisiones sobre áreas específicas. La última Sesión Plenaria de la Comisión se realizó en 1981. Las funciones que desarrollaba fueron posteriormente asumidas parcialmente por la OEA mediante la Conferencia Interamericana de Estadística. Actualmente se lleva a cabo un programa renovado a través de la Conferencia Estadística de las Américas de la CEPAL. 5. Actividades de capacitación. Desde los primeros años de su existencia, el IASI ha llevado a cabo numerosas actividades de capacitación, entre las que merecen destacarse: (a) actividades de capacitación para los censos de 1950; (b) Centro Interamericano de Enseñanza de Estadística Económica y Financiera (CIEF), en Santiago de Chile (1953-1961); (c) cursos de 10 meses para la capacitación de funcionarios de las Oficinas Nacionales de Estadística de los países de Centroamérica y el Caribe, en El Salvador (1954-1956), República Dominicana (1957-1958), Costa Rica (19571958), Panamá (1959-1961); y (d) Centro Interamericano de Enseñanza de Estadística (CIENES), que funcionó en Santiago de Chile (1962-1997). 6. Seminarios de Estadística Aplicada. Este programa se inició en 1987 y hasta el presente se han realizado 12 seminarios en varios países (Argentina, Brasil, Colombia, Costa Rica, Chile, Ecuador, México, Panamá). 7. Reuniones sobre Estadística Pública. Se han realizado 12 reuniones, desde 1998. Las sedes de las mismas han sido en Argentina, Brasil, Chile, México, Paraguay, Perú y Uruguay. 8. Cursos cortos de capacitación. Este programa se inició en 2001, con cursos de alcance nacional, de dos a tres días sobre diversos temas, en cooperación con instituciones de diversos países. Hasta el presente se han realizado estos cursos en Argentina, Colombia, Chile, Ecuador, Panamá, Paraguay y Perú.


120

ESTADÍSTICA (2015), 67, 188 y 189, pp. 117-120

9. Relaciones con instituciones nacionales e internacionales. El Instituto mantiene permanentes relaciones de cooperación con asociaciones nacionales de estadística, universidades, institutos nacionales de estadística, el ISI, la División de Estadística de las Naciones Unidas, la División de Estadística de la CEPAL, el BID, el Banco Mundial y otras organizaciones internacionales. El IASI tiene especial interés en recibir en su seno a todos quienes comparten nuestro objetivo de promover el desarrollo de la estadística en nuestra región, Las puertas están abiertas para el ingreso de miembros afiliados (institucionales), titulares (personales) y estudiantes.

Diciembre 2015 Evelio O. Fabbroni Director Ejecutivo del IASI


ESTADÍSTICA (2015), 67, 188 y 189, pp. 121-123 © Instituto Interamericano de Estadística

THE PLATINUM JUBILEE OF THE INTERAMERICAN STATISTICAL INSTITUTE The Review of the International Statistical Institute (ISI) published in 1959 under the title THE INTER AMERICAN STATISTICAL INSTITUTE AT AGE NINETEEN, an article by Stuart A. Rice, then Honorary President of the ISI. Stuart A. Rice, precisely one of the founders of IASI, in 1940, leaves us a picture of the circumstances of the creation of our Institute, when he says: "This unique organization came into being in 1940, at a time of profound uncertainty concerning the future of the European world. After months of preparatory correspondence and negotiation, the actual delivery of the infant occurred during the very hours in which Nazi armies entering Paris. The fate of the International Statistical Institute was unknown on the western side of the Atlantic. Indeed, as the Second World War continued, it is widely assumed among American members of that older body that it had perished. "IASI was a legitimate child of the venerable ISI, however difficult the biological phenomena indicated may appear. ISI was represented in the conception of its members from Argentina, Brazil and Mexico, and a number of members from Canada and the United States, consulting together under the aegis of the United States Committee for the 25th ISI Session, originally scheduled to meet in Washington in 1939. To facilitate this consultation the Committee was entrusted by a friendly Department of State with the organization of a Statistics Session of the Eight American Scientific Congress, meeting in Washington in May, 1940. A portion of the fund contributed from private sources to prepare for the 1939 session was diverted, with permission from the contributors, to the organization of the new Institute. "The general idea in the minds of the founders was to provide a medium for professional collaboration among the statisticians of the American nations and thus to carry forward the traditional work of the ISI within the comparatively peaceful area of the North and South America. Moreover, the existence of the new Institute, it was thought, might serve to assure continuity in the life of the older organization. If cut down by the war, the tree might grow again from roots that had spread to the Western Hemisphere.


122

ESTADÍSTICA (2015), 67, 188 y 189, pp. 121-123

"Despite heredity and parental aspirations. However, the new Institute began life with a distinctive personality resulting from environmental influences during prenatal growth. In 1940, statistical development within the American nations, excepting Canada and the United States, was spotty and often elementary. Statistics courses in universities were few and tended to the theoretical and philosophical rather than realistic and practical. Governmental statistics were greatly underdeveloped except, possibly, in the field of foreign trade. Census data, to the extent available, were generally out of date. The number of recognized statisticians was unimpressive and the social evaluation of their profession gave little security to their positions. Moreover, distances between countries of the Americas are much greater, on average, than in the relatively compact European continent in which the International Statistical Institute had its origin.For all of these reasons the organizing committee was under constraints that would apply to any international organization of statisticians in the Western Hemisphere." Which is not reflected in this summary is the vitality that the son of the ISI began to fulfill its purpose of promoting statistical development in the Americas region. After a period of organization extended between May 12, 1940, date of foundation of the Institute, until June 30, 1942, IASI began a series of activities of remarkable importance. In tight summary we can post here the most relevant activities carried out by the IASI in its 75 years of existence: 1. Journal Estadística. The first issue of this journal of IASI was published in 1943. And continues to be published today. 2. Inter-American Statistical Congresses. IASI organized the First Inter-American Congress of Statistics, held in Washington, D.C., in 1947, and the Second InterAmerican Congress of Statistics, carried out in Bogotá in 1950. Subsequently, the Organization of American States (OAS) was responsible for the series, with the Inter-American Statistical Conferences, organized with the support of IASI, institution with which a cooperation agreement had been established in 1950, and renewed on three occasions, whose validity was extended until 1996. 3. The Census of the Americas Program. This Program, in whose organization worked IASI in its first decade of existence, had a very important influence in the organization and taking of censuses of population and housing in the countries of the region in the years 1950, 1960, 1970 and 1980 rounds.


FABBRONI, EVELIO O.: The Platinum Jubilee of The Interamerican Statistical Institute 123

`

4.Committee on Improvement of National Statistics. In 1950, IASI established this Committee, which operated with a series of subcommittees on specific areas. The last plenary session of the Committee was held in 1981. The operations performed were later partially assumed by the OAS through the Inter-American Statistical Conference. At the present time a renewed program is carried out by the Statistical Conference of the Americas of ECLAC. 5. Training activities. Since the early years of its existence, IASI has carried out numerous training activities, among which worth mentioning the following: (a) training for the 1950 census activities; (b) the Inter-American Training Center for Economic and Financial Statistics (CIEF), in Santiago, Chile (1953-1961); (c) 10 month courses for training officials of national statistical offices of the countries of Central America and the Caribbean, in El Salvador (1954-1956), Dominican Republic (1957-1958), Costa Rica (1957-1958), Panama (19591961); and (d) the Inter-American Statistical Training Center (CIENES),which operated in Santiago, Chile (1962-1997). 6. Seminars on Applied Statistics. This program was launched in 1987 and up to the present have been 12 seminars in several countries (Argentina, Brazil, Colombia, Costa Rica, Chile, Ecuador, Mexico, Panama). 7. Meetings on Public Statistics. There have been 12 meetings, since 1998. The sites have been in Argentina, Brazil, Chile, Mexico, Paraguay, Peru and Uruguay. 8. Short Training Courses. This program was launched in 2001, with two to three days courses of national scope, on various topics, in cooperation with institutions from different countries. So far these courses have been given in Argentina, Colombia, Chile, Ecuador, Panama, Paraguay, and Peru. 9. Relations with national institutions and international organizations. The Institute maintains permanent relations of cooperation with national associations of statistics, universities, national institutes of statistics, the ISI, the United Nations Statistics Division, the Statistical Division of ECLAC, the IDB, the World Bank and other international organizations. IASI has special interest in receiving in its membership all who share our goal of promoting the development of statistics in our region. The doors are open for the entry of Affiliated (institutional), regular (personal) and student members.

December 2015 Evelio O. Fabbroni Executive Director of IASI



ESTADÍSTICA (2015), 67, 188 y 189, pp. 125-128 © Instituto Interamericano de Estadística

GUIA PARA EL AUTOR ESTADISTICA es la revista científica del Instituto Interamericano de Estadística (IASI). Tiene como propósito la publicación de contribuciones en temas estadísticos teóricos y aplicados, dando énfasis a las aplicaciones originales y a la solución de problemas de interés amplio para los Estadísticos y Científicos. Los artículos sobre aplicaciones deben incluir un análisis cuidadoso del problema que traten, tener una presentación clara para contribuir a la divulgación de la metodología y buena práctica estadística, y contener una adecuada interpretación de los resultados. Los artículos sobre aplicaciones pueden también estar orientados a contribuir a un mejor entendimiento del alcance y limitaciones de los métodos considerados. Estos artículos pueden encarar problemas en cualquier área de interés, incluyendo estadística pública, salud, educación, industria, finanzas, etc. Las contribuciones teóricas sin una aplicación correspondiente serán publicadas si presentan un avance significativo en el conocimiento de la disciplina a escala internacional y tienen una clara indicación de cómo pueden los métodos desarrollados ser útiles para aplicaciones relevantes. Esta publicación es registrada por los siguientes repertorios: el Current Index to Statistics (CIS) de la American Statistical Association (ASA) y el Institute of Mathematical Statistics (IMS), Zentralblatt-Math y el Sistema Regional de Información en línea para Revista Científicas de América Latina, el Caribe, España y Portugal (LATINDEX). Su cuerpo editorial es de carácter internacional y está integrado por destacados estadísticos. Para presentar un artículo tendrá que enviar por e-mail a la Editora dos copias del mismo, una de ellas anónima. El procedimiento editorial es doblemente anónimo, por lo que el nombre y dirección del autor a quien deberá dirigirse la correspondencia deben aparecer sólo en una A de las copias. Se aceptarán trabajos en Word, en L TEX o en Scientific WorkPlace. Durante el proceso de arbitraje se evalúan distintos aspectos del artículo, a saber, si se lo considera (a) importante; (b) interesante; (c) correcto; (d) original; y (e) adecuado según el perfil de “Estadística”. Un artículo será publicado en esta revista cuando satisfaga simultáneamente estos cinco requisitos. REQUERIMIENTOS 1. IDIOMAS Los artículos podrán presentarse en español, portugués, inglés o francés. 2. SOFTWARE A Se aceptarán trabajos en Word, en L TEX o en Scientific WorkPlace. 3. TAMAÑO DEL PAPEL Y MÁRGENES  El tamaño de papel deberá ser A4: 21.0 x 29.7 cm (8.26” x 11.69”).  Use los siguientes márgenes (superior, inferior, izquierdo y derecho) 2.5 cm.


126

ESTADÍSTICA (2015), 67, 188 y 189, pp. 125-128

4. FUENTE Los artículos en Word deberán estar escritos en Times New Roman 11 A presentados en L TEX en Roman 12 pt (CMR12).

y los

5. JUSTIFICACIÓN Excepto para el título, la información de autor y la palabra resumen (que deberán estar centrados), el artículo deberá estar justificado a izquierda y derecha. Los títulos de las secciones y subtítulos deberán estar justificados a izquierda. 6. ESPACIADO  El espaciado será simple en todo el artículo, incluyendo el título, la información del autor y el resumen.  Deberá haber exactamente una línea en blanco antes de los nombres de los autores, Palabras clave, los títulos de las secciones, los subtítulos, Agradecimientos, Notas, Referencias y Apéndices.  Deberá dejar exactamente dos líneas en blanco antes del resumen.  Deberá haber exactamente una línea en blanco antes y después de las tablas y las figuras.  Deberá dejar exactamente una línea en blanco entre párrafos. 7. ÉNFASIS A Use solamente itálicas (no subrayado, no negritas) para dar énfasis al texto. En L TEX use Text Italic 12 pt (CMTI12). 8. SANGRIAS NO DEBE utilizar sangrías. 9. NUMERACIÓN DE PÁGINAS En la versión final las páginas NO DEBERAN estar numeradas. 10. ENCABEZADO, PIE DE PAGINA O NOTAS AL PIE  En el texto deberá evitarse la utilización de pie de página, encabezados y notas al pie.  Si fuera absolutamente necesaria la utilización de notas al pie, deberán identificarse con supraíndices numéricos en el orden en que aparezcan en el texto.  Las notas al pie de página se deberán escribir todas juntas al final del artículo después de las Referencias. 11. AUTORES  Centrar los nombres de los autores escritos en MAYÚSCULAS.  Centrar la afiliación institucional de los autores en minúscula itálica y datos para su contacto (incluyendo email, teléfono y fax) en minúscula simple.  Deberá dejar una línea en blanco entre el título y la información de los autores. 12. TITULO Y SUBTITULOS  En Word, el título deberá estar centrado y en MAYUSCULA NEGRITA Times New Roman 13.


GUÍA PARA EL AUTOR

 

127

En Word, los subtítulos deberán estar ajustados a izquierda y en minúscula negrita, por ejemplo: Títulos de sección, Agradecimientos, Notas, Referencias, Apéndices, etc. A En L TEX , deberá definir los títulos y subtítulos como sección y subsección, respectivamente.

13. RESUMEN Y ABSTRACT  Dejar 2 renglones en blanco a continuación de los datos de los autores.  Escribir la palabra ABSTRACT, RESUMEN, RESUMÉ o RESUMO (de acuerdo al idioma en el que esté escrito el artículo) en mayúsculas negrita centrada.  Dejando un renglón, escribir el texto del resumen que será un párrafo de a lo sumo 150 palabras en el idioma que corresponda.  Este texto deberá describir brevemente los principales contenidos del artículo y evitar las citas bibliográficas.  Dejar 2 renglones en blanco a continuación del texto del resumen.  Escribir la palabra ABSTRACT (si el artículo está escrito en español, francés o portugués) o RESUMEN (si el artículo está escrito en inglés) en mayúsculas negrita centrada.  Dejando un renglón, escribir la traducción del RESUMEN, RESUMÉ o RESUMO al inglés en el primer caso o la traducción al español del ABSTRACT de más arriba.  Si el artículo está escrito en inglés, se deberá presentar el RESUMEN en español. 14. PALABRAS CLAVE Después del RESUMEN y del ABSTRACT, dejando un renglón, deberá escribirse respectivamente Palabras clave y Keywords en negrita itálica y, dejando un renglón, deberá escribir una lista de tres a seis palabras que se utilizarán para clasificar el artículo. 15. GRÁFICOS Y TABLAS  Todas las tablas y los gráficos deberán tener un título y estar numeradas correlativamente.  Los títulos deberán escribirse en la parte superior izquierda de las tablas y los A gráficos en Times New Roman 10 (Word) o CMR10 (L TEX).  Los gráficos deberán presentarse en su forma definitiva para publicación, se recomienda no utilizar color sino matices de grises o distintas tramas. La resolución óptima para impresión es de 300 dpi. El tamaño de la imagen deberá ser un 20% mayor al que tendrá en la publicación.  Si los gráficos o las tablas no se incluyen como parte del documento, deberán ser enviados en archivo aparte en formato Excel para Word o EPS para A L TEX. Los títulos deberán estar en concordancia con el siguiente estilo: •

Figura 2. Perfil de la función de verosimilitud.

Tabla 1. Distribuciones posteriores marginales.


128

ESTADÍSTICA (2015), 67, 188 y 189, pp. 125-128

16. ECUACIONES Las ecuaciones deberán estar numeradas. La numeración deberá colocarse a la derecha de la ecuación. 17. CITAS DE REFERENCIAS EN EL TEXTO Para citar un artículo en el texto, se indicará autor y año de publicación, como en los siguientes ejemplos: •

...... the model proposed by Barnett (1969)

The theoretical treatment provided by Fuller (1987, cap.4)

Bold et al. (1995) also find....

18. REFERENCIAS  Las referencias deberán disponerse en orden alfabético según apellido del autor y, para un mismo autor, en orden cronológico al final del artículo.  Las partes que deberá contener una referencia son las siguientes: Autor(es), año de publicación, título, información sobre la publicación. Las referencias deberán estar en concordancia con el siguiente estilo: THEOBALD, C.M. and MALLISON, J.R. (1978). "Comparative Calibration, Linear Structural Relationship and Congeneric Measurements". Biometrics. 34: 39-45 FULLER, W. A. (1987). Measurement Error Models. Wiley, New York LINDLEY, D. V. and SMITH, A. F. M. (1972) "Bayes Estimates for the Linear Model" (with discussion). Journal of the Royal Statistical Society, Series B. 34: 1-41


ESTADÍSTICA (2015), 67, 188 y 189, pp. 129-132 © Instituto Interamericano de Estadística

GUIDELINES FOR THE AUTHOR ESTADISTICA is the scientific journal of the Inter-American Statistical Institute (IASI). It aims to publish contributions about themes in theoretical and applied Statistics, giving emphasis to original applications and the solution of problems of wide interest to Statisticians and Scientists. Applications papers should include careful analysis of the problem at hand, have a clear presentation in order to contribute to the dissemination of methodology and good statistical practice, and contain adequate interpretation of the outcomes. Applications papers may also aim to contribute to a better understanding of the scope and limitations of the methods considered. Applications papers may tackle problems in any areas of interest including public statistics, health, education, industry, finance, etc. Theoretical contributions without a corresponding application will be published if they represent a significant advance in the knowledge of the discipline at the international scale and contain a clear indication of how the methods developed may be useful for relevant applications. This publication is registered by the following repertories: the Current Index to Statistics (CIS) of the American Statistical Association (ASA) and the Institute of Mathematical Statistics (IMS), Zentralblatt-Math, and the “Sistema Regional de Información en línea para Revista Científicas de América Latina, el Caribe, España y Portugal (LATINDEX)” (Regional system of information online for scientific journals of Latin America, the Caribbean, Spain and Portugal). The editorial board of Estadística is of international scope, and is composed of outstanding statisticians. If you wish to submit a paper, please send to the editor by e-mail two copies, one of them anonymous. Editorial process is double-blind so the name and the full postal address of the authors to whom further correspondence is to be sent must appear only on one of the copies. Papers A will be accepted in Word, in L TEX, or in Scientific WorkPlace. During the refereeing process several aspects of the paper are evaluated, namely, whether or not it is considered: (a) important; (b) interesting; (c) correct; (d) original; and (e) adequate according to the profile of “Estadística”. A paper will be published in this journal when it simultaneously satisfies these five requisites. REQUIREMENTS 1. LANGUAGES Papers can be presented in English, Spanish, French or Portuguese. 2. SOFTWARE A Papers will be accepted in Word, in L TEX, or in Scientific WorkPlace. 3. SIZE OF THE PAPER AND MARGINS  Use A4 paper: 21.0 x 29.7 cm (8.26” x 11.69”)  Use the following margins (upper, lower, left and right) of 2.5 cm (1.0").


130

ESTADÍSTICA (2015), 67, 188 y 189, pp. 129-132

4. FONT A Papers in Word shall be written in Times New Roman 11, while those presented in L TEX , shall use Roman 12 pt (CMR12). 5. JUSTIFICATION Except for the main title, the authors’ identification, and the word abstract, that shall be centered, the paper shall be left and right justified. The secondary titles, as well as the subtitles shall be left justified. 6. SPACING  The spacing shall be single throughout the paper, including the main title, the authors’ identification and the abstract.  Exactly one blank line shall be left before the authors’ identification, Key words, section titles, sub-titles, Acknowledgements, Notes, References, and Appendices.  Exactly two blank lines shall be left before the abstract.  Exactly one blank line shall be left before and after tables and figures.  Exactly one blank line shall be left between paragraphs. 7. ENPHASIS A Use only italics (not underline nor bold) to highlight parts of the text. In L TEX use Text Italic 12 pt (CMTI12). 8. INDENTATIONS DO NOT USE indentations. 9. PAGE NUMBERING The pages SHALL NOT be numbered in the final version. 10. HEADING, FOOTER AND FOOTNOTES  The use of footers, headings, and footnotes shall be avoided in the text.  In case the use of footnotes is absolutely necessary, they shall be identified with numeric supra-indices in the order they appear in the text.  The footnotes shall be written together, after the References. 11. AUTHORS  The names of the authors shall be centered and written in CAPITAL LETTERS.  The institutional affiliations of the authors in italic lower case letters, and contact information in regular lower case letters, shall also be centered.  A blank line shall be left between the title and the authors’ names. 12. TITLES AND SUB-TITLES  In Word, the title shall be centered written in Times New Roman 13 BOLD CAPITAL LETTERS.  In Word, the sub-titles shall be left justified and written in bold lower case letters, for instance: Section titles, Acknowledgements, Notes, References, Appendices, etc. A  In L TEX, the titles and sub-titles shall be defined as section and sub-section, respectively.


GUIDELINES FOR THE AUTHOR

131

13. ABSTRACT AND RESUMEN  Leave 2 blank lines following the authors’ identification.  The word ABSTRACT, RESUMEN, RESUMÉ or RESUMO (according to the language in which the paper is written) shall be centered written in bold capital letters.  After leaving a blank line, the text in the corresponding language shall be included. This shall a paragraph of at most 150 words.  This text shall briefly describe the main contents of the paper, avoiding the use of bibliographic references.  Leave 2 blank lines following the text.  The word ABSTRACT (if the paper is written in Spanish, French or Portuguese) or RESUMEN (if the paper is written in English) shall be centered written in bold capital letters,  After leaving a blank line, the translation of the RESUMEN, RESUMÉ or RESUMO into English in the first case or the translation of the above ABSTRACT into Spanish shall be included. 14. KEY WORDS After the ABSTRACT and RESUMEN, leaving a blank line, respectively write Key words and Palabras Clave (in bold italics). Then, leaving a blank line, write a list with three to six words that will be used to classify the paper. 15. GRAPHS AND TABLES  All tables and graphs shall have a title and be sequentially numbered.  Titles shall be written in the upper left part of tables and graphs, in Times New A Roman 10 (Word) or CMR10 (L TEX).  The graphs shall be presented in their final form for publication. It is recommended not to use colors but different gray shades or different plots. Optimal resolution for printing is 300 dpi. The size of the image shall be 20% larger than the size for the final publication.  In case the graphs or tables are not included as a part of the document, they shall A be sent in a separate file in Excel format for Word or EPS for L TEX. Titles shall be in accordance with the following style: •

Figure 2. Profile of the likelihood function

Table 1. Posterior marginal distributions.

16. EQUATIONS Equations shall be numbered. The number shall be written to the right of the equation. 17. REFERENCES IN THE TEXT To refer to a paper in the text, the author and year of publication shall be indicated, as in the following examples: •

...... the model proposed by Barnett (1969)

The theoretical treatment provided by Fuller (1987, cap.4)


132

ESTADÍSTICA (2015), 67, 188 y 189, pp. 129-132

Bold et al. (1995) also find....

18. REFERENCES  The references shall be placed, at the end of the paper, in alphabetical order by the names of the authors and, for the same author, in chronological order.  References shall include the following: Author(s), year of publication, title, information on the publication. References shall be presented in accordance with the following style: THEOBALD, C.M. and MALLISON, J.R. (1978). "Comparative Calibration, Linear Structural Relationship and Congeneric Measurements". Biometrics. 34: 39-45 FULLER, W. A. (1987). Measurement Error Models. Wiley, New York LINDLEY, D. V. and SMITH, A. F. M. (1972) "Bayes Estimates for the Linear Model" (with discussion). Journal of the Royal Statistical Society, Series B. 34: 1-41


ESTADÍSTICA (2015), 67, 188 y 189, pp. 133 © Instituto Interamericano de Estadística

MIEMBROS AFILIADOS DEL IASI AFFILIATED MEMBERS OF IASI Argentina Instituto Nacional de Estadística y Censos (INDEC) Universidad Nacional de Tres de Febrero Brasil Instituto Brasileiro de Geografia e Estatística (IBGE) Canada Statistics Canada Costa Rica Instituto Nacional de Estadística y Censos (INEC) Chile Instituto Nacional de Estadísticas (INE) Instituto de Estadística, Universidad Austral de Chile (UACH) Jamaica Statistical Institute of Jamaica México Instituto Nacional de Estadística y Geografía (INEGI) Panamá Instituto Nacional de Estadística y Censo (INEC), Contraloría General de la República Caja de Seguro Social Perú Instituto Nacional de Estadística e Informática (INEI) United States Bureau of the Census Minnesota Population Center (MPC), University of Minnesota Uruguay Instituto Nacional de Estadística (INE)



135

ESTADÍSTICA (2015), 67, 188 y 189, pp. 135

SUBSCRIPCIONES: Pueden solicitarse a la Secretaría del Instituto Interamericano de Estadística (IASI), INEC – Contraloría General de la República, Apartado 0816-01521, Panamá, Rep. de Panamá, enviando cheque en dólares sobre un banco de los Estados Unidos o de Panamá, pagadero al Instituto Interamericano de Estadística. Precios de las subscripciones (en US$): Individual ............... Institucional ...........

$30.00 $60.00

Las agencias de subscripciones pueden consultar por descuentos especiales.

SUBSCRIPTIONS: Orders shall be sent to the Secretariat of the InterAmerican Statistical Institute (IASI), INEC – Contraloría General de la República, Apartado 0816-01521, Panama, Rep. of Panama, together with a cheque in dollars drawn on any Bank of the United States or Panama, payable to the Inter-American Statistical Institute. Subscription rates (in US$): Individual ............... Institucional ...........

$30.00 $60.00

Subscription agencies may ask for special discounts.



Turn static files into dynamic content formats.

Create a flipbook
Issuu converts static files into: digital portfolios, online yearbooks, online catalogs, digital photo albums and more. Sign up and create your flipbook.