Training Module B
Module B Glossary ANOVA: ANOVA stands for analysis-of-variance. It is a collection of statistical models, and their associated procedures, in which the observed variance is partitioned into components due to different explanatory variables. In its simplest form ANOVA provides a statistical test of whether or not the means of several groups are likely to be equal. Chi-square tests: A statistical hypothesis test in which the sampling distribution of the test statistic is a chi-square distribution when the null hypothesis is true, or any in which this is asymptotically true, meaning that the sampling distribution (if the null hypothesis is true) can be made to approximate a chisquare distribution as closely as desired by making the sample size large enough. Cleaning database: A process to increase the accuracy of the data and streamline the database, by removing/correcting duplicate and wrong data in the database. Codebook: A document used for implementing a code. It reports dictionary information such as variable names, variable labels, value labels, and missing values.
Coefficient of variation (CV): A normalized measure of dispersion of a probability distribution.int or a missing component of a data point. Cohort Survival Rate (CSR): The percentage of enrollees at the beginning grade or year in a given school year who reached the final grade. Correlation: A single number that describes the degree of relationship between two variables. Correlations are useful because they can indicate a predictive relationship, possible causal, or mechanistic relationships. Coverage: The extent or degree to which the entire study area is observed, analyzed, and reported by the survey. Cross tabulation (Crosstab): This displays the joint distribution of two or more variables. They are usually presented as a contingency table in a matrix format. Whereas a frequency distribution provides the distribution of one variable, a contingency table describes the distribution of two or more variables simultaneously. Data collection: A process of preparing and collecting data to keep on record, to make decisions about important issues, and to pass information on to others.
Training Module B
Data preparation: A process of preparing and collecting data to keep on record, to make decisions about important issues, and to pass information on to others.
Factor analysis: A statistical method used to describe variability among observed variables in terms of a potentially lower number of unobserved variables called factors.
Data Validation: A process of ensuring that a program operates on clean, correct and useful data.
Frequency: The number of occurrences of a repeating event per unit time. It provides statistics and graphical displays that are useful for describing different types of variables.
Descriptive statistics: To describe the basic features of the data in a study. They provide simple summaries about the sample and the measures. Together with simple graphics analysis, they form the basis of virtually every quantitative analysis of data. Disaggregation: A process of breaking down and analyzing an indicator by detailed sub-categories. Also, it is for understanding the degree of accuracy and its limitations of the survey. Educational attainment: A term commonly used by statisticians to refer to the highest degree of education an individual has completed. Estimation: Any of numerous procedures used to calculate the value of some property of a population from observations of a sample drawn from the population.
Gender Parity Index (GPI): A socioeconomic index usually designed to measure the relative access to education of males and females. In its simplest form, it is calculated as the quotient of the number of females by the number of males enrolled in a given stage of education. Household: A basic residential unit in which economic production, consumption, inheritance, child rearing, and shelter are organized and carried out. Household is broader than family, which is a group of people related by blood or marriage such as parents and their children only. Household survey: A process of data collection and analysis for understanding general situation and exploring specific characteristics of households or household population. Imputation: To substitute some value for a missing data point or a missing component of a data point.
Training Module B
Kurtosis: A measure of the "peakedness" of the probability distribution of a real-valued random variable. Higher kurtosis means more of the variance is the result of infrequent extreme deviations, as opposed to frequent modestly sized deviations. Liner regression: An approach to modeling the relationship between one or more variables denoted y and one or more variables denoted X, such that the model depends linearly on the unknown parameters to be estimated from the data. Mean: The expected value of a random variable. For a data set, the mean is the sum of the observations divided by the number of observations. Missing value: This occurs when no data value is stored for the variable in the current observation. Missing values are a common occurrence, and statistical methods have been developed to deal with this problem. Nonparametric test: A statistic (a function on a sample) whose interpretation does not depend on the population fitting any parameterized distributions. Statistics based on the ranks of observations are one example of such statistics and these play a central role in many non-parametric approaches.
OLAP cube: A multidimensional database that calculate summary statistics for summary variables within categories of one or more grouping variables. The cube allows different views of the data to be quickly displayed. Outlier identification: To identify an observation that is numerically distant from the rest of the data. Pivot table: A data summarization tool to create output table formats. Pivot-table tools can automatically sort, count, and total the data stored in one table or spreadsheet and create a second table. Population census: A procedure of systematically acquiring and recording information about the members of a given population. It includes information on household members, which are useful for policy making, planning, monitoring and evaluation. Sample design: To determine what kind of people and how many people you need to interview to collect data. A decision about sample size can be made, based on factors such as: time available, budget and necessary degree of precision.
Training Module B
Sampling: A part of statistical practice concerned with the selection of an unbiased or random subset of individual observations within a population of individuals intended to yield some knowledge about the population of concern, especially for the purposes of making predictions based on statistical inference. A design of any informationgathering exercises where variation is present. Skewness: A measure of the asymmetry of the probability distribution of a real-valued random variable. Standard deviation: A statistic that tells how tightly all the various examples are clustered around the mean in a set of data. In other words, they are measures of variability. Structured Query Language (SQL): A standard programming language used for accessing and maintaining a database. The key feature of the SQL is an interactive approach for getting information from and updating a database. Syntax: A set of rules that define the combinations of symbols that are considered to be correctly structured programs in the programming language.
T-test: The expected value of a random variable. For a data set, the mean is the sum of the observations divided by the number of observations. Validation rule: A criterion used in the process of data validation, carried out after the data has been encoded onto an input medium and involves a data vet or validation program. Variable: A symbol that stands for a value that may vary. For instance, a variable can be used to designate a value occurring in a hypothesis of the discussion. Visual Binnig: To perform automatic creation of new variables based on grouping contiguous values of existing variables into a limited number of district categories. This can create categorical variable from continuous scale variables. Wealth index: The extent or degree to which the entire study area is observed, analyzed, and reported by the survey. Weighting: A process, which involves emphasizing some aspects of a phenomenon, or of a set of data.
Exploring Household Surveys for EFA Monitoring Purposes and learning outcomes To gain better understanding of common household surveys To understand reasons on limited use of household survey data in education planning and EFA monitoring To explore the values added and benefits of data from household surveys for education policies To recognize the questions in common household surveys, which are directly or indirectly useful in exploring access, quality and management of education, and their determinants To know the key point to be aware in analyzing data from household surveys Contents 1.
Understanding Household Surveys 1.1 Introduction to Household Surveys 1.2 Education Related Questions (or Modules) in Household Surveys 1.3 Inputs from Household Surveys for Aligning Education Policies
Brief Information on Common Household Surveys 2.1 Background and Objectives of Selected Surveys 2.2 Structure and Contents of the “Survey Questionnaire” 2.3 Consideration on Sample Design 2.4 Understanding Survey Data Files and Availability of Education Related Data
Gathering Survey Data and Getting Ready for Analysis 3.1 Data Sources and Contact Points for Obtaining Census and Survey Data 3.2 Common Obstacles and Approaches in Gathering Population Census and Household Survey Data 3.3 Quality Issues, Challenges and Recommendations in Using Survey Data 3.4 Use of Survey Data along with EMIS Data/Indicators for Policy Analysis
Exercises and Further Studies 4.1 Self-evaluation 4.2 Exercises 4.3 Further Studies
Annexes Annex 1: Population and Housing Census Annex 2: Education Related Questionnaires from Selected Household Survey Annex 3: Education Related Variables in the Selected Datasets Annex 4: List of Key EFA Indicators
1. UNDERSTANDING HOUSEHOLD SURVEYS Most education indicators, especially school-based ones, can be derived from the annual school census or EMIS data collection system. However, EFA monitoring requires more indicators to measure "reaching the unreached" which generally cannot be provided by school data. Some essential EFA indicators which are based on ethnic minority, disabled or illiterate population and out-of-school children can be derived only from the household surveys. 1.1 Introduction to Household Surveys “Household” is defined to be a basic residential unit in which economic production, consumption, inheritance, child rearing, and shelter are organized and carried out. Household is broader than family, since family refers only to a group of people related by blood or marriage such as parents and their children only. “Household survey” is a process of data collection and analysis for understanding general situation and exploring specific characteristics of households or household population. The fieldwork of a household survey investigates and records the facts, observations and experience of sample households, which represents all households in the study area. Tools for data collection include a series of questions, observation checklists and records for discussions. Nowadays household surveys were conducted in almost every country and territory, ad-hoc or periodically (annually, biennially or once in every three or every fifth year or etc.). There are different types of surveys (Ref. Section 2).
1.2 Education Related Questions (or Modules) in Household Surveys Two main components of household survey Household survey generally uses two different questionnaires: a household roster and at least one detailed or individual questionnaires. Household roster: this includes listing of all household members and their characteristics such as age, sex and relationship to head of household for every member; education and literacy status for the persons aged 5 and above; schooling status to those aged 5-24 (or 6-14, 6-19, etc.), and marital status for all adults aged 15 and above. Detailed or individual questionnaire: this explores the main theme of the study, and sometimes, aim only to the specific respondents such as head of household, married couples, mother of children under 5, ever married women, out of school children, disadvantaged children, etc. The fieldwork (data collection) of a household survey is followed by coding, checking and editing, data entry, data verification, data analysis and drafting of the report. Majority of household surveys use SPSS (renamed as PASW) for data analysis and also for creation of tables, graphs and charts. As such, although the survey may enter data using different programs such as dBase, MS Access, MS Excel, CSPro, IMPS, …, the final data files analyzed are available in SPSS data format. Household survey and population census The datasets created from household surveys and population censuses 1 normally include information on household members, which are useful for policy making, planning, monitoring and evaluation in education, such as: (i) population by age and sex (and urban/rural residence in larger surveys), and with special characteristics such as ethnic minority, disability, …); (ii) literacy status of respondents (self-reporting) and other family members (proxy reporting); (iii) highest educational attainment of the respondent, and population under study; and (iv) schooling status (currently attending , dropout or never attended) of children at the schoolgoing ages. Apart from the above mentioned information, several household surveys could provide migration status of household members, and socio-economic characteristics of household such as: (v) birth place and/or place of residence during five or ten years ago; (vi) number of income earners in the household; (vii) household income and expenditure (in some cases, separate health and education expenditures); (viii) possession of household amenities or durables; and (ix) food securities; and so on. As such, data from household survey and population census can complement the school-based data2 by providing information on aspects of children‟s background that may influence household schooling decisions and school participation of children (such as enrollment and/or school attendance). Household surveys provide broader varieties of information while population census provides more accurately on age and sex structure, and education and literacy attainment of entire population. 1
Population census is a type of household survey with broader coverage. By international agreement, census consists of an enumeration of entire population in the specified area regularly at a marked time interval. 2 Ministry of Education, through EMIS (Education Management Information System), regularly collects school-based data and normally processes and provides limited information on the individual characteristics of pupils, such as age, sex, grade and performance (flow rates), and little information on the characteristics of their households.
1.3 Inputs from Household Surveys for Aligning Education Policies Household surveys and population census could also provide data on adult educational attainment and reported literacy skill (that is, reported by the respondent) by household characteristics such as rich or poor household, reside in urban, rural or remote area, far or near to the school, and etc… Key education indicators possible to derive from surveys The following common education indicators which are essential in formulating and aligning education policies, and preparing, monitoring and evaluating education development programmes and projects could be derived from common household surveys and population censuses.3 1) Adult Literacy Rate (for population aged 15 and above); 2) Youth Literacy Rate (for population aged 15-24); 3) Illiteracy rates for different population groups, especially for the vulnerable groups such as females, ethnic minorities, disabled persons, and those from poor families and remote areas; 4) Educational attainment, measured by the number of years attended school or highest level of schooling or proportion of adult population who completes primary or secondary school (adult primary and secondary school completion rates); 5) Gross and net intake rates for primary Grade 1; 6) Gross and net enrolment rates by education level or by age; 7) Transition rates (from primary to lower secondary, and lower to upper secondary level); 8) Student flow rates (promotion, repetition and dropout rates); and 9) Out of School Children. Moreover, some other measures such as gender parity index, cohort survival rate and measure of internal efficiency could be derived from the above indicators. One important benefit for constructing education indicators from the household surveys is the “ability to compare the indicators among different population groups” such as; a. male versus females; b. ethnic minorities vs. other ethnic groups; c. disabled persons vs. general population; d. those living in remote areas vs. urban/rural areas; e. comparing among the families with different wealth levels (measure by quintiles of household expenditure per capita or ownership of household amenities). Such information cannot be made available from regular school-based data collection, and are important in measuring the achievement of education policies and in aligning education policies for future. Utilization of household survey data in education All these information are very valuable for education policy makers and planners, however, such information are not fully utilized for several reasons: lack of awareness on existence and accessibility of survey data even in the same ministry due to bureaucratic procedures, cost, and not knowing where to find or how to request such data; little information on education and literacy are presented in the main report – only few paragraphs or just a section on education in the general household survey reports; 3
See “Guide to the Analysis and Use of Household Survey and Census Education Data (UIS, 2004, pp 13-21)” for detailed framework for analysis and further discussion.
ď‚ˇ additional analysis on education and literacy status are very rare; and ď‚ˇ lack of knowledge and skill on how to capitalize education and literacy data from surveys particularly to facilitate the evidence-based policy formulation, implementation and monitoring. As a result, only a couple of researchers and consultants from international agencies are the ones who use the education and literacy data from surveys to undertake few additional studies. However, most of such studies are academic oriented or aimed to serve the specific project purposes set by the international organization. It is seldom provide the information needs for the policy recommendations. It is crucial to build the capacity on analysis of data from survey to the staff from Ministry of Education and line ministries so as to reflect and incorporate the findings from surveys into the policy formulation, program implementation, monitoring and evaluation, including those for achieving EFA goals4.
See Annex 4 for List of key EFA indicators.
2. BRIEF INFORMATION ON COMMON HOUSEHOLD SURVEYS Every year, different types of household surveys are conducting for different purposes in almost every country. Three most common household surveys in this region, namely, Multiple Indicator Cluster Survey (MICS), Demographic and Health Survey (Measure-DHS), and Living Standard Measurement Study (LSMS) together with the population census are discussed in this section. 2.1 Background and Objectives of Selected Surveys Multiple Indicator Cluster Survey (MICS) The Multiple Indicator Cluster Survey is a household survey developed by UNICEF to assist countries in filling data gaps for monitoring the situation of children and women. It is capable of producing statistically sound, internationally comparable estimates of these indicators. MICS was originally developed in response to the World Summit for Children to measure progress towards an internationally agreed set of mid-decade goals. The first round of MICS was conducted around 1995 in more than 60 countries, and the second round was conducted in 2000 (around 65 surveys). The third round of MICS was carried out in 2005 onwards (more than 50 countries). It was focused on providing a monitoring tool for the World Fit for Children, the Millennium Development Goals (MDGs), as well as for other major international commitments, such as the United Nations General Assembly Special Session (UNGASS) on HIV/AIDS and the Abuja targets for malaria. At least 21 MDG indicators can be collected in the current round of MICS, offering the largest single source of data for MDG monitoring. Results from the surveys, including national reports, standard sets of tabulations and micro level datasets are available at UNICEF's web site www.childinfo.org. Demographic and Health Survey (MEASURE DHS) Since 1984, the Demographic and Health Survey (DHS) Project has provided technical assistance to more than 200 demographic and health surveys in 75 countries advancing global understanding of health and population trends in developing countries. In 1997, DHS became one of four components of the â€œMonitoring and Evaluation to Assess and Use Resultsâ€? (MEASURE) Program5. The MEASURE DHS Project gains worldwide reputation for collecting and disseminating accurate, nationally representative data on health and population in developing countries. The project is implemented by Macro International, Inc. and is funded by the United States Agency for International Development (USAID) with contributions from other donors such as UNICEF, UNFPA, WHO, UNAIDS. Since October 2003 Macro International has been partnering with four internationally experienced organizations to expand access to and use of the DHS data: The Johns Hopkins Bloomberg School of Public Health/Center for Communication Programs; Program for Appropriate Technology in Health (PATH); Blue Raster; The Futures Institute.
MEASURE Program - Together, the four MEASURE partners (MEASURE DHS, MEASURE Evaluation, MEASURE U.S. Census Bureau- Survey and Census Information, Leadership, and Self Sufficiency (SCILS), and MEASURE Centers for Disease Control and Prevention - Division of Reproductive Health (CDC/DRH) provide a full range of related services, which include promoting the demand for quality data; providing technical assistance, training, systems development, data collection and analysis, and capacity-building services; and disseminating information and facilitating its use in decision-making. (See http://www.measureprogram.org/)
The DHS surveys collect information on fertility, reproductive health, maternal health, child health, immunization and survival, HIV/AIDS; maternal mortality, child mortality, malaria, and nutrition among women and children stunted. The strategic objective of MEASURE DHS is to improve and institutionalize the collection and use of data by host countries for program monitoring and evaluation and for policy development decisions. LSMS – Living Standard Measurement Survey LSMS was established by the Development Economics Research Group (DECRG) of the World Bank to explore ways of improving the type and quality of household data collected by statistical offices in developing countries. LSMS is a research project that was initiated in 1980 and carried out several rounds in more than 30 countries. The program is designed to assist policy makers in their efforts to identify how policies could be designed and improved to positively affect outcomes in health, education, economic activities, housing and utilities, etc... Objectives of LSMS include: to improve the quality of household survey data; to increase the capacity of statistical institutes to perform household surveys; to improve the ability of statistical institutes to analyze household survey data for policy needs; and to provide policy makers with data that can be used to understand the determinants of observed social and economic outcomes. LSMS is providing users with actual household survey data for analyses and also a link to reports and research done using LSMS data. Population Census The oldest type of household survey with broader coverage is the “population census”. By international agreement, census consists of an enumeration of entire population in the specified area regularly at a marked time interval. Questions may be asked concerning certain characteristics of each person, such as age, sex, marital status, education, employment status, and more while enumerating population. Therefore, census basically provides the data on number and composition of the entire population at a given time, and selected socio-economic and educational characteristics of household population in the country. Since it is based on the complete enumeration of all households in the country, a census can provide valuable information for policies and the planning of socio-economic development from the national to the lowest administrative levels. Moreover, census is the source for constructing sampling frames for selecting households and population for other surveys. Population censuses are carried out once in every 10 years in most of the countries or once in every 5 years in some economically advanced countries. As such, census is the most comprehensive source of demographic and socio-economic data for several countries. Although the main objective of a census is to get reliable population data, the latest United Nations guidelines 6 for preparing population census emphasis on collecting data on literacy, school attendance, educational attainment, field of study and educational qualifications.
“Principles and Recommendations for Population and Housing Censuses”, United Nations Statistical Office, 1998.
2.2 Structure and Contents of the â€œSurvey Questionnaireâ€? 2.2.1 Questionnaire Used in Multiple Indicators Cluster Survey (MICS) MICS uses three main questionnaires in every survey: (i) household questionnaire, (ii) questionnaire for women aged 15-49, and (iii) questionnaire for children under the age of 5. The Household Questionnaire comprises of household characteristics, household listing, education, child labor, water and sanitation, salt iodization, insecticide-treated mosquito nets (ITNs), and support to children orphaned and made vulnerable by HIV/AIDS, with optional modules for disability, child discipline, security of tenure and durability of housing, source and cost of supplies for ITNs, and maternal mortality. A. Household Identification
B. Household Listing Form
C. Education Module
2.2.2 Questionnaire used in MEASURE DHS Although DHS surveys aim to collect data to understand fertility; reproductive, maternal and child health; immunization, survival and nutrition; maternal and child mortality; HIV/AIDS; and malaria, the key household questionnaire covers several questions on education and its differentials. Followings are the extracts from the DHS Model Household Questionnaire. A. Household Identification
B. Listing of all Household Members - 1
C. Listing of all Household Members - 2
2.2.3 Questionnaire used in Living Standards Measurement Survey (LSMS) LSMS is a comprehensive survey. Its questionnaire set contains (i) household and (ii) community and (iii) price questionnaires. Household questionnaire expands over 100 pages covering 15 sections including education. The education section of the LSMS questionnaires has three sections on four pages as follows:
LSMS Working Paper 130 "Model Living Standards Measurement Study Survey Questionnaire for the Countries of the Former Soviet Union" by Raylynn Oliver.
2.2.4 Population and Housing Censuses As mentioned above, a census covers each and every person in the country, and is the most reliable source of population data. Household roster used in censuses contains basic information on all household members such as age, sex, marital status, education and literacy status together with household characteristics such as location and type of residence, and availability of services. Viet Nam 2009 Population and Housing Census questionnaire includes the following questions on education and literacy status of entire population. Combining with age, sex, residence, migration and disability status recorded in other questions, literacy, educational attainment, and participation and access to education could be analyzed for different population groups.
For further case study, please refer to Annex1.
2.3 Consideration on Sample Design Census based on all households in the study area (a region, or a territory or a country). Therefore, the entire household population is included in data collection. During census taking process, there might be some non-response households, but comparatively very few and generally negligible. Since it is complete enumeration, census does not require a sample design and the data and indicators derived from the census are the actual values, not the estimates. On the other hand, a household survey collects data from the selected households in the area, and provides the estimates (of the characteristics or indicators) for entire household population in the area based on the experience of the sample households. That is, not all the households in the study area are selected in a survey. The quality (accuracy of the estimates) and the usefulness of a household survey depend on the followings points. i) Sampling method (how the sample households are selected); Common sampling methods include SRS (Simple Random Sampling), PPS (Probability Proportional to Size), cluster sampling, multi-stage sampling, and purposive sampling. ii) Coverage (whether the entire study area is covered by the survey); To represent the entire area, sample households must be selected from all households in the area (country or region) using a random sampling method. Some household surveys select from the households with specific characteristics (e.g., poultry farmers) or from preassigned parts of the areas only (e.g. households beyond 3 mile radius from a school). iii) Sample size (how many households are selected) and allocation of samples (how the sample households were allocated to different parts of the area); and iv) Data analysis - how to get estimates (values) of the key indicators, perceived standard errors of estimates, and pre-determined level of disaggregation (e.g. by age, sex, grade, region, socio-economic status, etc.). Sample design of the household survey includes the above mentioned information and it is generally part of the survey report. For the data users (secondary analysts) it is important to know the sampling method and sample size of the study before making any analysis. The accuracy will be lower if the estimates are not calculated in-line with the sampling method of the survey. Similarly, the survey method and how the sample households were allocated are essential in deciding whether and which weights should be applied in data analysis. Moreover, the actual coverage of the survey, sample size and set level of disaggregation will help data user to understand the limitations of the survey including whether desired disaggregation is appropriate at required degree of accuracy or not. Example: In a survey which was designed to get reliable estimates up to the provincial level by sex, and if the estimates of adult illiteracy rate were computed for the adults who are living in remote areas with lowest socio-economic status (lowest quintile) by district by sex, the derived estimates will not be reliable. On the other hand, some surveys were designed to capture specific and rare events. In such a survey, sample size is large and thus sufficient to estimate common education indicators at lower levels at acceptable accuracy. The data analyst should, first, check the sample design through the accompanying documents such as survey report or service contract, and/or contact persons of survey organization.
2.4 Understanding Survey Data Files and Availability of Education Related Data This section highlights the education related variables in the main datasets of three common household surveys and sample outputs on selected variables. Education Related Variables in MICS Sample Dataset In MICS sample dataset, four SPSS data files are generated for: (i) household, (ii) individual household members, (iii) women aged 15-49, and (iv) children under 5. MICS datasets are shared to a wide range of users. The second data file, which is for all individual household members (or household listing – hl.sav), contains education and literacy status of population including schoolage children. The sample “hl.sav” data file contained 183 variables for 29,560 cases (persons), and the following 21 variables are useful for analyzing education and literacy. HH1 HH2 HL1 HL3 HL4 HL5 HL6 ED2 ED3A ED3B ED4
Cluster number Household number Line number Relationship to the head Sex Age Area (urban / rural) Ever attended school Highest level of sch. attended Highest grade at level Currently attending school (2004-05)
ED5 ED6A ED6B ED7 ED8A ED8B melevel helevel hhweight wlthind5
Days attended school in last week Level of education attended Grade of education attended Attended school last year (2003-04) Level of education attended last year Grade of education attended last year Mother's education Education of HH head Household sample weight Wealth index quintiles
The following tables, which are useful in analyzing the schooling status of children aged 5-14, are derived from the sample data file “hl.sav”.
Please see Annex 1 for more case studies.
3. GATHERING SURVEY DATA AND GETTING READY FOR ANALYSIS Although population census and household survey datasets are rich of information, those datasets are difficult to get and sometimes hard to understand. This section discussed the contact points and some tips on how to get the quality data from different sources. 3.1 Data Sources and Contact Points for Obtaining Census and Survey Data Population Census: Censuses are conducted regularly every five or ten years and cover entire country. Complete census databases are confidential and not sharing to the public or third party users. However, subsets of those databases could be requested by the government education departments after complete publishing of the census reports. Census databases are normally maintained by the Census Bureau or Census Department or Central (or General) Statistical Office of the country. On the other hand, if Ministry of Education identifies the required population data and education-related data in tabular forms and requests through higher level authorities (ministerial level), the census authorities will generate and provide the requested tables. Major drawback for using census data is long lag time. A population census took over a year to complete clean databases and the census reports are published two to three years after the census. As such, Ministry of Education could get the education related datasets at least two years after the census. There may also be a long delay in providing requested database subsets or tables. Therefore, not many education ministries are using census databases, but requesting only population data especially the projections of different school-age population. Household surveys: They are available more frequently than population censuses. Moreover, the conducting agencies are willing to share their datasets with simple formal requests. With smaller workload, conducting agencies could create survey databases faster and most reports are available within twelve months after completion of the fieldwork (data collection). Access to datasets varies by survey and from country to country. All major household surveys conducted or sponsored by international organizations have their own websites. Please refer to â€œFurther studiesâ€? for more information.
3.2 Common Obstacles and Approaches in Gathering Population Census and Household Survey Data As mentioned above, population censuses and household surveys contain useful data for EFA monitoring. However, there are limitations. -
Common obstacles in gathering population census database i) Difficult to locate the person (or department) who has the authority to provide census datasets to the third party user. ii) Lack of coordination in developing census questionnaire with other ministries and departments including education ministry so that the questionnaire items in the census may not directly useful for constructing education indicators. iii) A census is conducted normally once in every 10 years and the census data may obtain at least 2 to 3 years after completion of the census. Thus, the usefulness of census data is more to review historic trend than for unveiling the current situation and status. iv) Census collects during the school holiday. Census date rarely coincides with the beginning of school-year, which is the reference date for calculating common education indicators. As such, there may be minor discrepancies among the indicators calculated from the census and regularly collected service statistics. Tips: How to get census data faster and smoother for analysis? i) When seeking census data, it is better to contact at the ministerial level. Approaching census department/agency by a lower level education planner may result in catastrophic situation â€“ waiting days after days, and never receiving proper response from the census department. ii) Limit number of variables in the requested dataset. By requesting data just to meet the minimum requirements, the education planners may get a faster response and can conduct analyses easier. Census datasets are very huge, and take time to subset, or making analyses if several unused variables are included.
In many countries, very few household survey questionnaires were developed by education related ministries and agencies. The survey questionnaires were set by the conducting agency and just distribute to education ministry for comment or just for the information. Compared to population census data, household survey data are easier to obtain for the education ministries. -
Main barriers in using household survey data for EFA monitoring7 i) Variation in measures of educational participation Survey questions on educational attainment and current school attendance are phrased quite differently from survey to survey. In many cases, assumptions were to use in calculating common education indicators. For example, a survey inquires (1) the highest grade completed by household members, and (2) whether the person is currently attending school. To calculate net enrollment rate (NER) or gross enrollment rate (GER) from these questions, an assumption is required about the level/grade currently attended by the household member: if a child has completed Grade 4, and currently attends school, it is to assume that the child is currently attending Grade 5. ii) Timing and duration of survey fieldwork
This portion is extracted from: â€œGuide to the Analysis and Use of Household Survey and Census Education Data (UIS, 2004)â€?.
When considering education data from household surveys, the timing (when the survey was started or at which date that a survey referred to) and duration or how long has the survey taken to complete data collection. If a survey was started just before the end of school-year and took over a month, then, the grade completed or attending may differ from household to household depending on when the interview was conducted â€“ in the early days or later days of the survey. This may not be a problem for the surveys which has set the reference date clearly like in the population censuses. iii) Sample size and sampling method A household survey is designed to provide the facts on or characteristics of the population at a certain period through a representative sample of households. The representativeness of sample depends on the survey design, which is influenced by three factors: the sampling method used, the level of accuracy sought in the estimates for various indicators; and the level of data disaggregation. Some surveys especially the rapid assessments and case-control studies do not use probability sampling techniques, and thus, the findings may not represent the entire population under study. For the surveys aiming to get estimates for common characteristics with moderate accuracy require smaller sample size, while for a rare characteristic (or event) with higher accuracy requires larger sample size. Similarly, for estimating at the national (and provincial) level only requires smaller sample size while finer substratification (such as district or lower level) needs larger sample size. Therefore, it is important to check which sampling method was used in the survey under study, and whether the sample size is sufficient enough for the particular education indicators at desired level of disaggregation. EFA monitoring indicators generally aim to explore the differences among the population groups, such as normal and the disadvantaged ones. The sample size of a particular household survey may or may not be sufficient to compute indicators for the disadvantaged group living in a certain area, depending on the definition of â€œdisadvantaged populationâ€? and level of disaggregation. If the sample size is not sufficient for required disaggregation, it is recommended to reduce the level of disaggregation or compute the required indicators at the desired disaggregation level and present the results with sufficient notice.
3.3 Quality Issues, Challenges and Recommendations in Using Survey Data Generally speaking, data files made available for analysis should be “cleaned”. These files will have been checked for structural and range errors and edited for internal consistency. Provisions that compensate for non-response should also be incorporated into the files and fully explained in the accompanying documentation. The first step after acquiring a dataset is to familiarize with its structure and the nature of its variables, the circumstances of data collection, and any limitations on the use of the dataset. The documentation for a census or household survey, such as reports and a codebook, will provide important background information on the survey, such as sample size and data quality indicators. Data manipulation and analysis can be demanding and complex. The following discussions do not provide a comprehensive set of guidelines for the use of datasets; instead, reviews some key issues to be considered in analyzing survey data. (1) Familiarize with the structure of dataset and explore appropriate ways to analyze First, find out whether records within the data files are at the household or individual level, and second, whether household or individual weights should be used in estimation procedures. Since sample surveys do not collect entire population (all households or all individuals) in an area, weighting factors are required to reconstitute the characteristics of entire population from the samples. For example, in a survey 5 households are selected from two enumeration area (EA) of 50 and 60 households respectively; then, the household weight for each of the 5 sample households from the first EA is 10, and from the second EA is 12. The weights are calculated while planning the survey, and are provided in the dataset.8 (2) Study the variables in the datasets before analysis It is important to refer original questionnaires to understand the variables better how to analyze the data. For example, to analyze the literacy status of population, one should know the nature of the variable such as: its codes (for example, „1=literate‟, „2=illiterate‟); restrictions (whether the question was asked to all ages or aged 5+ or aged 15+); relationship to other questions/variables (whether it was asked to everybody, or only those persons who answered „no education‟ or „incomplete primary‟ in the question on “highest education level”); and missing values (code „9‟) and non-response (code „8‟ for the variable “literacy status”). Only after that, the data analyst can determine which variables were to select and how to handle the selected variables to produce required indicator estimates efficiently. (3) Replicate published results before proceeding with additional calculations If there are reports of results from the data collection activity, try to replicate these results before calculating any new indicators. Sorting out the difficulties with calculations already done will bolster confidence in producing new results. (4) Consider the issue of missing values Non-response in a survey or census can happen in one of two ways. First the entire record representing an individual or household was missing since the individual or household refused to answer, was not available, could not be contacted, etc.; this is called “total non-response”. The second type of non-response arises when variables within a record are missing and is termed “item non-response”. The item non-response is common for the variables representing the question which was not asked or known for all household members, such as whether a child attends school during the current school year. 8
For detail explanations on weighting, see C-E Särndal et al (1992) “Model Assisted Survey Sampling”, Springer-Verlag; and WG. Cochran (1977) “Sampling Techniques”, Jonh-Wiley & Sons.
A technique called “imputation” is often used to compensate for missing values in the case of item non-response. Imputation replaced missing values with the most suitable ones base on other cases in the same dataset. The resulting file, complete or “square”, allows getting better estimates in constructing new indicators. Therefore, the data analyst must know how item missing values were treated in the dataset. In the case of total non-response weight adjustments method is often used. That is, nonresponse records are omitted from the dataset and recalculating the weights. In this case, the dataset contains two sets of weights “sample weight” and “adjusted/final weights”, and the users must employ the final weight in calculating indicators. (5) Calculate the measures of accuracy (coefficient of variation) of the basic estimates to gauge reliability of the estimated indicators Depending on the overall sample size of the survey, some tabulations may yield cells with very small numbers of cases. The indicators estimated based on those tables may not be reliable. For this reason, it is paramount to calculate some measure of accuracy and to disseminate it alongside the basic estimate enabling to gauge the reliability of all estimates produced. A good rule of thumb in this regard is to use the coefficient of variation (CV). The coefficient of variation (CV) is defined as the square root of the variance divided by the estimate itself and multiplied by 100 – expressed as a percentage. Often, national statistical offices advocate basic quality guidelines that estimates having CVs greater than 35% should not be used to draw statistical inferences and should not be released to the public. Be sure to properly account for complex survey designs in analysis, particularly when calculating variances. In general, national population censuses collect data on all households and individuals in the population, and thus, sample design and weighting are not at issue. The only exception is when a different questionnaire with more detailed questions is presented to a sampled fraction of the population. But even then, no explicit issues of complex survey designs since simple and selfweighting designs (such as Stratified Simple Random Sampling or Systemic Sampling) are generally used. In the case of complex survey designs, forming the estimate itself (for example, primary school net enrolment rate (NER)) is not an issue since it is easy to take the design into account by simply applying the survey weights into the estimator. However, there may be critical issues in variance estimation and thus CV estimation9.
See “Guide to the Analysis and Use of Household Survey and Census Education Data (UIS, 2004, pp 36-37)” for further discussion on issues concerning weighting and calculation of CV in complex sample designs.
3.4 Use of Survey Data along with EMIS Data/Indicators for Policy Analysis Administrative and household survey data sources measure educational participation in different ways. Administrative data are based on school reporting at the beginning of the school year, and in some cases, it can include reporting at the middle or end of the school year. Enrolment rates are based on the numbers of children enrolled in school and the school-age population estimated from national censuses and/or vital statistics. Ideally, household surveys collect data on enrolment and/or school attendance based on a representative sample of children. Questions concerning childrenâ€&#x;s school participation are typically asked to the head of household. The timing of the survey is varied from one survey to another and unrelated to the school year. Some survey may actually even cross two different school years. ďƒ¨ Limitation of data Estimates of educational participation from these two sources may differ for a number of reasons. One major factor is that the question asked in the household surveys querying childrenâ€&#x;s school attendance is different from that answered by school censuses: attending school may slightly differ from being enrolled in school. Children may be recorded in school enrolment records and not actually attending school. Thus, the enrolment rates from the census and surveys may slightly lower than those from the administrative data. The different rates of participation can also be attributed to the timing of data collection relative to the school year. A school census conducted at the beginning of the school year and a household survey collecting data at the end of the school year will likely find different rates of participation since some children will have enrolled in school without ever actually attending, and other children will have dropped out of school during the school year. In addition, the accuracy of the population estimate and the completeness of school-level data can affect the calculation of participation rates from administrative data. Similarly, the completeness of the census enumeration and the sample design for the household survey may also affect the accuracy of estimates produced by censuses and surveys. In short, many factors may contribute to variations in the estimates of school participation rates from administrative data and household surveys. Further research is needed to explore the reasons for similarities or differences between the measures of participation from these two sources. However, when the school-age population estimates are not accurate and annual school censuses do not cover several aspects essential for planning and monitoring, only the population census and household surveys could provide reasonable indicators for planning and EFA monitoring. For example, school administrative cannot provide enrolment rates by socio-economic status of the household or for the disadvantaged groups and also cannot provide reasons for non-participation (not enrolled) or dropping-out. As such it is important to use both school administrative data and secondary data from census and surveys for the policy analysis especially for the EFA monitoring aiming at reaching to the unreached.
4. EXERCISES AND FURTHER STUDIES 4.1 Self-evaluation
How much do you understand why household survey data are essential in EFA monitoring and evaluation? Very well / Somewhat well / Not so much / Almost None Do you know which common household surveys are conducted in your country? Very well / Somewhat well / Not so much / Almost None Do you agree that the selected questions in three common household surveys are directly or indirectly useful in exploring access, quality and management of education, and their determinants? Strongly agree / Agree / Not so much / Disagree Are you able to share the factors to be aware in analyzing data from household surveys to someone who want to analyze survey data? Very well / Somewhat well / Not so much / Almost None Are you confident that you could explore a household survey questionnaire and extract key questions which are useful to supplement the regular data collection system for EFA monitoring and evaluation? Confident / Somewhat confident / Not so much / Not at all
4.2 Exercises i) When was the last population census conducted in your country? a. Get the census report or tables which may be useful for EFA monitoring. b. Provide pros and cons for using data from census report(s) for EFA monitoring. c. Get the census questionnaire and extract the items on education and related to education. d. Is that possible to get raw data on education and related fields from Census Department and why? ii) What is the most recent household survey conducted in your region (or country) and describe the followings briefly? a. When was it conducted? b. Which sampling method was applied? c. What was the sample size? d. Explain briefly about the survey findings on education and literacy provided in the report. e. Is data file (dataset) from that household survey available for you? iii) Connect to internet and find out the MICS website on your country, then, a. Collect the questionnaire set for the most recent MICS survey in your country (or in a neighboring country). b. Download datasets in SPSS format from the most recent MICS survey for your country (or for a neighboring country). c. Study the variables, and compile a list of variables which you think is useful to construct education indicators especially for EFA monitoring. iv) From the DHS website, find out a recent report (if possible for your country) and prepare an abstract which is useful for education planners. v) If you have a chance to discuss, what do you want to add to or delete from LSMS survey questionnaire, and why?
4.3 Further Studies -
International Household Survey Network (See http://www.internationalsurveynetwork.org )
Luxembourg Income Study (See http://www.lisproject.org/)
MEASURE DHS (Demographic and Health Surveys):Quality information to plan and improve population, health, and nutrition program ( See http://www.measuredhs.com/)
Rand Family Life Survey ( See http://www.rand.org/labor/FLS/ )
UNESCO Institute for Statistics (UIS). 2004. Guide to the Analysis and Use of Household Survey and Census Education Data (Can be downloaded at http://www.uis.unesco.org/template/pdf/educgeneral/HHSGuideEN.pdf )
UNICEF. Childinfo: Monitoring the Situation of Children and Women (Multiple Indicator Cluster Survey) ( See http://www.childinfo.org/)
United Nations Department of Economic and Social Affairs. 2008. Principles and Recommendations for Population and Hosing Census Revision 2. (See http://unstats.un.org/unsd/publication/SeriesM/Seriesm_67rev2e.pdf )
United Nations Population Funds. Collection and using data: population and housing data (See http://www.unfpa.org/data/census.cfm )
United Nations Statistics Division (See http://unstats.un.org/unsd/default.htm )
USAID‟s DHS EdData Activity website ( See http://www.dhseddata.com/ )
World Bank. Living Standards Measurement Study (LSMS) ( See http://econ.worldbank.org/WBSITE/EXTERNAL/EXTDEC/EXTRESEARCH/EXTLSMS/0,,m enuPK:3359053~pagePK:64168427~piPK:64168435~theSitePK:3358997,00.html )
Other organizations with links to education data sources The William Davidson Institute http://www.wdi.bus.umich.edu/ The Development Gateway http://www.ids.ac.uk/eldis/health/health.htm University of California http://biko.sscnet.ucla.edu/dev_data/
Country case studies -
NEPAL LIVING STANDARDS SURVEY 2002/03 ( See http://siteresources.worldbank.org/ INTLSMS/Resources/3358986-1181743055198/3877319-1181925143929/nlss2_urban.pdf)
General Population Census of Cambodia 2008 (See http://www.nis.gov.kh/nis/uploadFile/pdf/EnumeratorManual.pdf) (Household questionnaire refer to p65)
Vietnam 2009 Population and Housing Census (See http://www.gso.gov.vn )
2005 Population and Housing Census of Korea (See http://kostat.go.kr )
Tanzania poverty monitoring ( See http://www.povertymonitoring.go.tz/index.asp )
5. ANNEXES Annex1: Population and Housing Census A1.1 2005 Population and Housing Census of Korea: This includes just two education items on one question. Even form such limited data, education and literacy status of population and schooling status of children could be studied by age, sex, residence, and etcâ€Ś
General Population Census of Cambodia 2008:
This contains the following literacy, education and disability status in the main questionnaire.
Therefore, it is apparent that all population censuses include from a limited number to several questions on education and literacy status of entire population.
Annex 2: Education Related Questionnaires from Selected Household Survey A2.1
Household questionnaire of the Nepal Living Standard Survey 102002/03:
This contains a section on education covering (i) literacy, (ii) past enrolment and (iii) current enrolment as followings:
NLSS, which is alternative name of LSMS
Annex 3: Education Related Variables in the Selected Datasets A3.1
Nepal’s 2006 DHS Dataset
The dataset from 2006 Nepal DHS contains seven SPSS data files: (i) Births Recode, (ii) Couples' Recode, (iii) Household Recode, (iv) Individual Recode, (v) Children's Recode, (vi) Male Recode, and (vii) Household Member Recode. The last data file NPPR51FL.SAV (for the individual household members; 44,057 persons x 258 variables) contains all necessary information except for one important differential of access to and attainment of education, the “wealth index” (households grouped into five quintiles based on wealth). The wealth index could obtain from the third data file for the households. The selected variables from NPPR51FL.SAV are: HV001 HV002 HV003 HV005 HV024 HV025 HV026 HV104 HV105 HV106 HV107
Cluster number Household number Respondent's line number Sample weight Region Type of place of residence Place of residence Sex of household member Age of household members Highest educational level Highest year of education
HV108 Education in single years HV109 Educational attainment HV121 Member attended school during current schoolyear HV122 Educational level during current school-year HV123 Grade of education during current school-year HV124 Education in single years - current school-year HV125 Member attended school during previous schoolyear HV126 Educational level during previous school-year HV127 Grade of education during previous school-year HV128 Education in single years- previous school-year HV129 School attendance status
From the above variables, the following frequency tables could be constructed for the children aged 5-14.
Albaniaâ€™s 2005 LSMS Dataset
The 2005 Albania LSMS covered 3,638 households residing 17,302 persons. The survey datasets are available on the LSMS website. Since LSMS questionnaire covers several topics and items, datasets were split into several files. The datasets directly concerned with education are educationa_cl.sav (for preschool education), educationb_cl.sav (for general education and literacy), and household_rostera_cl.sav (for age, and sex).
The selected variables from those datasets are: hhid m2b_q00 m1a_q02 m1a_q5y m2b_q01 m2b_q02 m2b_q04 m2b_q05 m2b_q07 m2b_q09 m2b_q10
household identifier ID code Sex Age - Years Can read newspaper Can write personal letter Highest level Highest Grade Years of preschool Currently attending school Reason for not attending
m2b_q14 m2b_q16 m2b_q17 m2b_q18 m2b_q20 m2b_q22 m2b_q23 m2b_q24 m2b_q49 m2b_q50 m2b_q51
Intends to return to school Current level Current Grade Public - Private Distance from dwelling Hours to travel Minutes to travel Transport to school Absent from school Days missed Reason missed school
From the above variables, literacy (read and write) and schooling status for the children aged 7-14 could be analyzed as seen in the following tables:
Annex 4: List of Key EFA Indicators Goal 1: ECCE
(H) (S) S (S) (S) (S) (H) (S)
Goal 2: UPE
H H H H (H) (H) (H) (H) (H) (H) (H)
Goal 3: Lifelong learning
(S) S S S S S S S S S S S S S S S S S
H H (H)
S (S) (S) (S)
1. Gross Enrolment Ratio (GER) in ECCE programmes 2. Percentage of new entrants to primary Grade 1 who have attended some form of organized ECCE programme 3. Enrolment in private ECCE centres as a percentage of total enrolment in ECCE programmes 4. Percentage of trained teachers in ECCE programmes 5. Public expenditure on ECCE programmes as a percentage of total public expenditure on education 6. Net Enrolment Ratio (NER) in ECCE programmes including preprimary education 7. Pupil/Teacher Ratio (PTR) (child-caregiver ratio) 8. Gross Intake Rate (GIR) 9. Net Intake Rate (NIR) 10. Gross Enrolment Ratio (GER) 11. Net Enrolment Ratio (NER) 12. Percentage of repeaters 13. Repetition Rate (RR) by grade 14. Promotion Rate (PR) by grade 15. Dropout Rate (DR) by grade 16. (Cohort) Survival Rate to Grade 5 17. Primary Cohort Completion Rate 18. Transition Rate (TR) from primary to secondary education 19. Percentage of trained teachers in primary education 20. Pupil/Teacher Ratio (PTR) in primary education 21. Public expenditure on primary education as a percentage of total public expenditure on education 22. Percentage of schools offering complete primary education 23. Percentage of primary schools offering instruction in the mother tongue 24. Percentage distribution of primary school students by duration of travel between home and school 25. Number and percentage distribution of the adult population by educational attainment 26. Number and percentage distribution of young people aged 15-24 years by educational attainment 27. Gross Enrolment Ratio (GER) for technical and vocational education and training 28. Number and percentage distribution of lifelong learning/ continuing education centres and programmes for young people and adults 29. Number and percentage distribution of young people and adults enrolled in lifelong learning/continuing education programmes 30. Number and percentage distribution of teachers/facilitators in lifelong learning/continuing education programmes for young people and adults
Note: H: Household surveys (H): If collected by Household surveys
S: School records and school censuses (S): If collected from ECCE centers and NFE centers
Goal 4: Adult literacy
Goal 5: Gender equality
Goal 6: Quality of Education
31. Adult literacy rate (15 years old and above) 32. Youth literacy rate (15-24 years old) (S) 33. Public expenditure on adult literacy and continuing education as a percentage of total public expenditure on education (S) 34. Number and percentage distribution of adult literacy and basic continuing education programmes (S) 35. Number and percentage distribution of facilitators of adult literacy and basic continuing education programmes (S) 36. Number and percentage distribution of learners participating in adult literacy and basic continuing education programmes (S) 37. Completion rate in adult literacy and basic continuing education programmes (S) 38. Number and percentage of persons who passed the basic literacy test (S) 39. Ratio of private (non-governmental) to public expenditure on adult literacy and basic continuing education programmes H S 40. Female enrolled as percentage of total enrolment S 41. Female teachers as percentage of total number of teachers S 42. Percentage of female school managers/district education officers 43. Gender Parity Index for: (H) a. Adult literacy rate (15 years old and above) (H) b. Youth literacy rate (15-24 years old) (H) (S) c. GER in ECCE H S d. GIR in primary education H S e. NIR in primary education H S f. GER in primary education H S g. NER in primary education H S h. Survival rate to Grade 5 H S i. Transition Rate from primary to secondary education H S j. GER in secondary education H S k. NER in secondary education S l. Percentage of teachers with pre-service teacher training S m. Percentage of teachers with in-service teacher training S 44. Percentage of primary school teachers having the required academic qualifications S 45. Percentage of school teachers who are certified to teach according to national standards S 46. Pupil/Teacher Ratio (PTR) S 47. Pupil/Class Ratio (PCR) S 48. Textbook/Pupil Ratio (TPR) S 49. Public expenditure on education as a percentage of total government expenditure S 50. Percentage of schools with improved water sources S 51. Percentage of schools with improved sanitation facilities S 52. Percentage of pupils who have mastered nationally defined basic learning competencies S 53. School life expectancy S 54. Instructional hours
Introduction to PASW Statistics (SPSS for Windows) Purpose and Learning Outcomes: To inform background of popular statistical analysis software packages To understand why SPSS / PASW is chosen as a statistical software for assisting EFA monitoring To practice installation of PASW To explore basic features and components of PASW To understand how to import data from other sources to PASW Contents: 1. Selecting Example Software for Analyzing Household Survey Data to Assist EFA
Monitoring 1.1 CSPro (Census and Survey Processing System) 1.2 EPI Info 1.3 Microsoft EXCEL (with VBA Programming) 1.4 PSPP 1.5 SAS (Statistical Analysis System) 1.6 Stata 1.7 SPSS (Statistical Package for Social Sciences) 2. Introduction to PASW Statistics 2.1 What is SPSS/PASW Statistics? 2.2 Step-by-Step Procedure for PASW Statistics Installation 2.3 Running PASW and Its User Interface 3. Basic Components of PASW Statistics 3.1 Output Viewers 3.2 Pivot Tables 3.3 Charts 3.4 Saving/ Exporting Outputs 3.5 Online Help 4. Using Data from Other Sources 4.1 Importing Data from Microsoft Excel 4.2 Importing Data from Delimited ASCII Text Files 4.3 Importing Data from Fixed Width Text Files 4.4 Importing Data from Microsoft Access Databases 5. Tips and Exercises 5.1 Tips: Do and Don’t 5.2 Self-evaluation 5.3 Questions and Hands-on Exercises
1. SELECTING EXAMPLE SOFTWARE FOR ANALYZING HOUSEHOLD SURVEY DATA TO ASSIST EFA MONITORING More than 100 statistical software packages are observed on the web. Some of those packages can be run only on-line; some are free or public domain while the remaining are proprietary; some packages stick to a special field while the others are general purpose. It is impossible to review all packages, and difficult to select example software for this module. Therefore, a review has been made on seven most widely used software in this section. 1.1 CSPro (Census and Survey Processing System) CSPro is a public domain statistical package which can be used for entering, editing, tabulating, and mapping of census and survey data. It is widely used by statistical agencies in developing countries, especially for data entry (fixed-width text file format). It was designed and implemented through a joint effort among the developers of the Integrated Microcomputer Processing System (IMPS) and the Integrated System for Survey Analysis (ISSA): the United States Census Bureau, Macro International, and Serpro S.A. CSPro was designed to replace both IMPS and ISSA. The current version of CSPro is 4.0.003 released on 20 October 2009. CSPro 4.0 There are four key applications (together with several useful utilities) in the CSPro application package: 1) A Data Entry Application contains a set of forms (screens) and logic that a data entry operator uses to key in data to a file which can be used to add new data or to modify existing data. Users can create unlimited number of forms (screens) for data entry normally as a part of the data entry application. 2) A Batch Edit Application can be used to gather information about a data file together with several run-time features including: writing editing rules for checking validity (values in a variable) and consistency (between variables/cases) and modifying data values; making imputations and generate imputation statistics; generating edit reports automatically or creating a customized report and creating additional variables. 3) A Tabulation Application contains a set of table specifications (structure) and a data dictionary (an existing or newly defined one) describing a data file to be tabulated. This application could cross-tabulate variables and producing map results by geographical area (if applicable) using both existing variables and new variables created "on the fly". Output tables can contains selected statistics from simple counts and percents to mean, median, mode, standard deviation, variance, n-tiles, proportions, minimum, and maximum. Tabulations can be made on the values as it is o the data file or by applying weights. 4) A Data Dictionary describes overall organization of a data file (or) provides a description of how data are stored in a data file. Data dictionary is the life of CSPro applications. It must be created for each file being used. One of the excellent feature of CSPro is requiring very simple and minimal hardware resources to run. The minimum configuration includes (i) 33MHz 486 processor; (ii) 16MB of RAM, (iii) a VGA monitor, and Microsoft Windows 98SE (this program runs only on the Microsoft Windows family of operating systems). It is a public domain software and can be download at no cost. All in all, CSPro is the most software in conducting data entry and initial analyses for general surveys and population censuses. It is widely used in current DHS surveys. However, every data file must have a data dictionary, even for making simple data analysis such as constructing the frequency tables for the selected variables. Therefore, it is not suitable to analyze a dataset created in other software (or datasets without predefined data dictionary).
1.2 EPI Info “Epi Info”is public domain statistical software for epidemiology developed by Centers for Disease Control and Prevention (CDC) in Atlanta, Georgia (USA) since 1985. It is a public domain software package designed for the global community of public health practitioners and researchers. The first version, Epi Info 1, was an MS-DOS batch file on 5.25" floppy disks released in 1985. It was developed under MS-DOS platform until the Epi Info 2000, the first Windows-based version. Starting from Epi Info 2000, data was stored in the Microsoft Access database format, rather than the text file format used in the MS-DOS versions. In current years, Windows Vista was supported in version 3.5.1, released on August 13, 2008 and, an open source version, Epi Info 7, was released on November 13, 2008 where its source codes can be downloaded. The current versions provide easy form and database construction, data entry, and analysis with epidemiologic statistics, maps, and graphs. The primary applications within EpiInfo are: MakeView Enter Analysis EpiMap Epi Report
to create forms and questionnaires which automatically creates a database; to enter data into database through forms and questionnaires created in MakeView; to produce statistical analyses of data, report output and graphs; to develop GIS maps with overlaying survey data; and to combine analysis output, enter data and any data contained in Access or SQL server and present it in a professional format. The generated reports can be saved as HTML files for easy distribution or web publishing.
Although “Epi Info” is a CDC trademark, the programs, documentation, and teaching materials are in the public domain and may be freely copied, distributed, and translated. The 2003 analysis documented 1,000,000 downloads from 180+ countries and its manual and/or programs have been translated from English into 13 additional languages. One of the most attractive functions of Epi Info is supporting all steps from developing of questionnaire to data analysis and creating a tailor-made report. First, the users must develop a questionnaire with Epi Info's "MakeView". Base on that questionnaire, one can customize the data entry process, enter data into the database (that was created when developing questionnaire), and finally, analyze the data. For epidemiological uses, such as outbreak investigations, being able to rapidly create an electronic data entry screen and then do immediate analysis on the collected data can save considerable amounts of time versus using paper surveys. As such, it is one of the best software for using survey developers and researchers especially on epidemiological research/surveys. However, it is not easy to analyze a dataset created in other software, which the main theme of this Module.
1.3 Microsoft EXCEL (with VBA Programming) Microsoft Excel (full name Microsoft Office Excel), a component of Microsoft Office, is a spreadsheet application of Microsoft for both Windows and Mac OS X operating systems. Excel was first established in 1985 on Mac OS, and the first Windows version in November 2007. Microsoft Excel has became the most widely used spreadsheet application since the release of Version 5 in 1993. The most recent commercial versions are Microsoft Office Excel 2007 for Windows and 2008 for Mac. Key features of Microsoft Excel include: calculation, graphing tools, pivot tables (or OLAP Cubes) and a macro programming language in Visual Basic for Applications (VBA). It also has the ability to carry out several database management functions including supports to SQL (Structured Query Language) and Network DDE (Dynamic Data Exchange) allowing spreadsheets on different computers to exchange data. Since 1993 version, Microsoft Excel supports programming through Microsoft's Visual Basic for Applications (VBA). VBA is based on Visual Basic and adding the ability to automate tasks in Excel and to provide user-defined functions (UDF) for the use in worksheets. Moreover, programming with VBA allows spreadsheet manipulation impossible with standard spreadsheet techniques. Programmers may write VBA codes directly using the Visual Basic Editor (VBE). On the other hand, users can record VBA codes replicating their actions on the spreadsheets, and thus allowing simple automation of regular tasks. Through VBA, a programmer can assess a database (or dataset) which is placed on a spreadsheet or from the different files (created in non-Excel formats). Then, Visual Basic modules can be written for constructing frequency and crosstab tables, calculation of different statistics, and conducting transformation, sorting, selection and formatting. The results, intermediate or final, could be concurrently written back to a spreadsheet or saved in a separate file. The most favoring feature of Microsoft Excel is its wide accessibility as a component of Microsoft Office. Microsoft Excel is one of the most frequently used software since almost all computer literates can use it easily. On the other hand, only few users are familiar with VBA, Pivot Table and database functions which are the essential part for analyzing household survey data for EFA monitoring. However, Microsoft Excel is the most suitable software for making final touches on statistical output tables produced by other software, such as modifying a table format and adding graphs and charts.
1.4 PSPP A free, open-source alternative software to the proprietary statistics package SPSS. It is an application for analysis of sampled data and it has a graphical user interface and conventional command line interface. It is written in C, uses GNU Scientific Library for its mathematical routines, and "plotutils" for generating graphs. PSPP was start distributing since 1998, and the most recent once (version 0.6.2) was released on 11 October 2009. PSPP provides basic, but very useful, statistical analyses such as constructing frequency and crosstab tables; making non-parametric tests, significant tests and reliability tests; fitting of different linear regression models; factor analysis and computing basic statistics. It also provides some database management features such as sorting and selecting cases, computing new variables, recoding into existing and new variables, and more. Users can select outputs (tables and graphics) in ASCII, pdf, postscript or html formats. Some graphs such as histograms, pie-charts and np-charts can also be generated. PSPP can open SPSS data files and able to import data from Gnumeric, OpenDocument, Microsoft Excel spreadsheets, databases, comma-separated text files and ASCII text files. It can save data files in the SPSS 'portable' file format (*.por), SPSS 'system' file format (*.sav) and ASCII text file format. Some of the libraries used by PSPP can be accessed programmatically; PSPP-Perl provides an interface to the PSPP libraries. The program file and manual can be downloaded from "http://www.gnu.org/software/pspp/". The program can be installed freely and used without limitations. However, its documentations and help system are not much useful for the beginners.
1.5 SAS (Statistical Analysis System) SAS is an integrated system of software products from "SAS Institute Inc.". SAS enable programmers (users) to perform many different kinds of analysis, data management and output generating functions such as: data entry, retrieval, management, and mining report writing and graphics statistical analysis business planning, forecasting, and decision support operations research and project management quality improvement applications development data warehousing (extract, transform, load) platform independent and remote computing In addition, SAS has many business solutions that enable large scale software solutions for areas such as IT management, human resource management, financial management, business intelligence, customer relationship management and more. SAS is driven by SAS programs that define a sequence of operations to be performed on data stored as tables. SAS Library Engines and Remote Library Services allow access to data stored in external data structures and on remote computer platforms. SAS functions via application programming interfaces, in the form of statements and procedures. A SAS program is composed of three major parts namely, (a) the DATA step, (b) procedure steps, and (c) a macro language. The DATA step identifies file structure, and reading and writing of records, and closing of the file. All other tasks are accomplished by procedures in the procedure steps. Procedures are not restricted to only to built-in ones but allow extensive customization, controlled by mini-languages defined within the procedures. SAS also has an extensive SQL procedure, allowing SQL programmers to use the system with little additional knowledge. The macro programming extensions allows using of the "open code" macros or the interactive matrix language SAS/IML component. Macro code in a SAS program undergoes preprocessing. At runtime, DATA steps are compiled and procedures are interpreted and run in the sequence they appear in the SAS program. A SAS program requires the SAS software to run. SAS consists of a number of components, which require separately licenses and installations. SAS runs on IBM mainframes, Unix machines, OpenVMS Alpha, and Microsoft Windows; and code is almost transparently moved between these environments. SAS requires extensive programming knowledge and it is the most expensive and comprehensive statistical analysis software.
1.6 Stata The name "Stata" is taken letters from the words "statistics" and "data". It is a general-purpose statistical software package with full range of capabilities including data management, statistical analysis, graphics, simulations, custom programming. It is used by many businesses and academic institutions around the world. Most of its users work in research, especially in the fields of economics, sociology, political science, and epidemiology. Stata was first commercialized in 1985 by StataCorp and released a new major release roughly every two years in recent years. The most recent version is Stata 11 distributed on 27 July 2009. There are four major builds on each version of Stata: Stata/MP for multiprocessor computers (including dual-core and multi-core processors) Stata/SE for large databases Stata/IC the standard version Small Stata a smaller, student version of educational purchase only Stata emphasizes on command-line interface to facilitate replicable analyses although a graphical user interface (that is, menus and dialog boxes facilitate access to built-in commands) has initiated since Stata 8. It allows opening one dataset at a time for review and editing in spreadsheet format, but the dataset must be closed before other commands are executed. When working with Stata, it holds entire dataset in memory, which limits its use with extremely large datasets. The dataset is always rectangular in format, that is, all variables hold the same number of observations (with some entries may be missing values). Stata's proprietary file formats are platform independent, so users of different operating systems can easily exchange datasets and programs. Stata's data format has changed over time, although not every Stata release includes a new dataset format. Every version of Stata can read all older dataset formats, and can write both the current and most recent previous dataset format. Thus, the current Stata release can always open datasets that were created with older versions, but older versions cannot read newer format datasets. Stata can read and write SAS XPORT format datasets natively and it can import data from ASCII formats (CSV or fixed-width) and spreadsheet formats (including various Microsoft Excel formats). Just some other econometric applications can directly import data in Stata file formats. An advantage for using Stata is independency of OS for both datasets and programs. Another advantage is allowing to operate user-written commands together with built-in commands. Several useful commands are available to download from the internet (these command files are called adofiles). Stata's version control system is designed to give a very high degree of backward compatibility, ensuring that codes written for previous releases continues to work in newer version. Some of the difficulties in suing Stata are requiring a thorough understanding of working on its command line interface and basic commands. It seems that only those with extensive programming experience could use Stata through self-learning. That is, a tailor-made training may be required for the beginners before working effectively with Stata.
1.7 SPSS (Statistical Package for Social Sciences) SPSS is one of the most popular data analysis software allowing various statistical methods and procedures. SPSS was first developed in 1968 at the Stanford University for internal use only (see brief history of SPSS/PASW Statistics in Section 2.1 of this module). Starting from March 2009, the name SPSS had been changed to PASW Statistics (Predictive Analytics SoftWare)1. The recent versions of SPSS/PASW Statistics could handle multiple datasets with almost unlimited number of variables and cases. It allows importing and exporting of data and outputs to different formats including Microsoft Excel and various text formats. Both menu (and dialog boxes) driven graphical interface and command line (syntax) interface are available for the users. It is the most user-friendly statistical software for the beginners to do basic analysis. It offers excellent on-line help, complete users' manuals and self-learning tutorials. The package covers almost all statistical methods required from basic to advanced analysis, good data management and data documentation. It is also found out that a vast majority of household surveys were analyzed with SPSS and/or final survey datasets are available in SPSS (*.sav) format. For these reasons, PASW Statistics is chosen as the example software to demonstrate household survey data analysis for EFA monitoring purposes in this module. At the same time, with better availability and acquaintance with intended users of this module, Microsoft Excel is also selected as another example software especially for finalizing outputs and presentation purposes.
Disclaimer UNESCO does not recommend using a particular software. PASW Statistics and Microsoft Excel are used only as the â€œexampleâ€? software in this module. A software is just a tool to assist in exploring EFA monitoring indicators from the household survey datasets, and users can choose any statistical software. Review and selection of the statistical software are solely based on the limited experience of the author of this module. It does not reflect UNESCO's view or perspective. Several facts are obtained from the user manuals of underlying software, and from the Wikipedia, the web-based free encyclopedia.
Recently, PASW Statistics has been changed to IBM SPSS Statistics after becoming part of IBM in late 2009.
2. INTRODUCTION TO PASW STATISTICS Statistical Package for the Social Sciences (SPSS) was the first comprehensive data analysis software available on personal computers. Its original SPSS user’s manual is widely accepted as the “Sociology's most influential book". 2.1 What is SPSS/PASW Statistics? Brief History In 1968 at the Stanford University, Norman H. Nie a social scientist and doctoral candidate, C. Hadlai (Tex) Hull who was just completed master of business administration, and Dale H. Bent a doctoral candidate in operations research, developed a software system based on the idea of using statistics to turn raw data into information essential to decision-making. This statistical software system was called SPSS, the Statistical Package for the Social Sciences, which is the root of present day PASW, the Predictive Analytics Software. Nie, Hull and Bent developed SPSS out of the need to quickly analyze volumes of social science data gathered through various methods of research. Nie represented the target audience and set the requirements; Bent had the analysis expertise and designed the SPSS system file structure; and Hull programmed. The initial work on SPSS was done at Stanford University with the intention to make it available only for local consumption. With the launch of the SPSS user‟s manual in 1970, the demand for SPSS software was taken off. Moreover, the original SPSS user‟s manual has been described as “Sociology's most influential book 2 ”. With growing demand and popularity since 1970, a commercial entity, SPSS Inc. was formed in 1975. Up to mid-1980s SPSS was available only on mainframe computers. With advances of personal computers in early 1980s, the SPSS/PC was introduced in 1984 as the first statistical package appeared on a PC working on MS DOS platform. Similarly, the first statistical product on the Microsoft Windows (version 3.1) operating system was again SPSS, which was released in 1992. Versions of SPSS in Recent Years SPSS regularly updates to be fit in and also to exploit the advance features of new operating systems, and to fulfill the growing needs of users. SPSS 16.0.2 - April 2008 SPSS Statistics 17.0.1 - December 2008 PASW Statistics 17.0.2 - March 2009 (PASW = Predictive Analytics SoftWare) PASW Statistics 18.0.1 (or) IBM SPSS Statistics 18.0.1 - August 2009 PASW is just enhancement and renaming of SPSS and not even the version number is restarted. SPSS Users At the beginning, SPSS users were limited academic researchers, mostly around large universities with mainframe computers. With relatively very high price, employment of touch security systems and less user-friendliness, number of SPSS users were not many at the early age of SPSS/PC+. Use of SPSS is increasing rapidly after the release of SPSS for Windows which are user-friendly with enhanced availability (fully functional evaluation version with a specified trial period could be downloaded easily). 2
Wellman, B.; Doing it ourselves, Pp 71-78 in Required Reading: Sociology's Most Influential Books. Edited by Dan Clawson, University of Massachusetts Press, 1998, ISBN 9781558491533
Moreover, the cost for obtaining an SPSS/PASW license is minimal for the students, and it is within the reasonable range for the members of corporations/organizations. Yet, PASW Statistics is still expensive for general users. Nowadays, its users include market researchers, health researchers, survey companies, government, education researchers and marketing organizations. Strengths of SPSS/PASW Statistics In addition to superb statistical analysis, PASW offers good data management (case selection, file reshaping, creating derived data) and data documentation (a metadata dictionary is stored with the data). PASW data files are portable (smaller in size compared to other database systems) and its program (PASW syntax) files are quite small. Organization of PASW Statistics (SPSS) Software Package PASW organizes as the base system and optional components or modules. Most of the optional components are added on to the base system. However, some optional components such as Data Entry is working independently. The base system, main component for running PASW, has the following functions: Data handling and manipulation: importing from and exporting to the other data file formats, such as Excel, dBase, SQL and Access and allowing sampling, sorting, ranking, subsetting, merging, and aggregating the data sets; Basic statistics and summarization: Codebook, Frequencies, Descriptive statistics, Explore, Crosstabs, Ratio statistics, Tables, and etc.; Significant testing: Means, t-test, ANOVA, Correlation (bivariate, partial, distances), and Nonparametric tests; and Inferential statistics: Linear and non-linear regression; Factor, Cluster and Discriminant analysis. Some of the optional components (add-on modules) available in version 17.0 are: Data Preparation provides a quick visual snapshot of your data. It provides the ability to apply validation rules that identify invalid data values. You can create rules that flag out-ofrange values, missing values, or blank values. You can also save variables that record individual rule violations and the total number of rule violations per case. A limited set of predefined rules that you can copy or modify is provided. Missing Values describes patterns of missing data, estimates means and other statistics, and imputes values for missing observations. Complex Samples allows survey, market, health, and public opinion researchers, as well as social scientists who use sample survey methodology, to incorporate their complex sample designs into data analysis. Regression provides techniques for analyzing data that do not fit traditional linear statistical models. It includes procedures for probit analysis, logistic regression, weight estimation, two-stage least-squares regression, and general nonlinear regression. Advanced Statistics focuses on techniques often used in sophisticated experimental and biomedical research. It includes procedures for general linear models (GLM), linear mixed models, variance components analysis, loglinear analysis, ordinal regression, actuarial life tables, Kaplan-Meier survival analysis, and basic and extended Cox regression. Custom Tables creates a variety of presentation-quality tabular reports, including complex stub-and-banner tables and displays of multiple response data. Forecasting performs comprehensive forecasting and time series analyses with multiple curve-fitting models, smoothing models, and methods for estimating autoregressive functions.
Categories performs optimal scaling procedures, including correspondence analysis. Conjoint provides a realistic way to measure how individual product attributes affect consumer and citizen preferences. With Conjoint, you can easily measure the trade-off effect of each product attribute in the context of a set of product attributes - as consumers do when making purchasing decisions. Exact Tests calculates exact p values for statistical tests when small or very unevenly distributed samples could make the usual tests inaccurate. Available only on Windows OS. Decision Trees creates a tree-based classification model. It classifies cases into groups or predicts values of a dependent (target) variable based on values of independent (predictor) variables. The procedure provides validation tools for exploratory and confirmatory classification analysis. Neural Networks can be used to make business decisions by forecasting demand for a product as a function of price and other variables, or by categorizing customers based on buying habits and demographic characteristics. Neural networks are non-linear data modeling tools. They can be used to model complex relationships between inputs and outputs or to find patterns in data. EZ RFM performs RFM (recency, frequency, monetary) analysis on transaction data files and customer data files. Amos™ (analysis of moment structures) uses structural equation modeling to confirm and explain conceptual models that involve attitudes, perceptions, and other factors that drive behavior.
Another version of PASW, PASW Server, is also available which is developed in client/ server architecture with some features not available in the normal version, such as scoring functions.
2.2 Step-by-Step Procedure for PASW Statistics Installation First, the user must have the PASW Statistics software package with official license or just to install an evaluation version for 21 days trial period. In this manual, evaluation version of PASW Statistics 17.0 for Windows will be used for demonstration. The system requirements to install PASW Statistics 17.0: Operating System: System Requirements:
Microsoft Windows 7, Vista, XP or 2000 Intel Pentium-compatible processor, 256MB RAM, 700MB free disc space, VGA monitor, and Internet Explorer 6.0 or above
Follow the following steps in order to install evaluation version of PASW Statistics 17.0: Step 1: Check Installed SPSS Versions Make sure no older version is already installed. If a previous version exists, please uninstall it before starting the installation process.
Step 2: Insert Installation CD and Run “PASW_Statistics_1702_win_en.exe” Insert the Installation CD and open “PASW 17.0 for Windows” folder. Double-click the file named “PASW_Statistics_1702_win_en.exe” to begin extraction of the contents automatically by the PASW InstallShield Wizard”.
Step 3: Follow the “InstallShield Wizard” until Successfully Complete the Installation When requesting to choose license type, select “Single user license” and click Next to continue to the license agreement. Select I accept the terms in the license agreement and click Next to continue.
Immediately, a dialog window with additional information for the users will appear. Read the information and click Next to continue. Fill in “User Name” and “Organization” accordingly and click Next to continue.
Leave serial number blank to install evaluation version!
A window will pop-up requesting the place (folder) to save program files. It is strongly recommended to accept default location and just click “Next” to proceed.
Locate where to install
PASW InstallShield Wizard will again confirm to begin the installation. Click Install to start installation or Back to review and change the installation settings. As soon as clicked on “Install” button, PASW installation begins. It takes just a few minutes. During installation, do not press a key or click mouse buttons since it may interrupt the work.
Do NOT press or click here
When installation is complete, the Wizard will request to register PASW. 1. Click OK to begin registering process. 2. Select “Enable a temporary trial period” and Click Next.
3. Click browse button. 4. Select the trial license file “trial.txt” and click Open to get the trial license file.
5. Click Next to continue and the next windows will inform the enabling of trial period. 6. Click Finish to complete installing the PASW Statistics 17.0 with 21 days trial period.
At this point, the installation of PASW Statistics 17 is successfully completed.
2.3 Running PASW and Its User Interface After successful installation, a program group called “PASW Statistics 17” will be placed under “SPSS Inc.” in the “Start Menu”. There will be at least two items in the menu: 1) PASW Statistics 17, and 2) PASW Statistics 17 License Authorization Wizard. More items may be displayed in the menu, depending on which optional components (add-on modules) have been installed. 2.3.1
Starting and Ending a PASW Session
To start PASW, just click the “PASW Statistics 17” menu item as following.
To start just click “PASW Statistics 17”
Or, double-click any PASW (or SPSS) data or syntax file to start PASW Statistics. In this case, the file double-clicked will also be opened in an appropriate Window.
To browse and open data file not in the list
When running PASW for the first time, a superimposed dialog window will be displayed on top of the Data Editor window. This window is aiming to assist initiating a task when starting PASW. It helps users in performing an initial task such as opening a data or syntax or output file, or running the tutorial for beginners, or conducting new data entry, or activating an existing query or creating a new query to import data from another database file. Among the others, opening an existing data file, from the list or by browsing, is the first common task in PASW statistics. By default, up to nine most recently used files will be listed in both “Open an existing data source” and “Open another type of file”. There will be no file in both lists while running PASW for the first time. An unlisted data file could be opened by double-clicking “More Files…” item and following the steps of a regular “open file” dialog box. One can double-click the listed file names or select a file from the list and click OK button to open one of the most recently used files. By checking the box , only the Data Editor will appear when starting PASW Statistics in future sessions. It is recommended just to click the “Cancel” button to close the dialogue window to keep showing the superimposed dialogue window in the coming sessions. In this case, a blank Data Editor window will be appeared. For using the “evaluation” version, the following message will be appeared every time running the program. There will be 21 days if you are using PASW for the first time after installation.
And, it will become 20 days in the following day, and so on. After completing the trial period, PASW processor will no longer work, that is, commands will not produce any result. Tips: Save the syntax and output files frequently! Active running session of PASW will end and exit automatically if the user closes the last active dataset (or data file). Whenever exit PASW, it will ask to save all unsaved windows – including data, output and syntax windows. It does not have automatic recovery feature and there is no “undo” for data transformations. Thus, it is important to save the syntax and output files frequently. Data files should be saved under different name after applying any transformations or erasing any variables, not to lose the original data files. 2.3.2
Data Editor and Data Views
In PASW (and earlier versions of SPSS also), data files are displayed in the “Data Editor”. In the Data Editor, if the mouse cursors on a variable name (the column headings) a more descriptive label for that variable is displayed for every variable that has been defined with a label. Data editor has two views: “Data View” and “Variable View”. Data View: the actual data values are displayed in the cells by default. The „case numbers‟ are displayed as row captions (as „row number‟ in Microsoft Excel), and the variable names as the column captions. For the cells, users can choose to display descriptive value labels (for example: to display “Male” and “Female” instead of coded 1 and 2), from the menus by choosing View, then, click Value Labels as following:
or, simply, click the Labels button household survey.
. Value labels are easier to interpret the responses in the
The following is the dataset for individual household member of Bangladesh Demographic and Health Survey 2007 in the Data View with Value Labels.
Relationship to HH Head
The Data View shows the cases (or observations) in rows and each column represents a variable (a characteristic that is being measured). In the above example, each individual „member of selected households‟ is a case, and each „item in the questionnaire‟ is a variable. For example, „relationship to head of household‟, „age‟ or „highest education level‟ is a variable. Each cell contains a single data value of a variable for a case. The cell is where the case and the variable intersect, for example, if the case represents the „head of household‟ (row 13) and variable is „sex‟ (HV104), the cell is „sex of the head of household‟. When displaying the actual data values, the cell will show “2”, or it will become “Female” if selected to view in value labels. PASW data files are stored in flat-file format and data cells cannot store any formula. Variable View: This displays the metadata dictionary where each row represents a variable and shows the attributes (or characteristics or properties) of the variable on 10 columns: 1) variable name; 2) type: numeric, comma, dot, scientific notation, date, dollar, custom currency, and string; 3) variable width, i.e. number of digits or characters; 4) number of decimal places; 5) variable label; 6) value labels; 7) codes for user-defined missing values; 8) column width in data view; 9) cell alignment, i.e., left, right or center when displaying in data view; and 10) type of measurement (scale, ordinal or nominal). All attributes are saved with data values in the file.
Number of rows and columns (size or dimension) of the data file are determined by the number of cases and variables used in that file. Data can be entered in any cell, even in a cell which is outside the boundaries of the defined dataset. In this case the dimension of the data view is extended to include all the rows and columns to cover that newly entered cell. Variable names for the undefined columns will automatically be assigned as “VAR00001”, then “VAR00002”, and so on.
2 variables just created automatically
Value just typed-in
The cells without entering data in the newly expanded data range (in both rows and columns) will be filled-up with “.” (a system-missing value) for the numeric variables, and “ ” (blank is valid string values in PASW) for the string variables. In this case, type of the new variables is automatically defined as „numeric‟ and default attributes for the numeric variable are set by PASW. Users could change all attributes, including variable name and type, in the Variable View.
New properties typed-in / changed
Apart from directly putting in Variable View, the following two methods can be used in defining variable properties: Copy Data Properties Wizard provides the ability to use an external data file or another dataset that is available in the current session as a template for defining file and variable properties in the active dataset. Similarly, variables in the active dataset could be used as templates for other variables in the same dataset. „Copy Data Properties‟ is available on the „Data menu‟ in the main SPSS window. Define Variable Properties, which is also available on the „Data menu‟, scans the data and lists all unique data values for any selected variables, identifies unlabeled values, and provides an auto-label feature. This method is particularly useful for categorical variables that use numeric codes to represent categories, for example, 0 = Male, 1 = Female.
3. BASIC COMPONENTS OF PASW STATISTICS Both “Data Editor” and “PASW Statistics Viewer” will be automatically opened when starting a PASW Statistics session. A user-friendly Help system is available and ready to serve whenever requested by pressing F1 key: the opening page “Getting Help” of the “Base System Help” will be displayed if working on data editor or output viewer; or context sensitive “PASW Command Syntax Guide” for the specific command when working on the syntax. 3.1 Output Viewers The outputs created by the program are displayed in the “PASW Statistics Viewer”. By default, all outputs including, command syntax used during the analysis, output tables, charts, notes and the activity logs during the session are recorded in the Viewer. Users are allowed to determine which output items were to display or hide in the viewer. It could be set through the “Viewer” tab of “Options” sub-menu in the “Edit” menu.
Options for: Log Warnings Notes Title Page title Pivot table Chart Text output Tree model Model viewer
If PASW is stated through opening a data file, a Viewer (with the name Output1 [Document1]) will automatically activate and record the command syntax used to open the data file under the “Log” tag. If it is decided not to show the command syntaxes in future, for example, the user can set to hide “Log” initially as shown in the above exhibit. Otherwise, the following log will be displayed when opening the data file “BDHR50FL.SAV”.
A typical PASW Viewer, after running the cross-tabulation (crosstab) of “Highest education level” by “sex”, can be seen in the following illustration. Six types of outputs are recorded in the Viewer: (i) Command Log; (ii) Title; (iii) Notes; (iv) Active Dataset; (v) Case Processing Summary; and (iv) the output table (Highest educational level * Sex of household member Cross-tabulation).
Notes are hidden! Double-click here to unhide
Click to select Double-click toggles hide / unhide Drag-and-drop to change location (order in output)
PASW Statistics Viewer is useful in: browsing the results like in the Windows explorer; showing or hiding selected output item (notes, tables and charts); deleting selected output items; changing the display order of results; and moving items between the Viewer and other applications. In the viewer, double-click the appropriate icon in the left pane to unhide any hidden item and doing so to a visible item will hide it. For example, notes are hidden by default in outputs and double-click the notes icon will display the notes. Drag-and-drop can be applied on icons in the left pane to change the location of any item (order in the output pane). Click the icon to activate the associated item, and press “delete” key to eliminate that item (and its icon) from the output. Tips: If some particular items from the output were to use in other applications like in MS Excel or Word, just simple copy and paste technique can be used. Moreover, almost any object, a paragraph or a chart, can be paste on to the output view as usually do in popular application programs.
3.2 Pivot Tables Pivot table is a data summarization tool to create output table formats. Pivot-table tools can automatically sort, count, and total the data stored in one table or spreadsheet and create a second table. For example, user can change the variables displayed in rows to columns and vice versa. This ability of "rotation" is known as pivoting and a table with this ability is called a “pivot table”. One of the significant features of PASW Statistics Viewer is its ability to handle pivot tables. Most of the output tables in PASW Viewer can be pivoted interactively. User has the choice to setup and change the table structure by dragging and dropping the variables or by selecting the specific items of the layer variables whether the results represent the entire dataset, or just a subset of data. Options for manipulating a pivot table include: transposing rows and columns; moving rows and columns; creating multidimensional layers; grouping and ungrouping rows and columns; showing and hiding rows, columns, and other information; rotating row and column labels; and finding definitions of terms. The followings illustrate how one can use pivoting in data analysis and presentation. First, run cross-tabulation of “Educational Attainment” by “sex” by “type of place of residence” (click Analyze on Main Menu and select Crosstabs under Descriptive Statistics, then, select the variable, click appropriate arrowhead to move variable name to row or column or layer, and finally click OK – see in the next module for a detail illustration).
The following is the main results obtained by the above cross-tabulation command.
Double-Click any place on this Table
Then, go through the following steps for pivoting an output table: 1) Double-click the output table located in right result pane to go into table editing mode; 2) The main menu will contain a new item “Pivot”; 3) Select “Pivot” menu and click “Pivoting Trays”; and 4) In the pivot tray, arrange the row, column and layer variables (including statistics) as necessary by drag-and-drop the variable names, The followings illustrate the use of pivot table method on the crosstab table.
Drag and drop
3.3 Charts (a) Creating Charts while Analyzing Data PASW provides high-resolution charts by a click from several procedures on the “Analyze” menu. For example, in the bottom-left area of “Crosstab” command, there is a check-box “Display clustered bar charts” which could help create useful graphs for the selected variables.
(b) Creating Chart through Builder Different types of charts and plots could be produced by the procedures in the “Chart Builder” item under “Graphs” menu. The Chart Builder helps building charts from predefined gallery charts (templates/ samples) or from the individual parts (axes and bars). A chart can be built by dragging and dropping the gallery charts or basic elements onto the canvas, which is the large area to the right of the Variables list in the Chart Builder dialog box. When building a chart the canvas will display a preview of the chart with defined variable labels and measurement levels. The preview
does not reflect the actual data since it uses randomly generated data to provide a rough sketch of how the chart will look. Using the gallery is the preferred method for the new users. It is also possible to build a chart from basic elements which is more complex since the chart options were to define explicitly by the users. Construct a chart by using gallery First, click the “Chart Builder” item under “Graphs” menu, and the following Chart Builder window with superimposed warning will appear. Click OK since users can define temporary variable types while building charts.
Then, follow the steps for building a chart from the gallery as: 1) Click the Gallery tab if it is not already displayed. 2) In the Choose From list, select a category of charts. Each category offers several types. 3) Select the suitable type of chart again by dragging onto the canvas, or double-clicking, the picture of the desired chart type. If the canvas already displays a chart, the gallery chart replaces the axis set and graphic elements on the chart. 4) Drag variables from the Variables list and drop them into the axis drop zones and, if available also to the grouping drop zone. If an axis drop zone already displays a statistic and if it is the statistic desired, do not drag a variable into the drop zone. Add a variable to a zone only when the text in the zone is blue. If the text is black, the zone already contains a variable or statistic. Refer to Statistics and Parameters for information about the available statistics. In building the charts, measurement level of variables is important. The Chart Builder sets defaults based on the measurement level while building the chart. Furthermore, the resulting chart may also look different for different measurement levels. The user can temporarily change a variable's measurement level by right-clicking the variable and choosing an option.
5) If the user needs to change statistics or modify attributes of the axes or legends (such as the scale range), click Element Properties. In the â€œEdit Properties Ofâ€? list, select the item needs to change and change as needed and after making any changes, click Apply. 6) Click OK to create and display the chart in the Viewer. 4
Variable in grouping zone
Statistics in axis drop zone
6 Notes: (a) If it is necessary to add more variables to the chart (for example, for clustering or paneling), click the Groups/Point ID tab in the Chart Builder dialog box and select one or more options. Then drag categorical variables to the new drop zones that appear on the canvas. (b) To transpose the chart (for example, to make the bars horizontal), click the Basic Elements tab and then click Transpose. (c) If many default settings for a specific chart were to change often, the current settings could be saved as a favourite and use it later. Please refer to PASW manuals for detailed instructions. (d) Canvas is the area of the Chart Builder dialog box where building the chart. (e) An axis set defines one or more axes in a particular coordinate space (like 2-D rectangular or 1-D polar). Adding a gallery item to the canvas automatically creates an axis set. Each axis includes an axis drop zone for dragging and dropping variables. Blue text indicates that the zone still requires a variable. Every chart requires adding a variable to the x-axis drop zone. (f) The graphic elements are the items in the chart that represent data. These are the bars, points, lines, and so on. In the illustration, the graphic element is a bar. (g) The variable list displays the available variables. If a variable selected in this list is categorical, the category list shows the defined categories for the variable. A variable's measurement level can be changed temporarily by right-clicking its name and choosing desired measurement level. (h) Drop zones are the areas on the canvas to which drag and drop a variable from the Variables list. The basic drop zone is the axis drop zone. Certain gallery charts (like clustered or stacked bar charts) include grouping drop zones. The illustration shows a grouping zone that contains Sex as the grouping variable.
After clicking on the OK button, the following chart will be placed in the Viewer.
To generate a bar chart of the “percentage of male and female head of household in each district”, first, click Element Properties button on the Chart Builder window and follow the steps below: 1) In the “Element Properties” window, change the desired statistics to “Percentage()”; 2) Click Set Parameters button; 3) Select “Total for Each X-Axis Category” as the denominator for computing percentage in the set parameters drop-down list; 4) Click Continue; and 5) Click Apply to activate changes And, finally, click OK button on the Chart Builder window to get the following graph.
(c) Using Graphboard Visualization to Create Customized Graphs Creating a graph from the “Graphboard Template Chooser” This is a new feature in PASW Statistics 17. Through this command (located in the “Graph” menu), graphs can be created from ready-made templates called “Graphboard Visualizations” which contains graphs, charts, and plots. PASW Statistics ships with built-in visualization templates covering 23 different types of graphs which are sufficient for the general users. Another product, PASW Viz Designer, is available to create own visualization templates. To use built-in templates, select “Graphboard Template Chooser” in the “Graph” menu and follow the following steps: 1) In the “Graphboard Template Chooser” window, click basic tab to start selecting appropriate variable(s); 2) Click (with control key starting from the second variable) the variable name(s) to create the graph. Here, PASW just list the variable names, instead of labels. As soon as a variable is selected, all possible graph types which are suitable for the selected variable will be displayed in the right pane of the window. Similarly, if two variables are selected, possible types for those two variables will be displayed; 3) Double-click the icon of the preferred graph type from the displayed samples; 4) Optionally, click: (a) Detailed tab to change chart type, variables, and etc.; (b) Titles tab to set chart title, sub-title and footnote; and (c) Options tab to set output label and other options. 5) Click OK to start creating the preferred graph. It should be noted that creating graphs through this “Graphboard Template Chooser” requires more resources, such as processing time, better processor, and larger memory. Moreover, the graph created from this option is difficult to edit. (d) Graphs through Legacy Dialogs Graph can also be created from the "legacy dialogs". Almost all graph types are available and can be customize the view such as title, sub-title and so on while creating the graph through this option. The following exhibits show the types of graphs available under “Legacy Dialogs” and the population pyramid of sample household population created through the legacy dialogs. Different types of charts available in “Legacy Dialogs”
The following dialog shows the generating a population pyramid from sample household population by age and sex. 5
And, the pyramid produced by the above setting is as following:
Drawing a Population Pyramid: 1) Select “Legacy Dialogs” in “Graphs” menu; 2) Click “Population Pyramid”; 3) Drag “Age of household members” and drop in “Show Distribution over” box; 4) Drag “Sex of household members” and drop in “Split by” box; 5) Click “Titles…” button; 6) Type in “Population Pyramid of Sample Households” in Title Line 1; 7) Click “Continue”; and 8) Click OK on “define Population Pyramid” dialog
3.4 Saving and Exporting Outputs Starting from PASW Statistics 16, outputs are saved only in Viewer format (*.spv). The PASW viewer no longer supports output files of earlier versions in the proprietary file format (*.spo). From PASW Viewer, outputs can be selected, copied and paste in any spreadsheet software or word processors or graphical presentation software. Outputs in the Viewer can also be exported to different formats such as: Excel (*.xls); HTML (*.htm); Portable Document Format (*.pdf); Power Point (*.ppt); different text formats (*.txt) such as plain text, UTF8 and UTF16; and Word/RTF (*.doc). Moreover, graphical outputs can be saved into such formats as: Bitmap (*.bmp); Enhanced Meta File (*.emf); Encapsulated Postscript (*.eps); JPEG file (*.jpg); Portable Network Graphic (*.png); and Tagged Image File (*.tif). In exporting outputs, one can select: i) to export all items, including hidden, both selected items and non-selected items; ii) visible (non-hidden) items only; or iii) just selected items. For exporting multiple items, one can select different items by clicking the item while pressing control key, and follow the steps as described in the following example. For exporting PASW outputs to MS Excel, 1) Select the item(s) to export on the left pane of the PASW Statistics Viewer;
Selected Output Tables
2) Click Export in File Menu and an “Export Output” window will appear;
In the “Export Output” window: 3) Check “Selected” option button to export only selected output items (tables, notes, summaries, …); 4) Select “Excel file (*.xls)” from the “File Type:” dropdown; 5) Click Browse button and select the location of the export file and file name or type in the file name with full path, e.g., “C:\Documents and Settings\User\My Documents\SPSS Training\Sample\Test-exporting.xls”; 6) Click OK to begin the export process;
At the end of exporting process, the exported file can be seen in the designated folder.
For exporting only the graphics without any notes, tables, etc., select “None (Graphics only)” while choosing the Document Type in Step 4 (the last item in the drop-down list). Then, the Graphic section of the “Export Output” window will activate and the Document section will inactivate (that is, user can no longer set any options or select other than document type). In this case, users can select the graphic format (together with graphic options) to be saved and the root file name to save the graphics. If the root file name is “text.png” and if there were 3 charts in the active Viewer, three graphic files will be created with the name: “test1.png”, “test2.png”, and “test3.png”.
3.5 Online Help PASW Statistics provides a comprehensive help system together with tutorial for every key aspects. Context-sensitive help topics in dialog boxes could guide on every specific task. A help window will pop-up whenever the help key “F1” is pressed. It shows the base system help while working with data editor or output viewer, or command syntax guide of the closest command while in the syntax editor. Similarly, various types of PASW help can be accessed through “Help” menu.
The first item and the most important for the beginners under the Help menu is the item “Topics”. “Topics” provides access to the basic PASW Help system with Contents, Index, and Search tabs, from which users can find the explanation of specific topic or command procedure.
The second item, “Tutorial” illustrates step-by-step instructions on how to use many of the basic features. Users can choose the topics required to grasp, skip around and view topics in any order. The index or table of contents can be used to find specific topics. “Case studies”, the third item, provides hands-on examples of how to create various types of statistical analyses and how to interpret the results. The sample data files used in the examples are provided in the PASW package. Table of contents of the tutorial can be observed in the following illustration.
The “Statistics Coach”, using a wizard-like approach, helps finding the commands or procedures needed. After making a series of selections, the Statistics Coach opens the dialog box for the statistical, reporting, or charting procedure that meets selected criteria. It provides access to most statistical and reporting procedures and several charting procedures in the Base system.
The above mentioned help items are useful for all users – from beginners to advanced developers. A part from those, more help topics such as “Command Syntax Reference” and “Statistical Algorithms” are available interactively for the advanced users, and the “Developer Central” and “Technical Support Website” for the on-line users. Like in other modern software, PASW provides “Context-sensitive Help” in several places in the user interface as: 1) Most dialog boxes have a Help button that takes directly to a Help topic for that dialog box. The Help topic provides general information and links to related topics.
2) Right-click terms in an activated Pivot Table in the Viewer and choose “What's This?” from the context menu to display definitions of the terms.
3) In a command syntax window, position the cursor anywhere within a syntax block of a command and press F1 on the keyboard. A complete command syntax chart for that command will be displayed. Complete command syntax documentation is available from the links in the list of related topics and from the Help Contents tab.
Select any place in the Command Line and Click <F1>
4. USING DATA FROM OTHER SOURCES In general, PASW Statistics can read datasets created by almost all popular statistical software and databases. A PASW dataset is also possible to save in several popular formats. Therefore, PASW data format (*.sav) is the common format in sharing/distributing survey datasets. Generally, PASW Statistics can read data files created in: all versions of PASW Statistics (*.sav) and SPSS/PC+ (*.sys) formats; spreadsheets (EXCEL, Lotus and SYLK); database tables (dBase, MS Access, FoxPro, Oracle, SQL Server, etc.); statistical software (SAS, SYSTAT, and Stata); and different text formats (fixed width, comma delimited/ CSV, tab or space delimited, etc.). Data files created by spreadsheets and other statistical software could open directly as PASW data files. Similarly, PASW can open dBase files, text data files and other files without converting the files to an intermediate format or entering data definition information. On the other hand, complex database files such as MS Access, FoxPro and SQL databases could be accessed through the database wizard or SQL queries. Opening a data file makes it the active dataset. The active dataset is the one, from which PASW will read and write during the session if there is no specific command to change to other dataset. If there are one or more open data files (or datasets), those remain open and available for subsequent use in the session. Clicking anywhere in the Data Editor window for an open data file will make it the active dataset. A PASW data file could be saved (or exported) to other file types. However, some file types could save only data values while PASW keeps both values and data dictionary (or attributes). The data dictionary or attributes such as variable label, value labels, missing values, etc. will be lost if it is save to other formats including Microsoft Excel format. 4.1 Importing Data from Microsoft Excel Importing data from Microsoft Excel is the easiest among the data sources. First arrange the spreadsheet in tabular format fulfilling following six recommendations: i) Names of the variables on the first row of the data range; ii) Variable names comply with PASW Statistics naming rules3; iii) For all numeric variables, there should be no blanks in the second row of the data range; iv) Data range should be continuous – no blank rows or columns; v) Clear of any graphs, labels, and extra text or data on the worksheet; and vi) Delete unnecessary worksheets (which are not going to import). 3
Starting from Version 12.0, the following rules apply in variable names: 1. must be unique; duplication is not allowed and cannot contain spaces; 2. up to 64 characters in English; 3. starting with a letter or @, #, or $ and follow by letters, numbers, period (.), and non-punctuation characters; 4. starting with a “#” is a scratch variable, which can create only with command syntax; 5. starting with a $ sign is a system variable, and not allowed for a user-defined variable; 6. the period, underscore, and the characters $, #, and @ can be used within variable names, e.g. “A._$@#1”; 7. shall not end with a period or an underscore ; 8. not allow to use reserved keywords: ALL, AND, BY, EQ, GE, GT, LE, LT, NE, NOT, OR, TO, and WITH; 9. allows mixture of uppercase and lowercase characters, and “case” is preserved for display purposes; and 10. wrap long names in output – breaking at underscores, periods, and where changed from lower to upper.
If the data in Excel file is spreading over several worksheets, it is better to create a new Excel file with just one worksheet containing all necessary data including variable names. Then, follow the steps: On the main menu click:1. File; 2. Open; 3. Data; And, an “Open Data” pop-up window will be appeared. In this window:4. Change Files of type to “Excel (*.xls, *.xlsx, *.xlsm)”; 5. Select the folder containing Excel data file from Look-in box; 6. Select the correct Excel data file (in 97-2003 or 2007 format); and 7. Click Open; 1 2
The “Opening Excel Data Source” pop-up window will be appeared and on that window:8. Clear the check box next to “Read variable names from the first row”, if and only if the first row of the Excel data sheet does not have variable names; 9. Select the worksheet containing data, if the file has more than one worksheet; 10. Type in the range of data to be imported (for example A1:V100 for the first 99 cases or 100 rows including the row for the variable names); and 11. Click OK.
8 9 10 11
If the data file in Excel was prepared with six recommendations mentioned above, steps 8 to 10 could be skipped since there is only one sheet in the Excel file, the data range is continuous and there is no extra cells or objects in the sheet rather than the data to be analyzed. At the end of this process, data from Excel file has been transferred into PASW dataset. At this time, it is important to save the current SPSS dataset with an appropriate name in designated place.
Data files in Excel or text format or databases do not have data dictionary, that is, no information on data attributes such as variable labels, value labels, missing values, etc. Therefore, it is important to define such attributes to all variables, and save the data file again.
4.2 Importing Data from Delimited ASCII Text Files When requesting data from other agencies and departments, sometimes, data are provided in text or ASCII file format. Normally, data in an ASCII file are arranged with fixed width format, that is, a variable is placed in same location for every case or separated by a specific character such as tab, space, comma, semicolon and any other specific character which is unique throughout the file and did not use in the data values. To import data from a delimited text file, first, review the file on a text editor such as notepad or Word and check the character used for delimitation (normally, tab, space, comma or semicolon). Then, follow the steps: On main menu click:1. File; 2. Read Text Data; Then, “Open Data” pop-up window will be appeared with text (*.txt) file type. Note: Sometimes, text data files have different file extension than “.txt” and “.dat”, such as “.prn” or “.csv”. If “Read Text Data” menu item is chosen, PASW will display the files only with extension “.txt” and “.dat”. To search for a text data file with other extensions, choose “All Files (*.*)” in “Files of type” field to display all files.
In Open Data window:3. Select the folder containing text data file from Look-in box; 4. Change “File of type” to “All Files (*.*)”; 5. Select the correct data file (*.txt, *.dat, *.csv, *.prn, etc.); and 6. Click Open; And, a “Text Import Wizard” will begin automatically and guide through the importing process. 1
The wizard contains the following 6 steps: Step 1/6: Click Next to forward Step 2 of 6; In the Step1 of the Wizard, one can apply a predefined format (previously saved from the Text Wizard) or follow the steps. Step 2/6: (i) The Wizard will sense and opt whether the data is arranged as “Delimited” or “Fixed width”, but check and identify correctly (in “Data.csv” file, the variables are separated by a comma “,”, and thus the file structure is delimited); and (ii) identify whether the variable names are included at the top (first line) of the data file or not (in this example, “Yes”), and click Next to forward to Step 3 of 6.
The first line contains the variable name!
Step 3/6: (i) Since data file begins with variable names, the first case of data begins on line 2. Otherwise, user should identify the line number that the data begins. (ii) If a line represent a case (one person, for example), just click Next; otherwise, select the second option on “How are you cases represented?” and specify number of variables per case before clicking Next. Step 4/6: The Wizard will automatically identify the delimiter(s) between variables. However, it is important to check and specify correctly. Some software export text in quotes, i.e. expressed as “text” or „text‟, then the character of text qualifier (or quotation mark) must be specified by the radio buttons of the second question, and click Next.
Sometimes, the Wizard may identify wrong delimiters. Users must check and post correct delimiter(s).
Step 5/6: In this step, variable names and data formats can be specified (or) changed from the default settings. Then, click “Next” to continue or “Finish” to start importing data. Step 6/6: In this step, just click “Finish” to start importing and the task will complete in a few minutes.
4.3 Importing Data from Fixed Width Text Files In some text data files variables are aligned in fixed width columns. That is, a variable is at the same column throughout the data file. For example, sex of household member is situated in column 33 of every line in the “Data(Fix).txt” file, which is extracted from the Bangladesh DHS 2007. To import data from a text file with fixed width data structure it is important to have the data dictionary of the variables, that is, which variable is located on which column(s). After that: On main menu click:1. File; 2. Read Text Data; Then, “Open Data” pop-up window will be appeared with text (*.txt) file type, and 3. Select the folder containing text data file from Look-in box; 4. Select the correct data file (*.txt or *.dat); and 5. Click Open;
Not require to change “Files of type” since the extension of file name is “.txt”
Then, the “Text Import Wizard” will begin automatically and guide through the importing process. The wizard contains the following 6 steps: Step 1/6: Simply, click Next to forward Step 2 of 6; Step 2/6: (i) The Wizard will sense and opt whether the data is arranged as “Delimited” or “Fixed width”, but check and identify correctly (“Data(Fix).txt” contains no separation character and file structure is delimited); and (ii) identify whether the variable names are included at the top (first line) of the data file or not (in this example, “No”), and click Next to forward to Step 3.
First line does NOT contain variable names!
Step 3/6: Since there is no variable name in the first line, the first case of data begins on line number 1. Sometimes, a case spans over one lines, users have to identify the number of lines per case. Unless, just click â€œNextâ€? to continue. Step 4/6: This is the most crucial step in importing a fix width data file. Use the data dictionary to identify and split the case into variables accordingly. In this example, one line of data represents a case, and the location of variables are as following: Variable number 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
Column 1-8 9-10 11-12 13 14 15-16 17 18-19 20 21-28 29-30 31 32 33 34-35 36 37-38 39-40 41 42 43 44
Variable Name HV005 HV009 HV024 HV025 HV026 HV218 HV219 HV220 HV270 HV271 HV101 HV102 HV103 HV104 HV105 HV106 HV107 HV108 HV109 HV110 SH08 SH15
Variable Label Sample weight Number of household members Division Type of place of residence Place of residence Line number of head of household Sex of head of household Age of head of household Wealth index Wealth index factor score (5 decimals) Relationship to head Usual resident Slept last night Sex of household member Age of household members Highest educational level Highest year of education Education in single years Educational attainment Member still in school Marital status Employment status
The Wizard will put in separation lines or break lines wherever explicit (for example, if a column contains blank(s) consistently across the lines, the Wizard will insert a break line). A break line can be inserted or deleted with the “Column number” input box below the data view. For example, to insert a break line in the column 13, put in 13 in the “Column number” input box and press the “Insert Break” button. Similarly, to delete a break located on column 28, just type in 28 and click “Delete Break” button. In this step, the user has to check and identify all break lines to get correct data import. After defining the location click Next to proceed to Step 5.
Step 5/6: In this Step, one can select “Finish” to start importing data with default variable names (V1, V2, …, Vn), and data formats (all numbers will be numeric and the remaining be string). Or, user can put in variable names and formats individually. Step 6/6: Simply click “Finish” to start the text data importing task. In text data import wizard, the user can save the format (including break lines and variable names) for future use.
It will take just a few minutes to import the text data into PASW Statistics Data Editor. It is strongly recommended to check and edit (or create) variable attributes such as variable labels, value labels, missing values, etc. It is important to define such attributes to all variables, and save the data file again.
4.4 Importing Data from Microsoft Access Databases Data from the databases which are using the Open Database Connectivity (ODBC) drivers can be read directly by PASW Statistics if respective drivers are installed in the computer. Commonly used ODBC drivers are provided with the PASW installation package. Among the others, Microsoft Access is the most widely used database system and step-by-step guide to grab data from MS Access will be presented in this section. The same steps, with minor variations, could be followed to import data from the databases created on other platforms. Before importing data from an MS Access database, check whether the database contains a table in flat file format (like a worksheet) with all variables needed to import or not. If the data to import spread over several database tables, that is variables are located in different database tables, first, it is better creating a simple table containing all variables in MS Access before importing. To begin, click followings on main menu:1. File; 2. Open Database; and 3. New Query; A “Database Wizard” window will appear for identifying ODBC data source. All available ODBC data sources will be listed on the right pane and click the one which matches the database to be imported. If there is no appropriate source, a new driver file for that particular source must be installed or added before importing data from that database. 1
Normally, there is “MS Access Database” in the list, and: 4. Select MS Access Database from ODBC Data Sources; and 5. Click Next to continue.
For the first time, “ODBC Driver Login” window will appear. If it is not the first time that this import procedure is running, the Wizard may skip this step. 6. Click Browse to browse the folders and file, and select the correct database file to open; and 7. Click OK to open that database file. 6 7
At this point, the user can setup a new link also. Then, the “Database Wizard” window will come up with two panes: “Available Tables” on the left and “Retrieve Fields in This Order” on the left. 8. Click table name to expand and double-click the field name(s) to select or double-click the table name to select all variables in that table; and 9. Click “Finish” to start importing all cases (53,413 cases) from the database.
8 To import selected fields (variables), click here to expand and double-click the desire field names
To import all variables, just double-click table name
It is important to save the data file after the import process. 10. On the other hand, one can click “Next” to go to another step where users can select the cases to import based on some criteria (filtering). The following example shows how to import the cases where age of household member is between 6 and 15 years. Here, only 12,621 cases will be imported instead of 53,413 cases in the entire database. It is important to save the data file at the end of importing process.
HV105 is “Age of household member”, and the criteria is “Age > 5 and Age < 15” or “5 < Age < 15”
11. Again, by pressing “Next” to redefine variable names and to process auto-recoding string variables before pressing “Finish” to start importing. Although all variables been imported, PASW Statistics assigns F8.2 (floating-point format; total of 8 digits including 2 decimal places) to all numeric variables, and A255 (alpha-numeric format; up to 255 characters) to string variables. Therefore, it is import to realign formats for all variables, and also to set column widths to display appropriately. Moreover, it is recommended to recode string variables for easier analyses. The following section will explain how to refine imported data sets.
If there are several tables in the source database file, one can link through identification fields and import variables from different tables (please see: online tutorial on PASW Data Manipulation). However, it is more convenient to link tables and create a special table with all required variables in MS Access (or in the original database software) before importing into PASW Statistics.
5. TIPS AND EXERCISES
5.1 Tips: Do and Don’t i) Do… Don’t…
check whether any previous version of PASW Statistics or SPSS for Windows or SPSS/PC+ has already been installed in the computer. install any version of PASW Statistics without checking the existence of any working PASW Statistics.
ii) Do… Don’t…
check whether any installed PASW Statistics is a license version. uninstall any license version of PASW Statistics before ensuring the transferability of legitimacy to new PASW software.
uninstall existing PASW Statistics or SPSS for Windows or SPSS/PC+ if the new software has a valid license or decided to use for evaluation which allowed for 14 or 21 days. install new version of PASW Statistics before completing un-installation process.
Don’t… iv) Do…
study and make yourself expert of PASW Statistics components and survey files including data, questionnaire and codebook before conducting any analysis. change anything in the dataset! And also do not start analysis with the new dataset before understanding the questionnaire and codebook of the survey. familiarize with data file, especially if it is in other format than PASW, for text data files: review on a text editor such as Word, notepad, etc. check whether the first line comprises of variable names or not; and which separation character (blank, comma, tab, etc.) been used. save the original data file after reviewing in MS Word or any text viewer to avoid altering format and edited characters.
Are you able to explain to your colleagues on background information of some popular statistical analysis software packages? Very well / Somewhat well / Not so much / Almost None
Do you understand why SPSS / PASW is chosen as a statistical software for assisting EFA monitoring? Very well / Somewhat well / Not so much / Almost None
Can you install evaluation version of PASW statistics without any assistance? Certainly / Somewhat certain / Not so much / Not at all
Can you explain your friends on the following basic components of PASW: o Output Viewers Very well / Somewhat well / Not so much / Almost None o Pivot Tables Very well / Somewhat well / Not so much / Almost None o Charts Very well / Somewhat well / Not so much / Almost None o Export Outputs Very well / Somewhat well / Not so much / Almost None o Online Help Very well / Somewhat well / Not so much / Almost None
Are you confident that you can import data from the following sources to PASW: o Microsoft Excel Confident / Somewhat confident / Not so much / Not at all o Delimited text files Confident / Somewhat confident / Not so much / Not at all o Fixed width text files Confident / Somewhat confident / Not so much / Not at all o Access databases Confident / Somewhat confident / Not so much / Not at all
5.3 Questions and Hands-on Exercises i) Provide three reasons for appropriateness of using PASW Statistics for analyzing census and household survey data for assisting EFA monitoring. ii) What are the key components of PASW Statistics? iii) Open “B2_a.txt” file in any text editor and record (a) how many variables in this file, and (b) which separation character has been used on a blank sheet. iv) Import “B2_a.txt” file to PASW data editor and review characteristics of new dataset. v) Connect internet and (a) find available household survey data files for your country; (b) download the most recent survey data file; (c) find and review the questionnaire and codebook for that survey; (d) note down the variables which are useful to calculate education indicators, especially for EFA monitoring, and (e) prepare for importing data, if it is needed.
Checking, Editing and Preparing Household Survey Data for Analysis Purpose and learning outcomes: To gain knowledge on defining data and checking data quality with PASW To understand basic techniques of data validation To understand how to prepare datasets for conducting effective data analyses Contents: 1. Metadata Preparation 1.1 Defining Data: Setting Variable Properties 1.2 Setting and Editing Metadata through Wizard 1.3 Copying File and Variable Properties 2. Data Manipulation 2.1 Changing, Inserting and Deleting Data, Cases and Variables 2.2 Computing New Variables 2.3 Recoding 3. Data Preparation 3.1 Selecting Cases 3.2 Sorting Cases 3.3 Rearranging Variables 4. Data validation 4.1 Validation with Single-Variable Rules 4.2 Cross-Variable Rules 4.3 Multi-Case Rules 5. Tips and Exercises 5.1 Tips: Do and Don’t 5.2 Self-evaluation 5.3 Hands-on Exercises
One of the most famous computer and ICT terms is GIGO, â€œGarbage in Garbage outâ€?. It simply indicates that if dataset under analysis is prone to errors, outputs generated from that dataset are not reliable or unusable. Therefore, after loading a dataset, keep in mind that it is not yet ready for start producing analytical outputs. PASW Data Editor can display only the contents, but cannot secure the quality of data. To conduct meaningful analyses, it is also important to understand the data collection procedure, questionnaire and coding rules, and how dataset was prepared and distributed. Moreover, if and only if the data in the set is defined properly, the data analyst can understand correctly and conducting meaningful data analyses. Therefore, logical steps after loading dataset include: Metadata preparation: Defining data This step requires when data was imported from other formats such as Excel, text or databases. While importing data from those formats, only data values with variable name, and at most, the defined missing values will be in the new PASW dataset. In this case, data management should begin with defining data â€“ providing appropriate variable name and value labels, and setting missing values and measurement level for each and every variable. Editing data definition All PASW datasets should begin with reviewing variables in the dataset and determine their valid values, labels, and measurement levels. Identify combinations of variable values that are impossible but commonly miscoded. Define validation rules based on this information. This is a time-consuming task, but worthwhile to ensure the quality of data. Data preparation: Even the active dataset is reliable (clean or data with good quality) it may not perfectly fit in with the type of analyses to perform. The active dataset may require manipulations such as sorting, aggregation, creation of new variables, conditional selection of cases, and sometimes merging of datasets. Data validation: Run basic checks and checks against defined validation rules to identify invalid cases, variables, and data values. When invalid data are found, investigate and correct the cause. If it is impossible to correct, determine whether to omit the entire cases or include the case but setting the invalid values as missing or special category. Once the dataset is clean and well prepared, it is ready to analyze with PASW modules. The following sections highlight the tools provided in PASW base system for metadata preparation, data preparation, and data validation. This section will emphasize on metadata preparation while data manipulation, preparation and validation will be discussed in the Section 2, 3 and 4 respectively.
1.1 Defining Data: Setting Variable Properties In PASW, metadata or data dictionary is part of the dataset. It covers such properties as variable label, value labels, formats, and measurement level: scale, ordinal or nominal. While obtaining data from other sources such as: Excel, text, or Access database, only the variable name, format (numeric or string, width and decimal places) and data values are imported. Few more properties, such as missing values, could be assigned while importing from databases, however, there will be no description of variable (variable label), and the meaning of the data values (value labels) especially when the codes, instead of texts or words, were imported from the source. Examples are introduced as following;
In the above dataset, variable “HV104 (Sex of household member)” has values 1 or 2 only. However, users cannot know “what 1 and 2 stand for?” since 1 could stand for "Male" or "Female" depending on the coding scheme.
Therefore, it is impossible to answer a simple question: “how many household members are female?” from the above frequency table created by PASW Statistics.
Similarly, from the above frequency table of the HV106, no one could know: “What is HV106?” “What are valid values 0, 1, 2, 3, 8 and 9 stand for?” and “Why the codes jump to 8 after 3, and where are 4, 5, 6, and 7?” To answer such questions, the next step, after importing data or opening an existing data file, is to specify, or check and edit, variable label, value labels, missing values and measurement level for each and every variable in the dataset. For entering variable labels, value labels and missing values, the codebook, or survey questionnaire if the codes are printed on, is essential. To define variable label just click the appropriate cell and type in directly as following.
Again, to define the value labels, select “Variable View” in the PASW Statistics Data Editor. Then, follow the steps below: 1. Click the cell under “Values” and “Value Labels” window will pop-up; 2. Type the code in “Value” box; 3. Type the appropriate label in “Label” box; 4. Press “Add” button and the value and its label will appear in the space below; 5. Repeat Steps 2, 3 and 4 until all value labels been defined and press “OK”, after entering for the last valid code, to complete defining the value labels. Note:
Starting from the version 17.0, PASW Statistics allows checking spelling of value labels (click the “Spelling” tab). Similarly, users can identify “missing values” by clicking the cell under “Missing” and follow the similar procedure in defining value labels.
Click here to define value labels! 1
2 3 4
Repeat until all value labels have been added 5
The same analysis (frequencies) to the variable “HV106” after defining the variable label, value labels and missing values will provide the following output which is easier to understand and ready to place in a report or presentation.
Within Variable View, all properties (or definitions) of the variables: name, type, measurement level, etc., can be added, changed or removed as required. By default, PASW assigns measurement level for the imported variables automatically as “scale” for numeric variables and “nominal” for string variables. It is insufficient for some advanced analyses, and thus, the measurement level of the variables must be checked and changed. For example, type of measurement for the variable “HV106” can be changed from nominal to ordinal, which is more suitable for the variable.
1.2 Setting and Editing Metadata through Wizard PASW provides a wizard-like method of setting variable properties for the new variables, and also for checking and editing variable properties for existing variables in a dataset. The "Metadata Wizard" can also be applied to the imported data files instead of setting manually as described in the previous section. Steps in this procedure are: 1. Click “Data” on main menu bar; and 2. Select “Define Variable Properties…”. The “Define Variable Properties” window would pop-up and let choosing variables to be defined. For demonstration purpose, select just two variables HV219 “Sex of head of household” and HV104 “Sex of household member” in the following example. 3. Click the variable name(s) to select the variable(s) to be defined; 4. Double-click or click to move variable name to the right “Variables to scan” pane; Repeat Steps 3 and 4 until all required variables been placed in the right pane; 5. After selecting all variables, click “Continue” to start scanning the variables. 1 2
5 A new “Define Variable Properties” window will appear and show the scanned results by variable. In this window, one can set: (i) Variable label (type into blank spaces provided), (ii) Data type (select from the dropdown), width and decimal places (type-in), and (iii) Measurement level (select from the dropdown). After completing for the variable HV219, select HV104 and follow the same procedure described in steps (i), (ii) and (iii). Then, 6. Complete “Setting variable properties” by clicking “OK”.
Type-in Variable label and Value label
Click and select measurement level and variable type
(a) 6 Alternatively, after setting for HV219, its properties can be copied to HV104 since both variables have the same nature and using the same codes: 1=Male and 2=Female (i.e. same value labels). To copy variable properties, except variable label, from HV219 to HV104: (a) Press “To Other Variables...” button. Then, in the “Apply Labels and Level to” window: (b) Select the variable HV104; and (c) Click “Copy” to copy the variable properties. Type-in Variable label for HV104
(d) All properties of the variable HV219, except variable label, are copied to HV104. Thus, (d) Type in variable label for HV104, and click “OK” to complete the process. It should be noted that copying variable properties can be applied only among the variables scanned during the same session.
And, the dataset will appear in the Variable View as follow:
Setting of variable properties should be carried out on all variables in the dataset for easier understanding and effective analyses. Tip: Sometimes, source data file contains data in “text format” for some variables, such as “male” or “female” instead of 1 and 0. In this case, it is essential to code such variables for easier analysis. PASW Statistics provides automatic coding through AUTORECODE command. For detail information on AUTORECODE command, please refer to “Base User Guide” for PASW Statistics 17.0.
1.3 Copying File and Variable Properties Copying variable and file properties from a well-defined data file to another data file is an easy task in PASW Statistics. The “Copy Data Properties” in the “Data” menu provides the ability to use an external PASW Statistics data file as a template for defining file and variable properties in the active dataset. Similarly, properties of variables in the active dataset can also be copied to other variables in the same dataset. The “Copy Data Properties” wizard allows: • Copy selected file properties from an external data file or open dataset to the active dataset. File properties include: documents, file labels, multiple response sets, variable sets, and weighting. • Copy selected variable properties from an external data file or open dataset to matching variables in the active dataset. Variable properties include: value labels, missing values, level of measurement, variable labels, print and write formats, alignment, and column width used in the Data Editor. • Copy selected variable properties from one variable in (i) an external data file, (ii) open dataset, or (iii) the active dataset to many variables in the active dataset. • Create new variables in the active dataset based on selected variables in an external data file or open dataset. When copying data properties, the following general rules apply: • If an external data file is using as the source, it must be in PASW Statistics format; • Undefined (empty) properties in the source dataset do not overwrite defined properties in the designated dataset; and • Variable properties are copied from the source variable only to target variables of a matching type--string (alphanumeric) or numeric (including numeric, date, and currency). Variable properties can be copied from the source file to matching variables in the active dataset. Variables "match" if both the variable name and type (string or numeric) are the same. For string variables, the defined length must also be the same. Moreover, the variables which are not in the active dataset can be created using the properties of the selected variables in the source file. To do this, source list must be updated to display all or variables in the source data file. If you select source variables that do not exist in the active dataset (based on variable name), new variables will be created in the active dataset with the variable names and properties from the source data file. If the active dataset contains no variables (a blank, new dataset), all variables in the source data file are displayed and new variables based on the selected source variables are automatically created in the active dataset. This is the easiest way to create a new dataset (like Excel worksheet) for direct data entry and, also can be shared the dataset without data as electronic codebook. To copy the data file properties and variable properties, which may require after importing from other file formats, first, select “Variable View” of “Data Editor” and follow the steps below: 1. Click “Data” on main menu bar; and 2. Select “Copy Data Properties…” and “Copy Data Properties” wizard will appear; 3. Click the “Browse” button on the bottom right area and select the PASW data file which were to use as source of the properties; OR, type in the file name with its full address, for example, “C:\PASW Training\Sample\Data1.sav”
4 4. Then, click “Next” to proceed to the Step 2 of the Wizard; The Wizard will scan both source and target datasets, and display the “match” variables from source file in the left pane and from active dataset in the right pane. Number of selected variables is displayed in the bottom of the list. 5. Click “Finish” to copy with the default settings, or “Next” to change the settings;
The following settings can be changed in Steps 3 and 4 of the Wizard.
If the Wizard is followed Step-by-Step, the summary of “what would be copied” will be displayed on Step 5. After pressing “Finish” button, whether at the end of step 2, 3, 4 or 5, the active dataset will have the selected properties as in the source PASW data file.
Alternatively, properties can be copied from an open dataset, if more than one datasets are opened. Just select “An open dataset” as “Source of the properties” in Step 1, and follow the same steps. Here, new variables from the source dataset will be added to the active dataset if “Create matching variables in the active dataset if they do not already exist” is ticked in using set properties. All variables (press <Ctrl>A) or only some variables (click variable name with <Control> key) can be selected from the source list. In this case, at the bottom of the list of active dataset will display both (i) matching variables, i.e. 12 in this example; and (ii) variables to be created, 10 in this example.
New variables Newly inserted variables
No valid data here!
In the above example, 10 new variables will be added into the active dataset with the same variable names and properties by copying the properties of all variables from the source dataset. It should be noted that the data values were not be copied to the active dataset. PASW Statistics also allows copying variable properties from one variable to another in the same dataset. For example, in the sample dataset, two variables: sex of head of household (HV219) and sex of household member (HV104) are sharing the same codes “1=Male” and “2=Female”, and 9 as the missing value. If the codes were entered and missing value has been identified for the head of household (HV219), those properties can copy to household member (HV104). To do this, select the third option in “Choose the source of the properties”, which is “The active dataset” in Step 1 of the Wizard. Then, click a source variable, and click again the target variable(s). As usual, user must press <Control> key while clicking the next variable name(s). After selecting all target variables, just click “Finish” to begin copying process.
In this option, user must type-in appropriate variable labels for the target variables.
Variable labels are the same as the source variable
User must change these variable labels!
Surveys could provide very rich information. However, most survey datasets are yet to be ready for analysis and producing output tables to construct EFA monitoring indicators. Preparing for data analysis The following two steps are essential after setting variable properties to conduct an appropriate and productive data analysis: (1) the prospective outputs should be listed and laid out suitable analytical methods. (2) check which outputs can be generated directly from the existing datasets, and which outputs may require further manipulations such as sorting; calculation/creation of new variables (temporary or permanent); transformation (coding, grouping, etc.); and creation of new datasets (aggregation, subsetting and merging the existing datasets). Example: The working dataset contains data extracted from a household survey with personal records of all household members with the variables: age, sex, schooling status, and the class/grade currently attending. And, the requirement is to produce “age-specific enrolment rate (ASER) for the children aged 6 to 14 by sex” on dataset. It is impossible to compute ASFR directly from the working dataset since: (a) total number of children aged 6 to 14 by single year of age by sex (which is denominator); and (b) number of children aged 6 to 14 who are currently attending school by single year of age by sex (which is numerator), are not available in the current dataset. For this task, it requires the following Steps: (a) Extracting the cases for aged 6-14 only; (b) Counting of all children, irrespective of schooling or not, by age and sex, for denominator; (c) Counting of children who are currently attending school by age and sex, for numerator; and (d) Calculation of ASER by age and sex. Step (a) can be carried out by “case selection” command, while “aggregate” command is suitable for Steps (b) and (c), and “compute” command to create a new variable, ASFR, in Step (d). PASW allows data transformations ranging from as simple as collapsing categories for analysis, to more advanced tasks, such as creating new variables based on complex equations and conditional statements. In this chapter some important techniques of data manipulation and transformation will be discussed.
2.1 Changing, Inserting and Deleting Data, Cases and Variables In PASW Statistics Data Editor, t is simple to change the value of a specific cell, or properties of a variable, such as name, type, label, value labels and measurement scale. Changing the identification (or properties) of a variable: To change the properties of a variable, for example, variable name, select the cell with the variable name that you want to change in “Variable View” and type-in new appropriate name. All variable properties can be changed as such in “Variable View”. Cautions must be put in changing variable types: if change a string variable to numeric, all alpha-numeric data values will become missing values (“.”); and only blanks (zero length string data) will get if changing back to string type later. This may happen with some other data types also. If data values were to change, select “Data view”, locate the cell and type in the new value, one cell after another, as in a spreadsheet program. Adding variables or cases to an existing dataset: For example, a variable, education level “EdLevel”, should be added to have better understanding of educational attainment of all household members. To add a new variable, select “Variable View” and right-click the row number where to insert the new variables. PASW Statistics will insert the variable before the existing variable on that row with the name “Var00001”, “Var00002”, “Var00003”, and so on…. Variable type for a newly created variable is numeric with F8.2 format (8 digits, 2 decimal places). There will be no variable label and value labels. The user can input or import the variable attributes, as presented in the above section, for new variables including variable name, type, width and decimal places, variable label and measurement level. As and where applicable, value labels should also be identified.
Select row and click RIGHT mouse to get pop-up, and click “Insert Variable”
Type in variable name, and edit properties as necessary!
A new variable could also be inserted on “Data View” by clicking the existing variable name where to insert the new variable before. Then, go to the “Variable View” and change the properties. On the other hand, just click variable name (while working on Data View) or click the row number (on Variable View) and press “Delete” key to delete a variable. Inserting cases can be carried out only on “Data View”. Select the row (or several rows continuously) where to insert new case(s), right-click and select “Insert Cases”. Similarly, select case(s) and press “Delete” key will delete the selected cases. Alternatively, you can use the Clear command in the Edit menu.
2.2 Computing New Variables Use Compute to get values for a variable, an existing one or a newly created one, based on numeric transformations of other variables. Creation of new variables from existing variables is a common and essential task in data analysis. Example: Total service of primary school teachers in many annual school censuses was recorded in months for better accuracy. However, it requires to summarize or to relate with other variables in years. Then, a new variable “service in year” must be computed as “service in month” divided by “12”. Case study: In the sample dataset extracted from “Bangladesh Demographic and Health Survey 2007” contains highest education level (HV106) and highest year of education (HV107) for all household members. However, there is no educational attainment in usual “Grade” or “Grade-level”, that is, “Primary 2” or “Secondary 4” or …. To study the highest grade-level attended by adult household members (aged 15 and above), a new variable “Grade” must be calculated from two existing variables as: Grade = HV106 * 10 + HV107, for HV106 = 0, 1, 2, 3 and HV107 is not 98; and Grade = Missing, if HV106 = 8 (Don‟t know) or HV107 = 98 (Don‟t know). To calculate the new variable “Grade”, the “Compute Variable” command is available under “Transform” menu in the Data Editor. To create a new variable: 1. Click “Transform” on main menu bar; and 2. Click “Compute Variable” item and “Compute Variable” window will appear. 1 2 4
Compute only for the cases which are not “unknown” for both education variables and Age > 15,
3. Fill-in “Target Variable” name, and optionally, the type and label of new variable can also be set by clicking the button under target variable name;
4. Set the numeric expression the existing variables together with numbers, PASW Statistics built-in functions, and operators such as +, - , >, <, etc.; 5. If only the cases which meet certain criteria were to include, press button located at the lower left corner of the window and fill-in the conditions; and 6. Click “OK” to complete the task. A new variable, “Grade”, has been added in the current dataset, at the end of variable list. Although a new variable name was provided, the result variable from the “Compute” command can also take an existing variable name. After creation of a new variable, it is important to define thoroughly by setting labels, missing values and measurement level.
2.3 Recoding RECODE changes, rearranges, or consolidates the values of an existing variable. RECODE can be executed on a value-by-value basis or for a range of values. Recording is a common task in data preparation. Sometimes, values (or categories or codes) in a nominal or ordinal variable require regrouping to make further analyses. For example, grouping of single-year population into school-going age groups is essential to calculate education indicators. Sometimes, data entering in text format, for example area names, should be changed into numeric values for the ease of analysis. These tasks can be carried out by the following PASW commands: 1. Automatic Recode; 2. Recode into Same Variables; and 3. Recode into Different Variables. Automatic Recode It is useful for string variables with limited number of different values, for example, male or female; urban, suburban, rural or remote. When the existing categorization of a variable is no longer needed after recoding, “Recode into same variables” option can be selected or select “Recode into Different Variables” to maintain the original variable. To perform automatic recoding: 1. Click “Transform” on main menu bar; and 2. Click “Automatic Recode”, and a new window will appear; 1
8 Select one variable and send to the area under “Variable New Name”; Type appropriate name for the recoded variable in “New Name” box; Click “Add New Name” button; Repeat Steps 3, 4, and 5 for all variables to recode. Select whether to recode starting from the “Lowest value” or “Highest value”; Select whether to “use the same recoding scheme for all (selected) variables”, and whether to “treat string values as user-missing” or not; and 8. Click “OK” to complete the task. 3. 4. 5. 6. 7.
Then, two new variables “Division” and “SES” will be added to the current dataset with the following coding schemes (codes and value labels).
In some cases, there are more than one variable sharing the same values, for example, „Sex of head of household (HV219)‟ and „Sex of household member (HV104)‟ must have only two valid values “Male” and “Female”. Similarly, several variables could take just “Yes”, “No” and non-response or missing value; for example, „Usual resident (HV102)‟, „Slept last night (HV103)‟ and „Member still in school (HV110)‟ are such variables in the sample dataset. To recode such group of variables, just tick the checkbox of “Use the same recoding scheme for all (selected) variables” in Step 7. The following exhibit shows the automatic recoding of two variables, HV103 and HV102.
All properties are the same for both variables
8 „Automatic recode‟ is simple and useful in exploring the newly imported file or for the beginners.
Recode into Different Variables The “Recode into Different Variables” is the most useful recoding procedure for the general users. In this procedure, users can select all the recode options, and both old and new variables are maintained in the dataset. Before manual recoding, it is important to see the frequency distribution of the variable under study. The variable “Highest education level (HV106)” will be used as an example in this section. The frequency table for the variable HV106 is as following:
Here, 6 different items: „9‟, „DK‟, „Higher‟, „No education, preschool‟, „Primary‟ and „Secondary‟ are listed as valid values of the variable. Through the codebook of the DHS Survey, „9‟ is representing the missing value and „DK‟ represents „Do not know‟. Since the variable under study is „educational attainment‟, it is valid for those aged 6 and above only. Thus, it is logical to code as following for the population (household members) aged 6 and above: 0 = No education, preschool 3 = Higher 1 = Primary 8 = DK, and 2 = Secondary 9 = (system) missing value. To do this, 1. Click “Transform” on main menu bar; and 2. Click “Recode into Different Variables” and a new window will appear; 3. Select the variable “Highest education level (HV106)” and send to the area “Input Variable Output Variable:”; 4. Input a new “Name” and appropriate variable “Label” for the output variable, and click “Change” button to set new variable name and label; 5. Click “Old and New Values” button and a new window will appear for setting; In “Old and New Values” window: (i) Type in the old value (or a range), e.g. “Primary”; (ii) Type in new value, e.g. “1”; and (iii) Press “Add” button to add transformation rule into the process; (iv) Repeat above steps for all pairs of values and click “Continue” to complete selection and return to main recode window; 6. Click “If…” button and a new window will appear for case selection setting; In “If Cases” window: (a) Select “Include if case satisfies condition:” button; (b) Construct (or type in) the condition, e.g. “HV105 > 5”; and (c) Click “Continue” to return to main recode window; and 7. Click “OK” on “Record into Different Variables” window to complete the task.
Step 5 (ii)
Step 6 (a) (b)
After creating a new variable with recode command, all necessary properties must be set to the new variable, such as variable format (type, width and decimal places), value labels, missing values, etc. The new variable can be observed as following:
Since age (HV105) is < 6 yr, EdLevel is “Missing” Since age (HV105) is > 6 yr, EdLevel code is “0”
Just set width and decimal places
No value labels yet!
Similar steps were to carry out to “Recode into same variable”.
Visual Binning PASW Statistics also provides “Visual Binning” under “Transform” menu to perform automatic creation of new variables based on grouping contiguous values of existing variables into a limited number of distinct categories. Visual Binning can assist to: • Create categorical variables from continuous scale variables. For example, a scale variable “age” to create a new categorical variable that contains 5-year age groups. • Collapse a large number of ordinal categories into a smaller set of categories. For example, collapse the twenty 5-year age groups into 5 groups: 0-19, 20-39, 40-59, 60-79, and 80+. To conduct visual binning, first select a scale variable (HV105 Age of household members) and follow the steps below: 1. Click “Transform” on main menu bar; and 2. Click “Visual Binning” and a new window will appear; 3. In the “Visual Binning” window: (i) select the scale variable(s) to bin and move those variables into “Variables to Bin” pane; and (ii) click “Continue” button when complete selecting;
Step 3 2 (i)
PASW Statistics will analyze the selected variables, and present a graphical distribution of the variable after binning in the new “Visual Binning” window. Here, 4. Input an appropriate “name” for the binned variable; 5. Input variable “label” for the binned variable; and 6. Click on the “Make Cutpoints…” button to define cutting points for the binning; and “Make Cutpoints” window will appear to set cutpoints; Cut points can be constructed based on three options: (i) equal width intervals; (ii) equal percentiles based on scanned cases; and (iii) cutpoints at mean and selected standard deviations (1 or 2 or 3 SD) based on scanned cases. Generally, making cutpoints with equal width intervals is more common and suitable in analyzing household surveys on education.
In the “Make Cutpoints” window: 7. Input “4” as first cutpoint location since the first age group of common 5-year interval is 0-4; 8. Input “5” as the Width (or class interval), and the “number of cutpoints” will be filled automatically, 19 in this example; 9. Click “Apply” and Visual Binning window will appear with set intervals. Then, in the main “Visual Binning” window: 10. Click “Make Labels” button to generate value labels automatically and the user can change labels as appropriate; and 11. Finally, click “OK” to create a new binned variable called “Age”. As usual, properties of the new binned variable must be checked and changed as necessary.
9 The frequency table of the variable “Age” is as following:
For having effective data analysis, users must prepare dataset efficiently. The most frequently used data preparations techniques include sorting and selecting of cases. After checking and editing of dataset, setting the variable properties, and recoding as necessary, the dataset is ready to start preparation for data analyses. Before making any analysis: (1) the prospective outputs should be listed and laid out suitable analytical methods. (2) check which outputs could be generated directly from the existing datasets, and which may require further manipulations such as sorting; calculation/creation of new variables (temporary and/or permanent); transformation (coding, grouping, etc.); and creation of new datasets (aggregation, subsetting and merging the existing data sets). Example: the working dataset contains data extracted from a household survey with personal records of all household members. The variables include: age, sex, schooling status, and the class/grade currently attending; and the requirement is to produce “age-specific enrolment rate (ASER) for the children aged 6 to 14 by sex”. In this situation, it is impossible to compute ASFR directly from the working dataset since the analyst needs to have a dataset with: (a) total number of children aged 6 to 14 by single year of age by sex [which is denominator]; (b) number of children aged 6 to 14 who are currently attending school by single year of age by sex [which is numerator], before computing age-specific enrolment rate, ASER. In this situation, it requires: (a) selection of cases (extracts cases of aged 6-14); (b) aggregation of personal data to get grouped data by age and sex, that is, counting of all children irrespective whether schooling or not, and of children who are currently attending school, by age and sex; and (c) calculation of ASER by age and sex. [Note: The calculation is much easier and simpler if “Custom Tables” option is installed.] PASW Statistics allows data transformations ranging from as simple as collapsing categories for analysis, to more advanced tasks, such as creating new variables based on complex equations and conditional statements. In this chapter some important techniques of data manipulation and transformation will be discussed.
3.1 Selecting Cases Select Cases provides several methods for selecting a subgroup of cases based on criteria that include variables and complex expressions. Users can also select a random sample of cases.whenever to analyze a specific subset of data based on set criteria, for Selection of cases is essential example, to study the percentage of “out-of-school girls aged 6-14”. To do this: 1. Click “Data” on main menu bar; and 2. Click “Select Cases”, which is the second last item on the list. Then, 3. “Select Cases” window will appear and select “If condition is satisfied” and; 4. Click “If” button and a new window “Select Cases: If” will appear. 5. Construct selection statement using variables, operators and functions; then, click “Continue”; 6. Select output option: i. Filter out unselected cases; ii. Copy selected cases to a new dataset (to provide the new dataset name); and iii. Delete unselected cases; 7. Click “OK” button and a new Data Editor window will appear with selected cases. 1
There are three output options; “Filter out unselected cases” - cross-signs (X) will be put on unselected cases as following picture shows. The unselected cases will not be used in future analyses and run select cases with Select All Cases option to retain original dataset. “Copy selected cases to a new dataset” - this creates new dataset and leave current dataset intact. Users can switch between the original dataset and newly created dataset or use both datasets together through PASW syntax. “Delete unselected cases” – this deletes all unselected cases from the current dataset. With this option, original dataset cannot be retained, and thus, it is important to save the original dataset before, and the sub-dataset contains only selected cases should also be saved with an appropriate name as soon as completing the selection process.
Unselected cases Selected cases
The following cross-tabulation provides the percentage of out-of-school girls aged 6-14 in single year.
3.2 Sorting Cases SORT CASES reorders the sequence of cases in the active dataset based on the values of one or more variables. Optionally cases can be sort in ascending or descending order, or combinations of ascending and descending order for different variables. Cases can be sorted in ascending or descending order based on one to all variables in the dataset. In the sample dataset, households can be sorted by wealth index to observe the characteristics of households in similar wealth status. Moreover, some PASW Statistics commands require pre-sorted dataset, for example “aggregate” command requires sorted dataset by the breaking variable(s). Sorting can be carried out through “Sort Cases” command under “Data” menu as following: 1. Click “Data” on main menu bar; and 2. Click “Sort Cases”. Then, “Sort Cases” window will appear; 3. Select the first key variable and send to “Sort by” pane and set “Sort Order”; Repeat this Steps 3 for all key variables in the order of importance; 4. Click “OK” button to start sorting. The following example sorts current dataset with two variables: „Education in single year (HV108)‟ in ascending order and „Age of head of household (HV220)‟ in descending order. 1
4 Sorted data:
3.3 Rearranging Variables Relocating of variables does not have any impact on the results of data analyses. However, it makes easier to decide which variables to use for getting required outputs. Sometimes, the original dataset cannot provide the variables in good order, for example, education related variables may spread in several locations. Other occasions, linked variables are far apart that it cannot be visually observed the linkage. In such cases, putting those associated or linked variables or variables under investigation could be grouped into a new dataset or moved to the top of the variable list. Relocating Variables To move a variable form current position to the new one is just click the selected variable, dragand-drop at the desired position in “Variable View” or “Data View”. For example, to place “Line number of head of household (HV218)” to the second position in the list: 1. Select the variable by clicking on the row number (HV218 at row 6) on Variable View; and 2. “Drag and Drop” at the desired location (in this example, after the first variable in the list). A red hairline will show the position if the user drop the dragged variable at that time.
Thin Red Line shows the destination
At new location after moving
Variable Sets In case of several variables in the dataset, it is recommended to define and use “Variable Sets”. Define Variable Sets under Utilities menu creates subsets of variables to display in the Data Editor and variable lists in dialog boxes. Defined variable sets are saved with PASW format data files. A variable set can be defined with any combination of numeric and string variables, and a variable can belong to multiple sets. The order of variables in the set has no effect on the display order of the variables in the Data Editor or variable lists in dialog boxes. Two variable sets “Education” and “HH_Head” are defined in the following example with nine variables in “Education” variable set and eight in the other with four common variables. To create a variable set: 1. Click “Utilities” on main menu bar; and 2. Click “Define Variable Sets”, and a new window will appear; 3. In “Define Variable Sets” window, first put in the set name following PASW naming convention (can be up to 64 bytes long; valid any characters including blanks); 4. Select and put variables into the “Variables in Set” pane; 5. Click “Add Set” button to create the variable set; Define as many sets as needed by repeating steps 3-5. 6. Click “Close” button to complete creation of variable sets. It is strongly recommended to save the dataset with the new name after defining the variable sets. In this example, the dataset is saved as “BDPR50FL2.sav”. 1
To use a variable set: 1. Click “Utilities” on main menu bar; and 2. Click “Use Variable Sets”, and a new window with the list of variable sets will appear; The list of available variable sets includes all variable sets defined, plus two built-in sets: (i) ALLVARIABLES: contains all variables in the data file, including new variables created during a session; (ii) NEWVARIABLES: contains only new variables created during the current session; (iii) Education: the first user-defined variable set containing 9 variables; and (iv) HH_Head: the second user-defined variable set containing 9 variables. 3. In “Use Variable Sets” window, first, check the desired variable set(s) and uncheck all others under “Select variable sets to apply”; At least one variable set must be selected. If ALLVARIABLES is selected, any other selected sets will not have any effect, since this set contains all variables. In this example, “Education” variable set is selected. 4. Click “OK” to complete selection and the following new Data View will appear. 5. To get all variables back, click “Show All Variables” under Utilities menu.
2 5 3
Display 9 variables of Education set
4. DATA VALIDATION Validate Data helps identifying suspicious and invalid cases, variables, and data values in the active dataset. Why data validation is required? With rapidly expanding computing power and increasing storage capacity at reasonable cost, many surveys in current years were designed to collect several items (which will result more variables) with better coverage (i.e., larger sample size and thus more cases in PASW). It creates more workloads for the data handlers – coding staff, entry clerks, and data editors. Generally, with time pressure to complete the task on one hand and inefficiencies in training and recruitment of staff on the other, the quality of data transmitted from data manager to analyst is in question. In some cases, surveys were planned without a step to check the coding, and not at all to verify the data entered. For the education data analysts, it is expected to obtain survey data concerning education from various sources, and thus, there is no way to conduct rechecking of coding or data entry. Therefore, it is important to use validation rules to check the data validity and consistency before using the data set. Validation rules Generally, there are three types of rules in validating a dataset: 1. Single-variable rules 2. Cross-variable rules, and 3. Multi-case rules. In PASW Statistics 17.0, these rules are not available in the base system, but become part of the optional “Data Preparation” add-on module. However, these tasks can be carried out through common PASW Statistics commands. It is easier if the user understands PASW syntax (programming) language. The first two types, single-variable rules and cross-variable rules, require understanding “case selection” which was discussed in the previous section. The third type of rules is more complicated and it may need several steps of data manipulations such as creating temporary variables, matching, aggregation and selection of cases. PASW Statistics provided a procedure: “Identify Duplicate Cases” in “Data” menu to identify duplicate cases in a data file which is the most important part of the third, multi-case rules. This section will introduce simplest data validation procedures, but those are powerful in pointing out improper or invalid cases and values.
4.1 Validation with Single-Variable Rules Those validation rules which check internal inconsistencies such as invalid values and cases within a variable are known as Single-Variable Rules. These rules consist of a set of checks apply to a variable. Normally, checks for out-of-range or invalid values and missing values include in this category. For example, a value of 5 was entered for the “highest education level (HV106)” where valid codes are only 0, 1, 2, 3 and 8; values other than 1 and 2 (or “Male” and “Female”) are entered in variable “sex of household members (HV104)”, etc… Checking validation consists of three stages followed by editing of invalid cases. The first stage in validating a variable is obtaining valid values or ranges from the codebook, for example, valid values for HV104 (sex) are 1 and 2 only. Therefore, any values except 1 and 2 are invalid. The second stage is constructing a frequency table. If there is no invalid values displayed in the frequency table, the variable under observation is „valid‟ with the single-variable rule. If irrelevant values were observed in the frequency table, for example “3” in variable representing “sex”, it is required to identify “where these erroneous cases are?” And, thus, the third stage for checking validation is using “select cases” to split out and observe the irrelevant cases. To check the validity of “sex of household members (HV104)”, follow the steps: 1. Click “Analyze” on main menu bar; 2. Click “Descriptive Statistics”; 1 2
Invalid values for Sex
3. Then, click again “Frequencies”; and 4. On “Frequencies” window, select the variable to study (HV104) and click “OK” to construct frequency table. In the above frequency table, 5 cases with the values 3, 4, and 5 are invalid. Therefore, it is necessary to check which case contain such invalid values through conducting the third stage: “case selection” of invalid cases. To select invalid cases: 1. Click “Data” on main menu bar; 2. Click “Select cases”; 3. On “Select cases” window, check the option button “If condition is satisfied” and click “If” button; 4. On “Select cases: If” window, type in criteria: “not (HV104=1 or HV104=2)” or “~(HV104=1 | HV104=2)” and click “Continue”; 5. Check “Copy selected cases to new dataset” option button and provide the new dataset name, e.g. “Invalid_Cases”; and 6. Click “OK” to execute the case selection command. 1
Set in Step 4
The output, new dataset contains only 5 invalid cases (after moving variable HV104 to second position to get a better view) as below:
Invalid values for Sex
In this case, the user must decide, and act, whether to erase the entire case from the dataset or change the invalid ones to â€œmissing valuesâ€?, or check other datasets where there have different values and to correct the invalid values in the current dataset.
4.2 Cross-Variable Rules Rules for checking inconsistencies in a variable through the values of other variables in the same case is called Cross-Variable Rules. In cross-variable rules, users have to use cross-tabulations instead of frequency tables to specify whether there exist invalid cases or not, and to imply slightly different rule for conditional selection of invalid cases. In the sample dataset, the “highest educational level (HV106)” has no invalid cases if checked it alone using frequency tables command. However, when cross-checking with “age of the household members (HV105)”, there are few susceptible entries as follow:
NO visible invalid values
Reference Invalid On the margin Valid
From the above cross tabulation of age and highest education level, one can easily judged that there are 2 cases of “age 4 in primary education” and 1 case of “age 12 in higher education” are invalid. Moreover, there are few more cases which are not reliable (or on the margin) in all education levels. There are few options in developing cross-variable validation rules: Option 1 – to sip out all susceptible cases (invalid and marginal ones): i) with primary education at aged 5 or below (the official entrance age is 6), ii) with secondary education at aged 10 or below (the official starting age is 6+5=11), and iii) with higher education at aged 15 or below (the official starting age is 6+5+5=16).
Option 2 – to review just certainly invalid cases, one can use the following cross-variable rules with a grace period (early entrance) of one year: i) with primary education at aged 4 or below (the official entrance age is 6 but 5 can be allowed), ii) with secondary education at aged 9 or below (the official starting age is 6+5=11), and iii) with higher education at aged 14 or below (the official starting age is 6+5+5=16). Then, the “If” statements to be used in case selection are: Option 1: (HV105 <= 5 and HV106 = 1) or (HV105 <= 10 and HV106 = 2) or (HV105 <= 15 and HV106 = 3) Option 2: (HV105 < 5 and HV106 = 1) or (HV105 < 10 and HV106 = 2) or (HV105 < 15 and HV106 = 3)
And the following outputs will be obtained after running appropriate case selection procedures as presented in the previous section. Option 1: Both invalid and marginal cases
Case Number Age
Option 2: Only certainly invalid cases
Ed. Level to be checked & corrected
4.3 Multi-Case Rules A user-defined rule that can be applied to a single variable or a combination of variables in a group of cases is a Multi-Case Rule. The multi-case rules are defined by a procedure (sequence of logical expressions) that flags invalid cases. The most common and useful application of multi-case rules is checking whether there are duplicates in the dataset: entered twice or more for a household member or two heads in a single household or two persons in the same household have the same personal ID, and so on. PASW Statistics allows checking duplicate cases and inspection of unusual cases. Follow the steps below to check duplicate cases: 1. Click “Data” on main menu bar; and 2. Select “Identify Duplicate Cases”. Then, a new window will appear; 3. Select variables to identify duplicate cases (or press Ctrl+A to select all and release unnecessary variables) and send to the space below “Defined matching cases by:”; 1
3 2 4(a)
5 4. Set the options: (a) “Sort within matching group” - select the variable(s) from the remaining ones in the list, as the key for sorting within the matching groups; (b) “Sort” - if a key variable for sorting is selected, define the sort order; (c) “Variables to create” – tick in the check box, if the user wants a frequency table showing “how many duplicates are detected?”, or to point out which are the duplicate cases; then, also could identify: i. which is the primary case, the first or last case among the duplicates? ii. whether to count all duplicate cases sequentially or just count only nonprimary cases (the primary case is not considered as duplicate);
(d) Tick “Move matching cases to the top” to review duplicates easier; and (e) Tick “Display frequencies for created variables” if required; 5. Click “OK” to proceed. With the above set options, the result of checking duplicate cases is displayed in the following frequency table:
The above frequency table shows that there are 6 duplicates among the 1,889 cases. All of those may be the same (just one primary case and the group of 7 cases are the same in all variables) or there may be 6 pairs of duplicates (6 primary cases and one duplicate for each primary case). It is to review the dataset for understanding the nature of duplicates and how to deal with those duplicates. The following exhibit shows the groups of duplicates displayed on top of the dataset.
Values of all variables are same in both cases
Primary Duplicate Duplicate Cases
Primary Duplicate Primary Duplicate Primary Duplicate Primary Duplicate
After validation checks, the dataset should be edited as and where necessary. After data validation and preparation, the next step is analyzing “clean data” using appropriate PASW procedures under “Analyze” menu.
5. TIPS AND EXERCISES 5.1 Tips: Do and Don’t i) Do…
Don’t… ii) Do… Don’t… iii) Do… Don’t…
request to provide documents such as project proposals, questionnaire sets, codebooks, documents on fieldworks, and survey reports while approaching agencies/departments to get survey data; judge the usefulness on the spot and do not leave any survey documents and datasets which are available in survey agencies/departments. make understand, check and edit metadata (a set of data that describes and gives information about other data) before using secondary dataset; leave any variable without proper definition: variable label, value labels, missing values and measurement level (scale, ordinal and nominal). save the dataset with an appropriate filename whenever changes have been made, and record properly what changes were made from earlier version; save the current dataset in original filename after making changes, but do not replace the original data file with edited ones.
iv) Do… Don’t…
copy variable properties whenever available; leave it as it is after copying variable properties (must check and edit as necessary).
define and use variable sets for the ease of analysis, and subset new datasets by selecting variables as well as cases; change variable type and measurement level without sound understanding.
Don’t… vi) Do…
Don’t… vii) Do… Don’t…
recode string variables into numeric codes using “automatic recode” and use “visual binning” for the continuous variable (or numeric variable with several different values) to reduce the number of items; recode into same variable since it is irreversible (and also, the original variable can easily be deleted when it is no longer needed. validate data through single-variable and multiple-variable rules and check the existence of duplicate cases before conducting any analysis; change the values in the dataset with imagination or self-imposed assumptions. Always contact to the primary data source for correction or omit those cases if not many.
Do you understand how to set variable properties in PASW statistics? Very well / Somewhat well / Not so much / Almost None
Are you confident that you can do the followings in an active dataset? o Compute a new variable: Confident / Somewhat confident / Not so much / Not at all o Recode into a different variable: Confident / Somewhat confident / Not so much / Not at all o Selecting cases with girls under 15: Confident / Somewhat confident / Not so much / Not at all o Sorting cases with wealth index factor score and highest education attained: Confident / Somewhat confident / Not so much / Not at all o Check erroneous values in a variable (validate with single/cross variable rule) Confident / Somewhat confident / Not so much / Not at all o Check existence of duplicate cases in the dataset Confident / Somewhat confident / Not so much / Not at all
Do you understand visual binning? Very well / Somewhat well / Not so much / Almost None
5.3 Hands-on Exercises 1) Import the attached “data1(tab).dat” and define all variables appropriately. 2) From the dataset obtained from Exercise 1 above, recode all string variables. 3) Create single-variable rules to check the validity of three education related variables. 4) Create two multi-variable rules to check the validity of (i) current schooling status of household members, and (ii) education in single year of household members. 5) Find duplicate cases from the current dataset and propose how to handle those cases.
Basic Data Analysis Techniques in PASW Statistics Purpose and leaning outcomes: To introduce basic data analysis techniques in PASW To understand how to derive PASW to get required outputs (tables and charts) To know how to interpret PASW output
Contents: 1. Reports 1.1 Codebook 1.2 Case Summaries: Listing Selected Cases 1.3 OLAP Cubes (Online Analytical Processing Cubes) 2. Descriptive Statistics 2.1 Frequencies 2.2 Descriptive 2.3 Explore 2.4 Crosstabs 2.5 Ratio Statistics 3. Tips and Exercises 3.1 Tips: Do and Don’t 3.2 Self-evaluation 3.3 Hands-on Exercises 4. Annexe: Web Links for Further Study on SPSS/PASW Statistics
Procedures in the REPORT command group can provide all univariate statistics available in other procedures. In addition, computations involving aggregated statistics are directly accessible only in the REPORT procedures. Among the others, Codebook and OLAP Cubes are included in the most essential procedures for the education data analysts. The first command under ANALYZE menu is the REPORT. The REPORT procedures can provide all univariate statistics available in the DESCRIPTIVES statistics and subpopulation means available in the MEANS. In addition, some statistics available in the report procedures, such as computations involving aggregated statistics, are not directly accessible in any other command procedures. By default REPORT provides complete report format but a variety of table elements can be customized, including column widths, titles, footnotes, and spacing. Because it is flexible and the output has so many components, it is often efficient to preview report output using a small number of cases until finding the format that best suits the needs, especially when listing individual cases. The group of REPORT commands comprises of Codebook, OLAP Cubes, and Summarize – containing „Case Summaries‟, „Report Summaries in Rows‟ and „Report Summaries in Columns‟. Codebook This procedure reports the dictionary information and summary statistics for all or specified variables and multiple response sets in the active dataset. Summarize procedure (or Case Summaries) “Case summaries” produces subgroup statistics for variables within categories of one or more grouping variables. All levels of grouping variable are cross-tabulated. Summary statistics for each variable across all categories are also displayed. The order in which the statistics are displayed can be chosen. Moreover, data values in each category can be listed or suppressed. With large datasets, only the first n cases or all cases can be listed. Report Summaries in Rows It produces reports in which different summary statistics are laid out in rows. Case listings are also available, with or without summary statistics; and Report Summaries in Columns Produces summary reports in which different summary statistics appear in separate columns. OLAP Cubes (Online Analytical Processing Cubes) It calculates totals, means, and other univariate statistics for continuous summary variables within categories of one or more categorical grouping variables. A separate layer in the table is created for each category of each grouping variable.
1.1 Codebook Codebook reports such dictionary information as variable names, variable labels, value labels, and missing values. It also provides summary statistics for all or specified variables and multiple response sets in the active dataset. Summary statistics produced by Codebook for the nominal and ordinal variables, and multiple response sets include counts and percents. For scale variables, summary statistics include mean, standard deviation, and quartiles. As such, codebook is very useful for preliminary analysis. To obtain a codebook of the current dataset: 1. Click “Analyze” on main menu bar; 2. Click “Reports”; and 3. Click again “Codebook”, and a new window will appear with complete variable list. 4. Select and send the variables to “Codebook Variables” pane; Here, just three variables with different measurement scales are chosen. And, 5. Click “OK” to proceed with the default settings for Output and Statistics. 1 2
The output table obtained by above procedure for the first variable is:
Since the first selected variable “HV009 – Number of household members” is a scale variable, the statistics produced for the variable are mean, standard deviation and three quartile values.
However, the other variables, “HV024 – Division” is nominal and “HV025 – Type of place of residence” is ordinal, only count and percentage of each valid value (category) are provided as statistics.
In the Codebook procedure, measurement level of variables can be changed temporarily by clicking right-mouse button after pointing on the variable. The following exhibit shows changing measurement level of “HV270 – Wealth index” from “ordinal” to “scale”. Keep in mind that changing from “ordinal” to “scale” type is temporary and only useful in the codebook procedure.
Select and click right-mouse button
Click here to change „Ordinal‟ to „Scale‟
And, the followings are the options available in Codebook command at its default setting.
The following output table is the codebook for “HV009 – Number of household members” after changing: (i) measurement level to “Ordinal”; (ii) select “Measurement level” and “Weight status”; and (iii) to display only “Percent” in statistics option.
File information set in (ii)
Real measurement level „Scale‟ is displayed
Display only “percent” as set in (iii)
(i) These values are displayed because of changing to “Ordinal” temporarily
1.2 Case Summaries: Listing Selected Cases Occasionally listing of selected cases with limited number of variables is required for validity (error) checking, reporting, printing and presentation purposes. Case Summaries can help in such tasks. Case Summaries under Report is useful to filter and list the cases with specified characteristics. For example, to list “20 out-of-school children aged 6-14 from the lowest socio-economic status from the sample households” with their age, sex, highest education level, etc… It should be noted that the dataset must be (A) limited only to the household members aged 6-14 who are out of school (use “Select Cases”), and (B) sorted in ascending order by “Wealth index factor score” (use “Sort Cases”)before exercising the case summaries. The following exhibits explain the preparatory steps before executing “Case Summaries” briefly.
A – select cases i. ii.
Condition for selecting only “out-of-school”
Condition for selecting aged 6-14 only
B - sorting (a) iv.
Dataset for the selected cases only
After conducting the preparatory tasks, the “PASW Statistics Data Editor” shows the “Out_of_School_6_to_14” dataset with the selected cases sorted in ascending order of “HV271 – Wealth index factor score”. The original sample dataset contains altogether 53,413 cases while the filtered dataset contains only 974 cases containing out-of-school children aged 6-14 only.
After completing data preparation work, follow the steps to execute “Case Summaries” command: 1. Click “Analyze” on main menu bar; 2. Click “Reports”; and 3. Click again on “Case Summaries”. Then, a new window will appear with the complete list of variables in the current dataset. 4. In “Case Summaries” window, select the variables in desired sequence; 5. Set number of cases to display in the “Limit cases to first”, for example, 20 ; 6. Click “OK” button to create a case summary report. 1 2 3
(HV105) (HV104) (HV101) (HV109)
The output table of the above procedure is as following:
The following table is copied from PASW Statistics Viewer and pasted directly into MS Word. Then, few minor touches on output layout are applied in MS Word. Listing of Out-of-school Children from Poorest Households a
1 2 3 4 5 6 7 8 9 10 11 12 Male Mean Female 1 2 3 4 5 6 7 8 Female Mean Total Mean a. Limited to first 20 cases.
Age of household Division members Sylhet 14 Dhaka 12 Sylhet 12 Sylhet 12 Rajshahi 11 Dhaka 12 Barisal 9 Rajshahi 10 Barisal 12 Rajshahi 12 Rajshahi 13 Rajshahi 14 11.92 2 Dhaka 10 3 Barisal 8 5 Dhaka 10 6 Barisal 13 7 Chittagong 14 13 Dhaka 11 16 Dhaka 12 18 Dhaka 10 11.00 11.55
Case Number 1 4 8 9 10 11 12 14 15 17 19 20
Relationship to head Son/daughter Son/daughter Son/daughter Other relative Son/daughter Son/daughter Son/daughter Son/daughter Son/daughter Son/daughter Son/daughter Son/daughter Son/daughter Son/daughter Son/daughter Son/daughter Grandchild Son/daughter Son/daughter Son/daughter
Wealth index Educational factor score attainment (5 decimals) Incomplete primary -102597 Incomplete primary -97182 Incomplete primary -95883 Complete primary -95883 Incomplete primary -95868 Incomplete primary -95747 Incomplete primary -95184 Incomplete primary -94601 Incomplete primary -94539 Incomplete secondary -94185 Incomplete secondary -93875 Incomplete primary -93649 -95766.08 Incomplete primary -97793 Incomplete primary -97330 Incomplete primary -97182 Incomplete primary -96696 Incomplete primary -96592 Complete primary -95028 Incomplete primary -94331 Incomplete primary -93976 -96116.00 -95906.05
The next table shows the same list of 20 out-of-school children, but by “Division”.
“Report Summaries in Rows” produces reports in which different summary statistics are laid out in rows. Case listings are also available, with or without summary statistics. Similarly, “Report Summaries in Columns” can provide summary reports, in which different summary statistics appear in separate columns. The outputs of both commands are in text format and cannot use pivot table techniques. Moreover, all such outputs could be created from “Case Summaries” command described before. The following table is the summary statistics obtained from “Case Summaries” command without displaying individual cases. The variable selected to display summary statistics is “number of years effectively studied by a household member (HV108 – Education in single year)”. And, the report will provide such statistics as: (i) number of cases; (ii) mean year of study (average of HV108); (iii) standard error of mean; and (iv) median year of study by: a. sex, b. residence, and c. district without listing individual cases.
Case Summaries Education in single years
Residence Division Urban Barisal
Male Std.Err. Mean Mean Median 3.36 0.387 3.00
Sex of household member Female Std.Err. N Mean Mean Median 17 3.71 0.605 4.00
Total Std.Err. Mean Mean Median 3.51 0.338 3.00
Total Barisal (Urban+ Chittagong Rural) Dhaka
Std.Err. Mean = Standard error of mean.
1.3 OLAP Cubes (Online Analytical Processing Cubes) The OLAP Cubes procedure can produce variety of summary statistics for summary variables within categories of one or more grouping variables. It creates a separate layer for each category of every grouping variable in the table. The summary variables are quantitative (continuous variables measured on an interval or ratio scale), and the grouping variables are categorical. The values of categorical variables can be numeric or string. OLAP Cubes provides a wide variety of summary statistics such as: sum, number of cases, mean, median, grouped median, standard error of the mean, minimum, maximum, range, variable value of the first category of the grouping variable, variable value of the last category of the grouping variable, standard deviation, variance, kurtosis, standard error of kurtosis, skewness, standard error of skewness, percentage of total cases, percentage of total sum, percentage of total cases within grouping variables, percentage of total sum within grouping variables, geometric mean, and harmonic mean. Some of the optional subgroup statistics, such as the mean and standard deviation, are based on normal theory and are appropriate for quantitative variables with symmetric distributions. OLAP cube uses the pivot table techniques, but with specific statistics and output options which cannot be obtained from other procedures such as cross-tabulation. Example: OLAP Cubes Among the variables in the sample dataset, only “HV108 – Education in single year” is the education related continuous (interval or ratio scale) variable. Since the continuous variable(s) must be selected as “Summary” variable, HV108 is selected in this example. Thus, the following exhibits demonstrate how “OLAP Cubes” is useful in exploring the “average number of study years by the adult household members” by four grouping variables: Sex; Age Group; Residence and Division. Before using “OLAP Cubes” procedure, only the adult household members (aged 15 and above) must be selected using “Case Selection”. Preparing Dataset for analyzing adults only
1 2 3
4 (ii) 8
5 6 7 (HV105) (GAge) (HV025) (HV024)
After selecting only adults: 1. Click “Analyze” on main menu bar; 2. Click “Reports”; and 3. Click “OLAP Cubes”, and a new window will appear with complete variable list. 4. In “OLAP Cubes” window, select Summary and Grouping variables as planned; 5. Click “Statistics” to set the desired summary statistics: a. By default, the six summary statistics are selected (can leave it as it is); b. Users can double-click any unselected statistics to be selected and vice versa; c. Click “Continue” when complete selection of summary statistics; Step 5 Step 5
(a) Default stats
(b) Selected Statistics
(c). 6. Click “Differences” button to compute absolute or percentage differences for all measures selected in the Statistics dialog box. This step is optional. Step 6
The "Differences" dialog box allows calculating percentage and absolute differences: Differences between Variables calculates differences between pairs of variables. At least two summary variables must be selected before specifying differences between variables. Differences between Groups of Cases calculates differences between pairs of groups defined by a grouping variable. One or more grouping variables must be selected in the main dialog box before specifying differences between groups. The differences are calculated between summary statistics values by subtracting the value of the “minus” variable/category from the values of the first in the pair. Percentage differences use the value of the summary statistic of the second (the Minus) as the denominator. 7. Click “Title” button to create custom table titles. This step is optional. Title of output table or a caption (add below the table) can be added in this step. If the title or caption expands over one line, inset \n for wrapping (a line break in the text). Enter appropriate title and caption, and Click “continue” button when completed. 8. Click “OK” button on “OLAP Cubes window” to start creating with the set options. When complete creating the OLAP Cube, the following output will be placed in the output viewer. The default output provides three summary statistics: number of cases (N), mean, and standard error of mean for “HV108 – Education in single year” for the entire sample (valid cases).
User can change categories: from the default “total” to any item in the Dropdown list
Title Layer Statistics
Although this table seems simple and unattractive, one can select for each and every category of “Grouping variables” as in the Pivot tables. To do this, double-click on the table in Output Viewer, and then, click on the dropdown icon and select any category in the list. The following exhibit shows the statistics for the “Males aged 15-29 who lived in the urban areas”.
Again, one can pivot the output table to be more attractive as followings:
Most frequently used procedures in PASW Statistics are Descriptive Statistics. From making initial analysis and checking validity to extracting education data and constructing indicators from a household survey, “Descriptive Statistics” are essential. Although “Report” could provide similar statistics, “Descriptive Statistics” are user-friendly and provide more varieties of charts. 2.1 Frequencies Frequencies is the procedure to start analyzing a dataset. It provides statistics and graphical displays that are useful for describing all different types of variables. “Frequencies” procedure can produce such statistics as: frequencies (counts), percentages, cumulative percentages, mean, median, mode, sum, standard deviation, variance, range, minimum and maximum values, standard error of the mean, skewness and kurtosis (both with standard errors), quartiles and percentiles. Moreover, it can produce bar chart, pie chart, and histogram. For better display in the output table and charts, distinct values can be arranged in ascending or descending order of category labels or of their counts. The frequencies report can be suppressed when a variable has many distinct values. Charts produce by this command can be labeled with frequencies (default) or percentages. To produce a simple frequency table: 1. Click “Analyze” on main menu bar; 2. Click “Descriptive Statistics”; and 3. Click again “Frequencies”, and a new window will appear with complete variable list. 1 3
4 (HV104) (HV109) (HV105)
5 Step 5
6 (c) (c)
4. Select (categorical) variables to produce frequency tables (each variable will have a table); 5. Click “Format” button, and set the output formats on: a. how to order categories in the frequency table – ascending or descending order of values or count? b. how to organize the outputs if more than one variable is selected?; and c. whether to display or suppress the table with several categories (to set maximum)?; 6. Click “OK” button to start creating the frequency tables with selected charts and format.
And the following outputs will be obtained from the steps present in the above exhibit.
It should be noted that only two frequency tables are generated although three variables are selected. It is because PASW Statistics suppressed the frequency table of “HV105 – Age of household member” since the number of categories is more than set value of 15 (roughly 100). Generally, there are two key purposes in using frequencies: (a) to get frequency table of categorical variables with limited number of different items, for example, sex, educational attainment, age group, etc.; and (b) to get summary statistics of the continuous variables without frequency table (i.e., for the variables in interval or ratio scales and values are widely different). Moreover, bar charts, pie charts and histograms can be created automatically for the categorical variables with limited number of different items by clicking “Charts” button. Then, select the chart type and option after current Step 5. In the above example, “Pie chart” is appropriate to review gender composition (HV104) of sample population while “Bar chart” should be use for the education levels (HV109). Therefore, those two variables cannot join together at the same time.
Similarly, one can choose types of statistics to be displayed by clicking “Statistics” button after selecting charts. The following exhibits show the outputs for the variable “HV105 – Age of household member” without frequency table by age.
2.2 Descriptives Descriptives computes univariate statistics, such as mean, standard deviation, minimum, and maximum for numeric variables and displayed in a single table for better comparison. Because it does not sort values into a frequency table, it is an efficient means of computing summary statistics for continuous variables. Almost all statistics provided in DESCRIPTIVES can also be obtained from other procedures such as FREQUENCIES, MEANS, and EXAMINE. Although “Frequencies” could also provide univariate statistics, “Descriptives” displays summary statistics for several variables in a single table. It can also calculate and save the standardized values (Z-scores). Variables can be ordered by the size of their means (in ascending or descending), alphabetically, or by the order in which user selects the variables (default). When Z-scores are saved, they are added to the current dataset and are available for analyses and listings. When variables are recorded in different units, e.g., „household members‟ and „education in single years‟), the Z-score transformation places variables on a common scale for easier visual comparison. Moreover, “Descriptives” is efficient for large files: with tens of thousands of cases. To use Descriptives: 1. Click “Analyze” on main menu bar; 2. Click “Descriptive Statistics”; and 3. Click again “Descriptives”, and a new window will appear with complete variable list; 4. Select continuous (interval or ratio scale) or dichotomous (just 0 and 1) variables; 5. Click “Options”, (i) select the preferred statistics from the lists, (ii) define the order of the variables to be displayed in the output table, and (iii) click “Continue”; 6. Optionally, tick “Save standardized values as variable” to save the Z-score (or standardized values) of the selected variable(s) in the current dataset; and 7. Click “OK” button to start calculating summary descriptive statistics. 1 2 3 Step 5
(HV105) (HV108) (HV110)
(ii) 6 7 (iii) Note: Two scale variables: “HV105 – Age of household members” and “HV108 – Education in single years”, and one dichotomous nominal variable: “HV110 – Member still in school” are used in this example.
In calculating the descriptive statistics (and also in most statistical analyses), it is important to check and edit the variables under study to contain only valid values in the analysis. For example, in the variable “HV108 – Education in single years”, code 97 is used for “Inconsistent values”, code 98 represents “DK or Do not know”, and code 99 is “missing values”. In this case, 97, 98 and 99 are not valid years of study and should not be in the analyses, therefore, put all those codes into “missing values” to be excluded from computing statistics (see Module B2 to edit missing values).
Two similar “Descriptive Statistics” tables are presented in the above example: (i) constructed with the default missing values, that is, using the codes 97 and 98 as valid; and (ii) constructed after setting 97 and 98 as missing. Since number of cases is large, differences in the summary statistics are minimal. However, if the same calculation is conducted for a subset with limited number of cases, the differences could be significant. In the above output table, the mean value 0.61 of the variable "member still in school" can interpret as “61% of 20,540 persons are still in school”. The following example presents all available statistics (set in the options) in “Descriptives”.
(HV009) (HV026) (HV026)
It should be noted that the descriptive statistics calculated for the variable “HV024 – Division” are useless in any analysis. “HV024” is just a nominal variable with codes 1 to 6, representing 6 districts of Bangladesh, and their mean value 3.48 cannot point out anything. One of the significant features of “Descriptives” is its ability to save standardized values (Z-score) for the selected variables to be used in further analyses. To add Z-scores of a variable into current data set, just tick the checkbox next to “Save standardized values as variables”. Then, PASW will add new variables affixing Z in the original variable name as the first letter of the new variable, for example, the new variable for Z-score of “HV009” is simply “ZHV009”.
Newly created variable
2.3 Explore Explore produces summary statistics and graphical display, either for all cases or separately for groups of cases. It is particularly useful in data screening, outlier identification, description, assumption checking, and characterizing differences among subpopulations (groups of cases). Data screening aims to examine the existence of unusual values, extreme values, data gaps, or other peculiarities. By exploring data, users can determine whether the statistical techniques under consideration for further analyses are appropriate or not. It may help deciding whether to transform the data (in case the technique requires a normal distribution) or to use nonparametric tests. Dependent variables or variables to be explored [List (a) in following chart] can be quantitative (interval or ratio-level measurements). Factor variables [List (b)], with short string or numeric values, will break the dependent variables into groups of cases. The factor variables should have a reasonable number of distinct values, generally, not more than 10 categories. The case label variable [List (c): allowed only one variable], used to label outliers in boxplots, can be short string, long string (but use only first 15 bytes), or numeric. To analyze with Explore: 1. Click “Analyze” on main menu bar; 2. Click “Descriptive Statistics”; and 3. Click again “Explore”. A new “Explore” window will appear with complete variable list; 4. Select continuous (interval or ratio scale) variables to produce univariate statistics; With voluminous outputs produced by “Explore”, just one variable “HV108 – Education in single years” with simple (mostly default) settings been used in the following example. 1 2 3 4
5 6 7
(c) Step 6
8 9 Default Settings Step 5 Default
5. 6. 7. 8. 9.
Click “Statistics”, set the preferred statistics from the lists, and click “Continue”; Click “Plots”, set the preferred types of plots from the lists, and click “Continue”; Click “Options”, set how to handle the missing values, and click “Continue”; Select “Display” option (only statistics or plots, or both) on “Explore” window; and Click “OK” button to start “Explore”, and the following outputs will be displayed.
By selecting all statistics and available charts, exploring “HV108 – Education in single year” factored by “HV024 – Division” produced altogether 33 tables and charts as in the following output (starting from “Case Processing Summary” to “Spread-versus-Level Plot”):
2.4 Crosstabs Crosstabs is useful for investigating the relationship between two or more categorical variables by providing information about the intersection of variables. “Frequencies” and “Explore” are efficient in analyzing univariate statistics, but those procedures could not provide information on the relationship between categorical variables. For example, frequencies could provide “number of household heads by education level” and “number of household heads by sex” or “number of households by economic status (wealth index)”, but cannot provide “number of female headed households in the poorest category” or even simple question as “percentage of female headed households”. In crosstabs, use values of a numeric or short string variable to define categories of each variable. For example, codes “1 and 2” or “male and female” or “M and F” are valid for the variable “sex”. Ordinal variables can be either numeric codes that represent categories, for example, numeric codes “1 to 5” can be used for variable “Wealth Index” representing “1 = poorest, 2 = poorer, 3 = middle, 4 = richer, and 5 = richest” or string values “a to e” as “a = richest, b = richer, c = middle, d = poorer, and e = poorest”. In PASW Statistics, the alphabetic order of string values is assumed reflecting the true order of the categories. Therefore, if a string variable with codes “L, M, H” representing “low, medium and high” is used, the order of the categories in the output will be “H, L, M” and the results might be misinterpreted. In general, it is more reliable to use numeric codes and provide appropriate value labels to represent ordinal data. Selection of Variables For cross-tabulation, at least one variable each must be selected to the rows and columns of the output table. Then, other variables could be put as layers and known as “factor” variables. The variables used in crosstabs procedure must be categorical ones (measured in nominal or ordinal) with limited number of value items (generally, less than 10 different values).On the other hand, discrete scale variables could also be used to get statistics if the range of values are not too large and suppress the table output. The factor variables must be categorical. Statistics Option In Crosstabs, statistics and measures of association are computed for two-way tables only. If a table is formed in multi-ways as “row, column, and layer (control) variables”, the Crosstabs procedure forms one panel of associated statistics and measures for each value of the layer (or a combination of values for two or more control variables). For example, if “sex” is a layer factor for a table of “educational attainment” against “wealth index”, the results for a two-way table for the males and for the females are computed separately. Crosstabs is one of the procedures producing a variety of statistics as: Chi-square tests of independence/association is generally used for 2 x 2 tables. One can select: Pearson chi-square, the likelihood-ratio chi-square, Fisher's exact test, and Yates' corrected chisquare (continuity correction). For tables with any number of rows and columns, select Chisquare to calculate the Pearson chi-square and the likelihood-ratio chi-square. Spearman's rank correlation coefficient (rho) is calculated if both rows and columns contain ordinal variables (numeric data only). When both row and column variables are quantitative, Pearson‟s correlation coefficient (r), a measure of linear association, is calculated. For more explanations on statistics please see "PASW Statistics 17 Base User Guide". Cells Display Option By default, Crosstabs displays the “count” or the number of cases actually observed in each cell. Optionally, number of “expected” cases could be selected to display. Similarly, row, column and total percentages can be displayed in the cells together with the observed number of cases (count).
To uncover the patterns in data contributing to a Chi-square test, three types of residuals (deviates) that measure the difference between observed and expected frequencies could be displayed. Unstandardized: the difference between an observed value and the expected value. Standardized: the residual divided by an estimate of its standard deviation. Standardized residuals, also known as Pearson residuals, have a mean of 0 and a standard deviation of 1. Adjusted standardized: the residual for a cell (observed minus expected value) divided by an estimate of its standard error. Non-integer weights Option Cell counts are normally integer values. But if the dataset is weighted by a variable with fractional values (e.g. 1.25), cell counts can be fractional values. Then, counts can be truncated or rounded either before or after calculating the cell counts, or use fractional cell counts for both table display and statistical calculations. Using Crosstabs Follow the steps: 1. Click “Analyze” on main menu bar; 2. Click “Descriptive Statistics”; 3. Click “Crosstabs” and a new “Crosstabs” window will appear with complete variable list; 4. Select categorical variables (or scale variables with limited number of different values) and send to rows, columns and layers (click “Next” to add another layer). Layer variables can be organized as: all on the same layer (one set of tables per each layer variable) or on different layers (just one set of tables with cross-layers cells). 1 2 3 4(a) (HV026)
5 6 7
8 9 10 5. Select appropriate statistics to be calculated; In this example, no statistics is selected although both row and column variables are ordinal and thus chi-square, correlations, Gamma and Kendall‟s tau are appropriate to calculate. 6. Select the contents of the cells in the cross-tabulation; 7. Set the row order: ascending or descending;
All settings in Step 5 through Step 9 are "as it is in the Default” in this example
8. Set whether to get the clustered bar charts; 9. Set whether to suppress tables (or display the main crosstab table); and 10. Click “OK” to start constructing tables and charts as selected. In this example, no optional settings are set and just two tables, (1) Case Processing Summary, and (2) basic cross-tabulation table with simple counts in cells, are produced. In cross-tabulation the missing values are handled list-wise (across variables), and thus it is important to observe the “number of valid cases” in the “case processing summary” statistics.
If different cell display options, such as number of observed and expected counts; row, column and total percentages, and residuals, are selected in the Step 6, the following crosstab table is created after using pivoting capabilities offered in PASW statistics and a few minor touches.
Step 6 Newly selected options
click here and select what the cells to display
Note: The original output table is huge and difficult to read since all statistics are placed together. It is edited: (1) shortened a long value label; (2) hid the variable label of HV026); and (3) moved “Statistics” to “LAYER” in the “Pivoting Trays”.
The following tables present percentage distribution of households within “Place of residence” and within “Wealth index” by “Sex of household head”, which are extracted from the above pivot table.
By selecting both “Display clustered bar charts” and “Suppress tables” options, the following charts will be produced without producing any output tables:
No output tables except “Case Processing Summary”
2.5 Ratio Statistics Ratio Statistics provides a comprehensive list of summary statistics for describing the ratio between two scale variables with positive values. In Ratio Statistics, outputs can be sorted by values of a grouping variable in ascending or descending order. Grouping variables must be nominal or ordinal level measurements and it is better to use numeric codes or short strings. The ratio statistics report can be suppressed in the output, and the results can be saved to an external file. It provides statistics on: central tendency (median, mean, weighted mean); confidence intervals for mean and median; measures of dispersion (AAD – average absolute deviation, COD – coefficient of dispersion, PRD – price-related differential or index of regressivity, median-centered coefficient of variation, mean-centered coefficient of variation, standard deviation, range, minimum and maximum values), and the concentration index (ratio between a user-specified range or percentage within the median ratio). Practical Example: In analyzing household survey data for participation in general education, using total number of children aged 6-15 (var1) and those who are currently attending primary schools (var2) with sex (or urban/rural residence or division or etc.) as grouping variable, the age-specific enrolment ratios for the children aged 6-15 by sex can be calculated. Moreover, variation in the distribution of ratios between male and female can also be observed. However, there is no variable which could get “number of children at age x” after summing up within the grouping variable. Therefore, one variable must be created, say, “pop” with value 1 for each and every children aged 6-15. Use the “Compute” command as follow:
And, define the variable label (“Population aged 6-15”) and format (Display: 5 and Decimal: 0).
Warning: Caution must be taken in using DHS survey data for the “current schooling status” since DHS asks the question “Whether xx is still in school or not?” to those who have been to school only, and thus, who have never been to school were omitted or treated as “missing”. To obtain the correct “current schooling status” of every person, another variable must be created, say “schooling”, from “HV110 – Member still in school” by setting “schooling = 1” for the case “HV110=1”, and “schooling = 0” all other cases. Here, the new variable “schooling” can be created by using “compute” command twice: first, compute “schooling = 0” for all cases, then compute “schooling = 1” for those who are currently attending school, that is, HV110 = 1. Then, set appropriate properties to the new variable. Computing first time without IF condition
Computing second time with IF condition
After complete creating new variables, use the Ratio Statistics as following: 1. Click “Analyze” on main menu bar; 2. Click “Descriptive Statistics”; and 3. Click again on “Ratio”. A new “Crosstabs” window will appear with complete variable list; 4. Select two scale variables for “Numerator” and “Denominator”, and a categorical (nominal or ordinal) variable for “Group” variable; 5. Set whether to sort group variable in ascending or descending order; 6. Set whether to display results or not (just to save in a new file); 7. Set whether to save results to a new data file for further analyses; 8. Click “Statistics” button and select required statistics in “Statistics” window; and 9. Click “OK” button to start constructing statistics as selected.
4 5 6 7
9 The following exhibit shows both the “statistics options” selected and the “results” obtained.
Normally, the group variable is displayed on the rows and statistics on the columns. If several statistics are chosen, the output table may be difficult to read or print. In such case, double-click the table to get into Pivot Table editor. Then, apply “Transpose Rows and Columns” under “Pivot” menu to view the statistics on rows and groups on columns to become the table well accessible.
TIPS AND EXERCISES
3.1 Tips: Do and Don’t i) Do… Don‟t… ii) Do… Don‟t… iii) Do… Don‟t…
Don‟t… v) Do…
Don‟t… vi) Do…
Don‟t… vii) Do… Don‟t…
first, use the “codebook” procedure to be acquaintance with the household survey dataset if complete documentation is unavailable; waste time by searching/ requesting actual coding scheme or by running frequency tables for all variables. study the survey questionnaire and “codebook” to select the variables of interest, and make new datasets or variable sets for further analyses; try selecting variables on a “trial and error” basis without studying proper survey documentation or codebook in analyzing a newly available dataset. make acquaintance with OLAP Cubes procedure; run several frequency and crosstab tables and practice using the OLAP Cubes; display several variables in multiple layers in a table since it is difficult to get the essence of the statistics displayed, and unusable or easily misinterpret. expert the data preparation and management techniques such as computing new variables, selecting cases, creating new variable sets, data validation, and etc.; waste time to edit/correct secondary household survey dataset (obtained from other sources: departments, agencies, organizations, …. start analysis by running “frequencies” to every variable except for the continuous (scale) variables with several different items. For the continuous (scale) variables use “Descriptive” procedure to explore their basic structure; go into in-depth analyses or calculation of ratio statistics before well understanding the variables. crosstab between variables with intrinsic linkages and export the outputs to a spreadsheet software for better presentation, and create and present graphs and charts as appropriate in PASW or Excel; create oversized crosstab tables with multiple layers (use “pivot” technique to simplify the crosstab tables). run the crosstab tables (or frequency tables) to get baseline data correctly and make further calculations and analyses in spreadsheet software; try to run (and use the outputs) of “ratio statistics” procedure if you are not sure that the process is perfectly correct.
3.2 Self-evaluation Do you know when to use codebook procedure in PASW statistics? Very well / Somewhat well / Not so much / Almost None Are you confident that you can run the following procedures in an active dataset? o Codebook: Confident / Somewhat confident / Not so much / Not at all o OLAP Cubes: Confident / Somewhat confident / Not so much / Not at all o Frequencies: Confident / Somewhat confident / Not so much / Not at all o Crosstabs: Confident / Somewhat confident / Not so much / Not at all o Ratio Statistics: Confident / Somewhat confident / Not so much / Not at all Do you think you can demonstrate to your colleague on how to run: o Simple frequency tables: Definitely / Could be / Not so sure / Not at all o Frequency tables with appropriate charts: Definitely / Could be / Not so sure / Not at all o Simple crosstab tables: Definitely / Could be / Not so sure / Not at all o Crosstab tables with layers: Definitely / Could be / Not so sure / Not at all o Simple OLAP Cubes: Definitely / Could be / Not so sure / Not at all o Pivoting crosstab tables: Definitely / Could be / Not so sure / Not at all 3.3 Hands-on Exercises 1) Import the attached “data1(tab).dat” and define all variables appropriately, and run the codebook procedure to check whether you have defined the dataset effectively. 2) From the dataset obtained from Exercise 1 above, recode all string variables, and run the codebook procedure to check whether you have recoded and defined the dataset effectively. 3) Begin data analysis with selected procedures of your choice to get education indicators which are useful for EFA monitoring. 4) Get a recent household survey dataset from your country, then note down the step-bystep procedure on how to make use of it in education planning, especially for EFA monitoring. 5) Follow the steps defined in the previous question and get the “data, information and indicators” which you have defined.
ANNEX: WEB LINKS FOR FURTHER STUDY ON SPSS/PASW STATISTICS 1. Central Michigan University. SPSS (PASW) On-Line Training Workshop
(See http://calcnet.mth.cmich.edu/org/spss/index.htm ) 2. College of Humanities and Social Sciences. Topics in Multivariate Analysis.
(See http://faculty.chass.ncsu.edu/garson/PA765/index.htm) 3. Creative Research Systems: Survey Research Aids
(See http://www.surveysystem.com/resource.htm ) 4. East Carolina University. PASW/SPSS Lessons: Univariate Analysis.
(See http://core.ecu.edu/psyc/wuenschk/SPSS/spss-lessons.htm ) 5. Newcastle University. Statistics Support.
(See http://www.ncl.ac.uk/iss/statistics/docs/ ) 6. Research Method Knowledge Base.
(See http://www.socialresearchmethods.net/kb/index.php ) 7. SPSS Web-Based Training.
(See http://www.spss.com/training/wbt/ ) 8. Statistical Exercised Using PASW Statistics.
(See http://www.brad.ac.uk/lss/documentation/pasw-statistics-v17-exercise/statisticalexercises-using-PASW%20Statistics-v17.pdf ) 9. UCLS Academic Technology Services. Resources to help you learn and use SPSS.
(See http://www.ats.ucla.edu/stat/spss/default.htm ) 10. University of Toronto. SPSS Tutorial.
(See http://www.psych.utoronto.ca/courses/c1/spss/toc.htm ) 11. Visual Statistics Studio.
(See http://www.visualstatistics.net/ )
Using Microsoft Excel to Elaborate PASW Outputs for Better Presentation Purpose and leaning outcomes:
To know how to import PASW outputs into MS Excel 2007
To introduce data handling and data analysis using MS Excel 2007
To explore some advanced features of data presentation in MS Excel 2007
MS Excel 2007: Basics 1.1 Result-Oriented User Interface 1.2 New File Formats in Microsoft Office Excel 2007 1.3 Data Handling Capacity of Microsoft Office Excel 2007 1.4 Selected Statistical Functions in Microsoft Office Excel
Further Analyses and Presenting Outputs in MS Excel 2.1 Importing PASW Databases into Microsoft Office Excel 2.2 Creating Frequency and Crosstab Tables 2.3 PivotTables (OLAP Cubes) 2.4 Drawing Pivot Charts 2.5 Elaborating PASW Outputs for Better Presentation
Tips and Exercises 3.1 Tips: Do and Don’t 3.2 Self-evaluation 3.3 Hands-on Exercises
MICROSOFT OFFICE EXCEL 2007: BASICS
Nowadays, Microsoft Excel is the most widely used spreadsheet software all over the world. The new results-oriented user interface intended to make easy to work in Excel 2007. Commands and features are organized on task-oriented tabs that contain logical groups of commands and features. Since its user interface is totally changed, even the regular users require familiarizing with its new features and looks. 1.1 Result-Oriented User Interface Layout of the main menu and the contents of the first menu tab “Home” are as follow:
Many dialog boxes are replaced with drop-down galleries that display the available options, and descriptive tooltips or sample previews are provided to help choosing the right option. For example, when clicking on “Paste”, it will display a drop-down galleries with active options depending on which items are available in the clipboard as: (1) No items in clipboard (2) After copying an Excel range
(3) After copying a picture / image
(4) After copying text from MS Word
The Office clipboard can store up to 24 items. If the mouse is on the “ ”icon located at the bottom right corner of “Paste” menu, “instant help” on “Clipboard” will be displayed and if the mouse if on the “Paste”, the tool-tip will be displayed as followings:
And, if click the Clipboard area located at the bottom of the “Paste” menu, a clipboard pane with all available items kept in the clipboard will be displayed.
Number of items kept in the clipboard
Clipboard is empty
Sample of items copied from Word Sample of items copied from Excel Thumbnail of the picture/image copied
Moreover, online help for the clipboard is available For every activity being performed in the new user interface – whether it's formatting or analyzing data – Excel presents the tools, tips and help that are most useful to successfully complete that task. As such, the user interface of Office Excel 2007 is helping to obtain the desired results efficiently.
1.2 New File Formats in Microsoft Office Excel 2007 The previous versions of Excel files (from Excel 2.1 to Excel 2003) use “.xls” for Excel (data) files, “.xla” for add-ins, and “.xlt” for templates. Excel files with extension “.xls” could hold data sheets, chart sheets and micro sheets. In Excel 2003, “.xml” is used for XML-based spreadsheet or data files (XML = Extensible Markup Language). In Office Excel 2007, the following formats and file extensions are used to distinguish different file types and for better securities: Excel Workbook
The default Office Excel 2007 XML-based file format. It cannot store VBA macro code or Microsoft Office Excel 4.0 macro sheets (.xlm).
Excel Workbook (code)
The Office Excel 2007 XML-based and macro-enabled file format. It stores VBA macro code or Excel 4.0 macro sheets (.xlm).
Excel Binary Workbook
The Office Excel 2007 Binary file format (BIFF12).
The default Office Excel 2007 file format for an Excel template. It cannot store VBA macro code or Excel 4.0 macro sheets (.xlm).
.xltxm The Office Excel 2007 macro-enabled file format for an Excel template. It stores VBA macro code or Excel 4.0 macro sheets (.xlm).
.xlam The Office Excel 2007 XML-based and macro-enabled Add-In, a supplemental program that is designed to run additional code. It supports the use of VBA projects and Excel 4.0 macro sheets (.xlm).
Moreover, the following file types (or filename extensions) of previous versions of Excel are still valid Excel files in Office Excel 2007 and can open or save without transforming into 2007 format: Excel 97-2003 Workbook
The Excel 97 - Excel 2003 Binary file format (BIFF8).
Excel 97-2003 Template
The Excel 97 - Excel 2003 Binary file format (BIFF8) for an Excel tem plate.
Excel 5.0/95 Workbook
The Excel 5.0/95 Binary file format (BIFF5).
XML Spreadsheet 2003
XML Spreadsheet 2003 file format (XMLSS).
XML Data format.
It should be noted that all Excel files created in any version can be opened and saved them back in the original file type, however, the Office Excel 2007 files cannot be opened in earlier versions of MS Excel unless the optional Office updates for file format transformation is installed.
1.3 Data Handling Capacity of Microsoft Office Excel 2007 Enabling to explore massive amounts of data in worksheets, Office Excel 2007 supports 1,048,576 34 rows by 16,384 columns per worksheet (or 2 , i.e., 17 billion cells). This is the size that every household survey datasets cannot surpass: allowed one million cases across sixteen thousand variables. Therefore, any household survey dataset can be exported to Excel, and further analyses can be conducted in Excel 2007 which is much more familiar with most education planners and administrators.
As seen in the above exhibit, Office Excel 2007 Worksheet is â€œ1 Kâ€? (1024) times larger than Excel 2003 worksheet. Although Excel 2007 files can be opened in Excel 2003, the contents of Excel 2007 worksheets which are located outside the Excel 2003 boundaries (65,536 rows x 256 columns) cannot be retrieved into Excel 2003. Other improvements in Office Excel 2007 compared to Excel 2003 include the followings: (a) 4 thousand types of formatting allowed in Excel 2003 to unlimited number in the same workbook in Excel 2007; (b) the number of cell references per cell is increased from 8 thousand to limited by available memory; (c) memory management has been increased from 1 GB to 2 GB; (d) supports up to 16 million colors; and (e) supports dual-processors and multithreaded chipsets. With such improvements, general performance of Excel has moved forward. Moreover, when using computers with advanced chipsets, calculations are much faster in large, formula-intensive worksheets.
1.4 Selected Statistical Functions in Microsoft Office Excel 2007 There are altogether 346 built-in functions under 12 different categories in Excel 2007. Summary of Excel functions under different categories in descending order of number of functions in category is as following: Sr. 1 2 3 4 5 6 7 8 9 10 11 12
Category Statistical functions Math and trigonometry functions Financial functions Engineering functions Text functions Date and time functions Lookup and reference functions Information functions Database functions Cube functions Logical functions Add-in and Automation functions Total
Number 82 60 53 39 27 20 18 17 12 7 6 5 346
Per cent 23.7% 17.3% 15.3% 11.3% 7.8% 5.8% 5.2% 4.9% 3.5% 2.0% 1.7% 1.4% 100.0%
It is difficult to say which Excel functions are required and which are not in analyzing household survey data since it is more concerned with the experience of the user and types of output needed to generate. The followings are the functions, directly concerned with analyzing a database or refining the PASW Statistics output tables. DAVERAGE DCOUNT DCOUNTA DGET DMAX DMIN DSTDEV DSUM DVAR AND FALSE IF NOT OR TRUE ROUND ROUNDDOWN ROUNDUP SQRT SUBTOTAL SUM SUMIF SUMIFS SUMPRODUCT AVERAGE AVERAGEA
Returns the average of selected database entries Counts the cells that contain numbers in a database Counts nonblank cells in a database Extracts from a database a single record that matches the specified criteria Returns the maximum value from selected database entries Returns the minimum value from selected database entries Estimates the standard deviation based on a sample of selected database entries Adds the numbers in the field column of records in the database that match the criteria Estimates variance based on a sample from selected database entries Returns TRUE if all of its arguments are TRUE Returns the logical value FALSE Specifies a logical test to perform Reverses the logic of its argument Returns TRUE if any argument is TRUE Returns the logical value TRUE Rounds a number to a specified number of digits Rounds a number down, toward zero Rounds a number up, away from zero Returns a positive square root Returns a subtotal in a list or database Adds its arguments Adds the cells specified by a given criteria Adds the cells in a range that meet multiple criteria Returns the sum of the products of corresponding array components Returns the average of its arguments Returns the average of its arguments, including numbers, text, and logical values
AVERAGEIF AVERAGEIFS COUNT COUNTA COUNTBLANK COUNTIF FREQUENCY GEOMEAN GROWTH HARMEAN MAX MAXA MEDIAN MIN MINA MODE PERCENTILE QUARTILE RANK STDEV STDEVA TREND TRIMMEAN
Returns the average (arithmetic mean) of all the cells in a range that meet a given criteria Returns the average (arithmetic mean) of all cells that meet multiple criteria. Counts how many numbers are in the list of arguments Counts how many values are in the list of arguments Counts the number of blank cells within a range Counts the number of nonblank cells within a range that meet the given criteria Returns a frequency distribution as a vertical array Returns the geometric mean Returns values along an exponential trend Returns the harmonic mean Returns the maximum value in a list of arguments Returns the maximum value in a list of arguments: numbers, text, and logical values Returns the median of the given numbers Returns the minimum value in a list of arguments Returns the smallest value in a list of arguments: numbers, text, and logical values Returns the most common value in a data set Returns the k-th percentile of values in a range Returns the quartile of a data set Returns the rank of a number in a list of numbers Estimates standard deviation based on a sample Estimates standard deviation based on a sample, including numbers, text, and logical values Returns values along a linear trend Returns the mean of the interior of a dataset
The detailed descriptions of these functions and examples can be seen in online help of Microsoft Excel 2007, and thus, will not be elaborated in this module.
FURTHER ANALYSES AND PRESENTING OUTPUTS IN MS EXCEL
With extended data handling capacities, it is possible to analyze any dataset from household surveys for assisting EFA Monitoring with Microsoft Excel. However, it is much easier to use other popular data analysis software such as PASW Statistics, then export the outputs to Excel, and elaborate and present with MS Excel. 2.1 Importing PASW Database into Microsoft Office Excel To read PASW Statistics (*.sav) data files directly in applications that support Open Database Connectivity (ODBC) or Java Database Connectivity (JDBC), the PASW Statistics data file driver is required. PASW Statistics itself supports ODBC in the Database Wizard, providing the ability to leverage the Structured Query Language (SQL) when reading SAV data files in PASW Statistics. The PASW Statistics data file driver is packed together with other drives which may be required in accessing different types of databases in a “Data Access Pack (DAP)” which can be downloaded from the PASW Statistics Website. A version of DAP for Windows, “DAPWin32_5.3_SP2.exe” (file size: 36,624 KB) is provided in the training CD.
After installing DAP, there will be “SPSS Inc OEM Connect and ConnectXE for ODBC 5.3” program group in the “Start Menu Programs”. Click “ODBC Administrator”, and follow the steps to get access to PASW Statistics data files (*.sav) from the applications with ODBC capabilities: 1. Click “File DSN” tab; and 2. Click “Add” button to add a new data source. 1
The “Create New Data Source” user dialogue box will appear. There, all available drivers in the computer will be listed, and 3. Select “SPSS Inc. 32-Bit Data Driver (*.sav)”; and 4. Click “Next”. There, it will request a new Database System Name (DSN), and 5. Type-in an appropriate DSN name (“SPSS-Training” in this example); and 6. Click “Next”.
4 6 7. “Create New Data Source” dialogue will provide the summary information on the current setting. If it is correct, click “Finish” to complete creation of a „file DSN‟. At this point the program will request to identify the location and fill in correct folder name with complete “path” of the PASW Statistics data files. 8. In this example type-in: “c:\....\My Documents\SPSS Training\Sample” where all sample datasets are stored, and Click “OK”; 9. Click “OK” again to complete and exit from “ODBC Data Source Administrator”.
After creation of the new ODBC data source, the newly defined file DSN name, “SPSS-Training”, will be listed in the Windows applications with ODBC capabilities. Any PASW data files (*.sav) located in the specified folder can be accessed from other applications, and can retrieve full dataset through “existing ODBC connections” or partially through “Microsoft Query”. When clicking “Existing Connections” under “Data” menu in Office Excel 2007, “SPSS-Training” will be displayed as one of the existing external data sources for Excel (see “A” in the following exhibit). By selecting this connection, one can retrieve any dataset (whole dataset) from the list. Similarly, when clicking “From Other Sources” and selecting “From Microsoft Query”, one can see the “SPSS-Training” as a data source (see “B”), and by following the Wizard, users can retrieve part of a dataset: only cases which satisfied set conditions and only the selected variables. 1 1
3(a) 3 4
In short, follow the steps below to import a complete PASW Statistics dataset into Excel 2007: 1. Click “Data” tab; 2. Again, click “Existing Connections” button to get the “Existing Connections” dialog box; 3. Select “SPSS-Training” form the list of available connections; and 4. Click “Open” and a complete list of PASW Statistics datasets in the specified folder (set while creating the file DSN “SPSS-Training”) will be displayed as “Tables”. 5. Select the dataset (by clicking on the name) and click “OK” button; 6. In the import data window, select where to place the imported data, in the “Existing worksheet” (active worksheet) or in a “New worksheet”. If the “Existing worksheet” is selected, one can define the place to import data (default is $A$1). 7. Click “OK” to start importing process, which will take a few minutes.
6 5 7 At the end of the importing process, the PASW dataset will be placed on the specified Excel worksheet with the name like “Table_SPSS_Training” and treated as an Excel “Database Table”. Warning: Importing data into Excel (as well as into other databases) cannot retrieve metadata (labels, missing values, etc.), but only data values. Therefore, user must have the codebook of the dataset (and the survey questionnaire) before doing any analysis. As usual, after successfully importing PASW datasets, first, the Excel file with imported databases must be saved with an appropriate name. In this example, the file is saved with the name “Excel2.xlsx”. When opening the Excel file with imported database, Office Excel 2007 will issue a “Security Warning” with the message “Data connections have been disabled” together with an “Option” tab. If the imported data requires updating from the source PASW dataset, or requires importing another dataset, the user must enable the data connection. Otherwise, the user can choose to disable the data connection.
In the Excel worksheet, the variable names are placed on the first row with enabling “Autofilter” to all variables. The “Autofilter” feature can assist in checking the invalid entries and selecting cases which fulfil the specified rules. If the “Autofilter” is not required, it can be turned off by clicking on the filter tab, , and click it again to turn on Autofilter.
Example: To select the cases for the children of aged 6-year, one can click the down arrow sign next to “HV105” and clear the tick next to select all (to unselect all), tick the box next to 6 and click “OK”. In the following exhibit, it could be seen in the “status bar” (located at the bottom left corner) that there are altogether 53,413 records (or cases) in the database, and only 1,302 records with aged 6 children are found and selected.
If another variable sex (HV104) is filtered to show only “1 (Male)” again, the following output will be obtained with only 656 records (of aged 6 boys).
Selected value Non-selected value
Even entire worksheet is selected and copied, and then pasted on a new sheet while filtering, only the filtered records (or unhidden rows) will be pasted in new worksheet. Then, unwanted variables can be selected and deleted column by column to clean up the Excel database. The final result is totally the same as imported through “Microsoft Queries”, which is more complicated for those who are not acquaintance with manipulating databases (see the steps in the following exhibits).
Select the dataset and send entire set, or variable by variable to the right pane
Setting Condition 1 to import only the cases of children aged 6
The database can be sorted while importing with the selected variables
Setting Condition 2 to import only the cases of “boys”
Setting Option set the location of the imported database; and the query can be saved for future use!
The Output (Result) There are 656 cases (+1 row for variable names) in the imported database for the â€œaged 6 boysâ€?
2.2 Creating Frequency and Crosstab Tables The Excel function “FREQUENCY” is useful to create entire frequency table from a range of cells or from a variable in a database table. On the other hand, “COUNTIFS” can be used to get the appropriate value for a cell of a frequency table or crosstab table. Using FREQUENCY Function “FREQUENCY” is a worksheet function under “Statistical functions” category. It counts how often values occur within a range of values, and then returns a vertical array of numbers. For example, use FREQUENCY to count the number of males and females among the household members. Because FREQUENCY returns an array, it must be entered as an array formula. The followings are the steps required in construction of a table presenting the sex distribution of household members, both in absolute number and percentage distribution using FREQUENCY function. The variable to be used is “HV104” with the codes “1=Male” and “2=Female” in the imported database “SPSS_Training”. 1. Prepare the table structure, formulas and “bin” array as in the following exhibit;
2. Select cell “B3” and type in “=FREQUENCY(SPSS_Training[HV104],$G$3:$G$4)”; 3. Select the range “B3:B4”; 4. Press “F2” to get into formula editing mode, and press “<Ctrl><Shift>ENTER” to reenter formula as an array formula; and 5. Set the display formats of the number cells and the table, as necessary.
Using COUNTIF or COUNTIFS Function A frequency table can also be constructed by using COUNT functions. The above frequency table can be constructed using: 1. Prepare the table structure, formulas and “codes” as in the previous example; 2. Select cell “B3”, and type in “=COUNTIF(SPSS_Training[HV104],G3)”;
3. Copy “B3” and paste at “B4”; and 4. Ally the display formats of the number cells and the table, as necessary. Note: In the formula, “=COUNTIFS(SPSS_Training[HV104],G3)” can also be used in this example. COUNTIF allows only one condition while COUNTIFS can be used with multiple conditions.
Using COUNTIFS Function to construct a crosstab table Although the “FREQUENCY” function cannot use to construct a crosstab table, the “COUNTIFS” function can be used to get the number value of each and every cell of the table. The following example elaborates how to construct a complicated crosstab table of educational attainment (HV109) by sex of household members (HV104) for the population aged 15-24 (Age: HV105): 1. Prepare the table structure, formulas and “codes” for both variables; 2. Select cell “B5”, and type in: =COUNTIFS(SPSS_Training[HV109],$I5,SPSS_Training[HV104],B$14,SPSS_Training[HV105],">14") COUNTIFS(SPSS_Training[HV109],$I5,SPSS_Training[HV104],B$14,SPSS_Training[HV105],">24") ;
Here, the first COUNTIFS counts the population “aged 14 and above” by specific education level by specific sex, and the second COUNTIFS counts the population “aged 24 and above” with the same characteristics. Therefore, the difference represents for the population “aged 15-24”.
3. Copy “B5” and paste to the range “B4:C11”; and 4. Complete the formulas, ally the display formats and etc., as necessary to obtain the following output table.
As described above, frequency and crosstab tables can be constructed in Microsoft Office Excel. However, construction of such tables are much more complicated if the sampling procedure requires “weighting”. In this case, construct the tables with “weight on” in PASW Statistics and export the outputs to Microsoft Office Excel for further elaboration and presentation.
2.3 PivotTables (OLAP Cubes) Unweighted frequency and crosstab tables with multi-layers, which are useful in analyzing household survey data, can be constructed in Microsoft Office Excel with PivotTable technique. A PivotTable is an interactive way to quickly summarize large amount of data, to conduct in-depth analysis and to answer unanticipated questions about the data. It is especially designed for: Querying large amounts of data in many user-friendly ways; Subtotaling and aggregating numeric data; summarizing by categories and subcategories, and creating custom calculations and formulas; Expanding and collapsing levels of data to focus the results, and drilling down to details from the summary data for areas of interest; Moving rows to column or columns to rows to see different summaries of the source data; Filtering, sorting, grouping, and conditionally formatting the most useful and interesting subset of data to enable focus on the required information; and Presenting concise, attractive, and annotated online or printed reports. In a PivotTable, each column in the source data (or database) becomes a PivotTable field (a „field‟ in Excel is a „variable‟ in PASW Statistics) that summarizes multiple rows of information. A value field provides the values to be summarized. By default, data (of the variables) in the “Values” area summarize the underlying source data in the PivotTable using: the SUM function for the numeric variables, and the COUNT function for the text (string) variables. To create a PivotTable, first, define its source data, specify a location in the workbook or the database table, and lay out the fields as following: 1. Select the sheet with imported database and click “Insert” tab in the main menu; 2. Click “PivotTable” button to get the “Create PivotTable” dialog box; 3. Since the active worksheet contains the imported “SPSS_Training” database table, it will appear automatically in the “Table/Range” selection box. However, users can change the data source to another table or to a specific range (e.g., A1:C2000); 1 2
4 5 4. Select where to place the PivotTable: “New Worksheet” or “Existing Worksheet”, and if “Existing Worksheet” is selected, user should provide the first cell address; In this example, just leave it as default “New Worksheet”; and 5. Click “OK” to create a new worksheet with “PivotTable creation tools”.
Then, following new worksheet equipped with tools to assist creating a PivotTable will be created:
Place mark for PivotTable 1
Newly added sheet
And, the following tools are available for creation, elaboration and editing the PivotTable.
6. From “PivotTable Field List” select variables (or fields) and drag and drop to: (a) Values the variables to make actual summarization (count or sum or etc.) (b) Row Labels the variables to be displayed on the rows (can be nested) (c) Column Labels the variables to be displayed on the columns (can be nested) (d) Report Filter the variables to be used for filtering/subsetting the database; As soon as a variable is dragged and dropped into a box, the opening sign of PivotTable on the worksheet will be replaced with an actual PivotTable with default settings. Construction of a “PivotTable” will be demonstrated by creating a crosstab table of “Educational attainment by Sex for Population Aged 15-24”. To do this, first, define which variables (or fields) were to put into which box: “value, row, column or filter”, to get the required table. In this example, educational attainment (HV109) is the key variable to be explored and also to display the education levels in the rows.
Therefore, drag HV109 from the PivotTable filed list and drop it into both: (a) value (to count how many persons in each category), and (b) row (to display education levels in rows). And, the following PivotTable showing the “frequency of HV109” will be created:
The items displayed in rows can also be selected. For example, there are eight items: 0, 1, 2, 3, 4, 5, 8, and (blank), are displayed in cells A4 through A11. Since the code “8” represents “unknown” and “(blank)” is simply “missing value”, these two items shall not be displayed in the frequency table, or at least the item “(blank)”. To do this, just click on the dropdown next to “RowLabels” and uncheck the box next to “(blank)” and click “OK”. However, this refinement will conduct only when finalizing the PivotTable in this example.
The next step is to place “Sex (HV104)” into column box to obtain the following crosstab table:
Here, “value labels” can be directly typed into a PivotTable, and the new labels will replace the defaults. For example, the column labels “1” can be replaced with “Male” and “2” with “Female”. These fine-tunings will be carried out only when finalizing the PivotTable. The current PivotTable represents entire household population irrespective of age, but the requirement is just for the “population aged 15-24”. To fulfill this requirement, the cases must be filtered by “age”. Therefore, send the variable “age (HV105)” to the “filter” box. It should be noted that, although the filtering variable is set, the table will be unchanged since no filtering is in place. Therefore, click on the “dropdown” icon next to “(All)” in cell B2, then, tick “Select Multiple Items” checkbox, and leave the ticks only for the ages between 15 and 24 inclusively.
Above exhibit presents the PivotTable after tuning up captions (value labels) and column width. PivotTables can be copied the whole or any part of it to be use for other purposes. PivotTable is more useful if multiple tables with the same structure are required for different groups (e.g. for different ages), or presenting the same table with selected rows and/or columns only. For example, the same table for adults (aged 15+) can be created by clicking dropdown icon next to “(Multiple Items)” in Cell B2, first, tick “(All)”, and clear off ticks next to “0”, “1”, “2”, …, “14” (see A). Similarly, to create a table for all adults but with “up to complete primary” education only, click the dropdown icon next to “Row Labels” and select only the first three categories (see B). Place to employ changes
A B As seen in these examples, PivotTable method is user-friendly, powerful and efficient in analyzing household survey data, especially for the surveys applying “self-weighting” sampling designs.
2.4 Drawing Pivot Charts PivotChart provides a graphical representation of the data in a PivotTable. The layout and data that are displayed in a PivotChart can be changed just as in a PivotTable. A PivotChart always has an associated PivotTable that uses a corresponding layout. Both of them have fields that correspond to each other, that is, when changing the position of a field in the PivotTable, the corresponding field in the other report also moves. In addition to the series, categories, data markers, and axes of standard charts, PivotChart reports have some specialized elements that correspond to the PivotTable as following: Filter field: A field to filter data by specific items. In the example, the “age” field displays data for both sexes. To display data for a single age or selected ages, click the drop-down arrow next to (All) and then select a number or some numbers. Values field: A field from the underlying source data that provides values to compare or measure. Depending on the source data of the report, the summary function can be changed to Average, Count, Product, or another calculation. Series field: A field that is assigned to a series orientation in a PivotChart. The items in the field provide the individual data series. In a chart, series are represented in the legend. Item: Items represent the unique entries in a column or row field, and appear in the dropdown lists for report filter fields, category fields, and series fields. Items in a category field appear as the labels on the category axis of the chart. Items in a series field are listed in the legend and provide the names of the individual data series. Category field: A field from the source data assigned to a category orientation in a PivotChart report. It provides the individual categories for which data points are charted. In a chart, categories usually appear on the x-axis, or horizontal axis, of the chart. Customizing the chart: The chart type and other options (such as, the titles, the legend placement, the data labels, the chart location, and so on) can be changed. A PivotChart can be created automatically when creating a PivotTable or from an existing PivotTable. To create a PivotChart from an existing PivotTable, follow the steps: 1. Select any place (cell) on the existing PivotTable, two new menu items “Options” and “Design” will be added (under “PivotTable Tools” group) in the main menu; 2. Click “PivotChart” under “Options” tab to get the “Insert Chart” dialog box;
3. Choose “Chart Type” from the “Insert Chart” dialog box; and
4 4. Click “OK” to create a basic PivotChart together with a “PivotChart Filter Pane”.
PivotChart created automatically is a “draft”. Particularly, there is no chart title. Therefore, it must be edited using the following tools, which are available when clicking on an active PivotChart:
For example, to add “Education Level by Sex (Aged 15+)” as the chart title above the drawing (chart), click on the “Chart Title” under the “Layout” command, and select the third option, „Above Chart‟. After making few other make-ups such as moving the legend, changing the chart design, putting in border lines for the plot area, etc., the following PivotChart is successfully created and ready to use.
Another useful adjustment in both PivotTable and PivotChart is to display the “values” not in the absolute numbers, but in percentages. To review the percentage distribution of education level by sex for adults: 1. Click on the “dropdown” of the variable in the “value” area; 2. Select “Value Field Settings…”; 3. Select “Show values as” tab in the “Value Field Settings” dialog box; 4. Select “% of column” in “Show values as” dropdown list; and 5. Click “OK” .
3 2 1
Then, the following table and chart will be obtained after adjusting display formats, especially number of decimal places in percentages.
2.5 Elaborating PASW Outputs for Better Presentation Although PivotTable and PivotChart are user-friendly and efficient way to present household survey data in both tabular and graphical presentations, the PASW Statistics provides broader methods and options, and capable of using “weights” for complex sampling techniques. On the other hand, Excel is more familiar with users, and easier to make further analyses through output tables from different analyses. Therefore, the best blend is to analyse the dataset in PASW Statistics and to finalize the outputs in Office Excel. In this section, use of “weights” in the calculation of school-age population and number of children currently attending school in PASW Statistics, and calculation and presentation of age-specific enrolment rates in Microsoft Office Excel 2007, are going to demonstrate step-by-step. Basic of Weighting In a town with 2 Wards, there are 100 children aged 6-10 in Ward-1 and 50 in Ward-2. Of those children, a survey on “schooling status” was conducted by selecting 25 children from Ward-1 and 20 children from Ward-2. It was found out that 5 children (out of 25) from Ward-1 and 6 children (out of 20) from Ward-2 were not currently in school. Therefore, percentage of out-of-school children (say, POS) can be estimated as: POS (Ward-1)
= 5 / 25 x 100
= 6 / 20 x 100
= 30.0%, and
Percentage of out-of-school children in the town can be estimated as: POS (Ward 1+2) = 11 / 45 x 100
= 24.4%. …………. (1)
POS (Ward 1+2) = (20.0%+30.0%) / 2 = 25.0%. …………. (2) Although the percentages of out-of-school children by Ward can represent respective Ward, above percentages calculated for the entire town do not represent correctly. The main reason is the sample sizes are not “self-weighting” or unbalanced between two Wards: the sampling fraction for Ward-1 is 25 / 100 or 25.0% while that for Ward-2 is 20 / 50 or 40%. In other ward, a child in the sample from Ward-1 represents 4 children while a sample child from Ward-2 represents just 2.5 children. To have a correct estimate for the town, it should be calculated as following: Since the POS (Ward-1) is 20.0%, it is expected to have 20 out-of-school children (20.0% x 100) in Ward-1 and it is expected to have another 15 children (30.0% x 50) in Ward-2. Therefore, there could be 35 out-of-school children out of 150 children aged 6-10, and the POS for the Town is (35 / 150 x 100 = 23.3%). On the other hand, the appropriate number of out-of-school children in Ward-1 and Ward-2 can be estimated as 5 x 4.0 = 20 (since one in the sample represents 4 children in Ward-1) and 6 x 2.5 = 15. These numbers 4.0 and 2.5 are known as “sample weight”, and normally provided in the datasets. In PASW Statistics, it is easy to apply weights if it is provided in the dataset: 1. Click “Data” on the main menu; 2. Click “Weight Cases…” and “Weight Cases” dialog box will be appeared; 3. In “Weight Cases” dialog box, set “Weight cases by”; 4. Select the variable representing the “weight” (it is HV005 – Sample weight in the DHS dataset); and 5. Click “OK” to complete weighting process. When no longer weighting is necessary, select “Do not weight cases” in above Step 3 and click OK” to step weighting.
The following tables represent population aged 6-10 by sex with and without weighting.
The differences due to weighting can be observed in the percentage distribution of population by age and sex. Similarly, the following tables present weighted and unweighted number of children currently attending school (HV110 â€“ Member still in school) by age and sex.
From these two sets of tables, one can calculate proportion of children currently attending school by age and sex or percentage of out-of-school children by age and sex, in Excel. Since, it is easier to export all outputs from PASW Statistics Viewer, first clear unnecessary outputs, such as logs, notes, and case processing summary; then, export to Excel: 1. Click “File” on the main menu; 2. Click “Export…” and “Export Output” dialog box will be appeared; 1
3. In “Export Output” dialog box, set: a) “All” in “Objects to Export”; b) “Excel (*.xls)” in “Document Type”; c) Provide “File Name” with folder path; and d) Click “OK” to begin exporting outputs to Excel. Step 3 (a) (b)
At the end of this process, an Excel file, â€œPOS.xlsâ€? will be placed in the specified folder with four cross-tabulation tables exported from PASW. The files can be seen in the following exhibit.
Adding three more columns in the last two tables, making simple calculation of dividing children in school by total number of children of respective age and sex, the required percentage of children in school can be obtained easily. That is not such simple in PASW. Percentages of children in school by age and sex are presented in the following tables with rephrasing of titles and captions.
To demonstrate the visualization of data through charts, the percentage of children in school by age (or Age-Specific Enrolment Rate) will be presented in “3-D Clustered Column” and “Line” charts which are appropriate with the data. To create a 3-D Clustered Column chart, follow the steps: 1. On the table, (a) select Cell „A36‟, unmerge and type “Age” into Cell „B37‟; similarly, (b) select Cell „A43‟, unmerge and type “6-10” to Cell „B43‟; 2. Select the “Data Source” to create chart: age (B37:B43 - X-axis), percentage of children in school for male (F37:F43 - Series 1) and for female (G37:G43 - Series 2); 3. Click “Insert” on the main menu; 4. Click “Column” to get the list of available Column Charts; When user places mouse on the “Column”, concise but useful information: “Column charts are used to compare values across categories”, will be popped up. Similar information will be popped up when pointing on other chart types also. 5. Click “3-D Clustered Column” icon, the first one under “3-D Column” group; Then, following draft chart based on the provided data will be displayed instantly. 3 4
5 1(a) 2 1(b)
6. The next step is to finalize the chart in Excel: a. Click on the chart, and click again on “Layout” under “Chart Tools”; b. Click “Chart Title”, select “Above Chart” to insert a space for chart title, and type “Age-Specific Enrolment Rate by Sex, Aged 6-10” into that space; and c. Click “Axis Titles”, set “Primary Horizontal Axis Title” to appear “Title Below Axis”, and type “Age” into the space appears. At this stage, the chart is usable. However, more polishing could be carried out such as: d. To change the location of legend (just select, drag and drop at new location); e. To change the gap width between items (select one series, right-click to get popup menu, click “Format data series”, and set “Gap width/depth”); f. To change the series colour (select one series, right-click to get pop-up menu, click “Format data series”, and set colour in “Fill”); g. To format any … (select that item, right-click to get pop-up menu and set); and h. To move or resize the chart, chart title, legend, etc...
The following chart will be obtained after putting few final touches:
Same procedure should be carried out to create a line graph except selecting the data range to cover ages 6 to 10, but not total (aged 6-10). The line charts are normally used to display the trends, over time or age. Therefore, putting total (aged 6-10) in the series will misinform the viewers.
On the other hand, it is nice to include for both sexes in the line graph and the differences could be observed clearly if the rates begins at 60% instead of 0%. Such few adjustments in the above line chart will yield the following final one.
TIPS AND EXERCISES
3.1 Tips: Do and Don’t i) Do… Don’t… ii) Do…
Don’t… iii) Do… Don’t… iv) Do… Don’t… v) Do… Don’t…
export to Excel from PASW Statistics with data in “labels” as much as possible, rather than exporting only numeric data values; import PASW Statistics datasets to Excel without having codebook (the coding scheme used in creating dataset) or questionnaire with codes. practice importing PASW Statistics dataset to Office Excel 2007 and check the correctness of database table in Excel by constructing frequency tables; edit imported database before saving and leave computer with unsaved files. autofilter on one or more fields (variables) in extracting data with certain criteria or to review the invalid cases (data validation); forget to release autofilter from the fields which are not using; otherwise wrongly filtered the cases. practice using, and use PivotTable and PivotChart as and where appropriate; try to edit too much PivotTable and PivotChart or undo several times; it may hamper the computer performance or totally hanged. use PivotTable technique to create frequency and crosstab tables, and check the outputs thoroughly; trust computer outputs. Don‟t use those tables and charts on presentation or dissemination before completing thorough checking.
3.2 Self-evaluation Are you able to work with Microsoft Excel 2007 to: a. import SPSS dataset Very well / Somewhat well / Not so much / Almost None b. select some rows (cases) using auto-filter Very well / Somewhat well / Not so much / Almost None c. create frequency table Very well / Somewhat well / Not so much / Almost None d. construct a two-way (crosstab) table Very well / Somewhat well / Not so much / Almost None e. develop a Pivot Table Very well / Somewhat well / Not so much / Almost None f. create a Pivot Chart Very well / Somewhat well / Not so much / Almost None Are you confident that you can export selected output tables from PASW Statistics to Microsoft Office Excel 2007? Confident / Somewhat confident / Not so much / Not at all Are you confident that you can elaborate PASW Statistics output tables in Microsoft Office Excel 2007? Confident / Somewhat confident / Not so much / Not at all 3.3 Hands-on Exercises 1) Import the attached “BDPR50FL(Validate).sav” into Excel. 2) From the dataset obtained from Exercise 1 above, validate the database table in Excel for various errors, recommend with reasons on whether the imported database is valid to use. 3) Import the attached “BDPR50FL1.sav” into Excel and extract cases with “out-of-school children aged 6-10”. 4) Create PivotTable and PivotCharts to present “percentage of out-of-school children aged 6-15” by Division.
Published on Apr 23, 2010