Page 1

Section 1

Sampling & Types of Data Learning Outcomes At the end of this session, you should be able to: 

Understand the rationale for the use of statistical techniques

Discuss approaches to developing sampling frameworks and methodologies

Define key terms in the use of statistical techniques

Understand the difference between different data types

Present numerical data effectively in graphical and tabular form


Data Analysis for Research

Introduction to Statistics

Introduction to Statistical Terms and Sampling Frameworks 1.0

Introduction - Why use Statistics? There is far more to research than measurement and analysis of quantifiable facts. A prime tool in the study of how people exist in their environment is the very fact of the investigator’s common humanity: you know a lot about what people do because you are also a human being. And human actions and responses are affected by memory, prejudice and emotions which cannot be adequately quantified. Even so, there are innumberable instances of relevant, quantified facts in geographical investigations: most questionnaire results contain some quantitative element, even if it is only how many people said ‘yes’ and how many people said ‘no’; international comparisons can often make use of data from the World Health Organization, World Bank, Unicef or the United Nations Development Programme, amongst others; within Britain data from the census or health authorities exists for a wide range of areal units; in physical geography the geology, soils, vegetation, elevation, aspect and so on can all be quantified. You should not ignore these data. You may feel that it is only necessary to present such information, perhaps using a table or a graph, and sometimes that may be enough. On the other hand, statistics will enable you to go much further in the understanding of the patterns and relationships displayed. Furthermore, they will help ascertain the quality of the information that you are using. This last point is perhaps the most important part of using statistics: for instance, your pie chart showing that 75% of respondents preferred Bognor to Barbados as a holiday destination may look impressive, but statistics will soon reveal that any conclusions to be drawn from the answers given by only four people are limited, to say the least. If the statistics can’t test your hypotheses, the fault may be in your hypotheses, or more likely in your data collection, but it certainly isn’t in the statistics. Before considering the statistical manipulation of data, it is necessary to consider how the data is collected for use.

1.1

Sampling You often have to make do with what information there is (if you are interested in the cultivation of mangelwurzels and the nineteenth century agricultural survey did not record them, there is not much you can do about it), but ideally in research you can go and collect the information yourself. In such a case you can ensure that the information you collect is as useful as possible. Sometimes you will be able to collect all the relevant information – the census population of each ward in the county, for instance – but in many cases you will need to collect a sample. For example, you would not practicably be able to find the opinions of all the people in a county, but using an appropriate sampling technique you could collect information from a smaller but representative sample of that population.

© Dr Andrew Clegg

p. 1-1


Data Analysis for Research

1.2

1.3

Sampling Strategies

Some Terms in Sampling 

A variable is a property which can vary and be measured - temperature etc.

An observation or variate is a particular measure.

Population is the complete set of counts or measurements derived from all objects possessing one or more common characteristics. This can be infinite, as in the case of elevations in the field.

Sample - part of a population.

Avoiding Bias An important question to ask yourself at the start of sampling is ‘What do I want my sampling to be representative of ?’ An example of where this might be important is in studying the patterns of farming in a region. For simplicity and clarity, let us assume that each farm only cultivates one crop. Selecting points on a map will tend to choose the bigger farms because they occupy a larger area. On the other hand, selecting farms from a list will tend to choose the smaller farms, because there are likely to be more of them within the same area. Therefore the first method will give a representative sample of the land use, the second of the farming. What can cause problems is using the first to find out about farms, or the second about land-use.

1.4

Deciding on the Choice of Sampling Techniques Before you starting sampling, you need to consider whether a convenient sampling frame exists. An example of a sampling frame may be a list of names on an electoral register or a membership directory of a particular organisation. Even when sampling frames do exist, they are often incomplete or out of date. The integrity of the data set will therefore influence your choice of sampling technique. However, it is often possible to construct your own sampling framework, although this could be costly and time-consuming. For example, if investigating the distribution of farm shops in West Sussex, you could use the farm shops listed in the yellow pages as a provisional framework and then supplement this with fieldwork to check for any farm shops not listed in the yellow pages. For an area it may be necessary to create a grid with x and y axes, so that the whole area under investigation can be referred to using co-ordinates, like grid references. In this instance, you need to achieve a balance between having too few cells to give precise or even usable results (remember that a co-ordinate reference refers to an area rather than a point) and having so many that the sampling process becomes too time-consuming. Such decisions must be made with specific reference to the particular investigation and the time and resources at your disposal. Indeed, when designing a sampling strategy for a research project it is important to ask yourself whether you can afford the time and money to carry out the sample collection. When deciding on the sample technique, you also need to decide on the size of the sample. As a general guideline, the larger the sample, the more confident we can be that the statistics derived from it will be similar to the population parameters. However, a large sample with a poorly designed sampling frame, may contain less information than a smaller but more carefully designed sample.

© Dr Andrew Clegg

p. 1-2


Data Analysis for Research

1.4.1

Sampling Strategies

Random Sampling The word random in this context does not mean haphazard. It refers to a definite method of selection aimed at eliminating bias as far as possible. A random sampling method should satisfy two important criteria: a) every individual must have an equal chance of inclusion in the sample throughout the sampling procedure; and b) the selection of a particular individual should not affect the change selection of any other individual. To put these criteria in more formal probability terms: the probabilities of inclusion in the sample must be equal and independent of each other. So, if the aim is to pick a random sample of 50 households from a population of 200, every household should have the same 50/200 or 0.25 probability of selection. The simplest example of pure random sampling is a raffle or lottery. Thus to take a random sample of the population of the UK, the name of each resident would have to be written on a piece of paper, all the pieces of paper put in a giant drum and a random selection made: obviously not a practical method. More usually it is numbers, not names, which are used and, instead of picking these numbers out of a hat, a computer can be programmed to generate random number sequences. Alternatively tables of random numbers can be used. Computers use the last digits of their internal clock to ‘seed’ their random numbers (otherwise they would just keep repeating the same sequence), and similarly when using random number tables it is worthwhile picking a point somewhere in the table ‘at random’ and then sometimes read up, or left, rather than from left to right.

1.4.2

Systematic Sampling Systematic sampling is, as its name suggests, sampling according to a regular system. This involves choosing the first item at random and then selecting every nth item where n will be determined by the size of the sample required. For example, if a sample of 50 items is required from a population of 500 items, every 10th one would be selected. Provided that there are no characteristics in the population which recur every 10th item, the sample will be unbiased; indeed this may be thought of simply as a short cut (the population does not need to be numbered) method of producing a random sample.

1.4.3

Stratified Sampling It is possible, in some instances, to improve on simple random sampling by stratification of the population. This is particularly true where the population is heterogeneous (i.e. made up of dissimilar groups) and the population can be stratified into homogeneous (i.e. similar) classes. These classes should define mutually exclusive categories. For example suppose a bakery makes three different types of loaf: large, small and cottage. If a simple random sample was taken of the daily output, it would be possible, although unlikely, for it to include only one type of loaf. Stratification of the population before sampling can prevent this and, if carried out as described below, can produce a sample which is truly representative of the population. Assume that the bakery’s output is 50% large, 40% small and 10% cottage loaves. The different loaves divide the population into three strata. Now if a sample of 50 loaves is required it should contain 25 large, 20 small and 5 cottage thus ensuring that the proportions of each type of loaf in the population are reflected in the sample. Within these constraints, however, selection should be made on a random basis.

© Dr Andrew Clegg

p. 1-3


Data Analysis for Research

1.4.4

Sampling Strategies

Multi-Stage Sampling Surveys covering the whole UK are frequently required but, as you can imagine, simple random sampling or even stratified sampling will not give an easy solution. Where the population is very spread out, particularly geographically, simple random sampling will result in a dispersed sample leading to a considerable amount of travelling and time. Consequently some method is needed to narrow down the field down to a smaller area, with the resultant cost savings. Multi-stage sampling attempts to do this without adversely affecting the ‘randomness’ of the result. The first step is to divide the population into manageable, convenient groups or areas, such as counties or local authority regions. Indeed, stratification of areas such as counties or local authorities by principal geographical regions is often introduced in order to minimise geographical bias (Clark et al, 1998, p. 84). A number of areas are then selected at random. If the number of areas selected is still too large or dispersed, then these areas can be broken down further to reduce the sample size to more manageable proportions. For example, having chosen a random sample of local authorities, each one itself may be divided into political wards or streets or households. Finally a simple random or systematic sample will be chosen.

1.4.5

Cluster Sampling Cluster sampling can often be confused with multi-stage sampling as the first step appears identical. The important difference is that cluster sampling is used when the population has not been listed and it is the only way to obtain a sample. As an example, suppose that a survey is to be done on the proportion of elm trees attacked by Dutch elm disease in the UK. Obviously there is no list of the complete population of elm trees. Neither would it be possible to try and cover the whole population. To use cluster sampling in this case, the population could be divided into small ‘clusters’ by drawing a grid over the map of the country and choosing, at random, a few of these clusters for observation, each cluster being a small area. Within each area, the investigators will then be asked to find as many elm trees as possible within that area and note how many of them are diseased.

1.4.6

Non-Random Sampling The previous paragraphs have been concerned with methods of random sampling, basically simple random sampling with several variations and refinements. The methods discussed in the previous section share a number of key elements. These include: a) the chances of obtaining an unrepresentative sample are small; b) this chance decreases as the size of the sample increases; c) this chance can be calculated; and d) the sampling error can be measured and therefore the results can be interpreted. Unfortunately occasions often arise when the selection of a random sample is not feasible. This may be because:  It would be too costly;  It would take too long; or  All the items in the population are not known.

© Dr Andrew Clegg

p. 1-4


Data Analysis for Research

Sampling Strategies

For these reasons the following research methods of non-random sampling are used, particularly in the field of market research.

1.4.6.1

Judgement Sampling In this case an expert, or a team of experts use their personal judgement to select what, in their opinion, is a truly representative sample. It certainly cannot be called a random sample as it involves human judgment which could involve bias. On the other hand, the sampling process does not require any numbering of the population or random number tables. It can be done more quickly and economically than random sampling and, if carried out sensibly, can produce very good results.For example, in an interview situation, a researcher may pick individuals because of the nature of the response they are likely to give, and the responses the researcher is looking for.

1.4.6.2

Quota Sampling This is the method most often used in market research where the data is collected by enumerators armed with questionnaires. To avoid the expense of having to ‘track down’ specific people chosen by random sampling methods, the enumerators are given a quota of say 400 people, and are told to interview all the people they can until their quota has been met. Such a quota is nearly always divided up into different types of people with sub quotas for each type. For example, out of a quota of 400, the enumerator may be told to interview 250 working wives, 100 non-working wives and 50 unmarried women, and within each of these three classes to have 50% who smoke and 50% non-smokers. Using this technique, the researcher has the choice of selecting certain people who might be included in the sample, and can therefore introduce an element of bias into the sample. The main advantage of this method is that, if a respondent refuses to answer the questions for any reason, the interviewer will just look for another person in the same category. With true random sampling, once a sample item has been decided upon, it must be used. Any substitution results in a non-random sample.

1.4.6.3

Convenience Sampling As the name implies, the most important factor here is the ease of selecting the sample. No effort is made to introduce any element of randomness. An example of this is the quality controller who takes the first 20 items off the production line as his sample, a dangerous procedure as any fault occurring after this could remain unnoticed until the next sample is taken (maybe an hour later). For most purposes, this sampling method is simply not good enough but for some pilot surveys the savings in cost, time and effort outweigh the disadvantages. The aim of a pilot survey could be to establish the most satisfactory form of questionnaire to be used in the actual survey. Since the actual results would not be used it does not matter that the sample was not selected at random.

Š Dr Andrew Clegg

p. 1-5


Sampling Strategies

Data Analysis for Research

1.5

Summary Sampling serves two purposes. One is the saving of time and effort in the collection of information. The second is the collection of information so that inferences and comparisons can be drawn using statistics. Although a simple subject, it is fundamental to much research, and needs to be done with care. Table 1, provides a summary of the key sampling methods that have been discussed.

Table 1:

Sampling Methods Representative

Probability

Random

Description

Example

Judgemental 

Sampling elements are selected Several houses for sale in Belfast, based on the interviewer’s experience perhaps with families known to the that are likely to produce the required interviewer, are chosen subjectively. results.

Sampling elements are selected The quota is the first 30 homeowners subject to a predefined quota control. sellign their houses in Belfast who are also making an intra-urban move, and are aged between 20-40 years.

Sampling elements in the sampling frame are numbered. First sampling unit is selected using random number tables. All other units are selected systematically k units away from the previous unit.

Sampling frame of 600 homeowners selling their houses in Belfast. These houses for sale are ordered and numbered. A random number is selected for a start point, from which every tenth property is selected for inclusion in the sample.

Quota

Systematic (first unit selected at random)

Simple random 

Sample size of n elements selected from a sampling frame without replacement, such that every possible member of the population has an equal chance of being selected.

All 600 houses for sale in the sampling frame are numbered 1-600. A sample of 30 units is selected using a random number table, excluding those numbers outside the range 1-600.

Sampling frame divided into subgroups (strata) which are then each sampled using the simple random method.

All 600 houses for sale come from lists provided by six estate agents. These are each randomly sampled for houses to include in the sample.

Sampling frame divided into hierarchical levels (stages). Each level is sampled using a simple random method which selects the elements to be included at the next level.

All 600 houses for sale are distributed to enumeration districts within several wards. A random sample of these wards is selected and of these random samples of both enumeration districts and finally houses for sale are selected.

Sampling frame divided into hierarchical levels (stages). Levels are selected using random sampling similar to the multi-stage random method. However, all elements are selected at the final stage.

Similar to the above method, expect that all the houses for sale in a given enumeration district are selected.

Stratified random

Multi-stage random

Clustered random

[Source: KITCHIN, R. AND TATE, N. (2000): Conducting Research into Human Geography, Prentice Hall, London, p. 55.]

© Dr Andrew Clegg

p. 1-6


Data Analysis for Research

1.6

Types of Data

Types of Data Normally when we think of data quality, we think about reliability or accuracy. In statistics, data have quality in terms of what they represent and how they can be manipulated. The four levels of measurement are: nominal/ categorical, ordinal, interval and ratio. Each measurement is outlined below: 

An ordinal variable can be ranked in order from highest to lowest, for example a league table. Alternatively, a questionnaire survey may ask respondents to rank satisfication levels on a scale from ‘Strongly Agree’ to ‘Strongly Disagree’. Ordinal variables do not allow comparable measurements, for example ‘Strongly Agree’ is not worth double ‘Slightly Agree’.

Interval and Ratio variables are concerned with quantitative data. Interval variables are in the form of a scale which possesses a fixed but arbitrary interval and arbitrary origin. Addition or multiplication by a constant will not alter the interval nature of the observations (e.g. 10C, 20C, 30C, 40C). For a ratio measurement, this number is in relation to a scale of an arbitrary interval, similar to interval data, but with a true zero origin. In these cases, where we are using numbers as we normally think of them, one value can be twice the size of another. For example, income is a ratio variable as a person can have no income. Ratio measurement commonly applies to metric quantities such as distance and mass, which possess a zero origin. [When importing data into SPSS, and using the Variable window, Interval and Ratio data are classed as Scale - see Descriptive section in this handbook].

Categorical or nominal variables are the lowest level and are variables where numerical values have been assigned to separate categories, often viewed as unique from one another. For example, gender (male/female), hair colour (blonde, brown, ginger, grey), or direction (north, east, south, west).

It is important to remember that data can only be converted from higher to lower quality, and data can only be treated ‘at their own level’. For instance, the numbers ‘1,2,3,4’ could be heights in meters (ratio), temperatures in degrees C (interval), the order of countries achieving Rostow’s ‘take off’ (ordinal) or the answer to ‘what is your favourite number (nominal): they must not be treated at a higher level than their meaning. As Mulberg (2002) points out ‘the thing to ask is if it makes sense to talk about one case being double another, or if there is a highest and a lowest (see Figure 1). It is also important to understand the different types of data or variables, as this will influence the kind of statistical analysis that is possible. The levels of measurement are summarised in Table 2. In order to use parametric and non-parametric tests successfully later in the module, it is imperative that you understand the characteristics and differences between types of data. Please read through these notes carefully, and learn the different data types.

© Dr Andrew Clegg

p. 1-7


Types of Data

Data Analysis for Research

Figure 1:

Judging Levels of Measurement

Start Does it make sense to talk about one number being double another?

Yes

Ratio Level

Yes

Ordinal Level

No

Does it make sense to talk about one number being higher or lower than other? No

Nominal Level

[Source: Mulberg, 2002, p. 8]

Additional terms that you will encounter include: 

A discrete variable is a variable whose numerical values varies in steps or where the values are integer numbers. Normally such variables are associated with counts, for example you may count the number of firms, products or employees when conducting a survey. Discrete variables do not allow for decimal places.

A continuous variable is a variable which assumes a value that can be donated on a continuous scale. Examples include weights, heights and age. In reality, continuous variables relate to specific values that lie at a point on a continuum. For example a person’s age could be recorded in discrete form as being so many years, but in reality their age can be placed at a point on a continuum which reflects not only the numbers of years but also the number of days, minutes and seconds which have passed since the moment of their birth (Clark et al, 1998). Continuous variables allow for decimal places. Continuous variables can som etimes be described as demonstrating certain statistical properties that allow them to be used in parametric statistical tests. However, sometimes some continuous variables do not show these particular properties, and when this happens, the variables are though suitable to be used in non-parametric tests (Mulberg, 2002).

Variables can also be classed as ‘dependent’ or ‘independent’. A dependent variable refers to a variable which is identified as having a relationship or dependance on the value of one or more independent variables. For example, levels of car ownership may be directly dependent on a number of independent variables including average household income, age and the number of persons in the household.

© Dr Andrew Clegg

p. 1-8


Types of Data

Data Analysis for Research

Table 2:

Data Quality Name

Description

Nominal or Categorical

Data assigned to discrete categories, in no Clay, sandstone,granite, lifestyle groups, natural order

Ordinal

Examples

singles, retired

The categories associated with a variable Cities in order of population size/opinions can be rank-ordered. Objects can be regarding service or product quality ordered in terms of a criterion from highest to lowest.

Interval

With ‘true’ interval variables, categories Temperature in degrees Celsius or associated with a variable can be rank- Fahrenheit. ordered, as with an ordinal variable, but the Goal Difference distances between categories are equal; Categories have no absolute zero point; Variables which strictly speaking are ordinal but which have a large number of categories, such as multiple-item questionnaire measures. These variables are assumed to have similar properties to ‘true’ interval variables.

Ratio

Data with meaningful intervals and a true zero

Age, distance

When attempting to remember types of data use the abbreviation NOIR (nominal, ordinal, interval, ratio).

When using variables in statistical analysis, a further distinction is also drawn between descriptive and inferential statistics. Descriptive statistics refer to the sample that is created by the research/study process and literally refers to the methods and techniques used to describe and summarise data. Measures of central tendency (mode, median, mean) are the most basic descriptive statistics to which we can also add basic measures of dispersion including the maximum, minimum and range of values.

Inferential statistics refer to those techniques which are adopted to draw conclusions about the population to which the sample belongs and which enable inferences about the characteristics that might be expected in other samples as yet to be selected from that same population. Inferential statistics give greater analytical power and bring into play probability theory and other statistical tests and measures that will be discussed later in this handbook.

© Dr Andrew Clegg

p. 1-9


Data Analysis for Research

Types of Data

However, as Lindsay (1997) points out the use of inferential statistics carries greater responsibility and as such any user must be aware of the following guidelines: 

Sampling must be independent. This means that the data generation method should give every observation in the population an equal chance of selection, and the choice of any one case should not affect the selection of value of any other case;

The statistical test chosen should be fit for its purpose and appropriate for the type of data selected;

The user must interpret the results of the exercise properly. The numerical outcome of a statistical test is the result of an exercise.

© Dr Andrew Clegg

p. 1-10


Data Analysis for Research

1.7

Presenting Data

Presenting Data Presenting numerical data accurately is an important element of essays, reports, presentations and posters. The aim of the following section is to provide a few basic guidelines on how to incorporate graphs and tables effectively, and at the same time creatively, into your work.

1.7.1

Using Graphs and Charts Computer spreadsheets such as Excel, now allow you to produce a range of graphs and charts (bar charts, column charts, pie charts, graphs) quickly and easily. As such, graphs can be used effectively to enhance the quality of reports, essays, posters and presentations. Carefully thoughtout graphs can bring to life data from tables and allow comparisons to be made quickly. However, poorly designed graphs can easily fail and weaken a piece of work. It is very common for students to rush in and produce a whole plethora of charts and graphs without giving much thought to the data set they are using or what type of output would be most appropriate. Therefore is it important to take your time and give careful consideration to what you actually want to achieve. First, ask yourself the following questions: Is a graph or chart necessary? Students often use diagrams as a means of ‘padding out’ work and as a result graphs not referred to in the text become ‘window-dressing’. Therefore carefully consider whether the graph is actually needed - ask yourself whether the graph helps the reader understand a particular point or aspect of the data. If it does fine - but make sure that is it integrated and referred to fully in your dicussion. If not, provide a simple verbal description. What is the purpose/objective/outcome? Are you producing a graph for an essay/report, poster or presentation? While the basic guidelines and formatting options are generic, you need to consider the overall purpose and intended audience. For example graphs produced for a presentation will be different to those produced for inclusion in an essay or a PowerPoint presentation. Carefully consider the importance of visual impact and clarity, and the type of media you are using. What is the nature of the data set you are using? Graphs often fail because an incorrect chart type has been used or the graph is too complicated. Therefore before you start carefully consider the actual nature of the data set you are using. Above all you need to distinguish between ‘continuous’ data and ‘discrete’ quantities. A continuous quantity is that which can be chosen to any degree

© Dr Andrew Clegg

p. 1-11


Data Analysis for Research

Presenting Data

of precision. Examples of continuous quantities include mass (kg), length (m), and time (s). Discrete quantitites in contrast can only be expressed as integers (whole numbers) for example: 3 computers, 5 cars, 4 houses. In trying to decide if something is continuous or discrete, decide whether it is like a stream (continuous) or like people (discrete). Continuous variables are usually plotted on a graph as this demonstrates the existence of a casual relationship between the data points, whereas discrete data series are plotted as bar charts or histograms. In addition to the nature of the data set also consider whether you referring to absolute values or percentage distributions? This will have a significant influence on the chart type that you use. Second, how complicated is the data set?; is it best represented as a graph or a table?; can the data be manipulated to make it easier to use, for example by reformatting columns or excluding columns? Be prepared to modify the data set if necessary. However, make sure that when you do this you do not alter the accuracy or the representativeness of the data set you are using. The following graphs highlight the issue of using appropriate chart types. Figure 2: Car Sales for Rover, BMW, and Jaguar 1995-2000

[Source: Believe, M., 2001] In Figure 2, car sales for leading manufacturers have been plotted for a 5-year time period. In this instance we are dealing with discrete data (as you cannot sell half a car!). However, the data has been plotted as a line graph - is this correct? The answer is YES as there is a logical year to year link and the ‘joining the dots’ technique illustrates the casual relationship between the x-axis variables. This data could have also been presented as a column chart. Compare this to Figure 3.

Š Dr Andrew Clegg

p. 1-12


Presenting Data

Data Analysis for Research

Figure 3:

Resident Opinions to the Development of New Housing in Greenfield Sites in West Sussex

[Source: Believe, M., 2001] Figure 3 highlights the attitudes of residents to new housing development in West Sussex. Is this graph the most effective form of presentation? The answer is NO. In this instance joining the dots is not appropriate as there is no casual relationship between x-axis variables. In this instance a column chart would have been more effective - see Figure 4. Figure 4:

Resident Opinions to the Development of New Housing in Greenfield Sites in West Sussex

[Source: Believe, M., 2001] Š Dr Andrew Clegg

p. 1-13


Presenting Data

Data Analysis for Research

While Figure 4 is a definite improvement, is there any way of making the data in Figure 4 more effective so that it really highlights the differences in resident opinions between the different areas? Again the answer is YES. So far we have graphed the absolute values relating to resident opinions. If we were to change this to a percentage distribution we could present the data as a bar chart - see Figure 5. Figure 5:

Resident Opinions to the Development of New Housing in Greenfield Sites in West Sussex

[Source: Believe, M., 2001] As you can see in Figure 5, utilising the percentage distribution really succeeds in highlighting the differences in residents opinions. Let us consider a further example. Figure 6 illustrates the mean monthly temperature and rainfall totals for Edinburgh. Is the graph appropriate? Again the answer is YES as there is a logical year to year link and the ‘joining the dots’ technique illustrates the casual relationship between the x-axis variables. However, although this graph allows us to compare monthly temperature and rainfall totals, the high values for temperature have masked the values for rainfall and a degree of accuracy has been lost. To overcome this we can change the type of the graph and plot temperature and rainfall on separate axis - see Figure 7.

Š Dr Andrew Clegg

p. 1-14


Presenting Data

Data Analysis for Research

Figure 6:

Mean Monthly Temperature (OC) and Rainfall (mm) for Edinburgh

[Source: Bartholomew, 1987] Figure 7:

Mean Monthly Temperature (OC) and Rainfall (mm) for Edinburgh

[Source: Bartholomew, 1987] So far our discussion has concentrated on the use of line graphs, column and bar charts. Another type of chart frequently used is the pie chart. The overall total number of cases represented by the pie chart should equal the sample size, or aggregate to 100% where segments denote proportional frequencies (Riley et al, 1998, p. 172). Let us consider some specific examples. Š Dr Andrew Clegg

p. 1-15


Presenting Data

Data Analysis for Research

Figure 8:

The Distribution of Serviced Establishments in Torbay by Size

[Source: Clegg, 1997] Figure 8 refers to the percentage distribution of serviced establishments in Torbay by size. When using pie charts it is important to remember that pie charts can only graph the percentage distribution of one specific variable and cannot be used to analyse time series data. For example, we could not use a pie chart to illustrate the car sales for Rover, BMW and Jaguar referred to in Figure 2. However, we could use a pie chart to analyse the market share of car sales for a specific year (see Figure 9). Figure 9:

Market Share of Car Sales for Rover, BMW and Jaguar in 1995

[Source: Believe, M., 2001] Š Dr Andrew Clegg

p. 1-16


Presenting Data

Data Analysis for Research

By drawing and then combining two or more pie charts we could then compare market share for different years (see Figure 10). Figure 10: Market Share of Car Sales for Rover, BMW and Jaguar in 1995 and 1999

1995 Rover 27% Jaguar 41%

BMW 32%

1999 Jaguar 32% Rover 41%

BMW 27%

[Source: Believe, M., 2001] Programmes such as Excel will only allow you to draw one pie chart at a time - however once drawn you can arrange a number of pie charts on a worksheet and print them out. Alternatively, you can cut and paste Excel charts into Word or Publisher. Clearly, using the most appropriate type of graph is very important to ensure that the data is presented accurately. In addition to the type of chart it is also important to ensure that the graph is presented effectively.

Š Dr Andrew Clegg

p. 1-17


Presenting Data

Data Analysis for Research

1.7.2

Producing Graphs When producing graphs a number of basic rules and guidelines need to be considered. These are: Is the graph completely self-explanatory? Is the graph clearly titled, labelled and sourced? 

The axes should be labelled, and clear indication given as to the scales being used, and the numerical quantities being referred to;

All dates and times periods should be explicitly stated in the title, and on the appropriate axis;

In titles do not write ‘A Graph Showing....’. This is obvious - instead refer to the specific content of the graph (see examples given in this section);

The source of the data should be included, especially if they are drawn from published material.

Are elements of the graph distinguishable?

© Dr Andrew Clegg

When using charts it is important that the different data series are clearly distinguishable otherwise the graph will be meaningless;

Consider carefully the number of data series you intend to graph. Too much data will over complicate a graph and reduce its impact;

When using pie charts it is recommended that the number of segments should not be too large. Too many segments make charts confusing and difficult to read;

If charts are to be included in a black and white report, avoid shadings that involve colours as the distinctions will be clearly lost. Try and keep the use of colours to a minimum. Use one colour and different shades;

Ensure that each segment of the pie chart is clearly labelled and that the percentage values have been added to indicate quickly which are the principal groups and by how much;

Avoid repetition; if labels and percentage values have been added to a pie chart there is no need to include the legend.

p. 1-18


Data Analysis for Research

1.7.3

Presenting Data

2D or 3D Graph Formats Excel and similar packages allow you to enhance the quality of graphs by making them 3D. However, the use of 3D formatting needs to be treated with caution. If you are producing graphs on A4 for a presentation 3D charts can work effectively. However, if you are preparing graphs for inclusion in an essay or report 3D charts may not be appropriate and you may be better off with a standard 2D version. There are no hard and fast rules on this issue and, ultimately, the type of chart produced and the type of formatting applied will depend on the nature of the data set used. Let me illustrate this by referring to examples included in this section. Below is Figure 4, showing resident attitudes to housing development in West Sussex. At the moment this is a standard 2D column chart. Let us convert it into a 3D chart.

2D

3D

Š Dr Andrew Clegg

p. 1-19


Data Analysis for Research

Presenting Data

Do you think this chart is effective? It looks good but is not quite as easy to read as the standard chart. It is noticeable that in order to create a 3D chart Excel has to shrink the original chart. This is where problems lie, as in making the graph smaller the overall impact of the graph is diminished. Let us try another example. Below is Figure 8, which refers to the distribution of serviced accommodation in Torbay. As before, let us convert this into a 3D chart.

2D

3D In this instance the 3D chart is actually quite effective and has enhanced the standard 2D chart considerably. The basic rule seems to be that simple 2D charts can be converted into 3D charts quite effectively. However, the more detailed and complicated the standard chart the less effective it becomes when you make it 3D. Your best option is to experiment with different data sets and formatting options to find the most effective form of presentation. Š Dr Andrew Clegg

p. 1-20


Presenting Data

Data Analysis for Research

1.7.3

Using Tables In addition to charts, tables are also an effective way of presenting information. Again when producing tables a number of guidelines can be followed: 

Consider the purpose of presenting the data as a table as there may be better ways of presenting it;

Avoid the temptation of just photocopying tables out of text books and sticking into essays. In many cases, tables often contain information superfluous to the reader. Be prepared to modify data sets so that only relevant information is included in your table;

Make sure that tables are completely self-explanatory. Provide a table number and title for each table. If abbreviations are used when labelling then provide a key;

Make sure that the content of the table is fully referred to in the text - make sure that tables are not basically ‘window-dressing’;

Allow sufficient space when designing the table for all figures to be clearly written;

Make sure that the table/data is fully sourced.

Again let me illustrate with a number of examples.

Table 2:

Visits Abroad by UK Residents 1994-1997 Area of Destination

Year

Total (‘000) North America Number of Visits (000’s)

Western Europe

Rest of World

1994

39,630

2,927

32,375

4,328

1995

41,345

3,120

33,821

4,404

1996

42,050

3,584

33,566

4,900

1997

45,957

3,594

37,060

5,303

+9

0

+10

+8

% Change 1996/1997

[Source: ETB, 1999] Table 2 is an example of a table I created for the Arun Tourism Strategy document. Does the table meet the guidelines highlighted above? The answer is YES. The table is clear, well laid out, titled, sourced and selfexplanatory. Shading has also been used to try and enhance the visual impact of the table.

© Dr Andrew Clegg

p. 1-21


Presenting Data

Data Analysis for Research

Now consider Table 3 which refers to regional tourism spending in England in 1997. Again this is a clear table that for the purposes of the tourism strategy had to contain a lot of detail. If you were using this table to illustrate patterns of regional spending it could be simplified to show the most obvious or important patterns. For example in Table 1 it is evident that tourism spending is highest in the West Country and lowest in Northumbria. Table 3:

The Regional Distribution of Tourism Spending in England, 1997 All

Holidays

Tourism Destination England

Short

Long

Business

Holidays

Holidays

and Work

VFR

(1-3 nights) (4+ nights)

£11,665

£7,725

£2,505

£5,215

£2,055

£1,415

%

%

%

%

%

%

Cumbria

3

5

5

5

1

1

Northumbria

3

3

3

3

3

5

North West England

9

8

11

6

12

10

Yorkshire

8

8

7

8

9

10

Heart of England

11

9

14

7

15

16

East of England

13

14

11

15

14

12

9

6

13

2

15

17

West Country

24

30

17

37

10

10

Southern

11

10

10

11

3

9

9

8

9

7

10

12

London

South East England

[Source: ETB, 1998] The table could therefore be easily modified to really reinforce this message (see Table 4). Notice that in the amended Table 4, I have also changed the title so that the content of the new table becomes self-explanatory and reflects the actual purpose of the table. Table 3 could have also been modified by removing specific columns thereby emphasising the patterns of spending in particular market areas.

© Dr Andrew Clegg

p. 1-22


Presenting Data

Data Analysis for Research

Table 4:

Selected Regional Differentials in the Distribution of Tourism Spending in England, 1997

All

Holidays

Tourism Destination England

Short

Long

Business

Holidays

Holidays

and Work

VFR

(1-3 nights) (4+ nights)

£11,665

£7,725

£2,505

£5,215

£2,055

£1,415

%

%

%

%

%

%

3

3

3

3

3

5

East of England

13

14

11

15

14

12

West Country

24

30

17

37

10

10

9

8

9

7

10

12

Northumbria

South East England

[Source: ETB, 1998]

© Dr Andrew Clegg

p. 1-23


Data Analysis for Research

Š Dr Andrew Clegg

Presenting Data

p. 1-24

Section 1  

BML224 Statistics Workbook

Advertisement