6 minute read

DESCRIPTIVE STUDIES

What is the Importance of Descriptive Statistics?

Imagine that we have a dataset containing the ages of a thousand people and you need to read them all. It is impractical to read the thousand recorded data points; and even if you did, it is going to be hard to summarize the important trends in the data and reach a conclusion. If there are any interesting features in the data, they remain hidden from us.

Advertisement

In this chapter, we are going to describe some methods for organizing and presenting the data; so that we can more easily answer our questions of interest. Collectively, these methods are called Descriptive Statistics.

Central Tendency

Central Tendency is a summary statistic that measures the center point of a dataset. These demonstrate where most values in a distribution fall and are often referred to as a distribution's central position. In statistics, the mean, median, and mode are the three most common measures of central tendency. Each of these measures has its special uses, but the mean is the most important average in both descriptive and inferential statistics.

Measure Definition Example Mean

The total of numbers divided by how many numbers there are.

Median

The number which is in the middle, or the middle value (50th percentile).

In the following dataset: 11, 4, 11, 19, 11, 7, 8, 21, 7

The total: 11+4+11+ 19+11+7+8+21+7=99

There are 9 numbers

The mean= 99/9 = 11

In the following dataset: 18 12 4 26 14 18 5

Data arranged from minimum to maximum: 4 5 12 14 18 18 26

Median: 14

The number that appears the most. In the following dataset: 11 4 11 19 11 7 8 21 7

Mode= 11 (As it appeared three times in the data)

Averages for Qualitative and Ranked Data

 The mode is always appropriate for Nominal Data (e.g. Gender and Nationality).

 Percentage and frequency are used to describe Categorical (Qualitative) Data (e.g. 80% of the sample have type 1 diabetes; or 93 out of 112 participants have type 1 diabetes).

The Normal/Gaussian Distribution

Most biological phenomena are normally distributed. Meaning that the probability of randomly drawing a number from a dataset that measures one of those phenomena will always have a tendency to increase at a certain value (x: the mean of the dataset; the Central Tendency).

This probability will decrease in either direction away from that value on the x axis (also called horizontal spread). This is best represented by the Gaussian Curve (See Figure 1).

is Public Domain.

Using this principle to describe datasets hints at their characteristics. For example, most of the population is located at the center where the blood pressure is normal, while a small portion of them is located at the right segment of the curve (hypertensive) and at the left segment (hypotensive).

Gaussians have defining properties that allow us to use them to describe the data efficiently.

These properties are all incorporated in the mathematical equation of their curve, which always has the symmetrical bell-shaped form describing a continuous variable, peaking in the point midway along the horizontal spread and then tapering off gradually in either direction from the peak (without actually touching the horizontal axis; since, in theory, the tails of a normal curve extend infinitely). The last stunning feature of Gaussians is that the mean, median, and mode all have the same value, located midway of the horizontal spread.

Dispersion

Variability refers to how a set of data is spread out and provides a way to explain how sets of data differ. In a data set, the four main ways to explain variability are: range, interquartile range, variance, and, most importantly, standard deviation.

Range

The range is the interval between the largest and smallest scores.

Example: In the two datasets below, dataset 1 has a range of 38-20 while dataset 2 has a range of 52-11. Dataset 2 has a broader range and, hence, more variability than dataset 1 (even both datasets have the same mean). Dataset

Interquartile Range

The Interquartile Range (IQR) is almost the same as the range, only instead of stating the range for the whole data set, you’re giving the amount for the “middle fifty”. It’s sometimes more useful than the range because it tells you where most of your values lie. The formula is IQR = Q3 – Q1, where Q3 is the third quartile (75th percentile) and Q1 is the first quartile (25th percentile).

Example: Figure 2 shows the IQR, represented by the box. The whiskers (the lines coming out from either side of the box) represent the first quarter (min) of the data and the last quarter (max).

Standard Deviation

The Standard Deviation (SD) tells you how closely the data is located around the mean. A small SD means that your data is closely grouped, leading to a taller bell curve; while a large SD informs you that your data is more spread apart. See Figure 3 for graphic interpretation of the distribution of data as a factor of the SD.

Practice Practice Practice!

After an IQ test, I ask you about the location of the mark 90 to this test distribution.

Undoubtedly, you will have no answer; since 90 is just a raw score and you weren’t provided with any other information about the distribution.

On the contrary, if we provided you the mean of this test as 75, you would intuitively determine that your score is higher than the average of the test takers. In numbers, when you subtracted your score from the mean, then obtained a positive difference, you concluded your score is greater than the mean.

But before you get too proud of yourself, you need to figure out the position of your score from the mean. Effectively, knowing your position from the mean helps you know your position within the distribution. Now you see where you are located among your peers!

Let’s try the same subtraction from the mean but using other scores, say a score of 50. Subtract it from the mean (75) and the result is a negative number (-25). Thus, your score is less than the mean. How about a score of 110 on this test? Reiterating, you subtract from the mean; the number is positive (35); so the score is greater.

In these 3 exercises, we have established that the sign of the difference informs the direction of your score from the mean. Moreover, the magnitude of difference itself denotes the distance from the mean when the data is normally distributed. To put things into context, the differences from the mean we just calculated are called Deviation Scores and their equation is (X - µ).

The X represents your raw score (the datapoint you are calculating for), while the µ represents the population mean.

Describing Data with Charts

To see “what’s going on with the data,” an appropriate chart is almost always a good idea.

A chart will often reveal previously unsuspected features of the data. Which chart is appropriate depends primarily on the type of data you are dealing with, as well as on the particular features of it you want to explore. In addition, a chart can often illustrate or explain a complex situation for which simple text, or a table, might be unorganized or too long.

What makes a good graph?

Edward Tufte set out the foundations of elegant data presentation with some principles for good data visualization:

1. Show the Data.

2. Direct the Reader to Think About the Data Being Presented.

3. Avoid Distorting the Data.

4. Present Many Numbers, With Minimum Ink.

5. Make Large Datasets Coherent.

6. Encourage the Eye to Compare Different Pieces of Data.

7. Reveal the Underlying Message of the Data.

We also need to avoid chart junk: do not use patterns, 3-D effects, or shadows. Also, do not hide effects or create false impressions of what the data show.

We can use different charts for different data or hypotheses. A Bar Chart is a graph with rectangular bars (see Figure 4). The graph usually compares different categories. Usually, the horizontal (x) axis represents the categories and the vertical (y) axis represents a value for those categories.

Another type is the Pie Chart (see Figure 5): a circular pie split into sectors, one for each category, so that the area of each sector is proportional to the frequency of that category in the dataset.

Pet Ownership

Dogs

Cats

Fish

Rabbits

Rodents

The last type of common charts we are going to talk about is the Histogram (see Figure 6).

Histogram of arrivals

It is similar to a bar chart, but there should be no gaps between the bars as the datapoints are continuous. The width of each bar of the histogram relates to a range of values for the variable. The histogram should be labeled carefully to make it clear where the boundaries lie.

This article is from: